A curated collection of transcriptome datasets to investigate the molecular mechanisms of immunoglobulin E-mediated atopic diseases

Abstract Prevalence of allergies has reached ~20% of population in developed countries and sensitization rate to one or more allergens among school age children are approaching 50%. However, the combination of the complexity of atopic allergy susceptibility/development and environmental factors has made identification of gene biomarkers challenging. The amount of publicly accessible transcriptomic data presents an unprecedented opportunity for mechanistic discoveries and validation of complex disease signatures across studies. However, this necessitates structured methodologies and visual tools for the interpretation of results. Here, we present a curated collection of transcriptomic datasets relevant to immunoglobin E-mediated atopic diseases (ranging from allergies to primary immunodeficiencies). Thirty-three datasets from the Gene Expression Omnibus, encompassing 1860 transcriptome profiles, were made available on the Gene Expression Browser (GXB), an online and open-source web application that allows for the query, visualization and annotation of metadata. The thematic compositions, disease categories, sample number and platforms of the collection are described. Ranked gene lists and sample grouping are used to facilitate data visualization/interpretation and are available online via GXB (http://ige.gxbsidra.org/dm3/geneBrowser/list). Dataset validation using associated publications showed good concordance in GXB gene expression trend and fold-change.


Introduction
Allergic disease is highly prevalent and currently reaches ∼20% of the populations in developed nations and with sensitization rate to one or more allergens among school age children approaching 40%-50% (1). Although the generation of allergic responses is well understood, the early sensitization steps and factors contributing to the development of immunoglobin E (IgE)-mediated diseases remain unclear. IgE is the major mediator of atopic response in humans. Atopy is defined by the American Academy of Allergy, Asthma and Immunology as the genetic tendency to develop allergic diseases such as allergic rhinitis, asthma and atopic dermatitis (eczema) and is typically associated with heightened immune responses to common allergens, especially inhaled allergens and food allergens. However, not all encounters with a potential allergen will lead to sensitization. Similarly, not all sensitizations will result in a symptomatic allergic response even in atopic individuals.
The effect of IgE spans across multiple systems. In the circulation system, IgE increases flow and permeability of the blood vessels, fluid and protein in tissues, as well as flow to the lymph nodes. In the airway, IgE decreases air conduct diameter, increases mucus congestion and can induce blockage. In the gastrointestinal tract, IgE increases fluid secretion, peristalsis and expulsion diarrhea. The immunoglobin has also been suggested to play a role in the defense against parasite infection and as a general gate keeper for any foreign materials entering the body (2,3). Ultimately, IgE is involved in the normal spectrum of reaction to expulse foreign material from the body, hence protecting by elimination. Nevertheless, oversensitization, which is developed by unknown mechanism(s), can lead to imbalance and pathology in affected individuals.
Two main groups of immune signals initiate the production of IgE in response to an antigen: (i) the signals that drive the differentiation of CD4 naive T cell to T helper type 2 (Th2) cells and (ii) the cytokines and co-stimulatory molecules secreted by Th2 cells, which subsequently promote T follicular helper cells-induced immunoglobulin B cells class-switch toward IgE production. Antigen characteristics, such as concentration and localization of the encounter (i.e. tissue, mucosa, circulation), can also affect Th2 cell induction. IL-4, IL-13 and STAT6 are key mediators of Th2 responses and IgM class switch to IgE. IL-4 secretion and mast cell CD40 surface expression also contribute to the IgE class switch to IgE at the site of allergic reaction.
Systemic levels of IgE alone is not a sufficient indicator for allergy risk (4). Peripheral blood IgE level can increase upon sensitization, but not reliable enough for deducing a diagnosis of allergy or allergen type (5). Concentration, binding strength or affinity, specificity and portion of specific IgE to total IgE are all factors in translating a humoral IgE response into a clinical symptom (6). However, genetic component exists in allergic disease. Studies have demonstrated strong heritable components of allergic diseases and atopy, estimated at 33%-76% (7,8). Genetic components of food allergy and asthma have also been reviewed (9)(10)(11)(12). Although single susceptibility genes have been identified for certain allergies (13)(14)(15)(16), most genomewide association studies reported multiple loci of susceptibility or gene regions to a wide range of allergy development (17)(18)(19). Environmental factors further complicate allergic susceptibility-the hygiene hypothesis of allergic susceptibility has received much attention in the last few decades (20). Other exposure-related factors (e.g. diet, pollution, tobacco smoke) are also likely to have contributed to the increasing susceptibility to allergic diseases in developed nations (1).
IgE may also be implicated in some primary immunodeficiency diseases (PIDs). For CD40 ligand and CD40 deficiency, circulating IgE+ plasma cells are absent (21). For others, IgE plasma levels are elevated: DOCK8 deficiency, autosomal dominant hyper-IgE syndrome (AD-HIES) Job's syndrome, Comel-Netherton syndrome, PGM3 deficiency, IPEX (immune dysregulation, polyendocrinopathy, enteropathy X-linked) and Tyk2 deficiency. Large phenotypic heterogeneity is observed among these PIDs; however, the underlying mechanisms are not completely understood. This knowledge gap mirrors our limited understanding of the involved gene products mediating phenotypes of atopic diseases.
There is a crucial need to pursue genetic markers of an atopic constitution amidst growing concern over the increasing abundance of IgE atopy and that current treatments remain mostly palliative. As both environmental interaction and genetic predisposition form the determinant of allergy development, extrapolating potential gene markers of allergy atopy from a broad range of studies with associated clinical data is highly relevant, particularly given the high variations of the disease clinical phenotypes. With the advances in high-throughput platforms for transcriptomic analyses, a large number of gene expression datasets is regularly deposited on publicly accessible repositories such as the NCBI Gene Expression Omnibus (GEO). GEO currently hosts over 90 000 datasets comprising over 2 million samples (https://www.ncbi.nlm.nih.gov/geo/) and presents an enormous opportunity for data mining across multiple studies.
The web-based Gene Expression Browser (GXB) is an open-source interface that allows for custom compilation of selected datasets (i.e. of interest to users) and facilitates visualization of gene expression data. GXB has been previously described (22) and used to generate a number Figure 1. Workflow of search strategies for datasets retrieval from the GEO database. Datasets and search terms obtained from the seven independent search strategies were merged and with duplicates removed. The results were then filtered for human studies and microarray or NGS planforms. The studies were manually confirmed for IgE-related pathogenesis and only datasets with appropriate study design, sample type and disease groups were kept. The final collection consists of 33 datasets. Brackets indicate the number of datasets obtained after each search step.
A search strategy was implemented to identify GEO datasets relevant to IgE-related atopic disease, including PIDs, and the datasets uploaded to the GXB platform online. The associated study metadata, such as detailed gene information, relevant literature, study design and sample information, were also uploaded to facilitate in-depth interpretation. In creating this collection of datasets, we aimed to provide a resource that will facilitate risk prediction and interception of allergic diseases. The datasets were retrieved from publicly available GEO series and selected by relevance to IgE and atopic diseases and filtered by analysis platform and species. As previously demonstrated, data mining and reanalysis/reinterpretation of large and publicly available dataset is a promising avenue (31) to elucidate complicated diseases such as IgE-related atopy.

Justification of data selection and filter
The focus of the GEO dataset selection was primarily on whether or not the dataset involved (i) IgE production or function or (ii) if the pathology studied implicated directly or indirectly IgE production or function. The dataset curation workflow is shown in Figure 1. In brief, we ensured a broad search approach by conducting seven independent search strategies. The independent search results (435) were merged using the "Merge collection" function available on 'My NCBI' (https://www.ncbi.nlm.nih.gov/myncbi/) and compiled with the results from a second search using an assembly of all the terms used in the first sets of strategies. After removal of duplicates, the two queries generated 196 and 117 datasets, respectively. The combined results were then manually filtered to restrict datasets to human sample, expression profiling by microarrays and next generation sequencing (NGS) (115) and relevance to IgE-related atopic diseases (53). Out of the remaining 53 datasets, study description, design and sample type for each dataset were inspected, with 33 datasets in the final curated collection. The following criteria were deemed very important: clear description of tissue type, comparisons between patient vs healthy or stimulated vs unstimulated samples and disease category implicating IgE. Furthermore, datasets/studies that are indirectly relevant to IgE-mediated atopic disease, such as gene expression in B cells after tonsillectomy, were also included as they were deemed valuable to (i) the discovery of putative novel gene-disease association, (ii) to improving our knowledge of adaptive immunity and/or (iii) to increasing our knowledge about factors affecting IgE production. A descriptive summary of the final dataset collection can be found in Table 1.
The web-based GXB platform: a visualization tool for gene expression data GXB is a valuable tool for training researchers about the reductionist investigative approaches (31). The creation of a web-based GXB platform was previously described in detail (22). In brief, the GXB is a simple interactive interface designed for visualization of large quantities of heterogenous data (Supplementary Figure 1). The platform allows for customizable data plots with overlapping metadata information, changeable sample order, as well as generation of sharable mini URLs that encapsulate information about the display settings in use. The dataset navigation page allows for quick identification of datasets of interest, either through filtering using predefined lists or via query terms. The user has access to multiple functionalities within the GXB graphic interface. In brief, the data-viewing interface enables interactive browsing and graphic representation of large-scale data. This interpretable format displays ranked gene lists and expression results. The interface also allows for user flexibility in terms of changing how the gene list is ranked, the method used for ranking, sample grouping (i.e. disease type), sample sorting (i.e. gender or age) and view type (i.e. bar or chart). The end user can browse through the datasets, format graph for a selected gene within a dataset and export data (i.e. annotation, FC, signal, groups) as is. The original GEO data and annotation are accessible from the GXB interface (Downloads tabs) as well as from their GEO page (links provided on Study tab). The associated source code and R scripts are publicly available at https:// github.com/BenaroyaResearch/gxbrowser and https:// github.com/BenaroyaResearch/gxrscripts, respectively. A detailed description of the GXB platform usage can be found under the Data Browsing and Visualization Interface section in the Supplementary Information.

Construction of the dataset collection on GXB
The selected datasets were downloaded from GEO in SOFT file format and uploaded (http://ige.gxbsidra.org/ dm3/geneBrowser/list) via the GXB interface (accessed by navigating the Tools menu located in the top-right corner of the GXB webpage; Tools/Chips Loaded/Upload Expression Data). The SOFT files contain metadata and normalized signal intensity data, generated by methods indicated by the author(s). These SOFT files can be analyzed directly for differentially expressed genes; thus, no additional processing was required. Using the 'Sample Set Annotation Tool' of the GXB interface, the datasets were annotated according to the information provided on GEO. The raw signal data type of the dataset (e.g. raw signal, log 2 transformed and etc.) are also specified for each GEO entry. The default data display is in linear scale (see further details below in 'Presentation of datasets'). When necessary, the sample annotation file (which is part of the SOFT file) were edited in order to add group information; as it was important to identify groups in order to compute fold-change (FC). For each dataset, individual samples were grouped based on relevant study variables. Three datasets (GSE19190, GSE75603 and GSE8507) were split via GXB's 'Sample Set Annotation Tool/Group Sets' interface to provide a more meaningful group comparison (these datasets share the same GSE number in Table 1). Genes were then ranked based on FCs of the specified two-group comparison. All the information annotated and presented on the GXB is assembled in a SQL database (Data citation 1).

Data availability
The curated datasets collected that have been described in this data note were assembled from the public repository NCBI GEO website: http://www.ncbi.nlm.nih.gov/gds/. In this study, we cited the GEO accession number of each dataset and the raw signal and annotation files are made available for download from the GXB web application (http://ige.gxbsidra.org/dm3/geneBrowser/list).

Description of datasets
After applying the filtering strategy as previously described (Figure 1), we curated 33 datasets encompassing 1860 transcriptome profiles relevant to IgE-related atopic diseases. Detailed information on each dataset is presented in Table 1 and the summary of the data collection is presented in an aggregation of pie charts (Supplementary Figure 2).
The data collection includes a wide range of studies, sample types, as well as diseases. In total, three RNAseq and 12 different microarray platforms are represented, with the majority in the latter being Affymetrix Human Array chips (various version). Twenty datasets were generated from in vitro studies and 13 from ex vivo studies. Nine sample types are represented, with the most abundant being peripheral blood mononuclear cell (PBMC) (n = 9) and airway epithelial cells (n = 5). Sample size of each dataset ranged from 5 to 628, with most studies having 10-50 samples. The dataset collection covered seven main disease categories, including Meaning, the ranklist displayed by default in GXB for the dataset.
c Ex vivo: Experimentation or measurements done in or on tissue from an organism in an external environment with minimal alteration of natural conditions; e.g. purification of specific cell types. In vitro: Experimentation or measurements done in or on altered tissue from an organism in an altered external environment; e.g. stimulation of cultured bilological specimen. allergy, asthma, healthy responses, HIES, dermatitis, atopic irritable bowel syndrome (IBS) and urticaria. In the majority of the studies, comparisons are made between patients and health controls (n = 19), followed by stimulation (n = 6) and time course (n = 5). It is also possible to rank the genes based on expression difference by selecting 'Advanced' in the 'Rank Lists' dropdown menu of the graphic interface. This display option can be more robust than FC when low expression is observed in one group. Additional annotation features can be used to query specific gene list from the individual datasets. Using the 'Rank Lists' dropdown function, users can select either pathways or diseases from gene list category to selectively display the FCs of the desired gene sets. These annotated gene lists come from the KEGG database (32) or are constituents of the immune-relevant genes or of genes associated with known disease signatures. The pathways option covers five gene lists, including cytokine and chemokine receptors, cytokine and chemokine ligands, Tcell signaling, B-cell signaling and antigen presenting cell processing. The nine diseases gene lists covers asthma, allograph rejection, tolerance associated genes, rheumatoid arthritis, graft vs host, type one diabetes, primary immunodeficiency, systemic lupus erythematosus and auto-immune thyroid.

Presentation of datasets
It is important to note that integration and reanalysis of the datasets is not the intent of this collection. Therefore, a more meaningful use of this collection lies in the comparison of FC expression for a gene across multiple relevant datasets (i.e. reductionist approach (31)). To illustrate the cross-project view function, we presented the FC of CD40 in two datasets, GSE56681 and GSE44956, that were queried using the default cut-off of 2.0 FC (Supplementary Figure 3). The default pairwise rank comparison for each dataset and the associated FC are shown below the dataset titles. A detailed description of the cross-study features can be found in the Data Browsing and Visualization Interface section in the Supplementary Information.

Dataset validation
Quality assessment of the datasets was performed by looking for key gene expression, i.e. marker genes, and highly differentiated genes as indicated in the associated publications. Table 1 includes the type of validation, the genes used, as well as the strength of the validation and the associated comments. Certain datasets were split in two to facilitate comparisons (GSE19190, GSE75603 and GSE8507); in these cases, validation is indicated for one of the two datasets as indicated in Table 1. Data validation was achieved for most datasets except for 7, due to having no publications (GSE14842, GSE37157 and GSE41861) or incomparable units (i.e. z-score) and/or absence of FC information (GSE64639, GSE54522 and GSE44956). One dataset (GSE70050) has a discrepancy on sample names between GEO information and SOFT files; thus, the group comparison reported in the associated publication could not be replicated in the GXB. Trend validations were done on five datasets that had no linked publications (GSE13619 and GSE99948) or had incomparable units to GXB (GSE87399, GSE70760, GSE72542). But when possible, literature values were used for additional validation of the trend (i.e. GSE92688and GSE99948). Examples of the validation results are presented in Figure 2.

Potential application of the dataset collection
To demonstrate the potential use of the dataset collection, gene expression profile of different tissues for house dust mite (HDM) allergy were compared. Nasal epithelial (GSE19190) and PBMC-derived CD4 T cells (GSE70760) gene expression profiles (detailed in Table 1) were compared. In both datasets, genes associated with the Th2 pathway/axis were increased in HDM-sensitized patients compared against healthy controls (33,34); illustrating that Th2 response may results in the symptomatic phase. This is further supported by the IgE levels and sIgG/sIgE ratio in the same study (34). SERPINB2, a gene coding for the inhibitor of plasminogen activator PAI-2, is markedly upregulated in patients in both datasets, hence tissues ( Figure 3A from GSE70760 and Figure 3B from GSE19190). This is consistent with previous association of asthma severity and biomarker panel including SERPINB2 from PBMC (35) and correlation of SER-PINB2 expression in respiratory epithelial cells with atopic asthma severity (36). SERPINB2 has been reported to have a role in the interleukin-12-mediated signaling pathway; evidence from mice showed that SERPINB2 regulates IFNg production, causing down regulation of Th1 cytokines in macrophages (37). The expression profiles of SERPINB2 in both tissues suggest an important role of the gene as a first line regulator of immune response, perhaps by preventing excessive Th1 response. Interestingly, IFNg was not decreased in PBMCderived T cells, but interferon-gamma receptor 1 (IFNGR1) was, suggesting that in T cells, SERPINB2 may exert its role on the expression of the receptor rather than on IFNg itself. In Schroder et al. (2010) (37), stimulation of SER-PINB2 −/− cells with antiCD40/IFNg resulted in greater Th1 cytokine production, thus supporting the idea that SERPINB2 affects IFNGR.
However, certain genes are found to be differently expressed between nasal epithelium or PBMC-derived T cells. An explanation is that the activation of Th1 immune response may be different in the target tissue and peripheral circulation. For example, interleukin 1 beta (IL1B), a Th1 promoting cytokine is found to be increased in CD4+ T cells in allergic patients ( Figure 3C, from GSE70760), but in the nasal epithelium, the gene is not upregulated in patients with severe case of allergic rhinitis ('uncontrol asthma') compared against healthy control ( Figure 3D, from GSE19190). Further hypothesisgenerating comparisons can be made by comparing the top differentially expressed genes in both datasets as listed in Supplementary Table 1. The annotated feature can be used to query disease or pathway-specific gene lists and be applied to crossstudy comparison. As an example, two specific gene lists were queried from three datasets (GSE22528, GSE15823 and GSE19190), comparing asthma or allergic and asthmatic patients to healthy controls, using (i) disease asthma and ( Figure 4A) and (ii) pathway-cytokine and chemokine receptors ( Figure 4B). Number of unique and shared differentially expressed genes among the three studies can be identified. In this example, 30 cytokine and chemokine receptor genes ( Figure 4B) were uniquely identified to the allergic asthma disease type (GSE19190). This interesting pool of genes may potentially be used to investigate the molecular mechanisms of the co-occurrence of allergy and asthma.

Conclusion
The dataset collection may be useful for exploring specific gene signatures in response to natural antigen/allergen exposure (i.e. allergic patients vs control) and delineate the major genetic drivers associated with increased level of IgE (i.e. cellular responses to specific antigen/allergen). Furthermore, comparative investigation of similarities and differences in expression of genes between datasets can highlight key mechanistic differences of immune signaling and/or provide insights for tissue-targeted intervention. In compiling the present dataset collection, we hope to offer a resource that may improve accessibility of public omics data to researchers in this field.

Author contributions
MG and SSYH contributed to conceptualization. MG, SSYH and FA contributed to data curation and validation. MG and SSYH led, and FA supported, investigation and visualization. MG and SSYH performed formal analyses. SB and MT contributed to the maintenance of software. MG and SSYH contributed writing -original draft, methodology and project administration. MG and SSYH led, and FA, SB, MT and DC supported, writing-review and editing. MG lead and SSYH supported supervision. DC contributed funding acquisition and resources. The contributor's roles listed above (underlined) follow the Contributor Roles Taxonomy (CRediT) described in Nature Communication 2014 (38) and managed by The Consortia Advancing Standards in Research Administration Information (CAS-RAI) (https://casrai.org/credit/).

Supplementary data
Supplementary data are available at Database Online.