A multi-omics data simulator for complex disease studies and its application to evaluate multi-omics data analysis methods for disease classification

Abstract Background An integrative multi-omics analysis approach that combines multiple types of omics data including genomics, epigenomics, transcriptomics, proteomics, metabolomics, and microbiomics has become increasing popular for understanding the pathophysiology of complex diseases. Although many multi-omics analysis methods have been developed for complex disease studies, only a few simulation tools that simulate multiple types of omics data and model their relationships with disease status are available, and these tools have their limitations in simulating the multi-omics data. Results We developed the multi-omics data simulator OmicsSIMLA, which simulates genomics (i.e., single-nucleotide polymorphisms [SNPs] and copy number variations), epigenomics (i.e., bisulphite sequencing), transcriptomics (i.e., RNA sequencing), and proteomics (i.e., normalized reverse phase protein array) data at the whole-genome level. Furthermore, the relationships between different types of omics data, such as methylation quantitative trait loci (SNPs influencing methylation), expression quantitative trait loci (SNPs influencing gene expression), and expression quantitative trait methylations (methylations influencing gene expression), were modeled. More importantly, the relationships between these multi-omics data and the disease status were modeled as well. We used OmicsSIMLA to simulate a multi-omics dataset for breast cancer under a hypothetical disease model and used the data to compare the performance among existing multi-omics analysis methods in terms of disease classification accuracy and runtime. We also used OmicsSIMLA to simulate a multi-omics dataset with a scale similar to an ovarian cancer multi-omics dataset. The neural network–based multi-omics analysis method ATHENA was applied to both the real and simulated data and the results were compared. Our results demonstrated that complex disease mechanisms can be simulated by OmicsSIMLA, and ATHENA showed the highest prediction accuracy when the effects of multi-omics features (e.g., SNPs, copy number variations, and gene expression levels) on the disease were strong. Furthermore, similar results can be obtained from ATHENA when analyzing the simulated and real ovarian multi-omics data. Conclusions OmicsSIMLA will be useful to evaluate the performace of different multi-omics analysis methods. Sample sizes and power can also be calculated by OmicsSIMLA when planning a new multi-omics disease study.


Introduction
Complex diseases such as hypertension, type 2 diabetes, and autism are caused by multiple genetic and environmental factors (Timpson et al. 2018). Genome-wide association studies have identified many genetic variants (i.e., SNPs) associated with the complex diseases.
However, it remains difficult to understand the roles of the associated SNPs in the molecular pathophysiology of the disease and how the SNPs interact with other SNPs in a biological network (Karczewski and Snyder 2018). With the advancement of high-throughput sequencing technology such as next-generation sequencing (NGS) and massive parallel technology such as mass spectrometry, multiple types of omics data (i.e., multi-omics data) including genomics, epigenomics, transcriptomics, proteomics, metabolomics, and microbiomics are rapidly generated (Hasin et al. 2017). As a single type of data generally cannot capture the complexity of molecular events causing the disease, an integrative approach to combining the multi-omics data would be ideal to help elucidate the pathophysiology of the disease (Karczewski and Snyder 2018).
Integrative methods to combine multi-omics data for disease studies have been developed rapidly (Tyekucheva et al. 2011;Jennings et al. 2013;Holzinger et al. 2014;Ruffalo et al. 2015;Yan et al. 2017). They can be generally classified into two categories: multi-staged and meta-dimensional approaches (Ritchie et al. 2015). The multi-staged approach aims to first identify relationships between the multi-omics data, and then test the associations between the multi-omics data and the phenotype. For example, Jennings et al. (Jennings et al. 2013) constructed a Bayesian hierarchical model consisting of two stages. The first stage partitioned gene expression into factors accounted by methylation, copy number variation (CNV), and other unknown causes. These factors were subsequently used as predictors for clinical outcomes in the second stage model. One advantage of this approach is that the causal relationships between multi-omics data can be modeled. In contrast, the meta-dimensional approach combines the multi-omics data simultaneously. Raw or the transformed data from the multi-omics data are combined into a single matrix for the analysis. This approach allows for a more flexible inference of the relationships among the multi-omics data, without the assumptions of the causal relationships between these data. Although many multi-omics analysis methods for disease studies are available, they were generally evaluated by simulations with data generated specifically to the methods. To compare the performance among these methods, it is necessary to use the same simulated multi-omics dataset with disease status. However, current simulation tools for disease studies mainly focused on simulating a certain type of omics data. For example, more than 25 simulators are available for simulating genetic data with phenotypic trait, according to the Genetic Simulation Resources website (https://popmodels.cancercontrol.cancer.gov/gsr/). Tools such as WGBSSuite (Rackham et al. 2015) and pWGBSSimla (Chung and Kang 2018) can simulate whole-genome bisulphite sequencing (WGBS) data in case-control samples.
Moreover, tools such as Polyester (Frazee et al. 2015) and SimSeq (Benidt and Nettleton 2015) simulate RNA-seq data with differential gene expression between two groups of samples. To our knowledge, there is currently no simulation tool that is capable of simulating a variety of omics data types and modeling the complex relationships between the data and the disease. Furthermore, sample size estimation when planning a multi-omics study to ensure sufficient power also becomes important (Hasin et al. 2017). This also requires a simulation tool that simulates realistic multi-omics data structures and models the architecture of the complex disease.
Here, we developed the multi-omics data simulator OmicsSIMLA, which simulates genomics data including SNPs and CNVs, epigenomics data such as the WGBS data, transcriptomics data (i.e., RNA-seq), and proteomics data such as the normalized reverse phase protein array (RPPA) data at a whole-genome level. Furthermore, the relationships between different types of omics data, such as meQTLs (SNPs influencing methylation), eQTLs (SNPs influencing gene expression), and eQTM (methylation influencing gene expression), were modeled. More importantly, the relationships between these multi-omics data and disease status were modeled as well. The disease models in OmicsSIMLA are flexible so that the main effects and/or interaction effects (either risk or protective) of SNPs and CNVs on the disease can be specified. Differential methylation and differential gene and protein expression between cases and controls can also be simulated. We demonstrated the usefulness of OmicsSIMLA by simulating a multi-omics dataset for breast cancer under a hypothetical disease model, and compared the performance among existing multi-omics analysis tools based on the data. Figure 1 shows the framework of OmicsSIMLA. The genomics data that can be simulated include SNPs and CNVs. Genotypes at SNPs in unrelated and/or family samples are simulated based on the SeqSIMLA2 algorithm (Chung et al. 2015). CNV status (i.e., a deletion, normal, one duplication or two duplications) on a chromosome is simulated based on the user-specified chromosomal regions and CNV frequencies. Affection status of each sample is determined by a logistic penetrance function conditional on the causal SNPs and CNVs, and/or the interactions among the causal SNPs. The epigenomics data are the methylated and total read counts at CpGs based on bisulphite sequencing, simulated using the pWGBSSimla algorithm incorporating methylation profiles for 29 human cell and tissue types (Chung and Kang 2018). Allele-specific methylation (ASM), in which paternal and maternal alleles have different methylation rates, and differentially methylated region (DMR), where the same CpGs in the region have different methylation rates among different cell types, can also be simulated. Furthermore, the transcriptomics data (i.e., RNA-seq read counts) are simulated with a parametric model assuming a negative-binomial distribution.

Results
Finally, the mass-action kinetic action model (Teo et al. 2015) is used to simulate proteomics data at a certain time point incorporating the gene expression data. Some SNPs can be specified as meQTLs and eQTLs, and some CpGs can be specified as eQTM. Allele-specific expression (ASE), which alleles in a gene have different expression levels, caused by cis-eQTL can also be simulated. The differential methylation, gene expression, and protein expression levels between cases and controls are simulated conditional on the affection status.  Using OmicsSIMLA, we simulated a multi-omics dataset based on hypothetical pathways for breast cancer as described in Ritchie et al. (Ritchie et al. 2015) and illustrated in Figure 2.
The data included a deletion with a protective effect in the CYP1A1 gene, 3 common SNPs with risk effects in the CYP1B1 gene, 5 rare SNPs in the COMT gene, which had interaction effects with a meQTL for the XRCC1 gene, and 5 rare SNPs each in the GSTM1 and GSTT1 genes, which also had interaction effects with eQTLs affecting the gene and protein expression of the XRCC3 gene. A total of 2,022 SNPs in the four genes (i.e., CYP1B1, COMT, GSTM1, and GSTT1) and a regulatory region consisting of the eQTLs and meQTL, 1 CNV in CYP1A1, 688 CpGs in XRCC1, and gene and protein expression levels for 100 genes (including the expression for XRCC3 and 99 other hypothetical genes in the pathways) were simulated. More details about the simulations can be found in the Methods section.
We compared the performance of three multi-omics data analysis methods for disease prediction using the area under the curve (AUC) measures. The three methods included the random forest-based method (RFomics), a graph-based integration method (CANetwork) (Yan et al. 2017), and a model-based integration method (ATHENA) (Holzinger et al. 2014).
The RFomics combines the preprocessed multi-omics data in a single matrix for constructing the prediction model. As described in the Methods section, a gene-based risk score is calculated based on SNPs for each gene. Then the risk scores and other multi-omics data are normalized so that they can be evaluated on the same scale by the RF algorithm. In contrast, CANetwork calculates a graph matrix to measure the distance between samples using the composite association network algorithm (Mostafavi et al. 2008), and the prediction model is created based on the distance matrix using the graph-based semi-supervised learning algorithm (Tsuda et al. 2005). Finally, ATHENA creates a neural network model for each type of omics data and a final integrative model is generated based on these models. Table 1 shows the area under the curve (AUC) for the three methods under three scenarios.
Scenario 1 had 500 cases and 500 controls in the training set, and 100 cases and 100 controls in the validation set. Scenario 2 had the same sample sizes as those in Scenario 1, but the multi-omics data had less strong effects on the disease compared to Scenario 1. The effects of the multi-omics data were the same in Scenarios 3 as those in Scenario 1, but Scenarios 3 had larger sample size (i.e., 1,500 cases and 1,500 controls in the training data and 500 cases and 500 controls in the validation data). More details of the three scenarios are provided in the Methods section. Prediction models for the three methods were created based on the training dataset, and their prediction accuracies were evaluated by the validation dataset. As seen in Table 1, RFomics has the highest AUC in all 3 scenarios followed by ATHENA and CANetwork. Table 2 shows the run time for the three methods. In Scenario 1, RFomics and CANetwork had similar performance, while ATHENA required more than 20-times the runtime of RFomics and CANetwork. In Scenario 3, CANetwork was the most efficient method followed by RFomics, and ATHENA also required significantly more time than the other two methods.

Discussion
We have developed OmicsSIMLA, which simulates multi-omics data (i.e., genomics, epigenomics, transcriptomics, and proteomics data) with disease status. In contrast to the current omics data simulators that mainly focused on simulating one type of omics data, OmicsSIMLA simulates multiple types of omics data while the relationships between different types of omics data and the relationships between the omics data and the disease are modeled. As the development of integrative methods for analyzing multi-omics data has attracted substantial interest from researchers, OmicsSIMLA will be very useful to simulate benchmark datasets for comparisons of these methods. Furthermore, as more and more disease studies take advantages of multi-omics data, OmicsSIMLA will also be very useful for power calculations and sample size estimations when planning a new study.
We used OmicsSIMLA to simulate a multi-omics dataset for breast cancer based on hypothetical pathways. Three analysis tools were compared using the dataset. The results suggest that when the different types of data were properly normalized on the same scale, the RF-based method (i.e., RFomics) achieved the highest AUC. Furthermore, RFomics had comparable runtime efficiency as that of CANetwork, while ATHENA was computationally expensive. Therefore, RFomics can potentially be a useful analysis tool for disease prediction using multi-omics data.
As studies for quantitative traits are also important, it is our future work to extend OmicsSIMLA to simulate quantitative traits based on the classic quantitative genetics model (Falconer and Mackay 1996). Furthermore, environmental factors and the interactions between genes and environments can also play important roles in complex disease etiology. Therefore, simulating exposome data such as the climate and air quality data and modeling their interactions with genes are also important in the future extensions of OmicsSIMLA.
In conclusion, we developed a useful multi-omics data simulator for complex disease studies.
As many parameters can be adjusted in OmicsSIMLA, a user-friendly web interface is provided at https://omicssimla.sourceforge.io/generateCommand.html to conveniently specify these parameters.

Simulation of DNA sequences
The SeqSIMLA2 package (Chung et al. 2015) is integrated in OmicsSIMLA to generate DNA sequences in unrelated/related individuals. Similar to SeqSIMLA2, OmicsSIMLA expects a set of external reference sequences (i.e., haplotypes) generated by an external sequence generator, such as COSI (Schaffner et al. 2005) or HAPGEN2 (Su et al. 2011) that has been widely adopted in genetics studies. Generally, a set of 10,000 or more reference sequences are expected. Optional files consisting of recombination rate information and pedigree structures are also accepted. A gene dropping algorithm assuming random mating with crossovers is performed based on the reference sequences, recombination rates, and pedigree structures to generate haplotypes in each individual.

Simulation of CNVs
For the simulation of CNVs, we considered four CNV states including deletion (D), normal (N), one duplication (U), and two duplications (UU) on a chromosome. Therefore, there are 10 types of CNV states on the two chromosomes in an individual, as shown in Supplemental Table S1, and the total copy numbers on the two chromosomes range from 0 to 6. The user will provide frequencies and ranges of the four CNV states.
During meiosis, we use the single-copy crossover model, assuming all crossovers occurred between CNVs (Hartasanchez et al. 2014).

Simulation of affection status
Genetic variants, including SNPs and CNVs, are used to determine the affection status of an individual based on a logistic penetrance function as follows: where P(affected) is the probability of being affected, 0  determines the baseline prevalence,  ,  , and  are sets of causal CNVs, SNPs with main effects, and SNPs with interaction effects, respectively, specified by the user, Ci1 and Ci2 are the CNV states for the first and second haplotypes at CNV i, respectively, Gj is the genotype coding at SNP j,

Simulation of DNA methylation data
The pWGBSSimla package (Chung and Kang 2018) is integrated into OmicsSIMLA to generate the WGBS data. The pWGBSSimla algorithm simulates data using methylation profiles generated based on 41 WGBS datasets for 29 human cell and tissue types. The profiles contain the information for each CpG, such as its distance to the next site, methylation rate, methylation status (i.e., methylated, unmethylated, and fuzzily methylated), and read counts for each type of methylation status. CpGs and the distances between the CpGs are first determined based on the profiles, and then the total read count and methylated read count are simulated for each CpG. Methylation level at a CpG influenced by a meQTL is simulated based on a genotype-specific methylation probability, which is the methylation rate of the CpG in the profiles multiplied by a ratio following an exponential distribution.
Furthermore, ASMs are simulated based on father-and mother-specific methylation rates for paternal and maternal alleles, respectively. Finally, a DMR is generated by simulating the same genomic region using profiles for different cell or tissue types. More details of the pWGBSSimla algorithm can be found in Chung and Kang (Chung and Kang 2018).

Simulation of RNA-seq data
We implemented a parametric simulation procedure for simulating the RNA-seq data similar to that described in Benidt and Nettleton (Benidt and Nettleton 2015).

Simulation of eQTL and allele-specific reads
We followed the procedure in the simulation study in Sun (Sun 2012) to simulate eQTL and read counts for ASE. For eQTL l with a user-specified fold change hl, the means for the three genotypes AA, Aa, and aa at the eQTL are ij  , l ij h  , and (2 1) l ij h   , respectively, and the dispersion parameter is i  in the NB distribution for gene i influenced by the eQTL. ASE for a gene caused by a cis-eQTL is simulated by assuming reads were mapped to heterozygous SNPs (i.e., allele-specific reads) in the gene. A cis-eQTL refers to the eQTL being located in the cis-regulatory elements of the gene. Because the alleles at the cis-eQTL can be in the same haplotype as the alleles of the gene, ASE can be observed using the allelespecific reads of the gene. Furthermore, only heterozygous SNPs can be tested for cis-eQTL with the allele-specific reads. Therefore, we simulate allele-specific reads for heterozygous eQTLs. Assuming tij is the total read count for gene i in individual j, the total number of allele-specific reads is calculated as 0.005tij, where 0.005 was estimated from real data by Sun (Sun 2012). Furthermore, also suggested by Sun (Sun 2012), the number of allelespecific reads for a haplotype is simulated using a beta-binomial distribution with a mean determined by the effect size of the cis-eQTL and an overdispersion parameter of 0.1. The effect size is defined as log2(expression of the alternative allele at the eQTL/expression of the reference allele at the eQTL) (Mohammadi et al. 2017) for a heterozygous cis-eQTL and is set to 0 for a homozygous cis-eQTL.

Simulation of eQTM
We used linear regression to model the relationship between gene expression and methylation:

Protein expression simulation
We assumed that the protein expression level for protein k at a time point t in sample j follows a normal distribution with a mean kjt  and a standard deviation k  after normalization. We used the mass-action kinetic action model (Teo et al. 2015) (Fundel et al. 2008) based on the RNA-seq data simulated from the previous section. Similar to the simulation study in Teo et al. (Teo et al. 2015), d jt  is fixed to be 1, and s jt  with a default value of 1 can be changed by the user. A vector of standard deviations  were estimated from the level 4 protein expression data of primary tumor tissue in 874 breast cancer patients from the TCGA project downloaded from the cancer proteome atlas (TCPA) (Li et al. 2013) website. The level 4 data consist of protein expression data for 224 proteins that have been normalized across the samples as well as across the proteins, and a replication-based method was used to account for differences in protein expression among different batches. The parameter j  is then randomly sampled with replacement from  .

A random-forest based method for integrating multi-omics data for disease studies
Multi-omics data can have different data types (e.g., discrete data for SNP genotypes, categorical data for CNV statuses, and continuous data for proportions of methylated reads, RNA-seq read counts, and normalized protein expression) and different variations (e.g., three possible values of 0, 1, and 2 for minor allele counts at SNPs, and real numbers ranging between 0 and 1 for the proportions of methylated reads). When developing a method for integrating these data, it is important to account for the properties of different data types so that the analysis results would not be biased toward certain variables (Ritchie et al. 2015). We developed a preprocessing algorithm for the multi-omics data. A gene-based risk score, which is a weighted sum of the numbers of risk alleles at SNPs in the gene, for each individual is constructed. The weights are the effect sizes of the risk alleles at the SNPs. More details for calculating the risk score are provided in the Supplemental Methods. Then each variable from different omics data, including the gene-based risk scores, CNV statuses of genes, methylation proportions at CpGs, gene and protein expression levels, is normalized so that it has a mean 0 and a standard deviation of 1. The normalized variables are then used in RF for classification.

Simulation studies
We used OmicsSIMLA to evaluate the performance of the proposed RF-based method, compared with CANetwork and ATHENA. A hypothetical disease model for breast cancer involving multi-omics data (Ritchie et al. 2015) was simulated, as shown in Figure 2. To be more specific, a deletion with a frequency of 20%, which had a protective effect with an odds ratio (OR) of 0.67, in the CYP1A1 gene and 3 common variants, which had main effects (ORs = 1.5) with minor allele frequencies (MAFs) > 10%, in the CYP1B1 gene were simulated. We also simulated 5 rare variants with MAFs < 3% in the COMT gene, which had interaction effects (ORs = 5) with a meQTL for the XRCC1 gene. The CpG in XRCC1 influenced by the meQTL caused a difference in methylation rates of 10% between cases and controls. Furthermore, we simulated 5 rare variants in the GSTM1 gene, which had interaction effects (ORs = 5) with a cis-eQTL for the XRCC3 gene, and 5 rare variants in the GSTT1 gene, which had interaction effects (ORs = 5) with a trans-eQTL for XRCC3. The eQTL caused a fold change of 1.5 in the XRCC3 gene expression compared to the reference genotype, and a fold change of 1.5 was simulated for the differential gene expression of XRCC3 between cases and controls. In summary, the total variables consisted of 200, 687, 264, and 176 SNPs in the CYP1B1, COMT, GSTM1, and GSTT1 genes, respectively, and 695 SNPs harboring the meQTL and two eQTLs in the regulatory region, a variable for CNV status in CYP1A1, methylation levels at 688 CpGs in XRCC1, and gene and protein expression levels for 100 genes and their encoded proteins. More details for generating the reference sequences in the genes and the simulations for each omics data type are provided in the Supplemental Methods.
We simulated a training dataset consisting of 500 cases and 500 controls as well as a validation dataset consisting of 100 cases and 100 controls. The training dataset was used by RFomics, CANetwork, or ATHENA to construct a prediction model. The validation dataset was then used to calculate the AUC based on the prediction model. Note that a 5-fold crossvalidation was performed in ATHENA, and a best model based on the testing dataset (i.e., one of the five random 20% of the training dataset) was created for each cross-validation. The model with the highest AUC based on the testing dataset was selected and applied to the validation dataset. This simulation scenario was referred to as Scenario 1. We also simulated 22 a scenario with less strong genetic effects (Scenario 2) and a scenario with larger sample size (Scenario 3). More details about Scenarios 2 and 3 are provided in the Supplemental Methods. For each scenario, 1,000 batches of training and validation datasets were simulated, and the AUC for each algorithm was averaged over the 1,000 batches.

Data access
The simulated data generated in the simulation studies can be downloaded from the OmicsSIMLA website (https://omicssimla.sourceforge.io/download.html).