Correcting for experiment-specific variability in expression compendia can remove underlying signals

Abstract Motivation In the past two decades, scientists in different laboratories have assayed gene expression from millions of samples. These experiments can be combined into compendia and analyzed collectively to extract novel biological patterns. Technical variability, or "batch effects," may result from combining samples collected and processed at different times and in different settings. Such variability may distort our ability to extract true underlying biological patterns. As more integrative analysis methods arise and data collections get bigger, we must determine how technical variability affects our ability to detect desired patterns when many experiments are combined. Objective We sought to determine the extent to which an underlying signal was masked by technical variability by simulating compendia comprising data aggregated across multiple experiments. Method We developed a generative multi-layer neural network to simulate compendia of gene expression experiments from large-scale microbial and human datasets. We compared simulated compendia before and after introducing varying numbers of sources of undesired variability. Results The signal from a baseline compendium was obscured when the number of added sources of variability was small. Applying statistical correction methods rescued the underlying signal in these cases. However, as the number of sources of variability increased, it became easier to detect the original signal even without correction. In fact, statistical correction reduced our power to detect the underlying signal. Conclusion When combining a modest number of experiments, it is best to correct for experiment-specific noise. However, when many experiments are combined, statistical correction reduces our ability to extract underlying patterns.


Introduction:
For the last two decades, unprecedented amounts of transcriptome-wide gene expression profiling data have been generated, most of which are shared in public platforms for the research community. 1 Researchers are now combining samples across different experiments to form compendia, and analyzing these compendia is revealing new biology. [2][3][4][5][6] It is wellunderstood that technical sources of variability pervade large-scale data analysis such as transcriptome-wide expression profiling studies. [7][8][9][10] Numerous methods have been designed to correct for various types of effects. 7,11,12 Despite the prevalence of technical sources of variability, researchers have successfully extracted biological patterns from multi-experiment compendia without applying correction methods. [2][3][4][5]13 We sought to determine the basis of these seemingly contradictory results by examining the extent to which underlying statistical structure can be extracted from compendium-style datasets in the presence of sources of undesired variability.
A number of methods have been developed to simulate transcriptome-wide expression experiments. [14][15][16][17] However, simulating a compendium of many experiments with existing approaches would require defining a statistical model that describes the process by which researchers design and carry out experiments, which is likely to be very challenging. Instead, we developed an approach to simulate compendia by sampling from the low-dimensional representation produced by multi-layer generative neural networks trained on gene expression data from an existing compendium. This allowed us to simulate gene expression experiments that mimic real experimental configurations. We combined these experiments to create compendia.
Using this simulation approach, we studied how adding varying amounts of experiment-specific noise affects the statistical structure of gene expression compendia and our ability to detect underlying patterns. This topic is becoming pressing as more large-scale expression compendia become available. We found that prior reports of pervasive technical noise and analyses that succeed without correcting for it are, in fact, consistent. In settings with relatively few experiment-specific sources of undesired variation, the added noise substantially alters the structure of the data. In these settings, statistical correction produces a data representation that better captures the original variability in the data. On the other hand, when the number of experiment-specific sources of undesired variability becomes large, attempting to correct for these sources does more harm than good.

Results:
We characterized publicly available data compendia using refine.bio 18 , a meta-repository that integrates data from multiple different repositories. We found that an average experiment contained hundreds to thousands of samples in most widely studied organisms (Table 1). These samples were derived from hundreds to thousands of experiments, and the most common experimental designs had relatively few samples (medians from [5][6][7][8][9][10][11][12]. We compared these to two readily available compendia, recount2 and one for P. aeruginosa, that have been used for compendium-wide analyses. 2,3,6 The compendia that have been successfully used in prior work 2,3,6 have similar median numbers of samples per experiment (recount2 = 4, P. aeruginosa = 6) to the current publicly available data.

Constructing a generative model for gene expression samples
We developed an approach to simulate new gene expression compendia using generative multi-layer neural networks. Specifically, we trained a variational autoencoder (VAE) 22 , which was comprised of an encoder and decoder neural network. The encoder neural network compressed the input data through two layers into a low-dimensional representation and the decoder neural network expanded the dimensionality back to the original input size. The VAE learned a low-dimensional representation that can reconstruct the original input data.
Simultaneously, the VAE optimized the lowest dimensional representation to follow a normal distribution ( Figure 1A). This normal distribution constraint, distinguishes VAE's from other types of autoencoders and allowed us to generate variations of the input data by sampling from a continuous latent space. 22 We trained VAEs for each dataset (recount2 and P. aeruginosa). We evaluated the training and validation set loss at each epoch, which stabilized after roughly 100 epochs ( Figure 1B). We observed a similar stabilization after 40 epochs for recount2 ( Figure S1A). We simulated new genome-wide gene expression data by sampling from the latent space of the VAE using a normal distribution ( Figure 1C). We used UMAP 23 to visualize the structure of the original and data and found that the simulated data generally fell near original data for both compendia ( Figure 1D; Figure S1B).

Simulating gene expression compendia with synthetic samples
We designed a simulation study to assess the extent to which artifactual noise associated with individual partitions of a large compendium affects the structure of the overall compendium. Our simulation is akin to asking: if different labs performing transcriptome-wide experiments randomly sampled from the available set of possible conditions, to what extent would experiment-specific biases dominate the signal of the data. We simulated new compendia, randomly divided the samples into partitions, and then added noise to each partition, and compared the simulated compendia with added noise to the unpartitioned one ( Figure 2A). Each partition represents groups of samples with shared experiment-specific noise. We evaluated the similarity before and after applying an algorithm designed to correct for technical noise in each partition -given the linear noise added we used limma 24 to correct.
We performed a study with this design using the VAE trained from the P. aeruginosa compendium for 2 to 6,000 partitions. We found that adding technical noise to partitions always reduced the similarity between the simulated data without partitions and the partitioned data.
However, the nature of the change in similarity differed substantially between the partitioned sets before and after the correction step ( Figure 2B). With the correction step, similarity dropped throughout the range of the study, eventually reaching the same level as the permuted data.
Without the correction step, similarity dropped immediately to near the permuted level and then recovered throughout the rest of the tested range. Examining simulated data on the top 2 principle components from the original data with the corrected and uncorrected data at various numbers of partitions revealed that the correction step removes both wanted and unwanted variability, eventually removing all variability in the data ( Figure 2C). Without correction, the data were initially dramatically transformed; however, as the number of partitions grows the effect on the structure of the data was diminished.
To determine whether or not this was a more general property of such compendia, we repeated the same simulation study using a VAE trained on a recount2 compendium. recount2 is a compendium comprised of human RNA-seq samples, so it is generated using a different technology and consists of assays of a very different organism. Results with recount2 mirrored our findings with the P. aeruginosa compendium. The correction step initially retained more similarity, but performance crossed over and by the end of the study the uncorrected data were more similar to the unpartitioned simulated compendium ( Figure 2D). Examining the top principle components again revealed that correction better retained the structure of the original data with few partitions, but with many partitions the structure was better retained without correction ( Figure 2E). We observed the same trends when we varied the magnitude of the noise added ( Figure S2) or used a different noise correction method, such as COMBAT ( Figure   S3).

Constructing a generative model for gene expression experiments
We randomly selected samples from the range of all possible samples in the compendium. For the next simulation, we developed an approach that could simulate realistic experimental structure. This next simulation added another level of complexity to the model, by simulating experiments as opposed to samples, in order to make the simulated compendia more representative of true expression data. The technique that we developed uses the same underlying approach of sampling from a VAE. However, in this case we randomly selected a template experiment and a vector that would move that template experiment to a new location in the gene expression space ( Figure 3A). The simulation preserved the relationship between samples within the template experiment while also shifting the activity of the samples in the latent space ( Figure 3B). Intuitively, this process maintained the relationship between samples but changed the underlying perturbation. We used this process to generate compendia of new gene expression experiments. We exampled how the original samples in an experiment (E-GEOD-51409) and a simulated experiment generated using E-GEOD-51409 as a template have consistent clustering of samples ( Figure 3C original and experiment level simulated experiment). 25 However the genes that were differentially expressed were different between the two datasets. This demonstrated that the perturbation intensity and experimental design were relatively consistent in gene expression space, even though the nature of the perturbation differed. The simulated experiment had a lower variance compared to the original dataset due to the normality assumption made by the VAE, which compresses the latent space data representation. 22 However, in general, the clustering of samples is conserved between the simulated and original experiments, as observed in the additional template experiments with more complex structures ( Figure S4).
Given the fact that we preserved the association between samples and experiments in this new simulation, we would expect that new experiments would preserve the correlation in expression of genes that are in the same pathway. In our previous example experiment, E-GEOD-51409, the simulated experiment generated using the original E-GEOD-51409 as a template (i.e. experiment-level) identified 14 differentially expressed genes ( Figure 3C). In contrast, the simulated experiment generated by randomly sampling (i.e. sample-level) did not identify any differentially expressed genes; the median log2 fold change was 0.08. Furthermore, when simulating 100 new experiments using E-GEOD-51409 as a template, the experiments generated using the workflow in Figure 3A identified a median of 2,588 differentially expressed genes compared to those new experiments generated by randomly sampling from the compendium resulting from the workflow in Figure 1C ( Figure 3D) which identified a median of 0 differentially expressed genes. Additionally, the median number of enriched KEGG pathways is 1 using the workflow in Figure 3A compared to 0 using the random sampling approach using the previous simulation strategy ( Figure 3E). Overall, it appears that this new simulation approach generated a compendium of experiments with some real underlying biology and therefore this new simulation represents a more realistic simulation compared to the previous one. Examples of the significantly enriched pathways can be seen in Table 2. The top over-represented pathway is the ribosome pathway, which is likely a commonly altered pathway found in many experiments regardless of experiment type. 26 The remaining pathways found in the original experiment were generally metabolism related, which is consistent with the finding from the original publication. 25 The simulated experiment was particularly enriched in sulfur metabolism and ABC transporters, which is consistent with a different previous experiment that found upregulation of transport systems in response to sulfate limitations. 27 Overall, in accordance with real gene expression experiments, the new simulated experiments contain related groups of enriched pathways which reflect the specific hypothesis being tested.

Simulating gene expression compendia with synthetic experiments
We used our method to simulate new experiments that follow existing patterns to examine the patterns that we observed for generic partitions ( Figure 4A). We simulated 600 experiments using the P. aeruginosa compendium. We divided these experiments into partitions. These partitions represent groupings of experiments with shared noise, such as experiments from the same lab or experiments with the same experimental design. Each partition contains technical sources of variance within and between experiments. Results with simulated experiments were similar to those from arbitrarily partitioned samples. We observed a monotonic loss of similarity after the correction step as the number of partitions increased ( Figure 4B). Visualizing the top principal components revealed that statistical correction initially better recapitulated the overall structure of the data but that with many partitions similarity decreased ( Figure 4C, dark blue).
Without correction there was a larger initial drop in similarity but a later recovery ( Figure 4B) and visualizing the top principal components recapitulated this finding ( Figure 4C, observed consistent results with this dataset using both SVCCA similarity ( Figure 4D) and visual inspection of the top principal components ( Figure 4E). In summary, as the number of partitions increase the experiment-specific technical sources contribute less overall to the signal and the underlying patterns dominate the overall signal. When many partitions are present, even ideal statistical approaches to correct for noise over-corrects and removes the underlying signal.

Discussion:
Our findings reveal that compendia-wide analyses do not always require correction for experiment-specific technical variance and that correcting for such variance may remove signal.
This simulation study provides an explanation for the observation that past studies 2 preclude us from defining a specific threshold for noise correction, these simulations define a set of general properties that will guide compendia analyses moving forward.
We introduce a new method to simulate genome-wide gene expression experiments, using existing gene expression data as starting material, which goes beyond simulating individual samples. This allowed us to examine the extent to which our findings hold with realistic experimental designs. The ability to simulate gene expression experiments with a realistic structure may have many potential legitimate uses: e.g., pre-training for machine learning models, providing synthetic test data for software, and other such applications. Additionally, this simulation technique can be used to explore hypothetical experiments that have not been previously performed and generate hypotheses. However, such approaches could also be used by nefarious actors to generate synthetic data for publications. Forensic tools that detect synthetic genome-wide data may be needed to combat potential fraudulent uses.
Our study has several limitations. We assume a certain noise model that differs between experiments. However, the sources of noise are multifaceted and any such assumption will necessarily be an oversimplification, though such assumptions are not uncommon. 10,12,28 By selecting a specific noise model and using an ideal noise-removal step, we provide a best case scenario for artifact removal. While any simulation study will necessarily make simplifying assumptions, this work is the first to use deep generative models as part of a simulation study to probe the long-standing assumption that correcting for technical variability is necessary for analyses that span multiple experiments. Our findings reveal that in settings with hundreds or thousands of experiments, correcting for experiment-specific effects can harm performance and that it can be best to do nothing.
Our study also has broader implications for efforts to standardize scientific processes.
Centralization of large-scale data generation has the potential to reduce experiment-specific technical noise, though it comes at a cost of flexibility. Our results suggest that a highly distributed process where experiments are carried out in many different locations with their own specific sources of technical noise can also lead to valuable data collections.

Pseudomonas aeruginosa gene expression compendium
We downloaded a compendium of P. aeruginosa data that was previously used for compendium-wide analyses. 2 Previous studies identified biologically-relevant processes such as oxygen deprivation 2 and phosphate starvation 3 by applying denoising autoencoders. We obtained the processed and normalized gene expression matrices from the associated GitHub repository. The P. aeruginosa dataset was previously processed by Tan et. al. 2 During processing, raw microarray data were downloaded as .cel files, rma was used to convert probe intensity values from the .cel files to log2 base gene expression measurements, and these gene expression values were then normalized to 0-1 range per genes.

recount2 gene expression compendium
We downloaded human RNA-seq data from recount2. 29 The dataset includes over 70,000 samples collected from Sequencing Read Archive (SRA). It is comprised of more than 50,000 samples from different types of experiments, roughly 10,000 samples from Genotype-Tissue Expression project (GTEx v6) covering 44 types of normal tissue, and more than 10,000 samples from The Cancer Genome Atlas (TCGA) measuring 33 cancer types. 19,30,31 The recount2 authors uniformly processed and quantified these data. We downloaded data using the recount library in Bioconductor (version 1.14.0). 29 The entire recount2 dataset is 8TB. Based on the P. aeruginosa compendium we expected that a subset of the compendium would be sufficient for this simulation, so we selected a random subset of 50 NCBI studies, which resulted in 896 samples with 58,037 genes for our simulation. Each project (imported from NCBI BioProject) is akin to an experiment in the P. aeruginosa compendium, and we used the term experiment instead of projects in order to maintain consistency in this paper. The downloaded recount2 dataset was in the form of raw read counts, which was normalized to produce RPKMs used in our analysis. The normalized gene expression data was then scaled to a 0-1 range per gene.

Constructing a generative model of gene expression compendia
We designed an approach to simulate gene expression compendia with a multi-layer variational autoencoder (VAE). We built this model in Keras (version 2.1.6) with a TensorFlow backend (version 1. 10.0) based on the Tybalt software for gene expression VAEs. [32][33][34] Our architecture used each input gene as a feature. We compressed these genes to 2,500 intermediate features using a rectified linear unit (ReLU) activation function to combine weighted nodes from the previous layer. We encoded these features into 30 latent space features, which we optimized to follow a standard normal distribution via the addition of a Kullbach-Leibler (KL) divergence term into the loss function. We then reconstructed these features back to the input space using decoding layers that mirror the encoder network. We trained the VAE using 90% of the input dataset, leaving 10% as a validation set. We determined training hyperparameters by manually adjusting parameters and selecting parameters that optimized the validation loss. These were a learning rate of 0.001, a batch size of 100, warmups set to 0.01, 100 epochs for the P.
aeruginosa compendium and 20 epochs for the recount2 compendium.
We used the VAE trained from each compendium to generate new compendia by randomly sampling from the latent space. We generated a simulated compendium containing 6,000 P.
aeruginosa samples or 500 recount2 samples. For our first simulation, we sampled randomlyignoring the relationship between samples with a specific experiment. We simulated experiment-specific sources of undesired variability within compendia by dividing the data into partitions and adding noise to each partition.
We divided the P. aeruginosa simulated compendium into [1, 2, 5, 10, 20, 50, 100, 500, 1000, Though linear noise is an over-simplification of the types of noise that affect gene expression data, it allowed us to design an approach to optimally remove noise. Adjusting the choices of normalization, noise magnitude, and noise patterns will result in different selections of the precise cross-over point where it becomes beneficial to perform correction. With this design, we do not expect that it is possible to estimate exactly where this precise cross-over point is. That would require a compendium where investigators systematically performed the same combination of different experiments in multiple labs at different times. We were unable to identify such a compendium on the scale of thousands of samples from tens to hundreds of labs. Thus, though our analysis necessarily includes simplifying assumptions that limit our ability to precisely define the thresholds for correction for arbitrary datasets and noise sources, it remains suitable for examining the overriding principles that govern compendium-wide analyses.
We also designed an approach to generate gene expression compendia with realistic experimental designs. There was no consistent set of annotated experimental designs, so we developed a simulation method that did not depend on a priori knowledge of experimental design. For each synthetic experiment, we randomly sampled a "template experiment" from the set of P. aeruginosa or recount2 experiments. We then simulated new data that matched the template experiment by selecting a random sample from the low dimensional representation of the simulated compendia and calculating the vector (Z shift ) that connected this random sample (Z random ) and the encoded template experiment (X template _ experiment ) using equation 1: Z shift = centroid(Z template_experiment ) -Z random (eqn 1) We then linearly shifted the template experiment in the low-dimensional latent space by adding this vector to each sample in the experiment as seen in equation 2.
Z new_experiment = Z template_experiment + Z shift (eqn 2) This process preserves the relationship between samples within the experiment but shifts the samples to a new location in the latent space. Intuitively this simulation maintains the same experimental design but is akin to studying a distinct biological process. Repeating this process for each experiment allowed us to generate new simulated compendia comprised of realistic experimental designs.
We divided the P. aeruginosa simulated compendium into

Removing technical variability from noisy compendia
Our model of undesired variability is a linear signature applied separately to each partition of the data, which we consider akin to experiments or groups of experiments in a compendium of gene expression data. We used the removeBatchEffect function in the R library, limma (version 3.44.0), to correct for the technical variation that was artificially added to the simulated compendia. 24 Limma removes the technical noise by first fitting a linear model to describe the relationship between the input gene expression data and the experiment labels. The input expression data contains both a biological signal and technical noise component. By fitting a linear model, limma will extract the noise contribution and then subtract this from the total input expression data. This method presents a best-case scenario for removing the undesired variability in the simulated compendia because the model matches the noise pattern we've used in the simulation.

Measuring the similarity of matched compendia
We used Singular Vector Canonical Correlation Analysis (SVCCA) 35 to estimate similarities between different compendia. SVCCA is a method designed to compare two data representations 35 . Given two multivariate datasets, X 1 and X 2 , the goal of SVCCA is to find the basis vectors, w and s, to maximize the correlation between w T X 1 and s T X 2 . In other words, SVCCA attempts to find the space, defined by a set of basis vectors, such that the projection of the data onto that space is most correlated. Two datasets are considered similar if their linearly invariant correlation is high (i.e., if X 1 is a shift or rotation of X 2 then X 1 and X 2 are considered similar).
We compared the statistical structure of the gene expression, projected onto the first 10 principle components, in the baseline simulated compendia (those with only one experiment or partition, X 1 ) versus those with multiple experiments or partitions (X 2 ). Our SVCCA analysis was designed to measure the extent to which the gene expression structure of the compendia without noise is similar to the gene expression structure of the compendia with multiple sources of technical variance has been added as well as those where correction has been applied. Here we use 10 principle components for computational simplicity. Selecting a different value would affect the crossover point but not the general trends that we describe.

A case study of differential expression in a template experiment
We compared the E-GEOD-51409 experiment 36 with two different simulated representations to provide a case study for experiment-based simulation. E-GEOD-51409 included P. aeruginosa in two different growth conditions. For one simulation, we generated random samples and randomly assigned them to conditions, which we termed the sample-simulated experiment. For the second we used the latent space transformation process described above, which we termed the experiment-simulated experiment. We used the eBayes module in the limma library to

Comparing sample-level and experiment-level simulated datasets
We simulated 100 experiments using the template E-GEOD-51409 experiment 36 . We sought to compare the sample-level and experiment-level simulation processes. We set a threshold for differentially expressed genes at a Bonferroni-corrected p-value cutoff of 0.05. We used the enrichKEGG module in the clusterProfiler library to conduct an over-representation analysis 37 .
We used the Fisher's exact test to calculate a p-value for over-representation of pathways in the set of differentially expressed genes. We considered pathways to be over-represented if the Bonferroni corrected p-value was less than 0.05.

Implementation and Software Availability
All scripts to reproduce this analysis are available the GitHub repository     workflow describing how experiment-specific noise was added to the simulated compendia and how the simulated compendia were evaluated for similarity compared to the original input compendia. B,D) SVCCA curve measuring the similarity between a compendia without noise versus a compendium with noise (light blue), compendium with noise corrected for (dark blue).
As a negative control, we used the similarity between the gene expression pattern of the simulated data with a single partition compared with the simulated data that has been permuted to destroy any meaningful structure in the data. C,E) Subsampled gene expression data (500 samples per compendia) projected onto the first two principal components showing the overlap in structure between the compendia without noise (gray) versus the compendia with noise (light blue), compendia with noise corrected for (dark blue).