MetaNorm: incorporating meta-analytic priors into normalization of NanoString nCounter data

Abstract Motivation Non-informative or diffuse prior distributions are widely employed in Bayesian data analysis to maintain objectivity. However, when meaningful prior information exists and can be identified, using an informative prior distribution to accurately reflect current knowledge may lead to superior outcomes and great efficiency. Results We propose MetaNorm, a Bayesian algorithm for normalizing NanoString nCounter gene expression data. MetaNorm is based on RCRnorm, a powerful method designed under an integrated series of hierarchical models that allow various sources of error to be explained by different types of probes in the nCounter system. However, a lack of accurate prior information, weak computational efficiency, and instability of estimates that sometimes occur weakens the approach despite its impressive performance. MetaNorm employs priors carefully constructed from a rigorous meta-analysis to leverage information from large public data. Combined with additional algorithmic enhancements, MetaNorm improves RCRnorm by yielding more stable estimation of normalized values, better convergence diagnostics and superior computational efficiency. Availability and implementation R Code for replicating the meta-analysis and the normalization function can be found at github.com/jbarth216/MetaNorm.


Normalization methods for nCounter gene expression data
The medium-throughput platform NanoString nCounter has quickly become one of the most popular and efficient ways to analyze complex mRNA transcripts (Geiss et al. 2008).One reason for this is its ability to process formalin fixed, parrafin-embedded (FFPE) tissue samples effectively.While freshly frozen (FF) samples experience less degradation in storage, FFPE samples require less stringent storage requirements and are therefore cheaper and easier to retain (Perlmutter et al. 2004).This has led to a ubiquity of FFPE samples and has made the nCounter system an invaluable resource in medical research.Among other areas, the nCounter framework has been used to analyze FFPE samples in studies of colon, breast, and lung cancer (Chen et al. 2016, Walter et al. 2016, Lim et al. 2020).However, a significant downside to using FFPE samples is the level of mRNA "modification" during the preservation and storage of the sample (Masuda et al. 1999), causing higher levels of variability and uncertainty in the readings.Thus, it is critical that the gene read counts undergo an efficient normalization procedure before being formally analyzed.
In addition to degradation in FFPE samples, normalization can also help to remove background noise or lane-by-lane variation, which is common in FF samples as well.Perhaps the simplest approach to normalization is NanoStringNorm, an R package that implements NanoString guidelines and relies on summary statistics from the positive, negative and housekeeping probes to account for separate types of variation (Waggott et al. 2012).Other normalization methods, such as NAPPA (R Package: http://CRAN.R-project.org/package=NAPPA) and NanoStringDiff (Wang et al. 2016), use model-based approaches that can better account for the complexities in the data.Despite the improvements of NAPPA and NanoStringDiff, all three of these approaches use each type of probe in the same way, in that positive probes only remove lane-by-lane variation, negative probes only remove background noise, and housekeeping genes only remove variation in the amount of the input sample material.Among the more complex approaches include RUVseq (Risso et al. 2014), which relies on factor analysis to remove unwanted technical variation, and RCRnorm (Jia et al. 2019) that stands for random coefficient hierarchical regression normalization (Jia et al. 2019).RCRnorm is unique in that (i) it is an integrated system of hierarchical models for the different types of probes, (ii) it is designed specifically to normalize FFPE data (but can be used with other types of samples), (iii) it allows for sample-to-sample variation in the housekeeping genes (Eisenberg and Levanon 2013), and (iv) it uses a Bayesian framework for normalization, thus allowing statistical inference about key model parameters and uncertainty quantification for estimates while leading to better interpretability.Essentially, the RCRnorm model assumes that all probes in a given sample have a shared intercept and slope on a linear regression of log 10 (gene read count) and log 10 (RNA amount).While other probe-specific effects are included, this allows all types of probes to influence the normalization process, yielding more robust estimates (technical details of RCRnorm are outlined in Section 1.3).In general, RCRnorm is much more flexible than its frequentist counterparts, with the Gibbs sampler able to explore the extremely high-dimensional parameter space and land on the targeted posterior distribution.The authors also show a significant improvement in the correlation between FFPE samples and their corresponding FF samples, which have gene read counts that are thought to be more accurate to the ground truth than the FFPE counts, with RCRnorm outperforming other normalization methodology (including RUVseq) at both the gene and patient level (Jia et al. 2019).Additionally, RCRnorm along with RUVseq has also been found to remove more technical variation than its counterparts (Bhattacharya et al. 2021), including nSolver and NanoStringDiff.

Motivation to improve RCRnorm
Despite its impressive performance, RCRnorm is not without its drawbacks.The weakest point is its high computational cost, especially compared to other approaches (Bhattacharya et al. 2021).Large datasets require a significant amount of time and working memory to produce results, making RCRnorm an occasionally inconvenient choice for normalization.Another area of concern is the high number of parameters, which is typical in Bayesian hierarchical models, where it is common to rely on priors to regularize these parameters.Still, in a highly complex system, an abundance of parameters combined with an absence of accurate prior information can cause weak convergence.Further, a concern closer to the crux of the RCRnorm model is the construction of the prior distributions for the hyper-parameters l a ; l b .These hyperparameters represent the means of the sample-specific intercept and slope terms (respectively), which should reflect how the nCounter system inherently operates rather than the characteristics of an individual dataset.They have a significant impact on model estimates and so the results of RCRnorm are sensitive to the choice of prior distribution placed on these hyper-parameters.RCRnorm uses a jackknife approach on the positive probe data to estimate the prior mean and variance for l a ; l b (a normal distribution is assumed), so that the priors vary across datasets (Jia et al. 2019).This makes the results of RCRnorm more sensitive to the size and quality of the dataset being normalized, since any data-based prior has no ability to mitigate potential bias that can arise from sparse or unreliable data.For these reasons, identifying and implementing new prior distributions for these quantities will improve model estimation, especially for messy or sparse datasets.
Since l a ; l b are important global parameters in the RCRnorm model, replacing the jackknife priors with noninformative priors would not effectively mitigate the potential bias and noise stemming from low quality data.Rather, l a ; l b reflects the underlying mechanism of the nCounter system and constructing an informative a priori distribution based on largely existing data of the same type will help avoid excessive data-dependence and provide external calibration to low quality data.In this big-data era, accessing data from multiple independent studies has become increasingly easy, due to federal regulations and substantial efforts made in data sharing.Thus, leveraging such information to construct informative yet objective priors becomes feasible.
There are two goals of this study: the first is to create these prior distributions via a rigorous meta-analysis of public FFPE gene expression datasets from independent studies using NanoString nCounter platform.To do so, we devise a Bayesian hierarchical model, which adopts the linear regression setup with random coefficients from RCRnorm.However, it is distinct from the RCRnorm model in several aspects.First of all, it deals with data from multiple studies and includes a layer to account for study-specific effects.Secondly, for reasons detailed in the beginning of Section 2, this new model focuses on modeling data from positive control probes only.Thus, it is able to allow probe-specific variance terms, removing the simplifying assumption made by RCRnorm (i.e.constant variance across all positive controls).Thirdly, unlike RCRnorm, the new model, with much more data available in a meta-analysis, is able to account for the dependence between the sample-specific intercepts and slopes.
The second goal is to implement these informative prior distributions along with other important enhancements to improve the performance of RCRnorm.We refer to this new algorithm as MetaNorm.Among these enhancements are (i) optimized code and data structure, (ii) implementation of constraints on the positive and negative probe residuals, and (iii) a simplified sampling approach to normalized RNA expression estimates for housekeeping and regular genes.We show that these adjustments, along with the informative metaanalysis prior, improve computational cost, model convergence, prediction error and performance with low-quality data.In the remainder of this section, we outline in detail the original RCRnorm model (Jia et al. 2019).In Section 2, we provide an overview of our meta-analysis and a detailed summary of other improvements made to RCRnorm.In Section 3, we compare the performance of MetaNorm with RCRnorm on four real-world datasets.We make a similar comparison to RUVseq in Section 4. Finally, we provide a discussion of our findings in Section 5. Technical details including the full probability model and derivation of full conditionals for the meta-analysis are provided in the supplementary material, along with a table displaying the details of datasets involved.

Overview of the RCRnorm model
We conclude the Introduction with an overview of the notation used in this paper and a description of the RCRnorm model.Each FFPE dataset has I patient samples (indexed by i) with P, N, H, R positive probes, negative probes, housekeeping genes and regular genes (indexed by p, n, h, r), respectively.The datasets have a factorial structure, with each sample/gene combination represented exactly once per dataset (I Á ðP þ N þ H þ RÞ total observations).Typically, there are six positive probes (P ¼ 6), a relatively small number of negative probes and housekeeping genes (i.e.N < 10 and H < 20) and a high number of regular genes (R > 80).Each of the six positive probes contains a known amount of mRNA, while the amount of mRNA in a regular or housekeeping gene is unknown and varies across samples.These mRNA levels must be estimated using the read counts.For all negative probes, target transcripts are absent and so their mRNA levels should be zero ideally.The aim of RCRnorm is to create consistent, unbiased estimates for this mRNA amount for every gene/sample combination that adjust for unwanted biological and technical effects.The remainder of this section summarizes the outline and technical details of the model (Jia et al. 2019).
The RCRnorm model is based on a series of integrated linear regression models with random coefficients.At the lowest level, the log 10 gene read count Y for each of the four probe types is summarized in the below equations: where the superscripts (þ; À; Ã) indicate positive probes, negative probes, and housekeeping genes, respectively (no superscript in gene specific variables indicates regular genes); the log 10 RNA levels are known and fixed for each positive probe (X þ p ) in all samples and are unknown but differ for all housekeeping and regular gene observations (X Ã ih ; X ir ); the unknown scalar c for negative probes can be interpreted as the mean non-specific binding level due to background noise.The intercept and slope terms (a i ; b i ) reflect how the log read count Y relates to the log RNA level X in each sample, and are assumed to be independent and identically distributed normal variables: Note that for a given sample, these terms are shared by all the four categories and so occur in all four equations.Whereas / i represents the sample specific degradation level (with the constraint P I i / i ¼ 0), j ir and j Ã ih signify sample/gene specific mRNA levels before degradation (these parameters can be thought of as the model output), with j ir $ Nðk r ; r 2 j Þ and ) represent deviations from the general linear pattern and can be thought of as probe-specific residuals, with Jia et al. (2019) suggests that all probe types have similar variability except for the negative probes, which tend to have higher variances, hence r 2 d versus r 2 dÀ and similarly, r 2 e versus r 2 n in (1)-( 4).The left panel of Fig. 1 summarizes the hierarchical structure of the RCRnorm model.Note that the hyper-parameters l a and l b have data-based jackknife priors, which we seek to improve with the results of our meta-analysis.See Sections 3 and 4 of Jia et al. (2019) for more details of RCRnorm.

Materials and methods
The first step in improving RCRnorm is to identify realistic, informative prior distributions to replace the current priors for l a and l b .We maintain the assumption that the priors are independent normal distributions (the conjugate priors for l a and l b ), but rather than relying on the data that is currently being analyzed, the MetaNorm prior comes from a metaanalysis.We identified multiple independent NanoString nCounter gene expression datasets of FFPE samples from past studies and designed a Bayesian hierarchical model for metaanalysis to attain parameter estimates for the prior mean and variance of l a and l b irrespective of a specific dataset.For positive control probes, both Y (the log transformed count) and X (input RNA amount) are known so that we have strong information from the data about the intercept and slope terms.In contrast, for housekeeping and regular genes, only Y is known and X is latent with a complex underlying structure, so such information is much weaker.Since our main interest is on the intercept and slope terms, the meta-analysis only considers these positive controls (6 per sample) to find the posterior distributions for l a and l b .As will be shown later, this would also allow us to remove simplifying assumptions made by RCRnorm for the purpose of estimation stability.The remainder of this section outlines the design, data and results of this meta-analysis.

Datasets
We identified 17 potential datasets containing mRNA FFPE samples from human subjects, 14 identified in the gene expression omnibus with search terms "mRNA," "FFPE," and "nCounter" and three identified through collaborators or made available by authors on github.Two of these datasets (id 4 and 5) were used for testing the RCRnorm model, and were removed from the meta-analysis pool in an attempt to be completely independent from the original study.Two other datasets were removed for having low data quality, with one having a high number of low-quality samples by the NanoString quality control guidelines (R 2 of linear regression on positive probe data below 0.95) and the other having a high positive correlation between a i and b i estimates, contrary to the rest of the pool.This left us with K ¼ 13 datasets, which ranges in size from n 13 ¼ 8 up to n 9 ¼ 1; 950 samples.All 13 datasets had six positive probes, each having the same control RNA levels (128, 32, 8, 2, 0.5, 0.125 fM for probes 1-6, respectively).

Design
This section outlines the hierarchical model used to create our meta-analysis prior.We assume that the log 10 count of each sample/probe/study combination Y ijk is characterized by where X þ j is the log 10 RNA level for each probe j, which is fixed across all samples and studies at log 10 ð128; 32; 8; 2; :5; :125Þ, a ik and b ik are the slope and intercept terms for sample i of study k, and r ijk is the residual term reflecting the remaining variability of Y ijk after taking into account the linear trend, for i ¼ 1; . . .I k , j ¼ 1; . . .6, and k ¼ 1; . . .K (I k is the number of samples in study k and K ¼ 13).This structure maintains the original design of RCRnorm while accounting for multiple data sources.
First, we focus on modeling the random regression coefficients a ik and b ik in (5).Unlike RCRnorm, the meta-analysis includes samples from a variety of datasets, meaning heterogeneity between data sources is a potential factor.To examine this, we first calculated empirical estimates for all a ik ; b ik by fitting individual linear regressions of the form where each ðâ ik ; bik Þ is calculated with six positive control observations of sample i (j ¼ 1; . . .; 6). Figure 2a and b summarizes the estimates obtained from this analysis.There is clear heterogeneity in the distributions, both in the center and variance, when broken down by dataset.Therefore, it is reasonable to assume that parameters associated with a ik and b ik are diverse for different datasets.This leads us to a separate bivariate normal distribution in (7) for each study k, where the study-specific means a k and b k are equivalent to l a ; l b in the RCRnorm model.
The next step is to determine the nature of the covariance matrix R ðkÞ , i.e. whether or not a ik is independent from b ik .Figure 2c shows a breakdown of correlation between the empirical estimates âik and bik by dataset.Several datasets show positive correlation while others have a negative correlation, providing justification for modeling the study-specific correlation rather than assuming a common correlation value across all studies.Despite the fact that a few datasets would not have enough evidence to reject r ¼ 0 at the a ¼ :05 level, 12 of the 17 datasets have 95% confidence bounds strictly above or below 0, meaning we cannot assume that a ik and b ik are independent of each other when coming from the same dataset k.The implication of this is that R ðkÞ must be modeled with a multivariate distribution rather than independent univariate distributions.Several candidate distributions were considered, but a thorough simulation showed little difference in performance between prior distributions [interested readers can find results of this simulation in Supplementary Section S3].We therefore chose the path of least resistance by implementing a classic inverse-Wishart distribution in (8), which is the conjugate prior for the covariance matrix of a multivariate normal distribution and allows for both simplicity and computational ease.Finally, the mean parameters a k and b k are modeled independently, since the empirical estimates have a correlation near 0.
The hyperparameters l a ; l b have "safe range" uniform priors while r 2 a ; r 2 b have diffuse inverse gamma priors.Safe range prior distributions are those that follow a uniform distribution with a range that comfortably covers all plausible values for the parameters (usually it has a half-width of 3 or 5 SD).
The full hierarchical structure of the slope and intercept terms is summarized below, where IW 3 is an inverse-Wishart distribution with three degrees of freedom.
We are now left with the residual term r ijk from (5), which is modeled by the following hierarchical structure: The above structure was carefully chosen to reflect what was observed from the data and to achieve computing efficiency and stability.Figure 2d shows the mean residual for each corresponding probe by dataset.While some probes seem to be centered at 0, there is a clear pattern in the residuals especially for 5 and 6.This indicates that the model for r ijk should have probe specific terms.Since there is significant heterogeneity in the probe effect between different datasets, the parameters are specific to both the probe and dataset (i.e.indexed by j, k).Therefore, the residual term can be broken down into two pieces, shown in (9).The parameters s jk are assumed to have a probe specific mean t j and a shared variance s 2 , as the points by probe in Fig. 2d have centers clearly distinct from one another but have similar variability.We further enforce the constraints shown in (10), to avoid issues with identifiability, making s jk dependent on other probe effects from the same dataset.We include these constraints to stabilize our distribution in the absence of strong prior information, relying on the fact that the log-transformed positive probe data have a strong linear relationship with the logtransformed control RNA expression levels in each study alone.Since these terms represent probe specific residuals, we can reasonably expect them to follow the same rules as residuals, namely that they sum to 0 with and without multiplication by the factor-effects.While this is a frequentist technique, there is significant value in using both frequentist and Bayesian approaches together, despite the tendency to think of them as separate, binary categories (Bayarri and Berger

2004
).In this case, our analysis benefits from the constraints by reaching quick convergence without using artificially informative prior distributions.A "safe range" uniform prior is applied to t j (including the same constraints used for s jk ) while a non-informative IG prior is used for s 2 .Finally, d ijk acts as our pure residual term that has no pattern remaining, which we center at 0. Given the diversity shown in this section by probe and by dataset, we assume that the variance of this residual differs by these factors.We use a diffuse IG prior for the r 2 jk terms.The right panel of Fig. 1 gives a concise summary of variable relationships in the meta-analysis.The values we want to identify are represented by the green parameters.

Diagnostics and results
The MCMC algorithm for posterior sampling was written in R (R core team 2019) and implemented via a Gibbs sampler using the package rcpp (Eddelbuettel and Francois 2011), which greatly reduced the computational cost.The metaanalysis model was run via four parallel chains of 12 000 samples, with each chain needing no >5000 samples to converge.
For each parameter, the trace plots show that all four chains converge to the same distribution, which is further supported by the Gelman-Rubin diagnostics (median and 95% potential scale reduction factor at 1.00 for all four global parameters).Table 1 shows summary statistics for the posterior draws across all chains.Since we observed relatively low correlation among sequential draws, with all variables below 5% correlation for a lag of 2, we thin the chains after burn-in by one draw, so every other sample is used to calculate the summary statistics (a total of 14 000 samples).Therefore, we accept our meta analysis prior for l a and l b to be l a $ N 2:357; :041 ð Þand l b $ N :952; :003 ð Þ , based on posterior means for l parameters and posterior medians for r 2 parameters.

Additional enhancements
Three major changes were made to the Gibbs sampler besides the informative prior.These first of these is implementing constraints on the d p and d n parameters to improve model convergence.While this may seem counterintuitive, such identifiability constraints have been shown to dramatically improve convergence of Gibbs samplers for Gaussian mixture models (Zanella and Roberts 2021).The other changes included updating k r ; k h and / i with fixed calculations and editing the underlying code to improve computational efficiency.A detailed description and justification of these changes can be found in Supplementary Section S4.

Application and comparison to RCRnorm
We compare the performance and diagnostics of MetaNorm to that of RCRnorm in five different areas, including computation time, convergence, stability of estimates, model bias and performance with low-quality datasets.We will use four different datasets of varying quality, three that were excluded from the meta-analysis and one with a low number of samples.These datasets are summarized in Table 2. Dataset 4 is the lung cancer FFPE dataset used to evaluate RCRnorm (Jia et al. 2019) and contained in the RCRnorm R package.It contains 28 samples and 104 genes after cleaning, and is thought to be a high-quality dataset.Dataset 12 was excluded from the meta-analysis due to a high number of low-quality samples (low R-squared in the positive probe data) and dataset 15 was excluded for having a very high positive correlation between a i and b i (see Fig. 2c).Despite being included in the meta-analysis, Dataset 13 is also used for testing due to its low number of samples and high number of genes.

Computation time
After restructuring the algorithm, MetaNorm significantly improves the computation time.Table 3 shows a breakdown of computation time for the testing datasets in R-studio using a laptop with an i7 core and 16 GB of RAM.The results show that MetaNorm outperforms RCRnorm in computational efficiency, with a minimum improvement of 5.7 across the 17 meta-analysis datasets.But the effects of MetaNorm is seen most clearly when the number of samples is large (I > 20Þ, with RCRnorm taking 25-40 times longer to produce the same amount of samples.In addition to this, MetaNorm tends to reach convergence quickly with little autocorrelation, typically requiring < 500 draws to reach a stationary distribution and leading to improvement of another 5-fold at least.MetaNorm improves upon RCRnorm by being faster per-draw but also by requiring fewer draws, leading to an increase an efficiency by anywhere from 30-fold for lowsample datasets to 200-fold for large-sample datasets.Local memory storage also improves with MetaNorm, as shown in Section 4.

Convergence
All datasets were run with five chains (randomized starting points) of 15 000 draws each except for dataset 12, which we  4, where RCRnorm has decent convergence for dataset 4 and (to a lesser extent) dataset 15 but MetaNorm produces better convergence in either scenario.MetaNorm also converges in only a few hundred draws, much quicker than RCRnorm.Panel (a) of Table 4 also shows that datasets 12 and 13 achieve good convergence for global parameters regardless of which model is used.
However, the next level of the hierarchy tells a different story.Supplementary Figure S1 shows trace plots of a 1 ; a 2 ; a 3 from the first three patients for the RCRnorm (top) and MetaNorm (bottom) chains.Clearly, RCRnorm is not converging to the same distribution in every chain, which is a significant issue.
MetaNorm resolves this issue, with all eight samples converge in their a i estimates.Consequentially, the normalized gene read counts vary significantly between runs with different starting points, casting doubt on the results [see panel (b) of Table 4].The issues raised in this section indicate that RCRnorm suffers from a kind of "faux" convergence, where good convergence around global parameters masks weak or absent convergence in lower-level parameters.It should be noted that this does not occur in all datasets when using RCRnorm-such as dataset 4-but (to our knowledge) does not occur at all when using MetaNorm.

Stability of j ir estimates
The most critical piece of both the RCRnorm and MetaNorm models are the j estimates for housekeeping and regular genes.Representing the RNA amount after accounting for sample degradation, the j parameters are the true model output and any value that the models offer lies in these estimates.With this in mind, if we run separate chains on the same dataset, ideally we will observe very similar estimates for these parameters.We will once again show that MetaNorm improves upon RCRnorm in this way, providing more stable estimates for these parameters.Panel (b) of Table 4 shows the median standard deviation between chains among all j ir estimates.For datasets 12, 13, and 15, which do not have all chains of sample-specific parameters converging to the same distribution, the stark difference in standard deviation should not be surprising; if the a i and b i terms are different, of course the j ir terms would be wildly different.But even when the data is good-quality and convergence is somewhat stable (as with dataset 4), there is still significant reduction in standard error in MetaNorm.The left panel of Fig. 3 shows the densities of these standard errors.

Bias of j ir estimates
A small simulation study was performed to compare model bias of the j ir estimates between RCRnorm and MetaNorm.

MetaNorm
Data for the simulation was generated based on dataset 4 (lung cancer FFPE), under constraints P P p¼1 d p ¼ 0, P P p¼1 d p X þ p ¼ 0, P N n¼1 d n ¼ 0. Bias was estimated using average mean difference between actual j ir values that the data was generated from and j ir estimates from the normalization procedures.In 100 runs of RCRnorm, we found that the normalization produced output that was off by an average of À0.0651 per j ir .This translates to an average 13.9% reduction from actual to estimated normalized gene read counts.In 100 runs of MetaNorm, we found that the output differed by an average of 0.0018 per j ir .This translates to an average increase of 0.4% from actual to estimated normalized gene read counts.While this does show that MetaNorm has superior bias to RCRnorm, this result is not as significant as the reduction in prediction variance (especially as it pertains to meanerror contribution).

Performance with messy data
This section presents a specific example of MetaNorm producing more intuitive and interpretable estimates for data with a large number of low-quality samples.Dataset 12 contains 61 (24%) samples which have an R 2 < :95 from a linear regression of positive probe data.Since the nCounter system is designed to maintain very strong linear relationship between the log-transformed counts and log-transformed control RNA expression levels, NanoString's recommendation is to discard any samples with R 2 < :95.In this section, we included all 253 samples for testing.Three chains of 10 000 draws were run on each model, with the first 5000 discarded as burn-in.The estimates for b i (the sample-specific slope terms) were drastically different, as the right panel of Fig. 3 shows, despite the fact that the jackknife prior implemented in RCRnorm keeps the l b samples from deviating off the empirical estimate.While the MetaNorm estimates are within a reasonable neighborhood of l b , the RCRnorm estimates are much lower, with a significant portion of them ($34%) <0 (implying a negative relationship between read counts and mRNA).As a point of comparison, empirical estimates for b i from dataset 12 have a minimum of 0.829.While some variation from the empirical estimates is expected, the amount seen from the RCRnorm is counterintuitive both to the original data and to the nCounter system in general.Since b i drives the relationship between output and mRNA for sample i, this causes significant downstream effects by producing inaccurate j ir which will ultimately lead to biased analyses.Alternatively, MetaNorm is able to overcome the pitfalls of messy data and produces reasonably intuitive estimates for b i .For this reason, the j ir estimates from MetaNorm are more plausible than those from RCRnorm for this dataset.

Comparison to RUVSeq
We conclude with a comparison to a leading normalization approach called RUVSeq (Risso et al. 2014) The MetaNorm package contains an alternate function that runs the same normalization procedure, but keeps only the most recent draws for the j parameters and calculates a rolling average for a posterior estimates.This allows researchers to avoid storing all draws for all parameters, which can cause major computational problems when large datasets are involved.Since the algorithm is continually freeing space by dumping old draws, the maximum amount of local memory needed is <1 GB for CBCS, making it much more comparable to approaches such as RUVSeq.
Each sample in the CBCS dataset is categorized by study phase (time period in which the sample was taken) which accounts for a significant amount of variability.Out of 406 regular genes included, only six (1.5%) did not show evidence for different means by study phase at a statistically significant level using an F-test on log-transformed read counts.When normalized estimates were used, these numbers jumped to 34 and 37 for MetaNorm and RUVSeq, respectively, suggesting that MetaNorm maintains good performance in the removal of technical variation as compared to RUVSeq.A similar analysis on gene expression variation due to ER (Estrogen-Receptor) status, finds that across raw data, MetaNorm and RUVSeq estimates, 77%, 90%, and 86% of genes had statistically significant F-statistics.These results suggest that MetaNorm outperforms RUVSeq in retaining biologically relevant variation.
Finally, in an extension of the FF-FFPE correlation analysis summarized in Table 2 of Jia et al. (2019), we find that MetaNorm maintains strong correlation to "ground truth" FF samples across both genes and patients.Paired FFPE and FF samples from dataset 4 normalized with MetaNorm show an average gene-wise correlation of 0.494, just under RCRnorm at 0.55 but above the rest of the tested methods, including RUVseq at 0.30 (Jia et al. 2019).MetaNorm shows a strong average patient-wise correlation at 0.862, followed by RCRnorm (0.851) and RUVseq (0.803) (Jia et al. 2019).

Discussion
When conducting statistical analysis in the Bayesian framework, many researchers rely on non-informative or diffuse prior distributions to maintain objectivity.However, when meaningful prior information exists and can be identified, using an informative prior distribution to accurately reflect current knowledge may lead to superior outcomes and great efficiency.This paper demonstrates a practical example of achieving improvement by incorporating such prior information via meta-analysis into data normalization.
RCRnorm is a groundbreaking normalization tool that improved upon existing methodology by leveraging the complexities of nCounter data.While other methods rely on frequentist techniques, the Bayesian framework allows RCRnorm to thoroughly explore the parameter space, identifying hidden information in the data, casting aside implausible assumptions, and increasing model transparency and interpretability.But despite its improvements, RCRnorm struggles to provide reliable estimates when the quality of data is in question.One reason for this is the "non-informative" data-based prior implemented for

Figure 1 .
Figure 1.Graphic representation of the original RCRnorm model (left) and the meta-analysis model (right).Squares denote observed or fixed values while circles represent model parameters (random variables).All r 2 hyperparameters have diffuse inverse gamma priors, location parameters/hyperparameters c; / i ; k Ã h ; k r have "safe range" uniform priors, with all but c having ranges determined by positive probe data.Hyperparameters l a ; l b have priors created by a jackknife analysis of the positive probe data.On the right, the hyperparameters in the top row are those estimated by the meta-analysis.

Figure 2 .
Figure 2. Meta-analysis: (a) and (b) show boxplots without outliers for empirical a ik (left) and b ik (right) estimates by dataset; (c) shows error bar plots of ða ik ; b ik Þ correlation by dataset; (d) shows average empirical residual for positive probes by dataset.Red color represents datasets that were excluded from the meta-analysis (id 4, 5, 12, and 15).In plot (c), the dots represent the Pearson correlation between empirical estimates of a ik ; b ik while the bars reflect 95% confidence bounds.In plot (d), points on the same line come from the same dataset, and this plot only shows the datasets included in the meta-analysis.

Figure 3 .
Figure 3.Comparison between RCRnorm and MetaNorm: the left panel shows the density of j ir 's standard errors for dataset 4; the right panel shows boxplots of b i estimates for dataset 12, where the dashed line represents l b , the meta-analysis estimate for the prior mean of l b , and the dotted line represents the empirical estimate of l b , using the positive probe data.

Table 1 .
Summary statistics for global parameters in our Bayesian metaanalysis after a burn-in of 5000 and thinning every other draw.

Table 2 .
Characteristics of testing datasets.

Table 3 .
Comparison between RCRnorm and MetaNorm on computation time using real datasets (1000 draws per run).to five chains of 5000 draws each due to the size of the dataset.Convergence diagnostics for global parameter l a are shown in panel (a) of Table limited

Table 4 .
Comparison between RCRnorm and MetaNorm on (a) convergency using Gelman-Rubin diagnostics for l a and (b) median standard deviation of j ir s.
, requiring 20.93 GB of local memory to obtain normalized estimates due to the size of the dataset (>500 000 total observations).Using the same dataset, we compare MetaNorm to RUVSeq on similar criteria.
(Patel et al. 2022norm and two other approaches [nanostringdiff(Wang et al. 2016)and nSolver (NanoString Technologies 2018)] on their ability to remove technical variation and retain relevant biological variation from the Carolina Breast Cancer Study(Patel et al. 2022) (CBCS) dataset.While RCRnorm and RUVSeq outperform the other observed approaches, the computational load is much heavier for RCRnorm