Robustly detecting differential expression in RNA sequencing data using observation weights

A popular approach for comparing gene expression levels between (replicated) conditions of RNA sequencing data relies on counting reads that map to features of interest. Within such count-based methods, many flexible and advanced statistical approaches now exist and offer the ability to adjust for covariates (e.g. batch effects). Often, these methods include some sort of ‘sharing of information’ across features to improve inferences in small samples. It is important to achieve an appropriate tradeoff between statistical power and protection against outliers. Here, we study the robustness of existing approaches for count-based differential expression analysis and propose a new strategy based on observation weights that can be used within existing frameworks. The results suggest that outliers can have a global effect on differential analyses. We demonstrate the effectiveness of our new approach with real data and simulated data that reflects properties of real datasets (e.g. dispersion-mean trend) and develop an extensible framework for comprehensive testing of current and future methods. In addition, we explore the origin of such outliers, in some cases highlighting additional biological or technical factors within the experiment. Further details can be downloaded from the project website: http://imlspenticton.uzh.ch/robinson_lab/edgeR_robust/.


Introduction
RNA sequencing (RNA-seq) is widely used for numerous biological applications, including the detection of alternative splice forms, RNA editing, allele-specific expression profiling, novel transcript discovery but most notably, for detecting changes in expression between experimental conditions or treatments.Compared to microarray technology, RNA-seq offers an open system, higher resolution, lower relative cost and less bias [38].A typical RNA-seq experiment includes: i) capture of an RNA subpopulation (e.g., polyA-enriched, depletion of ribosomal RNA) from cells of interest; ii) reverse transcription into complementary DNA (cDNA); iii) preparation and sequencing of millions of short cDNA fragments (∼200bp); iv) mapping to a reference genome or (assembled) transcriptome; v) counting according to a catalog of features.This last counting step can be conducted by excluding ambiguous reads between genes [4], or with advanced tools that portion ambiguous reads to transcripts [23] or can be done in combination with assembly tools [17].The focus here is on methods for count-based differential expression (DE) analyses and the robustness thereof; thus, the starting point here is a count table of features-by-samples, such as those available from the ReCount project [15].
Considerable recent effort has been paid by the statistical community to the discovery of DE features, given a count table; recent comparisons have shown that no method dominates the spectrum of possible situations [36,28].RNA-seq remains expensive and in many cases researchers are studying precious samples or rare cell types, so the number of biological replicates is often limiting.It is clear that the most successful methods implement some form of information sharing across the whole dataset to improve DE inference [4], and this becomes an intricate exercise to tradeoff power, false discovery control and protection against outliers.To highlight this distinction, we describe two popular software implementations for the negative binomial (NB) model, which arguably is the de facto standard for accounting for biological variability in such genome-scale count datasets.The latest version of edgeR moderates dispersion estimates towards a trended-by-mean estimate [26], whereas DESeq takes the maximum of a fitted dispersion-mean trend or the individual feature-wise dispersion estimate [3].The effect imposed on features with "outliers" is illustrated in Figure 1.Ten randomly selected samples from individuals from the HapMap project (denoted as Pickrell [27]) are divided into two groups of 5, forming an artificial "null" scenario.While very little true differential expression is expected, a low rate of false detections occur; in particular, edgeR detects a small number of genes with low estimated false discovery rate that exhibit one or two observations that are generally much higher in expression (Figure 1a-c).We believe that there are two causes for this: i) the sensitivity of relative expression estimates to these "outlying" observations; ii) moderation of the dispersion estimates toward the trend.In contrast, DESeq remains largely unaffected by these outliers, since the dispersion estimation policy is to keep the maximum; in what follows, we will explore effect of this maximum policy on power.All computed statistics for this dataset are stored in Supplementary Table 1.
The downstream effect of these dispersion estimation strategies suggest: i) DESeq is conservative but robust; ii) edgeR can be sensitive to outliers when there is sufficient dispersion smoothing towards the trend (effectively underestimating the dispersion in the shrinking process), but should be more powerful in the absence of such extreme observations [4].Our goal in the current study is to achieve a suitable middle ground, perhaps forfeiting a small amount in statistical efficiency, similar to established robustness frameworks, to reduce the influence of extreme observations in differential expression calls.As hinted above and in general, robustness is not solely determined by the dispersion parameter, but also by controlling the influence of outliers to other parameters in the model (e.g., those representing changes in expression).We explore these aspects in both simulated and real data, provide a extensible framework for evaluating the tradeoffs and highlight some instances of biology or technical factors that may give outliers.
The literature is rich in alternatives for count-based DE analyses and in particular, dispersion estimation, yet it remains increasingly difficult to assess the performance across the range of possibilities.For example, recent evidence suggests that one can suitably transform count data and analyze with methods developed for microarrays, with special treatment [21].The mainstream strategy is to directly fit count data to extensions of the Poisson model and in particular, the NB model.Many implementations are available as R/Bioconductor packages [16], such as edgeR [30], DESeq [3], ShrinkBayes [37], baySeq [19] and variations of dispersion estimation that can be used within existing implementations [40]; the main differences lie in the estimation of the dispersion or in the inference machinery (e.g., Bayesian versus frequentist).Recent comparisons and summaries of the methods available can be found in [4], [36] and [28].
Some early and existing count-based DE analysis tools only allowed 2-group comparisons.That is, they could not handle more complex situations, such as paired samples, time courses or batch effects.Recently, McCarthy et al. developed generalized linear model (GLM) capabilities in edgeR [26], allowing a much broader class of experimental designs to be analyzed and other frameworks have followed suit.However, GLMs require iterative fitting and more complicated dispersion estimation machinery [26].As shown in Figure 1, this framework can suffer a lack of robustness, whereby even a single extreme value (outlier) could largely affect estimates of regression parameters (e.g., mean of experimental condition), as highlighted by recent comparative studies [36] (see also Figure 1).In addition, the moderation of the dispersion parameter towards a trended value is actually contributing to the lack of robustness, forcing the dispersion to be underestimated 1).Although the details are not yet published, DESeq2 (successor of DESeq) takes an altogether different stance on robustness: using a Cook's distance metric, features that exhibit an extreme value are not considered for downstream statistical testing.
Figure 2: The flow chart of the robust algorithm implemented in edgeR.β is the estimated GLM regression coefficient and φ is the moderated dispersion estimate by maximizing APL w (Equation 10).r gi is the Pearson residual corresponding to count y gi from Equation 7. w gi is the observation weight from Equation 8. LRT (glmLRT in edgeR) computes likelihood ratio tests.
The strategy proposed in this paper is that of observation weights, effectively downweighting outliers to dampen their influence.There is already some precedent for doing this in GLM settings: Carroll and Pederson [9] introduced weighted maximum likelihood estimators for the logistic model; Cantoni [8] presented a robust quasi-likelihood approach for inference in binomial and Poisson models; Agostinelli and Alqallaf [2] derived weighted likelihood equation for GLMs by directly inserting observation weights into iterative re-weighted least squares algorithm (IRLS).Of particular importance, after adding observation weights, the asymptotic theory suggests that likelihood ratio statistics of model parameters still converge to approximate chi-squared distributions under the null hypothesis [1].At present, no "off-the-shelf" robust approach is readily available for the negative binomial model in the context of genome-scale computations.In this paper, we build an outlier-resistant framework that maintains high power and achieves decent false discovery control and make it available in the edgeR software package; the same strategy could be employed in other frameworks.We benchmark its performance on real and simulated data and explore the origins of outlying observations.

A standard setup of NB model in GLM framework
To most easily explain the addition of observation weights, we follow closely the notation used in McCarthy et al. [26].Let the Y gi be the read count in sample i for feature g.Assume Y gi follows a NB distribution with mean µ gi and dispersion φ g , denoted by Y gi ∼ N B(µ gi , φ g ).Feature g's variance equals µ gi + φ g • µ gi 2 , while the dispersion φ g represents the square of the biological coefficient of variation [26].In the GLM setting, the mean response, µ gi , is linked to a linear predictor, here with the canonical logarithm link according to: where X is the design matrix containing the covariates (e.g., experimental conditions, batch effects, etc.), β g is a vector of regression parameters (a subset of which are of interest for differential expression inference) and N i is the (effective) library size for sample i.
For estimation of the regression parameters, maximum likelihood estimation is used.The derivative of the log-likelihood, l(β g ), with respect to the coefficient β g is X T z g , where z gi = (y gi − µ gi )/(1 + φ g µ gi ).The estimated value of β g can be obtained by the iteratively re-weighted least squares algorithm (IRLS) in the form: where X T Ω g X is the Fisher information matrix (also denoted below as I g ), and Ω g is the diagonal matrix of working weights, which are µ g /(1 + φ g µ g ) for the NB model.

Moderated and trended dispersion estimates
The adjusted profile likelihood (APL) introduced by Cox and Reid [11] has shown good performance for dispersion estimation in the context of genome-scale count data [26,33].The AP L g is a likelihood in terms of φ g , penalized for the estimation of the regression parameters, β g , as follows: where y g is the vector of counts for gene g, βg is the estimated coefficient vector, () is the loglikelihood function, I g is the Fisher information matrix and |.| is the determinant.The early strategy to accomplish moderation for the dispersion was by squeezing the tagwise dispersion toward a common dispersion that is estimated over all features [32].This weighted likelihood approach involves maximizing a linear weighting of the individual likelihood and the common (averaged) likelihood, the two terms, respectively, in: where α is a suitably chosen weight.
A slight variation on this, which is now commonly applied after experience in many datasets showing a dispersion-mean relationship, is to shrink towards a dispersion estimated from features with similar average expression level [26].This so-called trended dispersion is constructed using local shared log-likelihood for feature g (more precisely, a smooth fit to common dispersions that are calculated in bins of averaged counts per million) and its neighbouring features in terms of expression strength.Specifically, individual tagwise estimates for each feature can be estimated by maximizing a linearly weighted function between individual dispersion and local shared dispersion: where φg is moderated tagwise dispersion, γ is the prior degrees of freedom afforded to the shared likelihood and: where the set C represents features that are close to feature g in average log counts per million.

A robust negative binomial GLM
Our approach to induce robustness is to attach a weight to each observation; observations that deviate strongly from the model fit are given lower weight.In particular, Pearson residuals from the current fit are sent through a weight function, which gets passed to the next iteration of estimation.A flow chart of our robust method given the components discussed below is presented in Figure 2. The dispersion estimation machinery (i.e., trended APL) also receives the same observation weight, so that the influence of outliers is dampened on both the regression and dispersion estimates.The Pearson residual of an observed count y gi from the NB GLM fit can be calculated as: where μgi is the fitted value (from β) and φg is the moderated dispersion estimate.The Pearson residual is converted to weights using, for example, the Huber function: where k represents a tuning constant for Huber estimator and usually set to 1.345 in normallydistributed settings to achieve 95% efficiency [14].This weight, w gi , gets used in the next iteration of GLM fitting; the IRLS Equation becomes: where W g the diagonal matrix of observation weights for feature g.The Fisher information matrix with observation weight becomes In this approach, the APL for dispersion φ g with observation weights can be written as: where W (.) ≡ i w gi l(.) is the weighted log-likelihood function and I W g is the Fisher information matrix with observation weights.Then, using these dispersion estimates, the regression parameters are estimated, again using the observation weights.
For users of edgeR, only a small change in the standard pipeline is required.

A simulation framework with parameters based on the joint distribution of mean and dispersion estimates from RNA-seq data
We built a simulation framework that aims to accurately reflect the reality of RNA sequencing data.
In order to evaluate the performance of our robust method and other methods across a variety of reasonable conditions, we created several options: We generate true NB model parameters, µ and φ, using the joint distribution of estimates, μ and φ, from a real datasets, such as the published count tables at ReCount [15]: Pickrell [27], Cheung et al. [10,39].In particular, the joint distribution preserves the dispersion-mean trend, which can vary from dataset to dataset.After the removal of extremely high dispersions and low means (analogous to typical recommended filters), the derived-from-real-data parameters are used to simulate the counts, from a NB distribution and optionally with true DE.
To test robustness, we add outliers to the simulated counts.Outliers are large values and can be produced by two different mechanisms (outlierMech): first, counts are multiplied by a random factor between 2 and 10, as employed by Soneson and Delorenzi [36], and includes both the "single" (S) and "random" (R) method.In S, a gene is chosen at some probability to have a single outlier randomly added.In R, each observation can become an outlier with some probability.The second mechanism, called "model" (M), each observation can be become an outlier with some probability and if so, is sampled from a second NB distribution with large µ; R and M methods induce the same overall outlier rate.
Recently, van de Wiel et al. modeled genome-scale count data as zero-inflated negative binomial model (ZINB), which seemed to explain some of the dispersion-mean relationship [37].We have not considered simulations from ZINB distributions, since they do not appear to explain all of the observed dispersion-mean relationship observed in the datasets that we tested (see Supplementary Figure 1).

Methods compared
We evaluated and compared several methods for DE analysis, including edgeR, edgeR-robust, limma-voom, DESeq-pool, DESeq-glm, DESeq2, baySeq, SAMseq [24], EBSeq [22] and ShrinkBayes; the R-based performance evaluation system that we developed allows arbitrary additions (assuming they are implemented in R). limma-voom is an extension to DE analysis of RNA-seq count data from limma [21]; it transforms the count data with special treatment given to fitting the mean-variance relationship.DESeq is tested as two separate methods: DESeq-pool is the default setting method to estimate the empirical dispersion from all the conditions with replicates; DESeqglm fits models according to a design matrix and estimates dispersion by maximizing APL.edgeR, DESeq and DESeq2 differ in how the dispersion is estimated: edgeR moderates dispersion towards a trended estimate [26], edgeR-robust expands this with observation weights, DESeq takes the maximum of a fitted trend of dispersion or the individual feature-wise dispersion estimate [3].At time of writing, the details of DESeq2 were not yet published, but from the documentation, the approach offers a zero-mean normal prior on the regression parameters of interest with an empirical variance estimated from data; for outlier protection, a Cook's distance is calculated and those features with an extreme value are not promoted to formal statistical testing.The default normalization method is also different among edgeR, DESeq and DESeq2.edgeR uses trimmed-mean-of-M-values (TMM) [31], while DESeq and DESeq2 use a relative-log-expression approach.SAMseq, a non-parametric method, employs Wilcoxon rank-sum statistics to estimate false discovery rate (FDR) through sample permutations.
baySeq, EBSeq and ShrinkBayes use Bayesian inference.baySeq employs the NB model and assumes that samples can be classified as different groups by their treatment conditions; samples within the same group should follow the same distribution and share parameters.Using an empirical Bayes approach, baySeq estimates the posterior probability of the null state.ShrinkBayes introduces the ZINB and performs inference using integrated nested Laplace approximations (INLA) [35,25] and provides Bayesian FDR and local false discovery rate (lfdr) [13] estimates.Since the computational cost of ShrinkBayes is high, some comparisons are skipped.EBSeq is similar to baySeq, providing posterior probability of DE, as well as EE (equally expressed), based on a parametric mixture model.Compared with other methods tested here, EBSeq can also detect DE isoforms in EE features, yet this is not our primary question here.
Notably, new methods, or variations of existing ones can be easily added to our comparison framework, simply by providing a wrapper to an R function that contains the correct inputs (count table, grouping variable) and outputs (P-values).See Supplementary web site for details.

Comparison metrics
To test the performance of each DE method (in the presence of outliers), we employ several standard metrics and plots: false discovery (FD) plots, receiver operating characteristic (ROC) curves, partial ROC curves and power curves.Power (TP) curves and (partial) ROC curves (i.e., up to a certain false positive rate) evaluate the ability to distinguish, through statistical evidence, DE and non-DE.FD procedures gauge the control of the expected proportion of incorrectly rejected null hypotheses [6].Another useful plot is the relationship between TP rate and achieved false discovery rate across multiple thresholds.

An open graphical tool and R code for re-analysis: evaluating DE analysis methods
One disadvantage of current method comparisons (e.g., [36,28]) and those that accompany every new method published, is that they are a snapshot in time.If new methods come along, the developer must demonstrate that their method is better, by some metric.This task is important but somewhat repetitive, since many of the same metrics, plots and simulation models are (re-)implemented.We endeavored to create a system for performing standardized simulation-based testing.
In addition, all analyses presented in this paper are freely available from our website.Moreover, our simulation and evaluation framework is made available as a web-sourceable script that consists of 3 modules: simulation, evaluation (running of the software packages) and metric computation.Each module can be extended, using simple wrapper functions to existing R-based code, ensuring that our comparison results are reproducible, extensible and relatively easy for the user to track exactly what code segments (and versions) were run.
In addition to R code, we make available a web-based shiny "app" that can be used to look at simulation results across a wide number of conditions [34].Since there are often too many methods to be easily displayed together, our app gives users the ability to present results for a user-selected subset of methods; the results update automatically as the user selects different simulation settings.

Functional category analysis for outliers
To explore potential biological or technical factors that may manifest as outliers, we performed hypergeometric-based functional category analyses on the set of genes with weights less than some cutoff (here, set to 1) on a per-sample basis.Our goal with such an analysis is to identify possible biological or technical factors that affect a subset of genes for a particular experimental unit.In some cases, this may shed light on why the expression levels of some genes for a given sample are very different than that of their replicates.Furthermore, we can investigate whether the downweighting is driven by technical factors.As a positive control for this, we compared the observed weights to the sample-specific GC content effects observed in the Pickrell dataset [18,27] 3 RESULTS

edgeR-robust dampens the effect of outliers
To highlight how edgeR-robust dampens the influence of outliers, we return to the dataset shown in Figure 1. Figure 3(a) shows the trajectories for the 3 outliers in terms of their average log-CPM and dispersion estimates and how the dispersion-mean trend changes over 6 iterations of the edgeRrobust reweighted estimation scheme.As expected, the outliers appear "extreme" according to the model, as also reflected by their residuals.Extreme residuals are then down weighted, iteratively, and both the dispersion and average log-CPM estimates are altered (Figure 3a).In particular, we notice large changes to both the average log-CPM (or equally, to the regression parameter estimates) and they become in better accordance with the other features in the dataset.Notably, Figure 3(a) highlights a global drop in dispersion-mean trend after the iterative robust estimation, which suggests that outliers present in sufficient frequency may have a global effect on the statistical detection of DE within a dataset.Thus, we speculate that gains in statistical power (see Sections below) may be achieved in part by this global drop in trended dispersion.
In their manuscript, Li and Tibshirani [24] show some extreme examples of outliers affecting differential count analysis of miRNA-seq data (in particular, see their Figure 2).Figure 3(b) shows one of those examples, mir-133b, and highlights the estimated mean CPM by group, before and after downweighting; the observation weights after 6 iterations are shown in Figure 3(c).Notably, for this example, there still exists strong evidence for differential expression, even after careful reassessment of the outlying observations.Supplementary Table 1 shows the full details of these analyses, before and after reweighting.

Simulation reflects real data
To test the method on a wide range of simulated settings, we first generate count data from a model that reflects real data as well as possible.As described in the Methods, we choose to take the joint distribution of estimated log-CPM and dispersion for a large dataset, after applying no moderation to the parameter estimates, as the basis for the parameter settings and we use library sizes that mimic those from typical datasets.For example, the Pickrell dataset [27] consists of more than 50 replicates, which should represent a reasonably accurate reflection of the range of abundances observed as well as, in particular, the dispersion-mean relationship.We generate all data from the NB model and introduce outliers by various mechanisms (see Methods).Supplementary Figure 2 shows the dispersion-mean trend for the Pickrell dataset (Panel a) and an example simulated dataset based on the estimated parameters (Panel b), respectively, as well as the marginal distributions of both log-CPMs and dispersion.The framework for these simulations (see Methods) is designed to take an initial dataset that seeds the simulation parameters, so datasets spanning the range of biological variation could easily be tested.

Standard metrics across various methods for various simulation settings
Next, we present a representative simulation and performance results under a single "reasonable" setting of the parameters.We sampled NB model parameters µ and φ from the joint distribution of estimates from the Pickrell data [27] (dataset); we filtered out the top 10 percent of the extreme dispersion values; 10000 features are generated (nTags), with a 5 versus 5 two-group comparison (group); 10% of them are defined as DE genes (pDiff=.1),symmetrically (pUp=.5)with fold difference 3 (foldDiff=3); outliers are introduced to 10% of the features (pOutlier=.1)using the "simple" outlier generation mechanism (outlierMech="S"); outliers are randomly distributed amongst all genes; further details are described in Methods.Original simulated counts and the counts with outliers introduced are separately recorded.Figure 4 shows the set of standard metrics: panels (a)-(c) and (d)-(f) show false discovery plots, ROC curves and power numbers, respectively, for the original and original-with-outliers datasets under the setting of simulation parameters discussed above.Overall, the introduction of outliers results in more false positives (Figure 4a versus 4d) and/or less true positives at the same false positive rate (Figure 4b versus 4e).In the absence of outliers, all methods exhibit similar patterns of false discovery rates, with the Bayesian methods, ShrinkBayes and EBSeq having a slightly higher rates.Similarly, in terms of separating the truly DE from non-DE features using a P-value (or P-value-like score in the case of Bayesian methods), all methods are very close in performance.When outliers are introduced, edgeR-robust shows some advantages over edgeR, but many other methods are close.In terms of statistical power, all methods drop in overall power with the introduction of outliers (Figure 4c versus 4f), but edgeR-robust drops by the smallest amount.While DESeq maintains a good ranking of P-values, it becomes very conservative due to the maximum-of-trend-and-individual dispersion policy; in this respect, presence of outliers affect the whole dataset (see Supplementary Figure 3).
Since the direction of differential expression and the outlier introduction are applied at random, we can further split the DE features according to the position of the outlier relative to the direction of change in abundance (Figure 4g-i; "DEupOutlier" represents the situation where the outlier is added to the higher expressed group; "DEdownOutlier" represents those features where the outlier was added to the lower expressed condition; "DEnoOutlier" represent DE features with no introduced outlier).Notably, edgeR shows the highest power in the "DEupOutlier" setting, but this is artificial since the introduction of the outliers actually helps detection in this situation.The "DEdownOutlier" is the situation where edgeR-robust comes to the forefront, as expected, given that outliers strongly eliminate the differential expression in this setting.In the absence of outliers, edgeR-robust still remains a strong competitor, closely followed by DESeq2, ShrinkBayes and edgeR.
It is also interesting to consider how well the robust observation-weight-based method and DE-Seq2's Cook's distance policy perform in simply identifying the simulated true outliers.Supplementary Figure 4 shows an ROC curve depicting how well the observation weights separate outliers from non-outliers.Similarly, the default setting of DESeq2 leads to a similar tradeoff between false positives (here, falsely detected as an outlier) and false negatives (failing to identify an outlier).Notably, the edgeR-robust strategy smoothly identifies outliers and downweights them according to the magnitude of discordance, instead of setting a hard threshold where statistical tests are no longer conducted.One byproduct of DESeq2's hard threshold is a loss of power (e.g., Figure 4panels h and i), since genes with true differential expression are excluded from statistical testing.
The above discussion was in regard to a single dataset under a single set of simulation parameters.To provide a much wider scope of simulation settings, we created a web-based shiny app, that serves up pre-computed results over a range of simulation parameters, including different datasets, sample sizes and so on.In addition, it allows users to plot results for only the subset of desired methods and metrics from Figure 4.While new methods can only be added to the shiny app by us, the same simulations can be easily recreated in a local R environment, as described on our website.In general, the conclusions observed from the broader range of simulation settings are in agreement with those mentioned above (Supplementary Figure 5).We also tested DESeq2 when turning off the Cook's distance metric and found Pearson residuals to outperform deviance residuals (Supplementary Figure 6).

Across multiple simulations over a range of settings, edgeR-robust is somewhat liberal but achieves the best power-to-effective-FDR tradeoff
To complement the simulation results for individual parameter settings, we endeavoured to create a compact summary of a wider range of simulations and explore another important aspect of these methods: do methods accurately report false discovery rate? Figure 5 shows a series of 15 simulations divided into 3 different blocks based on the degree of difficulty: "hard" (foldDiff∈ [2, 2.2]), "medium" (foldDiff ∈ [3, 3.3]) and "easy" (foldDiff∈ [6, 6.6]), including 5 simulations within each group to illustrate sampling variability.For each dataset, lines connect the true positive rates and effective FDRs across three thresholds of the estimated FDR (.02, .05,.1).The rest of the simulation parameters are kept fixed: the NB model parameters originate from the Pickrell dataset [27], there are 10000 features, we consider a two group comparison (5 versus 5), 10% of features are DE and each dataset contains 10% "S" outliers; comparisons for 3 versus 3 and 10 versus 10 are shown in Supplementary Figure 7.
Overall, there is a broad range of power-to-achieved-FDR tradeoffs.On the one hand, EBSeq appears to be a bit conservative and more so in large samples, but quite liberal in small samples.In general, DESeq is conservative and achieves lower power, as reported earlier [36].Altogether, the collection of methods, such as voom, edgeR, edgeR-robust and DESeq2 achieve similar power-toachieved-FDR tradeoffs across the sample sizes, with perhaps a tendency to be more liberal in large sample sizes for edgeR and edgeR-robust.As expected and as highlighted above, edgeR-robust appears to have advantages in the presence of outliers, with only a minor decrease in power when no outliers are present.Overall, edgeR achieves the best tradeoff between power at the same achieved FDR, even if the target FDR is not quite met.

Outliers may originate from technical or biological sources
While the strategy based on observation weights appears useful for dampening the effect of outliers in differential expression analysis, it may also be of interest to investigate the origin of such outlying observations.In some cases, we know of technical artifacts that affect the profile of RNA-seq expression data, such as sample-specific GC content biases, as highlighted and mitigated by the analyses of the HapMap consortium as well as in follow-up methodology development (e.g., cqn normalization [18]).In this dataset, there are no experimental conditions to detect differential expression, so we fit an intercept-only model, using the iterative robust estimation scheme.Not surprisingly, we first observe that the two samples highlighted by Hansen et al. also exhibit a relatively higher number of down weighted observations (Figure 6a).As expected, the degree of downweighting is strongly related to the GC content of the cDNA sequences of the genes involved (Figure 6b).
In an unrelated dataset from Blekhman [7] comparing expression in human male and female livers, we observe that the most significantly overrepresented functional categories were associated with a signal sample (SRX014822and3, green circles in Figure 6c).These include several categories involving the extracellular matrix, as well as collagen catabolism and plasma membrane.We show the third most overrepresented category, "extracellular matrix" (Figure 6c) because the size of this category allows individual genes to be visualised (further details are given in Supplementary Table 2).Although we cannot confirm the exact cause of the overrepresented gene ontology categories, we note that accumulation of collagen and excessive production of extracellular matrix proteins are associated with the development of liver fibrosis (e.g.[20,5]), and we suggest that analyses such as these may assist biologists in identifying the source of outliers in gene expression.

DISCUSSION
Various method developers have shown that statistical methods for the task of discerning differential expression from RNA-seq data represented as counts can be sensitive to outlying observations.In this report, we have studied in detail the effects of outliers on various approaches and developed a new method based on observation weights that can detect and dampen the effect of outliers.In fact, it requires a delicate tradeoff between maintaining high power while at the same time achieving a decent resistance to the presence of outliers.In particular, it is difficult to know exactly what an outlier is and where the line should be drawn to identify it as such.In this respect, we take a "smooth" approach of dampening their effects, when there is evidence to support departure from the model.We have also explored the origin of such outliers and in some cases, we may be able to identify either a technical or biological effect to explain them.Our robust approach follows the strategy of classical robustness methods that are commonly applied to the linear regression problem.In our approach, we adopted the calculation of the residuals and observation weights to the specifics of the flexible dispersion estimation and standard GLM regression estimation of the negative binomial model.
As mentioned above, one reason the edgeR is sensitive to outlying observations is that the dispersion estimate used in the downstream inference is pulled towards the dispersion-mean trend, which may underestimate the variability.Therefore, another way to dampen the effect of outliers is to decrease the degree of moderation toward the dispersion-mean trend.Although we have not studied it here, there is again a delicate tradeoff between the degree of moderation to use and the average inference performance; it still remains an open question as to how exactly to set this value for a given dataset.
Though motivated and tested on real datasets, we employed simulations to explore the broad range of possible settings and developed a comprehensive system for such evaluations.Our strategy to mimick real datasets is to take the joint distribution of mean and dispersion estimates from a large dataset as the basis for parameters to sample from.From such a dataset, outliers and differential expression at a specified level can be readily introduced.In fact, because these are estimates and not true values, we expect the sampled dispersion to potentially exhibit more variation than observed in a real dataset.In terms of evaluating the different methods across the spectrum of simulation settings, it is important to consider it from all points of view: false discoveries amongst the list of top called features, the ability to separate the truly differential from non-differential (i.e., ranking by statistical evidence), the statistical power at thresholds that are typically used in practice and the degree to which methods achieve their purported false discovery rates.
Overall, the observation weight robust method performs well and achieves the goal of suffering only minimal loss of power, while maintaing resistance to introduced outliers.We have investigated the outlier policy in other packages and highlight that smoothly downweighting outlying observations appears preferable.In DESeq, a hard line against outliers is taken by using the maximum of a dispersion-mean trend and the individual estimate; with the addition of outliers, this has a global effect of increasing the variance to all features and gives a resulting loss of power.In DESeq2, a Cook's distance metric is used to remove features with outliers entirely from further consideration; in this case, features that have outliers and differential expression are excluded, again resulting in loss of power.
With the simulation system that we have created, we can now make a call to the community of both developers and users to check the effect of various settings.All that is required to test a new method and compare it against existing methods is to write a wrapper function with the correct inputs and outputs.In addition, if the exact simulation settings that we use in this report are not adequate, we can easily extend this framework into an open testing system that allows additional variations on the sampling model, perhaps including additional distributions or constructed truths, such as plasmodes [29].
The current edgeR framework does not always achieve its false discovery rate target.However, even if it were forced to be more conservative, it still achieves power as good or better than existing approaches across the simulation settings that we have tested, even with the addition of observation weights.The exact source of the liberality is beyond the scope of the current investigation, but there may be room for improvement, such as borrowing ideas from small sample asymptotic approximations [12].

CONCLUSION
We developed an approach to dampen the effect of outliers on count-based differential expression analyses.Overall, the method appears to achieve the desired "efficiency": a resistance to outliers while maintaining high power.We provided an implementation for the edgeR Bioconductor package, but the reweighting idea could easily be adopted to other packages.In addition, we developed an extensible simulation system (at the count table level) that readily performs the simulations based on an existing dataset and provides the infrastructure for producing the standard battery of evaluations.In particular, this allows new methods or variations (e.g., alternative settings) of existing packages to be quickly explored.Instead of preparing a large number of Supplementary Figures, we provide an interactive web-based shiny "app" to display simulation results across a broad range of simulation settings.(f) show corresponding plots with datasets containing 10% outliers (i.e., 10% of genes have a single outlier) using "S" method.(g), (h) and (i) split the results from panel (f) into three categories: features without outliers (g); outliers in the higher expression group (h); outliers in the lower expression group (i).All power results are shown as overall (single dot on left of plot) and split across five equally-sized average-log-CPM groups.The X on panels (b) and (e) highlights the achieved power (TP) according to each method's 5% FDR cutoff.Note that while panel (g) presents the situation with no outliers, there are outliers present in other features within the dataset and is therefore different from panel (c).

Figure 1 :
Figure 1: From Pickrell [27] data ten randomly selected samples from individuals are divided into two groups of 5, forming an artificial "null" scenario.(a), (b) and (c) show barplots of three genes from top 10 DE genes with one or two extremely large observations.(d) and (e) plot genewise biological coefficient of variation (BCV) against gene abundance (in log2 counts per million) for edgeR and DESeq.In panel (d), gray dots shows unmoderated biological BCV estimates ( √ φ i ) (prior degrees of freedom= 0).Steel blue dots shows moderated biological BCV with prior degree 10 (default setting for edgeR).Three outlier genes on (a), (b) and (c) are labelled by large blue dots.For (e) DESeq uses the maximum (steel blue dots) of a fitted dispersion-mean trend (red line) or the individual feature-wise (tagwise) dispersion estimate.Three outlier genes are also pointed out by large blue dots.

Figure 3 :
Figure 3: (a) For the random 5 versus 5 split of the Pickrell data [27] shown in Figure 1, the trajectories of overall trended dispersion and for the 3 individual genes are shown over 6 iterations of the edgeRrobust reweighted estimation scheme.(b) a barplot of miR-133b expression from Witten et al. [39], including an observation with very high count.(c) weights for miR-133b after 6 iterations of the reestimation from edgeR-robust.Dashed lines in panel (b) shown the group-wise counts-per-million (CPM) before and after weighting.

Figure 4 :
Figure 4: (a), (b) and (c) present FD, partial ROC (up to FP rate of 40%) and power plots (at each methods' 5% FDR) across several tested methods for datasets with no introduced outliers; (d), (e) and (f) show corresponding plots with datasets containing 10% outliers (i.e., 10% of genes have a single outlier) using "S" method.(g), (h) and (i) split the results from panel (f) into three categories: features without outliers (g); outliers in the higher expression group (h); outliers in the lower expression group (i).All power results are shown as overall (single dot on left of plot) and split across five equally-sized average-log-CPM groups.The X on panels (b) and (e) highlights the achieved power (TP) according to each method's 5% FDR cutoff.Note that while panel (g) presents the situation with no outliers, there are outliers present in other features within the dataset and is therefore different from panel (c).

Figure 6 :
Figure 6: Technical ((a) and (b)) and biological (c) sources of outlier genes.The number of down weighted observations (a) and distribution of outlier weights as a function of the gene GC% in three samples from the HapMap RNA-Seq data [27] are plotted (b).Two of the samples shown (NA11918 and NA12761) were shown by Hansen et al to have strong, opposing relationships between GC% and RPKM.The third sample (NA12156) had the least number of genes down weighted after applying our robust down weighting procedure.(c) shows the log(CPM) and the inverse of the down weighting value for genes in the "extracellular matrix" gene ontology category, where a value of 1 indicates no down weighting and larger inverse weights indicate stronger down weighting.