-
PDF
- Split View
-
Views
-
Cite
Cite
Mark D. Robinson, Gordon K. Smyth, Small-sample estimation of negative binomial dispersion, with applications to SAGE data, Biostatistics, Volume 9, Issue 2, April 2008, Pages 321–332, https://doi.org/10.1093/biostatistics/kxm030
Close - Share Icon Share
Abstract
We derive a quantile-adjusted conditional maximum likelihood estimator for the dispersion parameter of the negative binomial distribution and compare its performance, in terms of bias, to various other methods. Our estimation scheme outperforms all other methods in very small samples, typical of those from serial analysis of gene expression studies, the motivating data for this study. The impact of dispersion estimation on hypothesis testing is studied. We derive an “exact” test that outperforms the standard approximate asymptotic tests.
1. INTRODUCTION
Count data arise in numerous biological applications and can often be modeled by a Poisson distribution, where the mean and variance are the same. However, in situations where there is a positive correlation in the occurrence of events, the observed variation is significantly greater than the mean and an extension to the Poisson model is more appropriate. A popular alternative is the negative binomial (NB) model, also known as the gamma-Poisson model, since the Poisson rate parameter is a mixture of gamma random variables with fixed coefficient of variation.
Serial analysis of gene expression (SAGE) is a celebrated molecular biology technique that affords the simultaneous measurement of the expressed messenger ribonucleic acid (mRNA) population at a given time or state of a biological organism (Velculescu and others, 1995). Briefly, the procedure works as follows. mRNA, an intermediate molecule between the DNA code and an eventual protein, is extracted from a cell, and many short (e.g. 14 bp) regions called “tags” are sequenced and recorded. Typically, there are upward of 30 000 total tags sequenced for a single sample, called a “library”, and many tags occur multiple times. The tag counts are representative of the abundance of the corresponding mRNA molecule. Unfortunately, the technique is expensive due to the current cost of sequencing and we often only have access to a very small number of libraries, thus making estimation and inference a challenge.
We explore the use of an NB model for each tag across multiple libraries and discuss the estimation of the dispersion parameter. The data on hand consist of counts for many seemingly unrelated tags, with each tag being observed through a small number of libraries. The major statistical inference task with SAGE data is to find mRNA transcripts which differ in expression between experimental conditions or between classes of patients. To do this, a statistical model such as the NB can be used to determine whether observed differences in tag counts can be attributable to chance or not. If there are no replicates in each class, a χ2 test for 2-by-2 contingency tables or Fisher's exact test can be conducted for each tag (Kal and others, 1999). Even with replication, previous analyses have ignored this and pooled the tag counts over libraries (Kal and others, 1999). This, however, overestimates the significance of each difference because it fails to allow for interlibrary variation. Here, interlibrary variation is due to a combination of biological (e.g. Zhang and others, 1997, used 2 different patients for each tissue) and technical sources; however, due to the current expense of SAGE, technical replication is rarely available. Models used to capture this additional variation include overdispersed generalized linear models (GLMs), with the counts as either binomial (Baggerley and others, 2004) or Poisson (Lu and others, 2005), the latter proving to be more powerful in most situations.
We describe a new approach for estimating NB dispersion that uses simple distributional adjustments in combination with conditional maximum likelihood (CML). Even though we typically only have a small number of observations (e.g. 2 libraries for a given experimental condition), we do have counts for many tags and can leverage what little information that does give about the dispersion.
The paper is organized as follows: Section 2 describes the NB model and the generalizations we explore. Section 3 reviews the methods for estimating dispersion in NB models. Section 4 introduces our CML approach to estimation and Section 5 discusses hypothesis testing, including our new “exact” test. Performance comparisons are made in Sections 6 and 7, and discussion follows in Section 8. Software for all calculations is available from the authors.
2. NB MODELS
2.1 Genesis

Note also that as φ→0, the distribution reduces to the Poisson. Based on recent work in the analysis of SAGE data (Lu and others, 2005), the NB distribution is a robust alternative to the beta-binomial (i.e. overdispersed logistic regression) distribution and other models. In the SAGE context, φ accounts for the library-to-library variability.
2.2 Single tag with unequal library sizes
SAGE data motivate the following model for the counts for a single tag across n libraries. Consider Y1,…,Yn as independent and NB(μi = miλ,φ), where mi is the library size (i.e. total number of tags sequenced for library i) and λ represents the proportion of the library that is a particular tag. An important special case is mi≡m, where the Yi are identically distributed. This simple situation is integral to our proposed estimation mechanism. With an identically distributed sample, the maximum likelihood estimator (MLE) of λ will always be the sum of counts divided by sum of library sizes, independent of φ. If m = 1, the MLE of λ is the mean, as with the Poisson model. In the case of different mi, the MLE of λ will depend on φ and maximum likelihood estimation of the 2 parameters proceeds jointly.
2.3 Many tags with unequal library sizes: SAGE data
In a SAGE experiment, we simultaneously make observations on T tags over n libraries. Typically, T is in excess of 5000–10 000, depending on the diversity of the expressed mRNA and the sequencing time. We assume that all tags are independent for the purposes of estimation and inference. This is not strictly true since some sets of genes are involved in related biological functions and are inherently coregulated in their expression levels. However, our primary concern is getting an unbiased estimate of the dispersion parameter and consistency of the estimator is unaffected by this correlation.
For tag t and library i, we denote the random variables as Yti and model them as NB with mean miλt and dispersion φ. Note that all tags are assumed to have a “common dispersion,” φ. Since we are always dealing with very small s (e.g. n = 3), estimating a different dispersion for each tag separately is unrealistic and, in any case, evidence for truly different φ is unclear.
3. REVIEW OF DISPERSION ESTIMATORS
It is well known that MLEs in general tend to underestimate variance parameters because they fail to adjust for the fact that the mean is estimated from the same data.

Quasi-likelihood (QL) models estimate the dispersion in an identical fashion but replace the Pearson statistic with a deviance statistic (Nelder, 2000). The estimating equation is

Thus, the PL and QL methods do allow for estimation of the mean in computing the residual degrees of freedom. Nelder and Lee (1992) compared estimators for the NB and found PL almost always less efficient than QL and often less efficient than the MLE.

Each of the above estimators can be extended to the many-tag SAGE scenario simply by summing quantities over tags.
4. CONDITIONAL DISPERSION ESTIMATION
4.1 Conditional maximum likelihood

4.2 Quantile-adjusted CML
When the library sizes are unequal, we devise a simple but effective approximate approach to equate the library sizes and create quantile-adjusted “pseudodata,” allowing us to use the above CML machinery to achieve an accurate estimate of φ.
Let , the geometric mean of the library sizes. We adjust the observed data as if they were all sampled as identically distributed NB(m*λ,φ), as follows:
1. Initialize φ (e.g. the unadjusted CML estimate).
2. Given the current value of φ, estimate the rate λ.
3. Assuming each observation yi was sampled from an NB distribution with mean miλ and dispersion φ, calculate the observed percentiles

4. Using a linear interpolation of the quantile function, calculate pseudodata from an NB distribution with mean m*λ and dispersion φ, having quantiles pi. Note that pseudodata will then be continuous and can be as negative as − 0.5. The pseudodata for each tag is approximately identically distributed.
5. Calculate φ using CML on the pseudodata.
6. Repeat Steps 2–5 until φ converges.
Effectively, this adjusts observed counts upward for libraries with size below the geometric mean and downward for counts from an above-average–sized library. We call this quantile-adjusted conditional maximum likelihood (qCML). In the case of many samples (i.e. many tags), the above procedure creates pseudodata for each tag and the estimation of the common φ is done over all tags.
5. HYPOTHESIS TESTING
5.1 Asymptotic tests
We discuss the simple setting of a 2-sample comparison (e.g. cancer versus normal) that would be repeated for each tag in the SAGE context. For all tests, the common dispersion is estimated from a large number of tags and therefore is treated as fixed when applying these tests.
We wish to test whether the relative abundance under experimental condition A is the same as that in condition B, leading to a null hypothesis of H0:λtA = λtB, for each tag. There are no exact tests for testing such a hypothesis under the NB with unequal library sizes, but the standard approximate options based on asymptotics include the Wald test, the score test, or the likelihood ratio (LR) test. For the small sample sizes, we expect from SAGE experiments that the large-sample approximations are questionable. For this reason, we develop a new test.
5.2 A small-sample test procedure
The quantile adjustment, discussed in Section 4.2, provides the opportunity for an exact test. If the quantile adjustment is applied under the null model of no difference, the pseudodata are then approximately identically distributed, leading to known distributions of the within-condition pseudodata totals for each tag. Also, the sum of the total tag pseudocount over all libraries has a known distribution. For the 2-sample test discussed in Section 4, let ZtA and ZtB be the sum of pseudocounts for class A and class B, respectively, over the number of libraries, nA and nB. Under the null hypothesis, Ztk∼NB(nkm*λt,φnk − 1),k∈A,B. One can construct an exact test similar to the Fisher's exact test for contingency tables but replacing the hypergeometric probabilities with NB. Conditioning on the total pseudo-sum, ZtA + ZtB, also an NB random variable, we can calculate the probability of observing class totals as or more extreme than that observed, giving an exact method of assessing differential expression. In other words, the 2-sided p-value is defined as the sum of the probabilities of class totals that are no more likely than those observed.
The test is approximate in the sense that the quantile adjustment makes the pseudodata approximately identically distributed, but the probabilities are otherwise exact. As with all the asymptotic tests, the new test also has a many-class analog, which we do not consider here.
6. COMPARISON OF DISPERSION ESTIMATORS
6.1 General considerations
Since the SAGE setting leads us to a common dispersion model, we are primarily concerned with having an estimator with minimal bias to ensure that significance statements concerning λ are accurate. An analysis of significance testing is given in Section 7.
Based on previous experience, we would expect to find that ML has smallest variance but largest bias. CML should have smaller bias but larger variance. CR should be closer to qCML and should perform well in larger samples. QL should outperform PL for nonnormal models and especially in the small samples we consider here. In terms of bias, we expect qCML to be the best, followed by CR, QL, PL, and MLE.
We have chosen 4 scenarios, with small and large μ and small and large φ. For φ, we have chosen small at 0.25 and large at 1. The primary performance criterion is bias of the estimate of φ, since in real settings we have a large number of tags and the bias of φ will affect the significance of test statistics in the search for differential tags. We focus here on the smallest samples possible where we can estimate 2 parameters (n = 3), and these are typical of SAGE studies.
We choose library sizes uniformly between 20 000 and 80 000 to simulate the effect of differing SAGE library sizes and select λ = 0.0001 and λ = 0.0005 as small and large, giving means between 2 and 8 and 10 and 40, respectively. These are meant to imitate typical SAGE counts, and the settings are similar to those used in the simulations in Lu and others (2005).
Since there is nonzero probability of data situations in which the ML or CML estimate is infinite, we present results on the scale that is constrained to (0,1). It is important to note that biases observed on the δ scale are always more extreme when converted to the original φ scale.
6.2 Single tag with unequal library sizes
To achieve a baseline for comparison, we consider estimation for a single tag. Later, we consider estimation of a common dispersion using multiple tags. Figure 1 gives box plots of 1000 independent samples of 1 tag over 3 libraries. These show the bias of each of the 6 estimators. In this situation, most methods underestimate the true dispersion. Here, CML acts as a control as it applies CML using incorrect assumptions and should overestimate the true value. Fortuitously, CML appears least biased and would be considered the top performer in this situation. Beyond CML, it appears that the QL performs best in this situation.
Estimates of φ for 1000 samples of a single “tag” (i.e. each observation from a different library size) of size n = 3 on the scale. The horizontal line indicates the correct value.
6.3 Many tags with unequal library sizes: SAGE data
Here, we consider the actual SAGE setting, where many tags from unequally sized libraries are used to estimate the common dispersion. We consider a “mini-SAGE” setting with 100 tags being used to estimate the common dispersion. The library sizes and values for λ and φ are chosen as before, and the results are shown in Figure 2. Here, we see that qCML is the only estimator that is reliable under the whole spectrum of λ and φ values. CML should overestimate the true value and clearly does, especially for large means and somewhat surprisingly, for small φ. PL also consistently underestimates the true value, whereas ML grossly underestimates in all cases. QL overestimates the true value, especially for the more difficult small-mean, large-dispersion case. CR performs quite well in most situations, except the small-mean, large-dispersion state.
Estimates of common φ for 1000 samples with 100 tags (i.e. each dispersion is estimated from 100 randomly sampled tags) with n = 3 libraries. Results are presented on the scale. The horizontal line indicates the correct value.
6.4 Other settings
As the number of tags increases, the performance of the qCML is even more apparent (data not shown). We have tried 1000 tags, and the results observed in Figure 3 are identical, yet more defined. qCML is unbiased in all 4 scenarios, ML and PL underestimate, and QL and CML overestimate the true value. Similarly, CR performs very well, except in the small-mean, large-dispersion situation.
Achieved false-positive rates sampled under the null hypothesis (i.e. no differential expression) for 4 statistical tests, in either a 2 versus 2 comparison or a 5 versus 5 comparison.
7. COMPARISON OF TESTING PROCEDURES
7.1 Infinite evidence against the null hypothesis
There is a special case where some asymptotic tests seem to break down for the 2-sample NB model, and in fact, the same problem would exist for Poisson, binomial, and beta-binomial models.
The following table illustrates this. For a 2-class comparison, 2 libraries for each class, equal library sizes, suppose we observe zero total count in one class and varying levels of nonzero counts in the other class. In this case, the true value of λ for class 1 is 0, on the boundary of the parameter space. Assuming an arbitrary selection of φ = 0.5, we calculate the various significance tests mentioned in the previous sections.
| Class 1 | Class 2 | Wald | Score | LR | Exact | |||||
| Y11 | Y12 | Y21 | Y22 | z | p | z | p | χ21 | p | p |
| 0 | 0 | 6 | 8 | 1.23 × 10−03 | 0.999 | 2.26 | 0.024 | 9.77 | 1.77 × 10−03 | 1.17 × 10−02 |
| 0 | 0 | 60 | 80 | 1.30 × 10−03 | 0.999 | 2.75 | 0.006 | 25.69 | 4.01 × 10−07 | 3.75 × 10−06 |
| 0 | 0 | 600 | 800 | 1.37 × 10−03 | 0.999 | 2.82 | 0.005 | 43.81 | 3.62 × 10−11 | 4.31 × 10−10 |
| 0 | 0 | 6000 | 8000 | 1.45 × 10−03 | 0.999 | 2.83 | 0.005 | 62.20 | 3.10 × 10−15 | 4.37 × 10−14 |
| Class 1 | Class 2 | Wald | Score | LR | Exact | |||||
| Y11 | Y12 | Y21 | Y22 | z | p | z | p | χ21 | p | p |
| 0 | 0 | 6 | 8 | 1.23 × 10−03 | 0.999 | 2.26 | 0.024 | 9.77 | 1.77 × 10−03 | 1.17 × 10−02 |
| 0 | 0 | 60 | 80 | 1.30 × 10−03 | 0.999 | 2.75 | 0.006 | 25.69 | 4.01 × 10−07 | 3.75 × 10−06 |
| 0 | 0 | 600 | 800 | 1.37 × 10−03 | 0.999 | 2.82 | 0.005 | 43.81 | 3.62 × 10−11 | 4.31 × 10−10 |
| 0 | 0 | 6000 | 8000 | 1.45 × 10−03 | 0.999 | 2.83 | 0.005 | 62.20 | 3.10 × 10−15 | 4.37 × 10−14 |
| Class 1 | Class 2 | Wald | Score | LR | Exact | |||||
| Y11 | Y12 | Y21 | Y22 | z | p | z | p | χ21 | p | p |
| 0 | 0 | 6 | 8 | 1.23 × 10−03 | 0.999 | 2.26 | 0.024 | 9.77 | 1.77 × 10−03 | 1.17 × 10−02 |
| 0 | 0 | 60 | 80 | 1.30 × 10−03 | 0.999 | 2.75 | 0.006 | 25.69 | 4.01 × 10−07 | 3.75 × 10−06 |
| 0 | 0 | 600 | 800 | 1.37 × 10−03 | 0.999 | 2.82 | 0.005 | 43.81 | 3.62 × 10−11 | 4.31 × 10−10 |
| 0 | 0 | 6000 | 8000 | 1.45 × 10−03 | 0.999 | 2.83 | 0.005 | 62.20 | 3.10 × 10−15 | 4.37 × 10−14 |
| Class 1 | Class 2 | Wald | Score | LR | Exact | |||||
| Y11 | Y12 | Y21 | Y22 | z | p | z | p | χ21 | p | p |
| 0 | 0 | 6 | 8 | 1.23 × 10−03 | 0.999 | 2.26 | 0.024 | 9.77 | 1.77 × 10−03 | 1.17 × 10−02 |
| 0 | 0 | 60 | 80 | 1.30 × 10−03 | 0.999 | 2.75 | 0.006 | 25.69 | 4.01 × 10−07 | 3.75 × 10−06 |
| 0 | 0 | 600 | 800 | 1.37 × 10−03 | 0.999 | 2.82 | 0.005 | 43.81 | 3.62 × 10−11 | 4.31 × 10−10 |
| 0 | 0 | 6000 | 8000 | 1.45 × 10−03 | 0.999 | 2.83 | 0.005 | 62.20 | 3.10 × 10−15 | 4.37 × 10−14 |
There are 2 notable consequences. First of all, the Wald test tends to zero since the estimated standard error approaches infinity. Second, the score test gets more extreme but seems to approach an asymptote at approximately 2.83. As a result, the p-values never get very extreme. In a real SAGE setting, the tests would need to be adjusted for the large number of tests being conducted and the score test would lack power. Therefore, one can rule out both the Wald and score tests as reasonable approaches for testing. The LR and exact probabilities get more extreme with increasing evidence against the null hypothesis, but as we shall see in Section 7.2, only the exact test holds its size.
7.2 Size of the test: φ known
Assuming the dispersion is known, we look at how well each of the approximate tests achieves a 5% false-positive rate given the 5% cutoff of their respective null distributions. Random samples under the null hypothesis of no difference are taken over 1000 tags, with low mean and high dispersion, as used previously, and sampled 30 times. For a fair comparison of the tests, the known value of φ is used. Figure 3 shows the 30 observed false-positive rates (i.e. test statistics more extreme than the 5% cutoff) for a 2 versus 2 library comparison and 5 versus 5 library comparison.
Because of the discrete nature of count data, it is not surprising that the exact test is somewhat conservative because an exact 5% cutoff can only rarely be achieved. For the larger samples, there becomes a smaller discreteness effect, as expected. Both the Wald test and the LR test give test statistics that are more extreme than their approximate χ12 random variables, leading to high false discovery rates. The asymptotic normal score test appears to be the best for small samples, in terms of achieving the 5% nominal error rate, but fails with larger samples. The exact test performs reliably in both situations and is the only one to be correct or conservative.
7.3 Size of the test: small-sample test with different estimators of φ
In Section 7.2, we considered the range of approximate tests with the dispersion known. Here, we show the effect of the dispersion bias on the false-positive rate of the test. Figure 4 shows the effect of dispersion estimation on the false-positive rates. Again, we make 30 repeat samples of 1000 tags, make a 2 versus 2 comparison with no difference, and estimate the fraction of tags below a 5% probability. We chose the low-mean, high-dispersion case, as this was a situation where the estimators differed significantly (recall top-left panel of Figure 2).
Achieved false-positive rates sampled under the null hypothesis (i.e. no differential expression) for the exact test using different estimators.
Not surprisingly, overestimating the dispersion (CML, QL, CR) results in a conservative test, whereas underestimation (ML, PL) results in a liberal test and therefore more false discoveries than expected. qCML results in a slightly conservative test due to the discreteness as mentioned previously.
7.4 Power considerations
Lastly, we implant known differences and illustrate the power of the different tests.
We show the results as area under the receiver operating characteristic curve (AUC). Here, we sample 1000 tags from NB in the low-mean, high-dispersion setting with 10% having been sampled with a known difference between the classes ( or ). The class sizes are set at nA = nB = 2, with library sizes at 10 000 and 100 000, for each class. Differences are embedded so that the true means of both classes are affected, one up and one down, picked randomly. Though 8-fold may appear to be quite a large difference, in the low-mean and high-dispersion setting, it is still a challenging statistical inference task. Generating 100 such data sets of 1000 tags, we compute the AUC for each combination of the 4 statistical tests and 6 estimators. Figure 5 shows box plots of the AUCs.
Achieved AUCs for 4 statistical tests and 6 estimators. The tests and estimators are presented in the same order as all previous figures.
It is evident that the larger effect on the ability to find the true differences is the statistical test used, not the estimate of dispersion. The Wald test suffers from its flaw in detecting differences where one of the estimates of relative abundance is 0, thus introducing false negatives and lowering the achievable AUC. The results suggest that the dispersion estimate has only a small impact on the ordering. This is perhaps not surprising, as the dispersion estimate does not come into the significance equation directly, as it does in a Gaussian testing situation, for instance.
Another way to represent the ability of different tests and different estimators to rank the genes is to record the mean rank of the truly different tags, with lower rankings being better. To focus on differences between the estimators, we compared the mean ranks of the truly different tags on a per–data set basis. Figure 6 shows box plots of mean ranks for the same 100 data sets sampled above. The vertical axis shows the difference between the mean rank for the qCML estimator and that for each of the other estimators, for each of the simulated data sets. In all cases, the genes have been ranked using the exact test. The qCML estimator outperforms all the other estimators by 1–10 false discoveries. Although the differences are relatively small, they are remarkably consistent, so that qCML beats QL and QR on almost every data set. Surprisingly, the ML and PL estimators, both of which are negatively biased, perform well. This suggests that it may be advantageous to have a slight negative bias in the dispersion estimator for the purpose of gene ranking.
Difference in mean ranks of the truly differential tags between the qCML estimator and all other estimators on a per–data set basis, using the exact test.
8. DISCUSSION
We have derived an estimator, qCML, for the dispersion parameter of the NB distribution and compared it against several other estimators using an extensive simulation study. While qCML does not outperform all estimators under all circumstances, it is the most reliable in terms of bias on a wide range of conditions and specifically performs best in the situation of many small samples with a common dispersion, the model which is applicable to SAGE data. We have deliberately focused on very small samples due to the fact that DNA sequencing costs prevent large number of replicates for SAGE experiments.
For a single tag with a small number of libraries, all estimators offer mediocre performance and here is no clear winner. As the number of tags used to estimate the common dispersion increases while holding the number of libraries at a small number, qCML is clearly the best estimator. With more libraries (e.g. n = 10), CR performs about as well as qCML.
The same quantile adjustment we use for dispersion estimation affords us a new “exact” statistical test, similar in flavor to the Fisher's exact test for contingency tables. The exact test is compared to the standard Wald, LR, and score tests in terms of achieving their nominal false discovery rates and on power. The exact test is the only test to achieve its nominal false discovery rate, due in part to the fact that sample sizes are not large enough for the asymptotics to achieve a reasonable approximation. Bias in the estimation of the dispersion has significant impact on the expected false-positive rates but does not seem to adversely affect power to the same degree, since the ranking is only slightly affected by the dispersion estimate. Other than the Wald test, the remaining tests have similar power to detect differences.
The quantile adjustment approach can be applied more generally, not simply to SAGE data. For example, qCML can be applied to any NB regression and probably outperforms the standard methods such as PL in small-sample situations. For SAGE data, we had picked the geometric mean of the library sizes in order to adjust the observed data so that CML machinery can be put to use. For general GLMs, one must create pseudodata to equate the fitted means analogously and the iterative procedure is straightforward. This pseudodata could be used only for the overdispersion estimation or for both estimation and testing. More experimentation would be required in the GLM setting.
FUNDING
National Health and Medical Research Council (406657).
Thanks are due to Alicia Oshlack for valuable discussions and to the editor and an associate editor for suggestions on making the manuscript clearer. Conflict of Interest: None declared.





