A conditional predictive p-value to compare a multinomial with an overdispersed multinomial in the analysis of T-cell populations

Immunological experiments that record primary molecular sequences of T-cell receptors produce moderate to high-dimensional categorical data, some of which may be subject to extra-multinomial variation caused by technical constraints of cell-based assays. Motivated by such experiments in melanoma research, we develop a statistical procedure for testing the equality of two discrete populations, where one population delivers multinomial data and the other is subject to a specific form of overdispersion. The procedure computes a conditional-predictive p-value by splitting the data set into two, obtaining a predictive distribution for one piece given the other, and using the observed predictive ordinate to generate a p-value. The procedure has a simple interpretation, requires fewer modeling assumptions than would be required of a fully Bayesian analysis, and has reasonable operating characteristics as evidenced empirically and by asymptotic analysis.

This supplement is organized into sections that refer back to the relevant section in the main paper. For example, the section S6 below refers to Section 6 in the main paper (Asymptotic Theory) and presents details of issues raised in that section. Reported figures and tables, both in the main paper and supplement, can be reconstructed using R's demo facility and the accompanying R package, dod (short for double overdispersion model). The available demo's are: create Figure S3 (figS3.pdf) demo(lrstatJ) approximate likelihood ratio statistic; J region demo(lrstatV) approximate likelihood ratio statistic; V region demo(power) simulate non-null p-value distribution Computations were done with R version 2.15.1 (R Development Core Team 2011). To simplify the package operation, results of the three heaviest computations (MCMC sampling for J and V, and the simulation) were precomputed and stored as data objects in a list called precooked, though the precise code to obtain the results is also provided in example scripts within dod.

S2: Study design and cell-culture protocols Study design
There was no disease-related or data-related selection that was used for the samples analyzed. For the purposes of simplifying the model development, we chose to focus in the present manuscript on one highly relevant class of samples, namely blood samples from melanoma patients. All of the patients from whom sequence data were collected were analyzed here. The melanoma patients in this study were seen at the William S. Middleton Memorial Veterans Hospital or at the University of Wisconsin Hospital and Clinics between December 2004 and July 2006 and had biopsy-confirmed diagnoses of melanoma that was either American Joint Committee on Cancer (AJCC) Stage III or IV. Melanoma patients provided blood and/or tumor samples. Blood samples were also obtained from normal control blood donors at the William S. Middleton Memorial Veterans Hospital or at the University of Wisconsin Hospital and Clinics. The study was approved by the Health Sciences Institutional Review Board that serves both the William S. Middleton Memorial Veterans Hospital and the University of Wisconsin-Madison. Written informed consent was obtained from all participants. Data from control subjects were not reanalyzed in the present manuscript; neither were data from patient tumor samples, but these were reported in the original study paper, Zuleger et al. 2011.

Mass culture (MC) and single-cell derived isolates (SC)
Single-cell derived T-cell isolates provide a means by which to associate individual CDR3 to functional attributes of a particular T cell. This provides valuable insight in terms of T-cell specificity, affinity, lineage, and maturation status. Further, additional sequence analyses with single-cell derived isolates link gene expression to a particular T cell. As example, the TCR is composed of two chains which both contribute to the T-cell's specificity. Both chains can be sequenced from a single-cell derived T-cell isolate giving a more complete description of the TCR. However, T-cell cloning is not a trivial undertaking. We thus investigated the mass culture system, which is faster, cheaper, and less time consuming than working with T-cell isolates. Mass culture also permits functional T-cell analyses while taking advantage of cell-to-cell interactions and a somewhat more physiological system (i.e., cells do not exist in isolation in vivo). Although the mass culture does not permit the association of particular CDR3's with function, the trade-off is high-throughput analysis. Further, mass culture selection can be applied to blood, to lymph nodes, and to other tissues of interest for which limited cell numbers may preclude T-cell cloning.
The MC consists of culturing T cells from the blood as a bulk population in the presence (MT) or absence (WT) of 6-TG until sufficient cell numbers are available for sequencing. Typically the WT culture is initiated with 1×10 5 total cells, whereas the MT is initiated with at least 5×10 6 total cells. The SC are obtained by plating blood cells in 96-well round-bottom plates in the presence (MT) or absence (WT) of 6-TG. WT SC are plated at limiting dilution cell concentrations (i.e., 1-, 2-, or 5-cells per well). MT SC are plated at 10,000 cells per well. Plates are cultured for 14 days at which time expanding T cells are identified and further expanded prior to sequencing. These protocols have been described in detail (Albertini et al. 2008, andO'Neill et al. 1987).

S3: Data structure and sampling model Supplementary Data
Tables 1 and 2 in the main paper summarize J-region type frequencies for MT and WT CDR3 sequence data from six melanoma patients. Tables S1 and S2 present comparable frequencies when sequence data is organized by V-region type. (Patients A-F here correspond to patients P-13, P-9, P-5, P-3, P-2, and P-1 in Zuleger et al., 2011.) In the case examined MC and SC data were measured for both MT and WT cells from all subjects. The methodology allows imbalance in these data sources and the possibility that some source (e.g. SC) is not measured on every sample. Table S3 reports the cell characteristics for all the samples reconsidered here.

Posterior sampling
Adopting standard notational simplifications and the double over dispersion model from the main paper, the target posterior distribution is p(z, θ, φ|y) where z holds a matrix (types by patients) of latent counts just prior to in vitro growth (and just after 6-TG treatment) for the MT cells, θ is the vector (over types) of type frequencies, φ is an overdispersion parameter (Gamma shape parameter), and y is the set of all MT data being conditioned upon for this phase of the analysis. We implement a standard block-updating sampler, cycling through 1. p(z|else)

p(φ|else).
The simplest of these is the θ update, since under the flat Dirichlet prior this full conditional distribution is again Dirichlet, involving total counts of both observed MT-SC data and imputed z's underneath the MT-MC data.
The univarite φ parameter is updated with a random-walk Metropolis sampler in a small uniform window around the current value. Numerical experiments lead us to settle on window size for production runs.
For the latent matrix z, we sample one patient's count vector to be modified, and we simply propose to modify each type's latent count to be a draw from that variable's marginal Poisson distribution (given θ). Then we adjust this proposal according to the Dirichlet-Multinomial likelihood component to retain detailed balance against p(z|else). We experimented with modifying fewer than all type counts at once, but found adequate mixing with the stated proposal.
Reported posterior summaries come from 10000 saved states subsampled every 500 states after substantial burn in. MCMC diagnostics supported this design ( Figure S2). In the dod package, see help(MCMC) for specific implementation.
In R version 2.15.1, and on a 2.53GHz Intel Xeon CPU, the J-region MCMC run used 43.8 CPU minutes and the V-region MCMC run used 70.4 CPU minutes (5 million post-burn-in sweeps each; user time from proc.time).

Predictive sampling
The second step in computing the conditional predictive p-value involves the evaluation of the predictive mass function for various vector inputs x * (both the observed WT data as well as simulated versions). Considering the relative complexity of the double-overdispersion model for y, it is apparent that p(θ|y) has no simple analytic form, in spite of the fact that our analysis allows us to sample it via MCMC. A computationally intensive and numerically unstable approach to evaluate 1 is to replace the integral with a Monte Carlo approximation using the MCMC output; we could do this both for x * equal the observed WT data and for the numerous predictive simulations. Like the instability observed with importance sampling, there is instability by this approach; our experiments showed that one or just a very few of the sampled θ values would carry substantial weight p(x * |θ) for any given x * . Instead, we adopted the following approximation. We assumed that p(θ|y) could be approximated by some Dirichlet distribution. We used the posterior samples (obtained by MCMC) to approximate the parameters of this Dirichlet (using the method of moments). Then we took advantage of the fact that the integral in 1 has a closed form when p(θ|y) is Dirichlet and p(x * |θ) is Multinomial. Indeed, it is precisely (and coincidentally) the the Dirichlet-Multinomial model analogous to the model used in the main paper (equation 3.2). The function cpp in the package dod gives the specific implementation: cpp <function ( # predicted data density logpred.star <-lgamma( xstar + dirtot*theta.mean ) -lgamma ( xstar+1 ) # observed data density logpred.x <-lgamma( WTcounts + dirtot*theta.mean ) -lgamma( WTcounts + 1 ) logpred.ratio <-apply( logpred.star -logpred.x , 2, sum ) pval <-mean( logpred.ratio <= 0 ) return(pval) }

Approximate likelihood ratio statistic
For comparative purposes, we sought to evaluate the likelihood ratio test for the same problem. This is complicated by the fact that maximum likelihood estimation is difficult owing to the hierarchical model structure. To approximate MLE's, we used posterior mean estimates computed using the developed MCMC code. For the null case, we had to modify the MCMC code so that all MT and WT data were used to compute the estimated θ vector (in contrast to just the MT data which were used for p cp ). Since the WT data are multinomial this is a simple modification (method=2 within the MCMC function in dod). Thus we approximatedθ 0 (MLE under null) andθ MT andθ WT (MLEs under alternative), as well as the MLEsφ 0 andφ 1 of the overdispersion parameter (under the null and alternative). With x and y denoting the observed WT and MT data sets, the LR statistic is In both terms the joint mass function factors, and p(x|θ) takes the multinomial form. The difficulty is in p(y|θ, φ), which is an expectation, over latent Z matrices, of Dirichlet-Multinomial masses p(y|Z, φ), where Z is a matrix with Poisson-distributed entries. This we approximate by forward simulation of the Poisson matrices. Finally we obtain p-values by treating the LR statistic as chisquare distributed on K − 1 degrees of freedom, as would be valid asymptotically. Code to evaluate the approximate LR test is in demo(lrstatJ) and demo(lrstatV). We observed higher Monte Carlo error in the LR p-value than in the conditional predictive p-value. For example, 20 repeated runs of cpp(precooked$fitJ) had mean 0.046 with standard deviation 0.002; while 20 repeats of demo(lrstatJ) had mean 0.038 with standard deviation 0.011. As we were fixing the MCMC sample, this extra variation came from the forward simulation component of the LR statistic calculation.

Simulation
To check the null sampling distribution of p cp we simulated the double-overdispersion model and repeatedly computed its value via posterior and conditional predictive sampling. Figure S3 shows results for one case where the parameters were as estimated for the J-region data. The function simdod in the package dod gives the specific implementation: for( isim in 1:nsim ) { # make WT data wtsc <-sapply( as.list(sizes$n.wtsc), rmultinom, n=1, prob=theta ) wtmc <-sapply( as.list(sizes$n.wtmc), rmultinom, n=1, prob=theta ) # make MT SC data mtsc <-sapply( as.list(sizes$n.mtsc), rmultinom, n=1, prob=theta ) # make MT MC data by first getting latent Z's muz <-theta %o% sizes$nsurvive zstate <-apply(muz,2,rpois, n=nrow(muz) ) # next make Dirichlet distributed prob vector uu <-apply(phi*zstate,2,rgamma,n=nrow(muz) ) uprob <-t( t(uu)/apply(uu,2,sum) ) # next make the conditionally multinomial observations uplist <-as.list( data.frame(uprob) ) ## columns entries of list mtmc <-mapply( function(x,y) rmultinom(n=1,size=x,prob=y), sizes$n.mtmc, uplist ) simcounts <-list( 'WT-SC'=wtsc, 'WT-MC'=wtmc, 'MT-SC'=mtsc, 'MT-MC'=mtmc ) ## simulated data # run MCMC to get posterior of theta given MT (only) data Y=y fit.sim <-MCMC( simcounts, sizes$nsurvive, mcmc=mcmc, method=1, print.iter=FALSE) ## get conditional predictive p-value The null simulation described above was readily modified to enable an assessment of power of p cp . Two distinct θ vectors were required: one governing the multinomial WT data and one governing the more complex MT data. We used the posterior mean of θ as the driving value for simulated MT data (recall, this posterior relies only upon the observed MT data); we used the observed WT proportions vector (MLE, under alternative) as the governing parameter for simulated WT data. As in the null simulation, we used sample sizes from the original study design. Thus the data generation matched as well as possible the estimated non-null state for J-region data in our six patients. For each simulated data set (1000), we computed the conditional predictive p-value (using 20000 post-burn in sweeps in each case) and kept track of the empirical p-value distribution. See demo(power). Table S1: Data: frequencies of V-region types, melanoma patients, blood, mass culture (MC) method. WT and MT data arise from the same underlying discrete population on the null hypothesis, but MT data are subject to extra-multinomial variation.  Table S2: Data: frequencies of V-region types, melanoma patients, blood, single-cell-derived isolates (SC) method. As in Table S1, but we are justified to treat these SC-derived data as multinomially distributed.    Tables S1, S2, similar to Table 1, which shows data for J regions. Colored bars show empirical frequencies of each V region type, from various data sources. Boxplots summarize posterior analysis of the underlying proportions conditional on the MT (not WT) data. Types are arranged from top to bottom by increasing value of the posterior median proportion. Note that the boxplots track the green bars better than the red bars, since the SC data are less variable than the MC data. The hypothesis test asks if the common (vector over types) mean of the MT data differs significantly from the mean of the WT cells (blue).  Figure S3: Simulated null distribution of conditional predictive p-value, with settings (i.e. sample sizes and parameters) matching those given and estimated for the J region patient data. 1000 data sets were simulated from the double overdispersion (dod) model (on the null). MCMC was used on each simulated data set to compute the conditional predictive p-value. 1000 states were saved after a short burn-in and subsampling every 20 complete scans. This and similar numerical experiments (not shown) demonstrate that the conditional predictive p-value has a frequency distribution that is approximately uniform

S6 Asymptotics
Proof of Lemma 6.1 From Stirling's approximation, Applying this to all factorials in the log mass function, we have Hereθ j = X n j /n, the maximum likelihood estimate of θ j , and E n accumulates error terms and is converging in probability to 0 as n → ∞ since all θ j are strictly positive. The result follows immediately upon recognizing that minus 2 times the term preceding E n is the likelihood ratio statistic, which is converging to the chi-squared limit according to standard theory (e.g., Bickel and Doksum, 1977, page 315.)

A different version of p cp
Fisher's exact test would have enabled a comparison of WT and MT populations if the counts Z mc i were observable. Our first idea was to repeatedly recompute Fisher's p-value using these simulated latent counts, and then average the resulting p-values. Formally, we constructed a statistic according to the following argument.
Let h(x, z) denote the generalized hypergeometric mass function in a K × 2 contingency table with count vector x in one column and z in another. Specifically where · indicates summation over the K distinct types t. Combine all WT counts into a vector X = {X t } over types t by adding contributions from X mc i and X sc i across all tissue samples. That is X t = i X mc i,t + X sc i,t . Similarly combine MT data, but consider doing so with imputed counts Z mc i and with any available single-cell derived counts Y sc i , and call this vector Z; In each case, we combine multinomial counts that have a common probability vector, and so we retain the multinomial form. Finally, let Y record the mass-culture MT data Y mc i from all tissue samples (these are subject to overdispersion) as well as any available singlecell derived counts Y sc i that had also been recorded in Z. (Here we do not collapse by counting contributions over i, since such summarized counts would entail undue information loss; instead Y is a collection of vectors.) To emphasize notational distinctions, X, Y , and Z refer to random elements in our actual experiment, taking possible values x, y, and z. As is sometimes done in p-value discussions, we introduce a separate notation for data obtained from a hypothetical repeat of the experiment: let X rep and Z rep denote hypothetical repeated draws of the random vectors X and Z. In these terms, we originally considered where the event C = {X rep +Z rep = x+Z}, interpreting the sums in C as coordinate-wise additions.
To investigate this probability, write the key event as Here θ is the vector of unknown probabilities over types (its randomness/uncertainty comes from the Dirichlet prior distribution) and Z is the aggregated vector of true type counts. The inner conditional probability, at any realization θ and Z = z, is By sampling, X rep , Z rep , and Y are conditionally independent given θ, and so the event Y = y may be erased from the conditioning event in (4). Further, the conditioning on C converts the conditionally independent multinomial vectors X rep and Z rep into a generalized hypergeometric table, and thus allows us to erase θ from the same conditioning event. By further simplification, (4) becomes This is precisely the Fisher-exact-test p-value that one would compute from x and z, say p Fisher (x, z), and so we have established that the p-value defined in (2) satisfies: Recognizing that C no longer affects this expression, we have finally as suggested in the opening of this section. For further analysis we consider a somewhat simpler data structure. Specifically, suppose that at the level of resolution of interest, there are K ≥ 2 distinct types of T-cell receptors, and a single vector X = (X 1 , X 2 , . . . , X K ) T counts occurrances of these types from a set of size m = K k=1 X j receptors sequenced from WT cells. Similarly, a single vector Y = (Y 1 , Y 2 , . . . , Y K ) T records n = K k=1 Y k MT mass culture counts, and a vector Z = (Z 1 , Z 2 , . . . , Z K ) T records the latent counts in MT cells immediately after selection in a mass culture experiment. As new notation, let N = K k=1 Z k count the total number of surviving MT cells. In contrast to the actual data structure, this one overlooks the breakdown of counts according to different patients, and it ignores any MT SC data.
Consider the K × 2 contingency table with columns X and Z holding WT and MT counts, respectively. The null model asserts that X and Z are independent multinomial vectors from a common probability vector θ = (θ 1 , θ 2 , . . . , θ K ) T , given totals (row sums) m and N . We suppose that all types are possible; that is θ k > 0 for all k. It is well known that if m and N are reasonably large, then Fisher's exact p-value is equivalent to the p-value computed from Pearson's chi-square statistic (e.g., Bickel and Doksum, 1977, page 323). In this K × 2 table, Pearson's statistic is Nθ k , whereθ k = (X k + Z k )/(m + N ). This can be re-expressed as a quadratic form relative to the K × K diagonal matrix D 1/θ holding 1/θ on its diagonal: where γ m = m/N is a sampling ratio andX andZ are standardized versions of X and Z: Here we index by m because we envision asymptotics in which m and N agree in order of magnitude. We assume γ m −→ γ ∈ (0, ∞) as m −→ ∞. By the central limit theorem, the vectorsX m andZ m in (8) converge in distribution to independent and identically distributed Gaussian vectors with mean zero and covariance matrix D θ − θθ T . Note that this covariance matrix is singular owing to the constraint K k=1 θ k = 1. LettingX andZ refer to the respective limits, the limiting chi-square statistic from (7) is SinceX − √ γZ is Gaussian with mean zero and covariance matrix (1 + γ) D θ − θθ T , it follows that χ 2 (X,Z) has a chi-square distribution with K − 1 degrees of freedom, which is the trace of , Bishop, Fienberg, and Holland, 1975, page 473). Introducing an independent mean zero Gaussian vector U , with covariance matrix D θ − θθ T , the would-be Fisher p-value is when realized atX =x andZ =z, since the quadratic form U T D 1/θ U also has a chi-square distribution on K − 1 degrees of freedom. Further, the realized value of the conditional predictive p-value is for some limiting standardized versionỸ of the MT mass culture counts Y . In the finite-sample case (equation 6), the expectation refers to the posterior predictive distribution that integrates both uncertainty in the parameters φ and θ as well as the missing counts Z. In the limiting case (11), the expectation is assumed to average only the latentZ. All parameter uncertainty is assumed to have vanished, as predicted by large-sample parametric Bayesian inference (e.g., Schervish, 1995, Section 7.4). Mimicking the finite-sample model structure, the standardizedỸ satisfies: where ξ > 1 reflects limiting extra-multinomial variation. This is precisely the limiting form implied by the Dirichlet-Multinomial observation model; ξ depends on the overdispersion parameter φ and the limit of the sampling ratio N/n (when the limiting sampling ratio is unity, ξ 2 = 1 + 1/φ). Therefore, the conditional expectation in (11) is relative to the distributioñ Z Ỹ =ỹ ∼ Normal ỹ 1 + ξ 2 , This follows straightforward normal-theory regression, though accounting for the singularity of the covariance matrix. There is benefit in taking a change-of-variables in (11) and (12). Introduce the random vector Conditionally uponỸ =ỹ, V is a mean zero Gaussian with covariance matrix D θ − θθ T . Indeed this is also the marginal distribution of V . Various quantities can be re-written in terms of V . For example, with a = γξ 2 /(1 + ξ 2 ) the deviation vector inside (9) is The quantity in parentheses, say W =X − √ γỸ 1+ξ 2 , holds information from the limiting observables, and V expresses limiting latent deviations. With these considerations, the limiting conditional predictive p-value (11) can be evaluated explicitly. Using w = x − √ γỹ/(1 + ξ 2 ) as a possible realization of W , where H, being a ratio of independent chi-squared variables, has a non-central F distribution on K − 1 and K − 1 degrees of freedom, and with non-centrality parameter λ = (w T D 1/θ w)/a 2 , and where c + 1 = (1 + γ)/a 2 = (1 + γ)(1 + ξ 2 )/(γξ 2 ). This somewhat striking representation of the conditional predictive p-value enables us to study its limiting null probability distribution. Evidently the p-value depends on the data only through the quadratic form inside the non-centrality parameter λ. Randomness in the data is transduced through this non-centrality parameter, which becomes distributed as a scaler multiple of a (central on the null) chi-square random variable on K −1 degrees of freedom. Specifically, taking S = (W T D 1/θ W )/[1+γ/(1+ξ 2 )], we have established the following.
Theorem 1 Let G λ (f ) denote the cumulative distribution function of a non-central F distribution on K −1 and K −1 degrees of freedom, and with non-centrality parameter λ. With c = 1+γ γ 1+ξ 2 ξ 2 −1, we have in the asymptotic null case that the conditional predictive p-value p * cp (X,Ỹ ) is equal in distribution to G cS (1 + c), where S ∼ χ 2 K−1 .
The transformation S −→ G cS (1 + c) is monotone decreasing, owing to properties of the noncentral F distribution. It follows, therefore, that if s α is a chi-square quantile such that P (S > s α ) = α for α ∈ (0, 1), then G cS (1+c) < G csα (1+c) when S > s α , and thus P [G cS (1 + c) ≤ G csα (1 + c)] = α. That is, G csα (1 + c) is the α-quantile of the distribution of the limiting p * cp (X,Ỹ ). Introducing F (p) = P [p * cp (X,Ỹ ) ≤ p] as the limiting distribution function of the conditional predictive p-value, and letting F −1 be its quantile function, we have established a corollary to Theorem 1: namely, F −1 (α) = G csα (1 + c). Figure S4 shows a few examples of the distribution function F (p), which is readily obtained by numerical inversion. Conservativeness of the limiting p-value is evident in these examples, since F (p) < p for most p. We have not found a general proof, except when considering a neighborhood of the origin and invoking the inverse function theorem. On one hand, d dα F −1 (α) = 1 f [F −1 (α)] , where f is the density function of p cp (X,Ỹ ). Using the Poisson-mixture representation of G λ , direct differentiation also gives Here q j is the probability that a central F distribution on K − 1 + 2j and K − 1 degrees of freedom is less than (1 + c)/[1 + 2j/(K − 1)]. The behavior of F −1 is determined by the first factor dsα dα , which diverges as α → 0. Coupled with the first finding about F −1 we conclude that the p-value's density f (p) converges to 0 as p → 0, which further implies the following.
Theorem 2 The conditional predictive p-value is dominated by the uniform distribution, asymptotically, in the sense that F (p) = P p * cp (X,Ỹ ) ≤ p < p for all sufficiently small p > 0.
On the surface it seems that p * cp will work fine. It has a non-degenerate asymptotic null distribution that gives conservative p-values, and knowing the limit could enable a computational adjustment to uniformity. However something quite unusual happens because the asymptotic argument above has assumed knowledge of the underlying parameter vector θ in the probability simplex. This ordinarily would not be a problem, since we have a consistent estimateθ available from the MT data, and thus we would do the statistical thing and simply plug inθ where necessary to get an empirical version of p * cp . It turns out that this plug in completely destroys the asymptotic distribution of the p-value, in contrast to most plug in efforts. The reason is that with a large sample size, any deviation between θ andθ is amplified by the (now) hyper-sensitive Fisher p-value on the large imputed data set. Computational experiments (not shown) bear this out. Fortunately, the proposed p-value p cp , which does not average Fisher p-values, is not subject to this sampling defect. (ξ = 1)(K = 2) (ξ = 5)(K = 2) (ξ = 1)(K = ∞) (ξ = 5)(K = ∞) Figure S4: Limiting distribution functions of the (original) conditional predictive p-value p * . Shown are four versions of F (p), the limiting null c.d.f.. From Theorem 6.1, F −1 (α) = G csα (1 + c) for s α an upper quantile of the chi-square distribution on K − 1 degrees of freedom, G λ the c.d.f. of a non-central F distribution on K − 1 and K − 1 degrees of freedom, with non-centrality parameter λ, and a constant c > 0 that depends on the extent of overdispersion, as measured by ξ. In all cases investigated, the c.d.f. is dominated by the uniform distribution for small p, as predicted by Theorem 2. The greater the overdispersion the larger the deviation from uniformity. The number of distinct T-cell types K has a modest effect on the distribution. An explicit formula for F (p) is available in the K → ∞ case.