Recurrent mutation in the ancestry of a rare variant

Abstract Recurrent mutation produces multiple copies of the same allele which may be co-segregating in a population. Yet, most analyses of allele-frequency or site-frequency spectra assume that all observed copies of an allele trace back to a single mutation. We develop a sampling theory for the number of latent mutations in the ancestry of a rare variant, specifically a variant observed in relatively small count in a large sample. Our results follow from the statistical independence of low-count mutations, which we show to hold for the standard neutral coalescent or diffusion model of population genetics as well as for more general coalescent trees. For populations of constant size, these counts are distributed like the number of alleles in the Ewens sampling formula. We develop a Poisson sampling model for populations of varying size and illustrate it using new results for site-frequency spectra in an exponentially growing population. We apply our model to a large data set of human SNPs and use it to explain dramatic differences in site-frequency spectra across the range of mutation rates in the human genome.

Recurrent mutation has long been recognized as an important factor of evolution (Fisher 1928;Haldane 1933;Wright 1938). This is emphasized by recent analyses of single-nucleotide polymorphism (SNP) frequencies and variation of mutation rates across the human genome (Aggarwala and Voight 2016;Harpak et al. 2016;Seplyarskiy et al. 2021) describing how patterns of variation depend on the mutation rate, particularly for rare variants. By a rare variant we mean an allele, such as an alternate base at a SNP, which is observed a relatively small number of times in a large sample. Unless the mutation rate is very small, indistinguishable copies of the same allele may descend from multiple mutations. Here, we present a sampling theory for the numbers and associated frequencies of these unobserved or latent mutations in the ancestry of a rare variant.
Humans are on the low end of polymorphism levels among species (Leffler et al. 2012). On average, multiple mutations should be rare. In the 1000 Genomes Project data, about 1 in 1300 sites differ when two (haploid) genomes are compared, and SNPs with more than two bases segregating comprise only about 0.3% of the total SNPs observed (The 1000 Genomes Project Consortium 2015). But polymorphism rates vary by two or three orders of magnitude depending on local sequence context (Aggarwala and Voight 2016;Harpak et al. 2016;Seplyarskiy et al. 2021). Recurrent mutation is an important phenomenon for fast-mutating sites. Evidence for this can be found in the haplotype structure surrounding rare mutations (Johnson et al. 2022) and in the distribution of their frequencies among sites in large samples (Harpak et al. 2016;Seplyarskiy et al. 2021).
Here we focus on the latter, in particular on the sitefrequency spectrum (Tajima 1989;Braverman et al. 1995;Fu 1995). Deviations in site-frequency spectra compared to standard predictions may be due to selection (Bustamante et al. 2001;Achaz 2009;Ferretti et al. 2017), changes in population size over time (Eldon et al. 2015;Liu and Fu 2015;Gao and Keinan 2016) or population structure (Gutenkunst et al. 2009;Städler et al. 2009;Kern and Hey 2017). But they may also be due to multiple mutations, i.e. to violations of the infinite-sites model assumption that each polymorphism is due to a unique mutation (Fisher 1930a;Kimura 1969Kimura , 1971Ewens 1974;Watterson 1975).
The standard site-frequency prediction, which holds for a wellmixed population of constant large size N and neutral mutation rate u at a locus, is that the number of SNPs where a variant is found in i copies in a sample of size n should be proportional to θ/i, where θ = 4Nu (Tajima 1989;Fu 1995). This dramatically underpredicts the abundance of rare variants in data from humans, which is largely due to our recent explosive population growth (Keinan and Clark 2012;Gazave et al. 2014;Gao and Keinan 2016), but the standard neutral model is a useful starting point for modeling recurrent mutation. Jenkins and Song (2011) studied the occurrence of one or two mutations at a single site under the standard neutral coalescent model (Kingman 1982;Hudson 1983;Tajima 1983). They showed that if two mutations occur and are non-nested (meaning that all descendants of both mutations can be observed) there will be a shift away from rare variants and toward common ones. An earlier work focusing on the nested case is Hobolth and Wiuf (2009). Bhaskar et al. (2012) used a similar approach as Jenkins and Song (2011) to obtain results for one, two or three mutations, up to leading order in the mutation parameter θ. Sargsyan (2006Sargsyan ( , 2015 considered two mutations occurring at two different sites, and Jenkins et al. (2014) assumed that two mutations are distinguishable and yield a tri-allelic polymorphism. These latter works (Sargsyan 2006(Sargsyan , 2015Jenkins et al. 2014) allowed for variable population size following the general coalescent approach of Griffiths and Tavaré (1998). None of these works considered rare variants in particular but their predictions, especially those for non-nested mutations (Jenkins and Song 2011;Bhaskar et al. 2012) are helpful for understanding recurrent mutation.
Two recent large studies of human SNPs observed this predicted shift away from rare variants and toward common ones at fast-mutating sites. Harpak et al. (2016) surveyed about 8 million SNPs in a sample of nearly 61 000 people in version 0.2 of the Exome Aggregation Consortium database (Lek et al. 2016) for which data were available from other primate species. Among these, about 93.3% of these were bi-allelic, 6.5% were tri-allelic and 0.2% were quad-allelic. Harpak et al. (2016) took the presence of identical segregating variants in different species, ranging from chimpanzees to baboons, as indicative of a higher mutation rate at a site. Consistent with the hypothesis of multiple latent mutations at fast-mutating sites, they found fewer rare variants at bi-allelic SNPs for which the minor allele was segregating in another species, and that this effect is stronger when the other species is closer to humans.
The work we present here builds upon the second of these studies. Seplyarskiy et al. (2021) looked at rare variants in two datasets, one containing about 292 million variants among nearly 43 thousand individuals in TOPMed freeze 5 (Taliun et al. 2021) and the other containing about 182 million variants among 15 thousand individuals in gnomAD version r2.0.2 (Karczewski et al. 2020). Variants were divided into 192 types: each of the 3 possible base substitutions at the middle site of all 64 possible trinucleotides. A classic example of a fast-mutating site in this context would be ACG, which readily changes to ATG via a C to T transition at the CpG dinucleotide (Bird 1980;Goldman 1993). The main goals in Seplyarskiy et al. (2021) were to quantify how the rates of each kind of mutation vary across the genome and to partition this variation into distinct components correlated with different mutational processes.
Another aim, taken up in the Supplementary Materials of Seplyarskiy et al. (2021), was to correct for multiple mutations contributing to rare variants. Recurrent mutation was modeled as a multi-type Poisson process where mutations with lower sample counts occur independently at a locus to generate the appearance of higher count mutations (Desai and Plotkin 2008). The expected counts in the absence of recurrence were taken from the site-frequency spectrum at slow-mutating sites. The loss of rare variants due to recurrent mutation at fast-mutating sites was quantified for sites with up to 70 copies of a rare variant. These were considered to have descended from up to 5 mutations. Slow-mutating sites, even with rates up to the genome average in humans, should conform fairly well to the infinite-sites assumption. Resampling from these as in Seplyarskiy et al. (2021) is a way of controlling for the myriad unknown factors affecting the site-frequency spectrum, including growth.
In this work, we present a sampling theory for latent mutations of rare variants at each given site-frequency count in a large sample. We describe a mathematical population genetic framework for the Poisson-resampling method in Seplyarskiy et al. (2021) and provide closed-form analytical expressions for several quantities of interest. In short, the distributions of latent mutations and counts of rare variants depend on the expected total length of the gene genealogy of the sample, the expected lengths of branches with few descendants in the sample, and of course the mutation rate. We obtain new large-sample results for exponential growth and use these to illustrate the theory. We apply our results to a different subset of the gnomAD data than Seplyarskiy et al. (2021), synonymous variants observed in non-Finnish European individuals in v2.1.1, containing about 834 thousand variants at about 12.3 million sites among 57 K individuals, presorted into 97 bins based on estimates of mutation rate by the method of Seplyarskiy et al. (2022).
We develop and present these results in the next three sections. In "Theory for constant-size large populations," we begin with the standard neutral coalescent or diffusion model of population genetics (Ewens 2004) and demonstrate a close connection between the Ewens sampling formula (Ewens 1972) and distributions of latent mutations. In "Theory for nonconstant populations," we extend the results to populations which have changed in size, using the Poisson-sampling models of Watterson (1974b) and . In "Theoretical example and data application," we compare predictions for constant size to those for exponential growth and show how the new theory can be applied to understand the effects of recurrent mutation on counts of rare variants across the range of human per-site mutation rates.

Theory for constant-size large populations
In this section, we begin with a description of recurrent mutation via the well known predictions for allele frequencies in a population and in a sample at stationarity. We then use conditional ancestral processes to demonstrate independence of latent mutations of rare variants in a large sample and show that their numbers are distributed like the numbers of alleles in the Ewens sampling formula.
in which Γ(·) is the Gamma function, and where necessarily x K = 1 − i<K x i (Wright 1931(Wright , 1949. Conditional on the population frequencies (X 1 , . . . , X K ), the sample counts of alleles (N 1 , . . . , N K ) are multinomially distributed. A sample of size n taken from the population contains n 1 , . . . , n K−1 copies of alleles 1 through K − 1, and necessarily n K = n − i<K n i copies of allele K, with probability for n i ∈ {0, 1, . . . , n} constrained by i n i = n and where k (r) denotes the Pochhammer function or rising factorial k(k + 1) · · · (k + r − 1) with k (0) = 1. The shorthand defined in (2) is used extensively in what follows. In applications to DNA, K = 4 and a sample at a given site would contain counts n 1 , n 2 , n 3 , n 4 of each of the four nucleotides. The assumption of parent-independent mutation which leads to the relatively simple expressions (1) and (3) is unrealistic for DNA, but its results are useful in the case of rare variants in very large samples. In this case, it is likely that the common variant, allele 4 say, represents the ancestral state of the entire sample and that rare variants (alleles 1, 2 and 3) are due to recent mutations from the common variant. Then the mutation parameter θπ i for i ∈ {1, 2, 3} captures the production of type-i rare alleles in a specific ancestral background (allele 4).
An instructive special case is K = 2, where we have for the stationary distribution of the frequency of type 1 in the population Wright (1931), and for the sampling probability, i.e. that a sample of size n contains n 1 copies of allele 1 and n 2 = n − n 1 copies of allele 2. Any two-allele mutation model can be described as a parent-independent model, but this is not so in general for K > 2. Figure 1 shows how the sample frequency distribution p(n 1 ; n) in (5) depends on the mutation rate for a pair of alleles which differ by an order of magnitude in mutation rate. Three value of θ are shown, with the small value chosen so that the mutation rate for allele 2 (θπ 2 ) is equal to the human average of about 1/1300 (The 1000 Genomes Project Consortium 2015) and the mutation rate for allele 1 (θπ 1 ) is ten times that. When θ is small, the distribution is U-shaped and nearly symmetric, given that the sample is polymorphic. When θ is around one, the distribution becomes J-shaped (or L-shaped if π 1 < π 2 ). When θ is large, the distribution has a peak around π 1 . Graphs of ϕ(x) (not shown) display these same shapes, and p(n 1 ; n) will be very close to ϕ(x) dx when n is large.

Relationship to infinite-sites frequency spectra
We use θ for the per-site mutation parameter. In a collection of L total sites at which (5) holds, the finite-sites version of the site-frequency spectrum (i.e. the expected number of sites with n 1 copies of allele 1 and n 2 copies of allele 2) is given by the product Lp(n 1 ; n). Note, these expected numbers of sites do not depend on the rate of recombination, whereas the variances among sites and covariances between sites do (Kaplan and Hudson 1985).
Infinite-sites mutation models may be obtained as limits of finite-sites models as L tends to infinity with the total mutation parameter Lθ remaining finite. So when θ is small, we expect finite-sites results to be close to the usual (infinite-sites) predictions from the diffusion model (Ewens 1979(Ewens , 2004 or the coalescent model (Fu 1995). Finite-sites models distinguish between kinds of mutations, subject to different mutation pressures, whereas infinite-sites models implicitly treat all mutations the same.
From Ewens (1979) equation (8.18) or Ewens (2004) equation (9.18)-see also Wright (1938) equation (16)-the expected number of sites segregating in the population with frequencies between x and x + dx under the infinite-sites model is proportional to 1/x. For comparison to (4) we may write for a single site (θ small) approximately under the standard infinite-sites mutation model. For comparison with (5), we have for the approximate single-site probability that there are n 1 type-1 alleles in a sample of size n. Equation (7) has the same form as the usual infinite-sites site-frequency spectrum (Fu 1995) but here it is for a specific mutant (allele 1) with a specific ancestral type (allele 2 in the two-allele model).
for n 1 ∈ {1, . . . , n − 1}. The diffusion result (4) does not admit atoms of probability at x = 0 or x = 1-see section 10.7 of Ewens (2004) for discussion-but we can interpret (8) intuitively as follows. If θ is close to zero, most of the time the population will be fixed, containing only allele 1 with probability π 1 and only allele 2 with probability π 2 . Mutants of type 2 and type 1 are introduced with rates θπ 2 and θπ 1 in these two backgrounds, respectively. Then the leading terms in (8) represent a mixture of two infinitesites models like (6) with the constants of proportionality specified. Equation (9) has an identical interpretation, as a mixture of two infinite-sites site-frequency spectra. These are the key principles of the boundary mutation model (Vogl and Clemente 2012;Vogl et al. 2020). Although no closed-form expression like (1) is available except under parent-independent mutation, Tang (2016, 2017) have shown that the stationary densities for pairs of alleles under general mutation models take forms identical to (8) when θ is small; see equation (21) in Burden and Tang (2017). See also Schrempf and Hobolth (2017). Similarly from a coalescent analysis of general K-alleles mutation, Bhaskar et al. (2012) obtained leading order terms for sampling probabilities with forms identical to (9) when θ is small and samples contain just two alleles. For K = 2, the result from Theorem 1 of Bhaskar et al. (2012) is identical to (9).

Mutation and the frequencies of rare sample variants
Our goal here is to understand how the frequency spectra of rare variants depend on θ and on the number of mutation events in the ancestry of the sample under the standard neutral coalescent or diffusion model of population genetics which assumes constant population size (Ewens 2004). We first describe an ancestral process for the sample, then focus on rare variants in a large sample to obtain predictions about latent mutations.

A conditional ancestral process for rare variants
Here, we focus on ordered samples because the calculations are more intuitively related to the familiar rates of events in the ancestral coalescent process. The results do not depend on the order and so apply equally to ordered and unordered samples. Using the subscript "o" for ordered and writing p o (n 1 , . . . , n K ) in place of p o (n 1 , . . . , n K−1 ; n) to facilitate the calculations, we have which differs from the sampling probability in (3) only by the multinomial coefficient, or the number of ways a sample containing allele counts n 1 , . . . , n K can be ordered.
Equation (10) is suggestive, as are (3) and (5), that the sampling structure of the n i copies of allele i may be related to the Ewens sampling formula (Ewens 1972). Specifically, from the fact that where |S (k i ) n i | is an (unsigned) Stirling number of the first kind, we might guess that there is a latent variable k i which is the number of mutations giving rise to the n i copies of allele i. As in the usual application of the Ewens sampling formula, in contrast to the total possible number of type-i mutations in the ancestry of the sample, these latent mutations are just those k i ∈ {1, . . . , n i } most recent ones which produced the observed alleles.
That is, based on (10) and (11), we suppose that the joint probability of the sample counts n 1 , . . . , n K and their numbers of latent mutations k 1 , . . . , k K is given by and therefore that the probability of k 1 , . . . , k K conditional on n 1 , . . . , n K is given by which applies to both ordered and unordered samples. We show that (13) is true using the ancestral-process approach of Griffiths andTavaré (1994a, 1994b). If sampling probabilities like (3) or (10) are known, this approach can be used to describe the conditional ancestral process of a sample given its allelic types (Slade 2000a(Slade , 2000bFearnhead 2001Fearnhead , 2002Stephens and Donnelly 2003;Baake and Bialowons 2008). Following our analysis of (13) for arbitrary (n 1 , . . . , n K ), we describe a large-n approximation in which allele K is the overwhelmingly common type and 1 through K − 1 are the rare variants.
The conditional ancestral process has the same total rate of mutation and coalescence as the unconditional process, n(θ + n − 1)/2. Lineages which must be of type i in the sample experience type-i mutations at rate n i θπ i /2 and type-i coalescent events at rate n i (n i − 1)/2, but with additional weights proportional to the probability of (n 1 , . . . , n K ) given each event. All other events have rates equal to zero because the sample could not be (n 1 , . . . , n K ) if they occurred. To obtain (13), we follow ancestral lineages only back to the first mutation event they experience. The probability of a type-i mutation event is and the probability of a type-i coalescent event is where we have used (10) to obtain the results on the right. Whether mutation or coalescence occurs, the number of type i lineages decreases by one: n i → n i − 1. This ancestral process continues until there are no un-mutated ancestral lineages, that is until n i = 0 for all i ∈ {1, . . . , K}.
To this we add a mutation counting process which starts with k i = 0 for all i ∈ {1, . . . , K} then has k i → k i + 1 whenever a mutation occurs on a type-i ancestral lineage. Equations (14a) and (14b) show that each event in the ancestral process includes two sub-events: a choice of the allelic type involved then a choice between mutation and coalescence. Depending on (n 1 , . . . , n K ), the n = i n i choices of allelic type will result in a random ordering of events among types. But for every ordering, the series of choices between mutation and coalescence within allelic type i depends only on n i (and θπ i ) and is independent of what happens in the ancestry of allele j ≠ i. The number of mutations of type i is the sum of n i Bernoulli random variables with success probabilities θπ i /(θπ i + j − 1) for j from n i down to 1. The number of latent mutations counted in this way will be distributed like the number of alleles in the Ewens sampling formula-see  and -with mutation parameter θπ i for allele i, and these counts will be independent among alleles as in (13).
We use this conditional ancestral process below but here note its close relationship to models of lines of descent (Griffiths 1980;Watterson 1984). In particular, (13) is included in equation (3.3) and Theorem 4 of Donnelly (1986), who extended Watterson's lines-of-descent model to the case of K-allele, parent-independent mutation. See also Donnelly and Tavaré (1987). Equation (3.3) in Donnelly (1986) in fact shows that if we were to keep track of the numbers of descendants of each latent mutation, the full Ewens sampling formula would give their distribution in the sample.
Before describing a large-n approximation for rare variants, we also note that latent mutations reckoned as in (13) include what Donnelly (1986) called 'spurious mutations to one's own type' and Baake and Bialowons (2008) called 'empty mutations'. These are a modeling artifact not only of parent-independent mutation models but of general mutation models as they are typically implemented (Jenkins and Song 2011;Bhaskar et al. 2012;Jenkins et al. 2014;Burden and Tang 2017;Burden and Griffiths 2019). Empty mutations have no empirical significance and should not be counted as mutations. To deal with them, we must keep track of the ancestral types of lineages when they experience mutations. We can do using the identity which decomposes our previously generic type-i mutations according to their ancestral types j ∈ {1, . . . , K}. A mutation is empty when j = i. In our large-n approximation, we take K to be the overwhelmingly common allelic type in the sample and 1 through K − 1 to be the rare variants. Our goal is to model latent mutations in the ancestry of the rare variants, so we use (15) only for i ∈ {1, . . . , K − 1}. For the common allele K, we instead lump (14a) and (14b) together and record both mutation and coalescence as n K → n K − 1. Making these changes to (14a) and (14b), and again using (10) to simplify ratios of sampling probabilities, the conditional ancestral process for a sample with state (n 1 , . . . , n K ) jumps to state ( . . . , n i − 1, . . . , n j + 1, . . . ) for i, j ≠ K with probability n i n to state ( . . . , n i − 1, . . . , n K + 1) for i ≠ K with probability n i n and to state ( . . . , n K − 1) with probability where we have used Kronecker's delta to accommodate empty mutations, i = j in (16a). Equation (16a) includes both empty and nonempty mutations, but only ones where the ancestral type is also rare. Nonempty mutations where the ancestral type is the common type K are in (16b). This classification of mutations by ancestral type does not change the probabilities of coalescence, so (16c) only differs from (14b) by the absence of type-K coalescent events which are now in (16d). If n K is large compared to n 1 through n K−1 , then n = i n i ≈ n K . The probabilities in (16a) will be O(1/n 2 K ), those in (16b) and (16c) will be O(1/n K ), and the one in (16d) will be O(1). Empty mutations and other mutations with rare-variant ancestors will become negligible as n K grows for fixed n 1 through n K−1 . Keeping only terms of O(1/n K ) and larger gives an approximate, large-n ancestral process with total rate n(θ + n − 1)/2 ≈ n 2 K /2 and jumps, for i ∈ {1, . . . , K − 1}, from state (n 1 , . . . , n K ) to state ( . . . , n i − 1, . . . , n K + 1) with probability to state ( . . . , n i − 1, . . . ) with probability and to state ( . . . , n K − 1) with probability This process is dominated by (17c), that is by events on lineages ancestral to the common allele K, which decrease the number of these but leave the counts of rare-allele lineages unchanged. Although we are not tracing the details of common-allele ancestry, we note that the overwhelming majority of these events will be coalescent events, since their rate is approximately equal to the total rate ∼ n 2 K /2. The next most frequent will be empty mutation events at rate O(n K ), followed by common-allele mutation events with rare-allele ancestors at rate O(1).
When one of the rarer events occurs in the ancestral process, it involves allele i with probability n i /n K , then is either a mutation event from a common allele as in (17a) or a coalescent event as in (17b). This process for the rare variants i ∈ {1, . . . , K − 1} has the same form as that found for all variants and all mutations in (14a) and (14b). Then by the same logic as before, the number of (now nonempty) latent mutations in the ancestry of the rare variants will be distributed like the number of alleles in the Ewens sampling formula, independently and with mutation parameter θπ i for allele i ∈ {1, . . . , K − 1}. In addition if we were to keep track of the counts of each mutation's descendants among the n i copies of rare variant i in the sample, then because every pair of type-i lineages is equally likely to be the one which coalesces when a type-i coalescent event occurs, the distribution of these counts should be given by the full Ewen's sampling formula (Ewens 1972;Kingman 1982;Donnelly 1986;Arratia et al. , 2016. The events involving the common allele in (17c) occur very quickly. But since only a fixed number of events involving rare alleles are required to resolve the ancestry of latent mutation and coalescence, the approximation remains accurate until all the rare-allele events have happened, if n K is large enough. In Appendix section "Time-dependent conditional ancestral process," we study the joint distribution of the times of events among the rare alleles and the numbers of common-allele ancestors when these rare-allele events occur. Focusing on the case of two alleles for simplicity, if T i is the time back to the ith event involving the rare allele 1, we have which in either case tends to zero as n 2 tends to infinity. Further, if N 2 (T i ) is the random number of type-2 ancestral lineages left at the ith event involving the rare allele 1, we have suggesting that, despite the rapid decrease of common-variant lineages, the approximation can hold until the entire ancestry of latent mutation and coalescence is resolved. Even for the largest rare-variant site-frequency count considered in Seplyarskiy et al. (2021), there will still be >1200 commonvariant lineages left on average at T 70 for the TOPMed data (n 2 ∼ 86, 000) and >400 left for the gnomAD data (n 2 ∼ 30, 000). In section "Application to human SNP data," we consider sitefrequency counts up to 40 for synonymous exonic sites in gnomAD with many fewer SNPs but a larger sample size (n 2 ∼ 114, 000) and in this case there should be about 2780 common-variant lineages left at T 40 when the entire ancestry of latent mutation and coalescence among the rare variants is resolved.
In sum, rare alleles in a large sample will quickly coalesce and mutate. Their ancestors will be common alleles. If k i ∈ {1, . . . , n i } is the number of these latent mutations in the ancestry of allele i ∈ {1, . . . , K − 1}, then from the rates of mutation and coalescence in (17a) and (17b) we have Latent mutations of different rare variants are independent and distributed like the numbers of alleles in the Ewens sampling formula, each with its own mutation parameter.

Latent mutations and sample counts of rare alleles
Our goal in this section is to understand how predictions about the counts of rare variants, and hence about their site-frequency spectra, depend on the number of latent mutations and the mutation rate. In anticipation of "Application to human SNP data," we focus on the marginal count of just one rare variant, which we arbitrarily call allele 1. From (20) we have which we note holds for any K. Here we let K = 2 for simplicity. To understand how the mutation rate influences the count of a rare variant, we apply the result for ratios of gamma functions with a common large parameter, 6.1.47 in Abramowitz and Stegun (1964) or equation (1) in Tricomi and Erdélyi (1951), to the terms involving n in (5) to obtain in which we have used n −θπ1 = e −θπ1 log (n) to make a connection with the underlying coalescent tree or gene genealogy. Specifically, 1/i is the expected number of type-1 mutations on the gene genealogy of a sample of size n, and for large n this is approximately equal to θπ 1 ( log (n) + γ) where γ = 0.5772 . . . is Euler's constant. In "Theory for nonconstant populations" we explore this connection in detail and explain the additional constants of proportionality in (22) after finding an analogous result for the general coalescent trees of Griffiths and Tavaré (1998).
Site-frequency spectra are typically defined as the proportion of segregating sites in each possible count in the sample (Braverman et al. 1995) or equivalently as the probability that a single mutation is in each possible count given that it is polymorphic in the sample (Griffiths and Tavaré 1998;Nielsen 2000). So, to understand how n 1 depends on θπ 1 , we may ignore the constants of proportionality in (22) and focus on Then using (23) together with (21), we have for the dependence of the rare-variant count, n 1 , on the number of latent mutations, k 1 , relevant to the site-frequency spectrum. Figure 2 shows site-frequency spectra computed using (23) and (24), and conditioning on the event that n 1 ∈ {1, 2, . . . , 40}. Figure 2a shows the dependence on the number of latent mutations. When all copies descend from a single mutation (k 1 = 1), the usual predictions from the infinite-sites model hold. Thus if we put |S (1) n1 | = (n 1 − 1)! in (24), then consistent with (7) we have The total number of such sites will depend on θπ 1 , and in general on the factor (θπ 1 ) k1 in (21) for larger numbers of latent mutations.
But conditional on k 1 , the site-frequency counts for a rare variant do not depend on θ, at least to leading order in the sample size n. If there are k 1 > 1 mutations in the ancestry of the rare variant, then n 1 cannot be less than k 1 . This is shown in Fig. 2a for k 1 = 2 to k 1 = 5. A key effect of recurrent mutation is to give relatively less weight to low site-frequency counts, as found previously by Jenkins and Song (2011).
Using (21) and (23) the joint distribution of n 1 and k 1 obeys which can be compared to the results of Jenkins and Song (2011). With fixed n 1 and large n in our model, all mutations in the ancestry of the rare variant will be non-nested mutations; note this also follows from (18) in Jenkins and Song (2011). Adapting the notation of Jenkins and Song (2011) in which E (1,1) 2N ,N is the event that the n 1 copies of allele 1 are due to two non-nested mutations, both from allele K = 2 to allele 1, their (21) becomes Numerical computations (not shown) using the unnumbered equation below (10) in Jenkins and Song (2011), which holds for any θ, reproduce the case of k 1 = 2 shown in Fig. 2a when n is large. This is evident in Figure 3 of Jenkins and Song (2011) for the quantity E 2N N . These computations are difficult for samples beyond the hundreds. Our results for k 1 = 3 could potentially also be compared to the O(θ 3 ) results of Bhaskar et al. (2012) using their Theorem 3 and summing appropriately. Figure 2b shows how the site-frequency counts of the rare variant depend on the mutation parameter of that variant, θπ 1 . Although Fig. 2a shows a dramatic effect of k 1 on the sitefrequency counts, Fig. 2b suggests that large values of k 1 are unlikely. This is evident from (21) and (25) in that each additional mutation results in an additional factor of θπ 1 . Note that the smallest value of θπ 1 in Fig. 2b is already more than twice the human average. From (23), we have p(n 1 ; n large, θ small) ∝ θπ 1 n 1 which is consistent with (9) in the case where allele 1 is rare in a large sample. Thus, when θπ 1 is small (0.002 and 0.02 in Fig. 2b) the site-frequency spectrum under recurrent mutation is very close to the standard infinite-sites model predictions. When θπ 1 is large (0.2 in Fig. 2b) the site-frequency spectrum under recurrent mutation is noticeably different, with a dearth of low-frequency variants and corresponding excesses at higher frequencies. Figure 2b plots site frequencies on a log scale to better illustrate differences, especially at higher frequencies.

Theory for nonconstant populations
Here we extend our analysis to populations which deviate from the standard neutral site-frequency predictions. We have in mind populations which have changed in size, although other applications may be possible. Here gene genealogies are the general coalescent trees of Griffiths and Tavaré (1998), which have the same branching structure of standard coalescent trees but may have different distributions of coalescence times. Equation (21) suggests another way to model both the number of copies (n 1 ) of a variant of interest and the corresponding count of latent mutations (k 1 ) when the variant is rare in a large sample.  proved that when the sample size tends to infinity, the numbers of alleles in small counts 1, 2, . . . , i in the Ewens distribution converge to independent Poisson random variables with expected values θ, θ/2, . . . , θ/i. Note that θ/i is the usual expected site-frequency count of mutants in i copies in the sample under the standard neutral model of a large constant-size population. A seminal result of Watterson (1974b) is that the numbers and counts of mutations in a sample from such a multi-type Poisson distribution conform to the Ewens sampling formula when conditioned on their total size. So we may interpret (21) and other findings in the previous section within this independent-Poissons sampling framework. This is exactly the approach in the Supplementary Materials of Seplyarskiy et al. (2021). Again, human SNP data strongly reject the standard neutral model with site-frequencies ∝1/i, owing largely to the great excess of singletons and other rare variants due to our recent growth (Keinan and Clark 2012;Gazave et al. 2014). So we replace 1/i with E[τ i ]/2, where τ i is the total length of branches with i descendants in the gene genealogy of a sample. For an extension of independent-Poissons sampling to variants under selection, see Desai and Plotkin (2008). Our notation is different than in Seplyarskiy et al. (2021) because here we use the coalescent or diffusion time scale.
Under the standard neutral coalescent model, E[τ i ] = 2/i. For the general coalescent trees of Griffiths and Tavaré (1998), τ i can be expressed in terms of the coalescent intervals, T k , which are the lengths of time when there were k ∈ {2, . . . , n} lineages in the ancestry of the sample. In particular, (Fu 1995;Griffiths and Tavaré 1998). Watterson (1974b) studied three models. In Model 1, using our notation, mutations arise from a constant source at rate θ, then propagate or go extinct independently according to a critical branching process, i.e. with birth rate equal to death rate as for a neutral (a) (b) Fig. 2. a) Shows the probability of observing n 1 copies of allele 1 in a large sample given these are produced by k 1 mutations. b) Shows the log 10 -probability of observing n 1 copies of allele 1 in a large sample for three different values of θπ 1 . In both panels, probabilities are normalized to sum to one, that is conditioned on the event that n 1 ∈ {1, 2, . . . , 40}.
mutation. The number of mutations in count i has expected value θμ i /i, for a constant μ > 0 which converges to 1 as the duration of the process increases. Watterson (1974b) proved that the numbers and counts of mutations follow the Ewens sampling formula when conditioned on their total size, which for Watterson (1974b) was equivalent to the population size. Models 2 and 3 are the Moran model and the Wright-Fisher model (Fisher 1930b;Wright 1931;Moran 1958Moran , 1962 and Watterson (1974b) proved that these have the same limit as Model 1 when the population size is large. Model 1 is an example of a logarithmic species distribution (Fisher 1943;Watterson 1974a;Arratia et al. 2003;Lambert 2011). Branching-processes have also been used to describe and infer the ages of rare alleles (Rannala and Slatkin 1997;Slatkin and Rannala 2000;Wiuf 2000); for recent developments and a review, see Crespo et al. (2021). Slatkin (2000) used this approach and an extension of Griffiths and Tavaré (1998) to model the ages of rare alleles in a large sample. Lambert (2012, 2013) studied the convergence of population frequencies of alleles for supercritical, subcritical or critical branching processes. All of these works assume that each allele traces back to a single mutation, as under the infinite-alleles mutation model.
Our approach to modeling recurrent mutation follows that of Watterson (1974b) to Model 1. Whereas Watterson (1974b) did not specify the source of mutations, here we take it to be the production of rare variants by mutation from a common variant on the gene genealogy of a large sample. What for Watterson (1974b) was the total population size is for us the total count of a rare variant. Allele 1 is our nominal variant of interest, but for simplicity for the moment, we use n, k and θ in place of n 1 , k 1 and θπ 1 . As a further notational convenience, we define so that θ ̅ τ i /2 is the expected number of mutations with count i in this independent-Poissons sampling model.
Let (a 1 , a 2 , . . . ) be the numbers of latent mutations of the variant of interest with counts (1, 2, . . . ). We assume that a i ∼ Poisson(θ ̅ τ i /2) and that a i and a j are independent for i ≠ j. Their joint distribution is then with a i ≥ 0. The total sample size is what would set the upper limits of the product and the sum above, but we leave these unspecified for now, only imagining that the total sample size is much larger than the sample count of the variant of interest, so we can model the latter without restriction. We are only concerned with a i for i ≤ b, where b is the largest rare-variant count. Thus, the assumption of independence in (27), which is equivalent to there being no nested mutations in the ancestry of a rare variant, will only need to be true for ̅ τ i with i ∈ (1, . . . , b). In Appendix section "Low-count branches of general coalescent trees" we prove that this holds for the trees of Griffiths and Tavaré (1998) for fixed b in the limit as the total sample size tends to infinity, and that the counts (a 1 , . . . , a b ) converge to independent Poisson random variables as with expected values (θ ̅ τ 1 /2, . . . , θ ̅ τ b /2). A condition is that the total height of the genealogy is finite, which is a mild assumption ruling out pathological situations such as a populations whose sizes increase too quickly backward in time.
The count of the variant of interest is n = i ia i and its number of latent mutations is k = i a i . Following Watterson (1974b), we consider the probability generating function of n and k, which in the present case simplifies to For the details of this derivation, see (A29) in the Appendix. The coefficient of x n (and y k ) can be found using where the sum is over Returning to our notation in which n 1 is the number of copies of a variant of interest, k 1 its number of latent mutations, θπ 1 its mutation parameter, and n is the total sample size, and further using τ to show the new dependence on the vector of expected times (̅ τ 1 , . . . , ̅ τ n−1 ), we have which is nonzero for n 1 = k 1 = 0 and n 1 ≥ k 1 ≥ 1. The sum over (i 1 , . . . , i k1−1 ) here is the same as in (28). It is equivalent to summing over partitions of the integers 1 through n 1 into k 1 subsets, where the sizes of the subsets are (i 1 , . . . , i k1 ). It is convenient to decompose (29) as follows. The number of type-1 mutations is Poisson distributed with parameter equal to the expected number of type-1 mutations on the gene genealogy of the sample. Conditional on this, the distribution of the number of times allele 1 appears in the sample is given by which depends on the relative expected branch lengths but does not depend on θ or π 1 .
In what follows, we consider k 1 ≤ 7 mutations at each site. Equation (30) suggests that this will be accurate up to about three expected mutations per site, because the probability of k 1 greater than 7 is just over 1% when (θπ 1 /2) n−1 i=1 ̅ τ i = 3. As in Fig. 2, the largest value of n 1 we consider is 40. These are not the upper limits of feasibility; it takes two minutes to evaluate (31) for all k 1 ∈ {0, . . . , 7} and n 1 ∈ {0, . . . , 40} in Mathematica version 11.2 (Wolfram Research, Inc. 2017) on a mid-2015 MacBook Pro.
Considering the first three possible values of k 1 in (31), Equation (33) says simply that if there are no type-1 mutations on the gene genealogy then no copies of allele-1 will be observed. Equation (34) is the familiar result for the site-frequency spectrum, that it is given by the proportion of branches in the tree that have n 1 descendants. Equation (35) extends this to two mutations and emphasizes that mutations in the ancestry of a rare allele will be non-nested when n is large.

Relation to K-alleles diffusion results
From a gene-genealogical point of view, (36) is the probability of seeing n 1 total copies of a rare variant when a random number of type-1 mutations occurs on the low-count branches of a standard neutral coalescent tree. However, the type of the common variant and the ancestral states of these mutations are not specified in the independent-Poissons model. Of course these should be allele K, as in "A conditional ancestral process for rare variants," but (36) does not include this event. In contrast, the sampling probabilities (3) and (5) from the equilibrium diffusion model specify the types of the entire sample. Implicitly, they average over the ancestral states of the sample. Here we focus on K = 2 and show how (36) is related to (5) when n is large, in particular to the leading order term in the expansion (22). The type of the common ancestor of the entire sample, at the root of the coalescent tree, is allele 2 with probability π 2 . If this were the case, allele 2 would be the ancestral source of the lowcount type-1 mutations. But if θ is not very small, it is possible for allele 2 to be the ancestral source of these mutations even if the common ancestor is type 1. To illustrate, dividing either (5) or (22) by (36) and letting n → ∞ gives Indeed when θ is small, (22) is close to (36) times π 2 . But the error of this, even as n tends to infinity, may be appreciable for larger values of θ. The additional probability of order θ 2 in (39) is consistent with the possibility that the root of the coalescent tree is type 1 and there are two type-2 mutations, one on each of the two branches descending from the root. A better guarantee that allele 2 is the ancestral source of lowcount mutations would be to specify it not as type of the single most recent common ancestor but rather as the type of the pair of ancestors at the first time in the past when there were two ancestral lineages. Equation (5), with sample size equal to two, gives the relevant probability. This accounts for both possible states at the root of the tree as well as for mutation during the deepest coalescent interval, T 2 in (26). Then the independent-Poissons model could be applied to the remainder of the tree, i.e. to coalescent intervals T 3 through T n .
Because latent mutations of rare variants tend to be very recent, cf. (18) and (19), we may extend this logic to the first time in the past when there were r ancestral lines of the sample, for an arbitrary r ≥ 1. The probability that these are all of type 2 is given by the diffusion result (5) with sample size r. The probability of seeing n 1 copies of the rare variant is given by an appropriately adjusted independent-Poissons model, covering coalescent intervals T r+1 through T n . By summing (26) only over j ∈ {r + 1, . . . , n} it can be shown that the total length of branches with i descendants in this more recent part of the gene genealogy differs only by 2(1 − r)/n + O(1/n 2 ) from the full result ̅ τ i = 2/i. The product of these two probabilities is which can be compared to the leading order term in (22). As expected from (39), if r = 1 (40) reduces to (36) times π 2 . Now dividing (5) or (22) by (40) and letting n → ∞ gives as a measure of how well this augmented independent-Poissions model approximates the equilibrium diffusion result, depending on r and θ. Expanding (41) around θ = 0, because we do not in fact expect the per-site mutation parameter to be large, gives where the π in π 2 is the usual constant (not our π i ). The parenthetic term in (42) tends to zero quickly as r increases. It is equal to the trigamma function ψ (1) (r) for r ∈ {1, 2, 3, . . . }; see 6.4.2 and 6.4.3 in Abramowitz and Stegun (1964). Even just taking r = 2 instead of r = 1 cuts the error by about 60%. Similar conclusions may be drawn from the large-r expansion of (41), which gives 1 + (2 − π 1 )π 1 θ 2 /(2r) + O(1/r 2 ). Again θ 2 is the largest-order effect of mutation. The event that a pair of mutations occurs on the two lineages descending from the root of the coalescent tree is non-negligible in the constant-size population model, even as n → ∞ and even for the entire population, because ancient coalescence times tend to be long. But the chance of this event will be small for most eukaryote species as θ ranges from about 10 −4 to 10 −1 with typical values around 10 −2 (Leffler et al. 2012). Based on our estimates in the next section, even the fastestmutating sites in the human genome have θ ≈ 0.02. Note that this event will even less likely in growing populations, because in this case the deepest coalescence times will be relatively short, but could be an important phenomenon for populations which were much larger in the past.

Theoretical example and data application
Here we illustrate the theoretical and empirical use of (30) and (31). First we describe the consequences of recurrent mutation in an exponentially growing population compared to those in a population of constant size. Second we explore an entirely empirical application to human SNP data, which suggests that disparate site-frequency spectra may be explained by differences in mutation rate (and thus recurrent mutation).
Note that if estimates of the expected fraction of the gene genealogy comprised of branches with i descendants, that is are available, then p(n 1 |k 1 ; n large, τ) can be computed using (31). In addition, for any estimated or supposed values of the expected number of mutations on the gene genealogy, the joint distribution of the number of latent mutations, k 1 , and their total count, n 1 , is the product of (30) and (31).

An exponentially growing population
Consider the simple model of pure exponential growth which has been the subject of a number of studies (Slatkin and Hudson 1991;Griffiths and Tavaré 1998;Polanski and Kimmel 2003;Chen and Chen 2013;Polanski et al. 2017): a population which has reached its current (haploid) size N 0 by exponential growth at rate r per generation. On the coalescent time scale of N 0 generations, looking backward in time and setting β = N 0 r, gives the population size at time t in the past. This model is unrealistic because the past population size approaches zero, but it can be taken as a rough approximation for recent dramatic growth.
For instance, a population of current size N 0 = 5 × 10 7 with a generation time of 30 years and r = 0.0064, would have β = 3.2 × 10 5 . About 40,000 years ago, it would have had size 10 5 , and using equation (7) in Slatkin and Hudson (1991) (26) if the expected coalescent intervals E[T k ] are known. We use the large-n results of Chen and Chen (2013) for E[T k ] (our notation) to obtain a simple approximation for E[τ i ]. With the time scale and notation here, equation (11) in Chen and Chen (2013) gives as a large-n approximation for the cumulative expected time for the number of ancestral lineages of the sample to decrease from n to k. Writing (46) as a continuous function of x = k/n, we approximate the expected coalescent interval as Note that while (48) is a large-n approximation, it allows that β might be of the same order of magnitude as n. Applying the same approximation to the combinatorial coefficient in (26) gives Finally, we approximate the sum in (26) with the integral which can be evaluated efficiently either as (51), in terms of the hypergeometric function, or as the integral (50). Slatkin and Hudson (1991) and others have observed that gene genealogies under very fast exponential growth are close to star trees. Using either (50) or (51) we have as β/n increases. From the log (2β/n) term in (52), we confirm the star-tree prediction that under extreme growth essentially all variants will be singletons. These results for exponentially growing populations, derived here using a coalescent approach, are identical in form to some results for "Luria-Delbrück distributions," especially in application to cancer, derived using forward-time birth-death or branching processes ( Gunnarsson et al. 2021;Poon et al. 2021). In particular, (50) has the same form as the approximation in equation (4) of Ohtsuki and Innan (2017) and as equation (33) in Gunnarsson et al. (2021). Equation (52) has the same form as the expression in Theorem 2 in Durrett (2013) if only the leading-order term is kept in (52) in the case i = 1. Figure 3 shows the same quantities as Fig. 2 but for the pure exponential growth model with n = 10 5 and β/n = 3. The value β/n = 3 was chosen to roughly reproduce the ratio of singletons to doubletons observed for low-rate sites in the gnomAD data in section "Application to human SNP data." Figure 3a is directly comparable to Fig. 2a, the only difference being whether E[τ i ] = 2/i or comes from (51). As Fig. 3a shows, recent rapid growth produces a singlemutation (k 1 = 1, blue line) site-frequncy spectrum with an excess of rare variants and a deficit of common variants. So, compared to the constant-size case in Fig. 2a, there is a diminished tendency to observe high-frequency variants when the number of latent mutations is larger, and a stronger tendency for the site-frequency count (n 1 ) to be equal to or close to the number of latent mutations.
To make Fig. 3b comparable to Fig. 2b, we used (44) with n = 10 5 and E[τ i ] = 2/i to compute the corresponding expected numbers of mutations on the gene genealogy for the three values of θπ 1 in Fig. 2 (0.002, 0.02, 0.2). The resulting expected numbers of mutations were 0.024, 0.24 and 2.4, the last being about equal to the value for the highest-rate sites in the gnomAD data in section "Application to human SNP data." We then computed p(n 1 ; n large, τ) by averaging (31) over the distribution (30). Similar to Fig. 2b, the two smaller values of the mutation rate give nearly indistinguishable results for the total count n 1 . But there is a dramatic difference for the largest mutation rate. In Fig. 2b the prediction is distinctly L-shaped and thus similar to that for the lowest mutation rate, which again is 100-fold lower. In contrast, in Fig. 3b singletons have a much lower chance of being observed. In fact, doubletons are slightly more likely than singletons. This relative excess of doubletons is due to the fact when there are two latent mutations these are highly likely to produce two copies of the variant under growth (Fig. 3a) than under constant size (Fig. 2a).
It is also of interest to know how the number of latent mutations in the ancestry of a rare variant depends on its count. Figure 4 depicts this for a series of increasing counts n 1 , from 1 to 16. Figure 4a shows the results for constant size, Fig. 4b the corresponding results for pure exponential growth. The expected number of mutations on the gene genealogy is 2.4 in both cases. Regardless of the demography, if only one copy of the variant is observed, it must be due to one mutation. Otherwise, the results differ greatly for constant size versus growth. Under constant size, a variant observed multiple times in the sample can easily be due to a single mutation. Under growth, higher variant counts are more likely due to multiple mutations.

Application to human SNP data
We also used (30) and (31) to account for latent mutations in the ancestry of rare variants in a subset of the gnomAD data (Karczewski et al. 2020). We took the approach described in the Supplementary Materials of Seplyarskiy et al. (2021), specifically obtaining estimates of relative branch lengths (43) from the data at low-rate sites, then using our new analytical result (31) to average over mutation counts. Rather than categorizing variants by trinucleotide context as in Seplyarskiy et al. (2021), we analyzed data from gnomAD version v2.1.1, presorted into 109 bins based on estimates of mutation rate by the Roulette method of Seplyarskiy et al. (2022) which incorporates information from the six flanking bases on either side of a SNP, strand asymmetry, expression level, methylation and promoter status. We did not use this information but simply assumed that variants within a bin all have the same mutation rate.
The data consist of variant counts for synonymous mutations in the exomes of about 57 K non-Finnish Europeans. Thus n ∼ 114 K although this varied by about 2% among sites because we required that sites were successfully genotyped in a minimum of 112K chromosomes. Importantly for our application, the data include monomorphic sites, i.e. sites with variant count equal to zero. The gnomAD only provides n for polymorphic sites, so we imputed n for monomorphic sites using the nearest value at a polymorphic site within 100 bp on either side of the focal site. After filtering for sequencing quality and coverage as well as removing mutation rate bins with fewer than 100 observed mutations, there are a total of 12,338,176 sites in 97 bins and 834,486 of these are polymorphic. Figure 5a shows the total numbers of sites and the numbers of monomorphic sites in each bin. The great majority of sites are in bins 1 through roughly 20. These have low mutation rates, as indicated by their nearly equal numbers of total sites and monomorphic sites. The widening gap between the total number of sites and the number of monomorphic sites reflects the fact that higher-number bins have larger mutation rates.
For each bin, the data are the numbers of sites where a variant is observed in each possible count in the sample. As in "Latent mutations and sample counts of rare alleles," these are marginal with respect to other possible variants at the site. Sites with two (resp. three) rare variants appear twice (resp. three times) in the data, once for each rare variant. These will likely be in different bins given the fine substructure of mutation rate variation (Seplyarskiy et al. 2021(Seplyarskiy et al. , 2022. Although bins contain mixtures of different sequence contexts and different nucleotide substitutions, for our purposes sites within a bin are all of the same type because they all have the same mutation rate. Let S i be the number of sites in a given bin where i copies of the variant are observed in the sample. If a bin contains L total sites, then with reference to the notation in (2) we may write Thus we use a simplified notation here, with i in place of n 1 to avoid the additional subscript when we apply the results of the previous sections. In addition we use "mutrate" to refer to the estimate of the expected number of latent mutations per site for a given bin, i.e. (θπ 1 /2) i ̅ τ i for sites in that bin, as this is the rate parameter in the Poisson distribution (30). We used (30) and the proportion of monomorphic sites, S 0 /L, to estimate this "mutrate" for each bin, specifically as − log (S 0 /L). Figure 5b plots these estimates across bins, on a log scale. They range from 0.0097 for bin 1 to 2.23 for bin 97, with a mean of 0.083, taking the proportion of sites in each bin into account. Most sites have mutation rates on the lower side: bins 1 through 5 contain about 47% of all sites, bins 1 through 19 about 95%, and bins 60 through 97 contain only about 2% of sites. Overall, rates vary 230-fold from lowest to highest. Assuming that the average estimated mutrate of 0.083 corresponds to the genome average mutation rate per site, for which the usual estimate of θ from pairwise differences is about 1/1300 ∼ 0.00077, we can infer that the expected number of mutations between a pair of (haploid) genomes is about 9 × 10 −5 for the slowest sites and about 0.02 for the fastest sites.
We compared observed and expected site-frequency counts for each bin based on an empirical fit of our model. First, we used (30) with the estimated mutrate (θπ 1 /2) i ̅ τ i for each bin to compute probabilities of k ∈ {0, 1, . . . , 7} latent mutations. Then from (34) and the fact the polymorphisms at sites with very low mutation rates likely have just one latent mutation, we used the combined data for bins 1 through 5 to estimate ̅ τ i / i ̅ τ i directly as S i /(L − S 0 ) for i ∈ {1, . . . , 40}. Our estimates of the mutrate for bins 1-5 range from 0.0097 to 0.037 with an average of 0.021, which we note is somewhat less than the smallest mutation rate in Figures 2 and 3. We assumed that this ̅ τ i / i ̅ τ i estimated from bins 1-5 holds for all bins. Finally, we computed the expectations E[S i ], for i ∈ {0, . . . , 40} in each bin, multiplying the probabilities of counts obtained using (30) and (31) by the total number of sites in the bin, cf. (53).
The upper three panels of Fig. 6 show the observed and expected variant counts, S i for i ∈ {1, . . . , 40}, for bins 9, 50 and 92, chosen to represent a low-rate bin, a middle-rate bin and a highrate bin. Figure A2 in the Appendix gives the plots for all 97 bins. In making these plots, we grouped variant counts for which E[S i ] < 1. For bin 50 for example, this was true of variant counts i ∈ [12, 40] as depicted in Fig. 6B and in the 50th panel of Fig. A2. The mutrate values in these plots are again the estimates of the expected number of mutations per site on the gene genealogy, (θπ 1 /2) i ̅ τ i , for each bin.
The broad pattern from these plots is clear. For smaller mutation rates (e.g. Fig. 6a) the site-frequency spectrum is heavily weighted toward the rarest variants. For large mutation rates (e.g. Fig. 6c), that is when multiple latent mutations are likely, the site-frequency spectrum is shifted toward higher counts. Again from Fig. 5a, the data contain fewer sites with intermediate mutation rates. In this case (e.g. Fig. 6b), the site-frequency spectrum does show the expected intermediate pattern, but subject to considerable sampling error. Across the range of mutation rates, the empirical model, which uses low-rate sites to estimate relative branch lengths ̅ τ i / i ̅ τ i and assumes these hold for all sites, fits the data well.
As can be seen in Fig. 6a and the first 20 or so panels of Fig. A2, the empirical estimates of ̅ τ i / i ̅ τ i include fluctuations due to sampling error for higher-count variants. The combined data for the first five bins have S i ranging from 71 to 38 for i ∈ [30,40]. The presence of these fluctuations helps illustrate a subtler phenomenon, namely the smoothing which occurs at larger mutation rates (e.g. Fig. 6c). For reference, the combined data for the first five bins have S i in the thousands for the low-count variants. From these, the estimated chance that a latent mutation is a singleton is about 64%, followed by 13% for doubletons and 6% for tripletons. By comparison, the chance is less than 0.1% for each variant with count i ∈ [25, 40]. The predictions E[S i ] are (a) (b) Fig. 3. Plots of the same quantities shown in Fig. 2 but for a sample of size n = 10 5 under pure exponential growth with β/n = 3. a) Probability of observing n 1 copies of allele 1 in the sample given k 1 = 1, 2, 3, 4, 5 latent mutations. b) log 10 -probability of observing n 1 copies of allele 1 in the sample for three different mutation rates, corresponding to the values of θπ 1 : 0.002, 0.02 and 0.2 in Fig. 2, but here expressed in terms of expected numbers of mutations on the gene genealogy (44) smoothed for higher-count variants at larger mutation rates because they are mixtures. For example, two latent mutations will come in counts 1 and i − 1, 2 and i − 2, or 3 and i − 3 with approximate relative proportions 64:13:6. The lower three panels of Fig. 6 show estimates of the probability that a variant in count i ∈ {1, . . . , 40} descends from k ∈ {1, . . . , 7} latent mutations, computed using (32). All singletons descend from single mutations. Variants in larger counts can have multiple latent mutations, and the probabilities of these increase very quickly then settle down to stable values. This suggestion of a limiting distribution was also seen for exponential growth in Fig. 4b, only there depicted differently. For very large counts of the variant, the distribution of k − 1 is well approximated by a Poisson with mean equal to the expected number of mutations per site on the gene genealogy, (θπ 1 /2) i ̅ τ i . This shifted-Poisson result is known already for the constant-size case (Arratia et al. 2000;Yamato 2017). In "A remark on the total number of mutations for large n 1 " in the Appendix we argue that it should hold more generally. The accuracy of this shifted-Poisson result for the gnomAD data and i = 40 is shown by the black dots on the right axes of Figs. 6d-f.
For low-rate sites (e.g. Fig. 6d) there is a relatively small chance of multiple latent mutations. But the chance of two or more latent mutations is not negligible, owing to the very large sample size. Note that the mutrate for bin 9 is less than the genome average, which is 0.083 for this sample of n ∼ 114K. Thus in a very large sample even low-rate sites are affected by recurrent mutation. For the middle-rate sites (e.g. Fig. 6e) in the trough in Fig. 5a the chance of there being only one latent mutation is still considerable. However, for high-rate sites (e.g. Fig. 6f) it can be more likely that there are two or three mutations in the ancestry of a rare variant than the single unique mutation which is typically supposed. Finally, we explored the extent to which rare variants might be observed less frequently than would be expected if there were no recurrent mutation. Figure 7a shows the expected frequency of singletons, doubletons, etc., up to variants found in five copies in the sample, across the range of mutrates in the binned gnomAD data. The standard infinite-sites prediction is that the frequency will increase linearly with the mutation rate. Figure 7a is largely consistent with this but shows marked deviations when the mutrate becomes too large. The point at which the linear prediction fails depends on the count of the rare variant. Singletons are the first to deviate, which they do as soon as there is an appreciable chance of two or more mutations at a site. For rare variants in five copies, linearity holds even close the upper limit of mutation rates in the human genome. Figure 7b shows the extent to which the infinite-sites model over-predicts the frequency of singletons across the 97 bins. The infinite-sites prediction for a bin is its mutrate (θπ 1 /2) i ̅ τ i times the proportion of singleton branches ̅ τ 1 / i ̅ τ i = 0.64 estimated from the first five bins. The corresponding independent-Poissons predictions are the same as those for i = 1 in the 97 panels of Fig. A2. The infinite-sites model makes reasonable predictions for the twenty lowest-rate bins, which contain 96% of all sites and have mutation rates less than twice the genome average. But it predicts the impossible for the seven highest-rate bins: more singletons than there are sites to mutate. For bins 21 through 97, which contain 4% of all sites, the infinite-sites model predicts a total of 269,222 singletons compared to the 83,002 which are actually observed.
We emphasize that the results in Fig. 7 depend on the sample size. The expected number of mutations at a single site, (θπ 1 /2) i ̅ τ i , is proportional to the total length of the gene genealogy, which is an increasing function of the sample size. Already for the sample size n ∼ 114K considered here, singletons start to be affected by recurrent mutation at around the genome average mutation rate (Figs. 7 and 6d). For variants in any fixed count i there will be a sample size above which the infinite-sites, linear prediction starts to fail.

Discussion
In this work, we modeled the mutational ancestry of a rare variant in a large sample. Under the standard neutral model of population genetics with K-allele parent-independent mutation, we found that co-segregating rare variants may be treated independently and that the Ewens sampling formula gives the probabilistic structure of latent mutations in their ancestries. In particular, the number of latent mutations is distributed like the number of alleles in the Ewens sampling formula. We obtained more general results, for changing population size, by modeling latent mutations as independent Poisson random variates.
Our aim was to describe how the site-frequency spectra of rare variants in large samples are affected by recurrent mutation. The key parameters for a variant in count i are its expected total rate of mutation on the gene genealogy of the sample (here denoted (θπ 1 /2) i ̅ τ i and called "mutrate" in the previous section) and the expected relative lengths of branches in the gene genealogy which have i descendants in the sample (̅ τ i / i ̅ τ i ). Under the standard neutral model ̅ τ i = 2/i. We obtained new results for ̅ τ i under exponential population growth and used these to illustrate how recurrent mutation affects the site-frequency spectrum differently than under constant size. Lastly, we showed that our general results provide a good fit to synonymous variation among a large number of (non-Finnish European) individuals in the human Genome Aggregation Database (Karczewski et al. 2020), suggesting that, whatever the causes of deviations from ̅ τ i = 2/i might be for this sample, differences in mutation rate can explain differences in site-frequency spectra among sites.
Our application was empirical. We did not fit a demographic model, but following Seplyarskiy et al. (2021) used low-mutation-rate sites to estimate relative branch lengths and assumed these hold for all sites. Site-frequency spectra are a rich source of information about population-genetic phenomena but are of somewhat limited use in disentangling their effects (Myers et al. 2008;Bhaskar and Song 2014;Terhorst and Song 2015;Lapierre et al. 2017;Rosen et al. 2018). When low-mutation-rate sites are plentiful enough to provide stable estimates of relative branch lengths, this empirical method offers a way to control for myriad factors and isolate the effects of variation in mutation rate.
We began with a K-allele model with parent-independent mutation, and used its sampling probabilities in our computations for constant-size populations. We conjecture that our findings will hold for general mutation models because conditioning on a rare variant in a large sample means that the common allele will be the ancestral source of mutations with very high probability. Then the relevant mutation rate in any model will be the rate of the production of the rare allele from the common allele. We described our general results as being for populations which may have changed in size. This is appropriate for the general coalescent model (Griffiths and Tavaré 1998) which we assumed for some proofs in the Appendix. Strictly speaking, the general coalescent does not require a generative model for the times between coalescent events. Thus our results can be applied more broadly. The case of a fixed tree with arbitrary τ i considered in the Appendix is one example. The independent-Poissons model, with results (27) to (35), does not even require interpretation in terms of coalescence times. These results hold if we replace θπ 1 ̅ τ i /2 with an arbitrary rate parameter λ i for the production of mutants in count i. Rates of production of mutants have been obtained for under a range of demographies and some types of selection (Lange and Fan 1997;Dorman et al. 2004;Lambert 2011;Kaj and Mugal 2016;Torres et al. 2020;Müller et al. 2022). Applications to selection will likely require free recombination between sites. Desai and Plotkin (2008) applied the independent-Poissons model (for all variant counts in the sample) for example under a version of the Poisson Random Field model (Sawyer and Hartl 1992).

Data availability
The data application to low-frequency synonymous polymorphisms used allele frequencies from exome sequencing data compiled in gnomAD v2.1.1, available here: https://gnomad. broadinstitute.org/downloads and basepair-resolution mutation rates (Seplyarskiy et al. 2022), available here: http://genetics.bwh. harvard.edu/downloads/Vova/Roulette/. The mutation rate model specifies the rate for all three possible alternative nucleotides, and different nucleotide mutations were counted separately when generating the site-frequency spectra. The pipeline used to compile and annotate all potential synonymous mutations in the human genome is available at: https://github.com/vseplyarskiy/Roulette. The site-frequency spectra in different mutation rate bins is available at: https://doi.org/10.6084/m9.figshare.3426251.v1.
Let P n be the probability measure for this process starting at n = (n 1 , n 2 ), and define the random times to be the times at which the first coordinate of the process decreases to n 1 − i for 1 ≤ i ≤ n 1 , with T 0 = 0. We have 0 = T 0 < T 1 < T 2 < · · · < T n1 almost surely under P n , and the process (N 1 , N 2 ) visits the following points in order (n 1 , n 2 ) → (n 1 − 1, N 2 (T 1 )) → · · · → (0, N 2 (T n1 )).
In Theorem A1 we describe the joint distribution of the hitting times (T i ) n1 i=1 and the locations (N 2 (T i )) n1 i=1 as n 2 → ∞.
Theorem A1 As n 2 → ∞, the random vector in R 2n1 + converges in distribution under P n to the random vector are independent random variables with probability density functions for z ∈ (0, ∞).
Remark A1 (Mean of T n1 ). Note that Hence for n 1 ≥ 2, Theorem A1 implies that E n [T 1 ] is of order 1/n 2 and gives the second part of (18) in the main text. In contrast, when n 1 = 1, E[Z 1 ] = ∞ and E n [T n1 ] is no longer of order 1/n 2 .
These give (18) in the main text.

Proof of Theorem A1
To explain the key idea we first establish weak convergence of (n 2 T 1 , N 2 (T 1 ) n2 ), i.e. of the marginal distribution for i = 1 in (A3). By definition, T 1 is given by where ♯ is the number of downward jumps in second coordinate of the process starting at (n 1 , n 2 ) up to the first decrease in the first coordinate. The variables {ξ i } ♯−1 i=0 are the times between these downward jumps, with ξ ♯ being the time to the final jump starting at (n 1 , n 2 − ♯). This last jump is the one which decreases the first coordinate. Observe that N 2 (T 1 ) is either n 2 − ♯ or n 2 − ♯ + 1. Given ♯, N 2 (T 1 ) is equal to which correspond to a non-empty mutation event and a coalescent event of type 1 respectively. These follow from (A1).

Lemma A1.
As n 2 → ∞, we have convergence in distribution with Z 1 and Y 1 as defined in Theorem A1.
As n 2 → ∞, N 2 (T 1 ) → ∞ in the sense of (A13). Hence the same argument that leads to Lemma 1 can be applied again, starting at the new location (n 1 − 1, N 2 (T 1 )). More precisely, by computing moment generating functions as before, and applying the strong Markov property of the random walk {(N 1 (t), N 2 (t))} t∈R+ at the stopping time T 1 , we obtain the joint convergence under P n as n 2 → ∞, where {Z 1 , Z 2 , Y 1 , Y 2 } are independent variables defined in Theorem A1. This implies the convergence in distribution n 2 T 1 , n 2 (T 2 − T 1 ) ; N 2 (T 1 ) n 2 , under P n , as n 2 → ∞. Continuing this way, by letting ♯ i be the number of downward jumps starting at (n 1 − i + 1, N 2 (T i−1 )) before hitting the vertical line {(n 1 − i, y) : y ∈ Z + } for i ≥ 1, we obtain the desired convergence in Theorem A1. □

Low-count branches of general coalescent trees
Here we prove the non-nestedness and Poisson-independence of low-count mutations, which we assumed in section "Theory for nonconstant populations." We do this first for fixed trees then for the random, general coalescent trees of Griffiths and Tavaré (1998). We also present the computation of the probability generating function, G n,k (x, y), of the count of the variant of interest and its number of latent mutations. Our definition of nested differs from some previous ones (Saunders et al. 1984;Wiuf and Donnelly 1999;Hobolth and Wiuf 2009); here nested mutations may occur on the same branch of the gene genealogy.

Nested mutation on a fixed tree
Let T n be a fixed (non random) tree with n leaves. We suppose the tree is ultrametric, that is the leaves have the same distance H n from the root. We call H n the height of T n . Consistent with the main text, we adopt the following notation for some relevant properties of T n , for the most part suppressing the dependence on n for simplicity: 1) T k is the length of the time during which there are exactly k lineages ancestral to the sample, for k ∈ {2, 3, . . . , n}. 2) τ j for j ∈ {1, . . . , n − 1}, is the total length of branches in T n that have j descendants. We suppose there are m j such branches with lengths {τ j,k } m j k=1 . Then τ j = m j k=1 τ j,k . 3) T total is the total branch length, the sum of all the branches in T n , which is equal to n k=2 k T k = n−1 j=1 τ j . 4) For a positive integer b, we define a collection {Γ of disjoint connected subtrees of the coalescent tree as follows: Each of the m b branches with b descendants in the sample (say the ith one) subtends b leaves in the coalescent tree and gives rise to a subtree Γ  We assume that mutations arise as a Poisson point process on the tree with constant rate θ/2 per unit length. Theorem A2 below holds for any fixed ultrametric tree (it can be binary or have multiple mergers, or even be a star tree).
Theorem A2 (Nested mutation on fixed trees). Let T n be a fixed ultrametric tree with n leaves. For any positive integer b and for any θ ∈ (0, ∞), the probability that nested mutation up to count b occurs is bounded above by In particular, the probability that nested mutation up to count b occurs tends to 0, as n → ∞, if θ 2 ( max 1≤k≤m j τ j,k )τ j → 0 for 1 ≤ j ≤ b.
Remark A3. There is good evidence that the upper bound θ 2 8 T total H n is actually small for humans. For the gnomAD data we analyze in the main text, the expected number of mutations per site (θT total /2) is between about 0.009 and 2.13. So θT total /2 is not big with high probability. The rest of the upper bound, bθH n /4, should be proportional to the average pairwise difference per site (very nearly equal to this for random Kingman coalescent trees and large n) and this ranges from about 9 × 10 −5 to about 0.02 for these same data. See section "Application to human SNP data." Remark A4. The simpler bound θ 2 8 bT total H n can be weaker than the other bound θ 2 8 b 3 b j=1 m j k=1 τ 2 j,k in (A14) for large n. For the Kingman coalescent, E[T total H n ] = O( log n) is larger than E[ b j=1 m j k=1 τ 2 j,k ] since the latter tends to 0 as n → ∞, by (A17). For a star tree, however, both bounds are approximately θ 2 nH 2 n (up to a multiplicative constant).
Proof. The total number M n of mutations on T n is a Poisson variable with mean c n : = θ 2 T total . Given the tree T n and M n = k, the k mutations are uniformly distributed on the tree. Hence the conditional probability that two given mutations are on the same subtree Γ (b) i for some i is equal to i | is the total branch lengths of the subtree Γ Note that |Γ i }. The subtree on the left has one mutation which is labeled 1 and has count four. The subtree on the right has nested mutations, with the mutation labeled 1 in count two and another labeled 2 also in count two. where σ jj is defined in Fu (1995, eqns. (1)- (2)). This follows from the fact that Fu's β n (j) ≈ 2 log n n as n → ∞ for each j ≥ 1 (Fu 1995, eqn. (5) p(k, j; k ′ , j) (k − 1)(k ′ − 1) + C * k<k ′ p(k, j; k ′ , j) − p(k, j)p(k ′ , j) (k − 1)(k ′ − 1) .
By Fu (1995, eqns. (29) and (22)), the first and second terms of (A28) are of order o(n) and O( log n n ), respectively, as n → ∞ for each j ≥ 1. The completes the proof of lim n→∞ Var(τ j ) = 0. The latter implies, by Chebyshev's inequality, that τ j − E[τ j ] → 0 in L 2 as n → ∞. □ Theorem A5 (Poisson approximation for counts across loci). Let {T n } n≥2 be a sequence of random coalescent trees which are the generalized coalescent trees of Griffiths and Tavaré (1998). Suppose sup t≥0 λ(t) < ∞ and assumption (A26) holds. Let a j be the number of mutations on T n with counts j. Then for any positive integer b and any θ ∈ (0, ∞), the variables {a j } b j=1 are asymptotically independent and a j ∼ Poisson( θ 2 E n [τ j ]) for 1 ≤ j ≤ b, as n → ∞.
Proof. By Theorem A4, the probability that nested mutation up to count b occurs tends to 0 as n → ∞. The result then follows from Lemma A4 and Theorem A3. □ It can be checked that exponentially growing popolations clearly satisfy sup t≥0 λ(t) < ∞ and also assumption (A26). The conclusions of Theorems A4 and A5 then hold for the generalized coalescent trees of Griffiths and Tavaré (1998) when λ(t) = e βt for t ∈ R + for some β > 0.
Equipped with Theorem A5, we write ̅ τ i = E n [τ i ] as in the main text and compute the probability generating function G n,k of the count of the variant of interest and its number of latent mutations. The count of the variant of interest is n = i ia i and its number of latent mutations is k = i a i . Hence as declared in the main text.
A remark on the total number of mutations for large n 1 The stable probabilities observed in the lower three panels of Fig. 6 and in Fig. 4b suggest that the conditional distribution of k 1 given n 1 and a very large n will approach a distribution as n 1 gets larger. That this is the case under constant population size follows from the fact, here as in "A conditional ancestral process for rare variants," that the number of alleles in the Ewens sampling formula is the sum of independent Bernoulli trials (Arratia et al. , 2000. The limiting large-n 1 distribution is Poisson, but shifted because there must be at least one mutation to produce n 1 > 0 copies, so it is k 1 − 1 that is Poisson. See Proposition 3.1 of Yamato (2017). By the following heuristic argument, we suggest that this result holds more broadly, in particular for growing populations or ones in which ̅ τ i decreases at least as fast with i as in the constant-size model, whatever the reason. In this case, when a mutation occurs it will very likely produce a low-count variant because ̅ τ i / i ̅ τ i for small i will be much greater than ̅ τ n1 / i ̅ τ i for large n 1 . A large-n 1 variant which is due for example to k 1 = 2 latent mutations will very likely have a count pattern such as (a 1 = 1, a n1−1 = 1) or (a 2 = 1, a n1−2 = 1) and very unlikely to have one such as (a n1/2−j = 1, a n1/2+j = 1) for some small j.
Then we expect the probability (31) of seeing n 1 copies given k 1 latent mutations to be close to p(n 1 |k 1 ; n large, τ) ≈ k 1 ̅ τ n1 n−1 i=1 ̅ τ i when n 1 is large, because each of the k 1 mutations has a small chance ̅ τ n1 / i ̅ τ i of producing an appropriately large number of copies, the other k 1 − 1 mutations being inconsequential to the total count. Multiplying by the Poisson distribution of k 1 in (30) and rearranging gives p(n 1 , k 1 ; n large, τ) ≈ k 1 ̅ τ n1 for large n 1 . To the left of the · is a probability like (7). To the right of the · is the shifted Poisson distribution, which implicitly averages over the (small) sizes of the k 1 − 1 additional mutations.