A Faster and More Accurate Algorithm for Calculating Population Genetics Statistics Requiring Sums of Stirling Numbers of the First Kind

Ewen’s sampling formula is a foundational theoretical result that connects probability and number theory with molecular genetics and molecular evolution; it was the analytical result required for testing the neutral theory of evolution, and has since been directly or indirectly utilized in a number of population genetics statistics. Ewen’s sampling formula, in turn, is deeply connected to Stirling numbers of the first kind. Here, we explore the cumulative distribution function of these Stirling numbers, which enables a single direct estimate of the sum, using representations in terms of the incomplete beta function. This estimator enables an improved method for calculating an asymptotic estimate for one useful statistic, Fu’s Fs. By reducing the calculation from a sum of terms involving Stirling numbers to a single estimate, we simultaneously improve accuracy and dramatically increase speed.

Ewens's sampling formula not only was a seminal result for population genetics, but also established connections with combinatorial stochastic processes, algebra, and number theory (Crane 2016). For population genetics, in particular, Ewens's sampling formula provided a key analytical result that finally enabled mathematical tests of the neutral theory of evolution (Crane 2016;Nielsen 2001). It has given rise to several classical population genetics tests for neutrality, including the Ewens-Watterson test, Slatkin's exact test, Strobeck's S, and Fu's F s (Fu 1997;Strobeck 1987). Calculation of subsets of this distribution are useful for testing deviations of observed data from a null model; such subsets often require the calculation of Stirling numbers of the first kind (hereafter referred to simply as Stirling numbers). In particular, Fu's F s has recently been shown to be potentially useful for detecting genetic loci under selection during population expansions (such as an infectious outbreak) both in theory and in practice (Wu et al. 2016). However, Stirling numbers rapidly grow large and thus explicit calculation can easily overwhelm the standard floating point range of modern computers.
In previous work, an asymptotic estimator for individual Stirling numbers was used to solve the problem of computing Fu's F s for large datasets, which are now becoming common due to rapid progress in DNA sequencing technology (Chen 2019). Without such improved numerical methods, Fu's F s calculations for data sets as small as 170 sequences can cause overflow, preventing the use of these statistics for genome-wide screens of selection. This algorithm based on estimating individual Stirling numbers solved problems of numerical overflow and underflow, maintained good accuracy, and substantially increased speed compared with other existing software packages (Chen 2019). However, there was still a need to estimate multiple Stirling numbers (up to half the total number of sequences). Here, we explore the potential for further increasing both accuracy and speed in calculating Fu's F s by using a single estimator for the entire sum, which involves multiple Stirling numbers.

General definitions and theory
We take a population of n individuals, each of which carries a particular DNA sequence D i (referred to as the allele of individual i). We define a metric, distðD i ; D j Þ to be the number of positions at which sequence D i differs from D j . Then, we denote the average pairwise nucleotide difference as u p (hereafter referred to simply as u), defined as: We also define a set of unique alleles Du i 2 fD i g which have the property of ði 6 ¼ jÞ ⇒ðdistðDu i ; Du j Þ . 0Þ. The ordinality of fDu i g is denoted m, i.e., the number of distinct alleles in the data set. Building upon on Ewens's sampling formula (Fu 1997;Ewens 1972), it has been shown that the probability that, for given n and u, at least m alleles would be found, is S9 n;m ðuÞ ¼ 1 ðuÞ n X n k¼m ð21Þ n2k S ðkÞ n u k ; u . 0; where ðuÞ n is the Pochhammer symbol, defined by S ðkÞ n is a Stirling number and is defined by: Fu's F s is then defined as: Fu's F s thus measures the probability of finding a more extreme (equal or higher) number of alleles than actually observed. It requires computing a sum of terms containing Stirling numbers, which rapidly become large and therefore impractical to calculate explicitly even with modern computers (Chen 2019). Because of the relation in (4), the statistics quantity S9 n;m ðuÞ satisfies 0 # S9 n;m ðuÞ # 1. Also, this relation and (3) show that  There is a recurrence relation n 2 nS ðkÞ n ; which easily follows from (4). For a concise overview of properties, with a summary of the uniform approximations, see (Gil et al. 2007, x11.3).
We introduce a complementary relation T9 n;m ðuÞ ¼ 1 2 S9 n;m ðuÞ ¼ 1 ðuÞ n X m21 k¼0 ð21Þ n2k S ðkÞ n u k ; leading to an alternate calculation for Fu's F s of F s ¼ ln S9 n;m ðuÞ 1 2 S9 n;m ðuÞ ¼ ln 1 2 T9 n;m ðuÞ T9 n;m ðuÞ : The recent algorithm considered in (Chen 2019) is based on asymptotic estimates of S ðmÞ n derived in (Temme 1993), which are valid for large values of n, with unrestricted values of m 2 ð0; nÞ. It avoids the use of the recursion relation given in (7).
In the present paper we derive an integral representation of S9 n;m ðuÞ and of the complementary function T9 n;m ðuÞ, for which we can use the same asymptotic approach as for the Stirling numbers without calculating the Stirling numbers themselves. From the integral representation we also obtain a representation in which the incomplete beta function occurs as the main approximant. In this way we have a convenient representation, which is available as well for many classical cumulative distribution functions. We show numerical tests based on a first-order asymptotic approximation, which includes the incomplete beta function. In a future paper we give more details on the complete asymptotic expansion of S9 n;m ðuÞ, and, in addition, we will consider an inversion problem for large n and m: to find u either from the equation S9 n;m ðuÞ ¼ s, when s 2 ð0; 1Þ is given, or from the equation F s ¼ f , when f 2 ℝ is given.
Remarks on computing S9 n;m ðuÞ When computing the quantity F s defined in (5), numerical instability may happen when S9 n;m ðuÞ is close to 1. In that case, the computation of 1 2 S9 suffers from cancellation of digits. For example, take n ¼ 100, u ¼ 39:37, m ¼ 31. Then S9 n;m ðuÞ ≐ 0:99872, and F s becomes about 6.6561 when using the first relation in (9). However, when we calculate T9 n;m ðuÞ ¼ 0:002689 and use the second relation, then we obtain the more reliable result F s ≐ 5:9160.
We conclude that, when S9 n;m ðuÞ $ 0:5, it is better to switch and obtain T9 n;m ðuÞ from the sum in (8) and F s using the second relation in (9). A simple criterion to decide about this can be based on using the saddle point z 0 (see Remark 1 below).
A second point is numerical overflow when n is large, because S ðmÞ n rapidly becomes large when m is small with respect to n. For example, Therefore, it is convenient to scale the Stirling number in the form S ðkÞ n =n!. In addition, the Pochhammer term ðuÞ n in front of the sum in (2) will also be large with n; we have ð1Þ n ¼ n!.
We can write the sum in (2) Leading to a corresponding modification in the recurrence relation in (7) for the scaled Stirling numbers: To control overflow, we can consider the ratio This function satisfies f n ðuÞ # 1 if u $ 1. For small values of n we can use recursion in the form For large values of n and all u . 0, we can use a representation based on asymptotic forms of the gamma function. It should be observed that using the recursions in (7) and (12) is a rather tedious process when n is large. For example, when we use it to obtain S  Stirling number by using the asymptotic approximation derived in (Temme 1993).

Data availability
Code implementing the new estimator for Fu's F s in R is available at https://github.com/swainechen/hfufs.

Analytical results
The new algorithm is based on the following results, which we describe in two theorems.
Theorem 1. The statistics quantity S9 nþ1;mþ1 ðuÞ introduced in (2) has the representation as an integral in the complex z-plane where n and m are positive integers, 0 # m # n, u is a real positive number, and C R is a circle at the origin with radius R . u. The symbol ðaÞ n denotes the Pochhammer symbol introduced in (3).
Observe that we have raised in S n;m 9ðuÞ the parameters n and m with unity; this is convenient in the mathematical analysis. The proof of this theorem will be given in the Appendix (Proof of Theorem 1).
Corollary 2. The complementary quantity T9 nþ1;mþ1 ðuÞ has the representation T9 nþ1;mþ1 ðuÞ ¼ I 12x ðn 2 m þ 1; mÞ 2 R9 nþ1;mþ1 ðuÞ; This follows from Theorem 1 and the complementary relation of the incomplete beta function I x ðp; qÞ ¼ 1 2 I 12x ðq; pÞ: Note also that the incomplete beta function in (17) has the representation (see (Paris, 2010, x8.17 and from the complementary relation in (21) it follows that the function in (20) has the expansion I 1 1þt ðn 2 m þ 1; mÞ ¼ ð1 þ tÞ 2n X m21 j¼0 n j t j : The representation in this theorem in terms of the probability function I x ðp; qÞ shows the characteristic role of S9 n;m ðuÞ as a cumulative distribution function of the Stirling numbers. The representation can also be viewed as an asymptotic representation in which the incomplete beta function is the main approximant.
The proof of Theorem 2 can be found in the Appendix (Proof of Theorem 2), but we give here some preliminary information about functions used in the proof to explain the definition of the parameter t in (17). It is a function of u and arises in certain transformations of the integral given in Theorem 1. For this we need the function and its derivative With the function fðzÞ we can write (15) in the form S9 nþ1;mþ1 ðuÞ ¼ e 2fðuÞ 2pi Z C R e fðzÞ dz z 2 u ; R . u: n■ Then the saddle point of the integral in (26) follows from the equation There is a positive saddle point z 0 when 0 , m , n.
Next to these functions we introduce a function for complex values of a variable t: where t 0 ¼ m n 2 m . These functions are related by fðzÞ 2 fðz 0 Þ ¼ xðtÞ 2 xðt 0 Þ; with condition signðz 2 z 0 Þ ¼ signðt 2 t 0 Þ. In this way, using this relation as a transformation of the variable z to t, we can write (26) as The parameter t in Theorem 2 is defined as the positive solution of the equation fðuÞ 2 fðz 0 Þ ¼ xðtÞ 2 xðt 0 Þ; signðu 2 z 0 Þ ¼ signðt 2 t 0 Þ: (31) In Figure 1 we show the graphs of fðzÞ 2 fðz 0 Þ ( Figure 1A) and xðtÞ 2 xðt 0 Þ ( Figure 1B) for n ¼ 100, m ¼ 38. For these values the saddle points are z 0 ≐ 22:81 and t 0 ¼ 19 31 ≐ 0:61. The sign condition signðz 2 z 0 Þ ¼ signðt 2 t 0 Þ for the relation in (29) means that the left branches of the convex curves correspond with functions values for z 2 ð0; z 0 and t 2 ð0; t 0 , and the right branches with values for z 2 ½z 0 ; NÞ and t 2 ½t 0 ; NÞ. Clearly, we have a one-to-one relation between the positive z and t-variables.
A first-order approximation of the function R9 nþ1;mþ1 ðuÞ in (17) and (20) where and the function f ðtÞ is defined in (30). The value f ðt 0 Þ follows from evaluating dz=dt (see (30)) at t 0 , by observing that both functions f 0 ðzÞ and x 0 ðtÞ vanish when t/t 0 (hence, z/z 0 ). Then, l'Hôpital's rule can be used to obtain f ðt 0 Þ. In Figure 2 we show the error curves dðF s ;F s Þ in (34) for Fu's F s (9) for u 2 ½10; 400. We show examples for n ¼ 100, m ¼ 75 ( Figure  2A) and n ¼ 500, m ¼ 275 ( Figure 2B). The solid curves are for F s when using S9 nþ1;mþ1 ðuÞ I t=ð1þtÞ ðm; n 2 m þ 1Þ, the dashed curves when using S9 nþ1;mþ1 ðuÞ I t=ð1þtÞ ðm; n 2 m þ 1Þ þ R9 nþ1;mþ1 ðuÞ with the asymptotic estimate given in (32). For ease of visualization, the error dðF s ;F s Þ has been multiplied by a factor 10 or 100 in Figure 2. We have used the following mollified error function whereF s is the approximation of F s . This mollified error is exactly the relative error unless jF s j is small. Because F s will vanish when S9 nþ1;mþ1 ðuÞ ¼ 1 2 (which also means that u is near the transition value z 0 ≐ 137:98 (in Figure 2A) and z 0 ≐ 251:58 (in Figure 2B) (see Remark 1)), we cannot use relative error for all u . 0. This explains the non-smooth curves in Figure 2.
The final estimator is based on the representations in (17) and (20) and the first order approximation in (32), which are used to calculate Fu's F s with one of the two relations in (9) depending on whether S9 n;m ðuÞ $ 0:5, decided as described above.

Implementation and numerical results
We first summarize the steps to compute Fu's F s by using (9) and the first-order approximations (see (32) and (17) or (20)) Figure 3 (A) Comparison of relative error of the estimator from (Chen 2019) and the single term asymptotic estimator in (35). Relative error for each is calculated against the arbitrary precision implementation described in (Chen 2019). In total, 10,000 calculations were performed with n randomly sampled from a uniform distribution between 50 and 500; m between 2 and n; and u between 1 and 50. A solid diagonal line is drawn at y ¼ x.
with fðzÞ defined in (24) and xðtÞ defined in (28). When u ¼ z 0 there is one solution t ¼ t 0 . When t 6 ¼ t 0 there are two positive solutions, and we take the one that satisfies the condition signðu 2 z 0 Þ ¼ signðt 2 t 0 Þ. 3 When u , z 0 , hence t , t 0 , compute the approximation of S9 nþ1;mþ1 ðuÞ by using (35), and F s from the first relation in (9). 4 When u . z 0 , hence, t . t 0 , compute the approximation of T9 nþ1;mþ1 ðuÞ by using (36), and F s from the second relation in (9).
In Table 1, we show the relative errors in the computation of F s defined in (5). The values of n, m, and u correspond with those in Table 1 of (Chen 2019). The asymptotic result is from (35). Computations were done with Maple, with Digits = 16. The "exact" values were obtained by using Maple's code for Stirling1ðn; mÞ, which computes the Stirling numbers of the first kind.
We additionally performed a comparison with the recently published algorithm in (Chen 2019). We performed 10,000 calculations with each algorithm and compared the results with an exact calculator. As expected, since the previous algorithm required estimating a Stirling number for each term of the sum, while the current asymptotic estimate directly calculates the sum, both error and compute speed were improved. Relative error for the single term estimate in (35) was well controlled at , 0:001 for nearly 99% of the calculations; for 411 calculations where the previous hybrid estimator had an error . 0:001, the estimate in (35) was more accurate in all but one case (n ¼ 157; m ¼ 4; u ¼ 43:59732; 3.08e-3 relative accuracy using (Chen 2019); 3.32e-3 relative accuracy using (35)) ( Figure 3). Further analysis of the relative error demonstrated that it peaks at intermediate values of m=n, depending on u. These correspond to parameter choices near the transition values m ¼ m 0 , where t approaches t 0 and z approaches z 0 in the calculation; notably, they remain well controlled (all values , 0:001 mollified error) regardless of u. The asymptotic behavior (lower relative error) can also be seen as both n and m increase in Figure 3B.
The fewer calculations led to a clear improvement in calculation speed (median 54.6x faster; Figure 4). The speedup also depends on the parameter choices; in general, the speed advantage is greater when the hybrid calculator requires many calculations (namely, when m is small relative to n, as the hybrid calculator performs the sum in (2)) ( Figure 4B).

CONCLUSION
The rapid growth of sequencing data has been an enormous boon to population genetics and the study of evolution. Traditional population genetics statistics are still in common use today. The statistics Fu's F s and Strobeck's S have been difficult to calculate on modern, large data sets using previous methods; we now further improve both accuracy and speed for the calculation of Fu's F s for such data sets, using the main estimator in (35). Our plan for a paper about the ability to invert the calculation provides additional future directions in understanding the performance of these statistics. Therefore, the methods used herein may be useful for the development of new statistics that more effectively capture different types of selection.

ACKNOWLEDGMENTS
The authors thank the reviewers for their helpful comments and suggestions. SLC acknowledges Shyam Prabhakar and members of the Chen lab for fruitful discussions. NMT acknowledges CWI, Amsterdam, for scientific support; he thanks Edgardo Cheb-Terrab (Maplesoft) for his help with the Maple code. SLC was supported by the National Medical Research Council, Ministry of Health, Figure 4 (A) Comparison of run times between the hybrid algorithm from (Chen 2019) and the single term asymptotic estimator in (35). 100 iterations were run, each with 10,000 calculations; the time elapsed for each set of 10,000 calculations was recorded and plotted here. The same set of parameters were used for each algorithm. The order of running the algorithms was alternated with each iteration. The dark horizontal line indicates the median, the box indicates the first and third quartiles, the whiskers are drawn at 1.5x the interquartile range, and outliers are represented by open circles. The median for the hybrid algorithm is 62.64 s; the median for the asymptotic algorithm is 1.17 s. (B) Detailed benchmarking for n ¼ 100 (open violins) or 500 (gray violins), m 2 ð0:1n; 0:2n; . . . ; 0:9nÞ, and u 2 ð10; 500Þ. Fold speedup (ratio of the time taken for the hybrid calculator to that taken for the aysmptotic estimator) is plotted on the y-axis. Each dot represents one set of parameters; the violin plots summarize the density of points on the y-axis. Times were calculated for 100 iterations of each estimator for the same parameter values.