A scaling law of multilevel evolution: how the balance between within- and among-collective evolution is determined

Abstract Numerous living systems are hierarchically organized, whereby replicating components are grouped into reproducing collectives—e.g., organelles are grouped into cells, and cells are grouped into multicellular organisms. In such systems, evolution can operate at two levels: evolution among collectives, which tends to promote selfless cooperation among components within collectives (called altruism), and evolution within collectives, which tends to promote cheating among components within collectives. The balance between within- and among-collective evolution thus exerts profound impacts on the fitness of these systems. Here, we investigate how this balance depends on the size of a collective (denoted by N) and the mutation rate of components (m) through mathematical analyses and computer simulations of multiple population genetics models. We first confirm a previous result that increasing N or m accelerates within-collective evolution relative to among-collective evolution, thus promoting the evolution of cheating. Moreover, we show that when within- and among-collective evolution exactly balance each other out, the following scaling relation generally holds: Nmα is a constant, where scaling exponent α depends on multiple parameters, such as the strength of selection and whether altruism is a binary or quantitative trait. This relation indicates that although N and m have quantitatively distinct impacts on the balance between within- and among-collective evolution, their impacts become identical if m is scaled with a proper exponent. Our results thus provide a novel insight into conditions under which cheating or altruism evolves in hierarchically organized replicating systems.


Introduction
A fundamental feature of living systems is hierarchical organization, in which replicating components are grouped into reproducing collectives [1]. For example, replicating molecules are grouped into protocells [2], organelles such as mitochondria are grouped into cells [3], cells are grouped into multicellular organisms [4], and multicellular organisms are grouped into eusocial colonies [5,6].
Such hierarchical organization hinges on altruism among replicating components [7], the selfless action that increases collective-level fitness at the cost of self-replication of individual components [8]. For example, molecules in a protocell catalyze chemical reactions to facilitate the growth of the protocell at the cost of self-replication of the molecules, a cost that arises from a trade-off between serving as catalysts and serving as templates [9,10]. Likewise, cells in a multicellular organism perform somatic functions beneficial to the whole organism, such as defence and locomotion, at the cost of cell proliferation due to different trade-offs [4,11,12].
Altruism, however, entails the risk of invasion by cheaters-selfish components that avoid altruism and instead replicate themselves to the detriment of a collective. For example, parasitic templates replicate to the detriment of a protocell [13,14], selfish organelles multiply to the detriment of a cell [3], and cancer cells proliferate to the detriment of a multicellular organism [15,16]. Since cheaters replicate faster than altruists within a collective, they can out-compete the altruists, causing the decline of collective-level fitness-within-collective evolution, for short. However, collectives containing many altruists can reproduce faster than those containing many cheaters, so that altruists can be selected through competition among collectives-among-collective evolution. Evolution thus operates at multiple levels of the biological hierarchy in conflicting directions-conflicting multilevel evolution. Whether withinor among-collective evolution predominates exerts profound impacts on the stability and evolution of hierarchically-organised replicating systems , which abound in nature [1][2][3][4][5][6].
Therefore, how the balance between within-and among-collective evolution is determined is an important question in biology.
Previously, we have demonstrated that the balance between within-and among-collective evolution involves a simple scaling relation between parameters of population dynamics [39][40][41]. These parameters are the mutation rate of components (denoted by m) and the number of replicating components per collective (denoted by N )-in general, N represents the 'size' of a collective, such as the number of replicating molecules per protocell, organelles per cell, cells per multicellular organism, and organisms per colony. As m or N increases, withincollective evolution accelerates relative to among-collective evolution (i.e., promoting the evolution of cheating), and m and N display the following scaling relation when within-and among-collective evolution exactly balance each other out (i.e., no bias towards the evolution of cheating or altruism): N m α is a constant (i.e., N 9 m´α), where scaling exponent α is approximately one half [39][40][41]. This scaling relation indicates that although m and N have quantitatively different impacts on the balance between within-and among-collective evolution, their impacts are identical if m is scaled with exponent α (e.g., doubling N and quartering m approximately cancel each other out, keeping the balance of multilevel evolution).
While the above scaling relation provides a novel insight into how the balance between within-and among-collective evolution is determined, the generality of this relation is unknown because the relation has originally been demonstrated in specific models of protocells through computer simulations [39][40][41]. To shed light on the generality of the scaling relation, here we adapt a standard model of population genetics, the Wright-Fisher model [42], to investigate the balance between within-and among-collective evolution. Combining computer simulations and mathematical analyses, we establish the following generalized scaling relation under the assumption that selection strengths are stationary in time: N 9 m´α, where α decreases to zero as selection strength s decreases to zero. To examine further the generality of the scaling relation, we analyse another simple model of multilevel evolution, which approaches the model studied by Kimura [22,23] as s Ñ 0. Interestingly, our results show that this model displays a distinct scaling relation: N 9 m´α, where α increases to one as s decreases to zero. We show that this difference stems from the fact that our first model considers a quantitative trait, whereas our second model and Kimura's consider a binary trait. Taken together, our results suggest that the existence of scaling relation N 9 m´α is a general feature of conflicting multilevel evolution, but scaling exponent α depends on multiple factors in a non-trivial manner.

Model
Our model is an extension of the Wright-Fisher model to incorporate conflicting multilevel evolution [42]. The model consists of a population of M replicators grouped into collectives, each consisting of at most N replicators ( Fig. 1 and Table 1). The number of replicators in a collective can increase or decrease, and if this number exceeds N , the collective randomly 4 divides into two.
Replicator j in collective i is assigned a heritable quantitative trait (denoted by k ij ) representing the degree of altruism it performs within collective i (e.g., k ij represents the amount of chemical catalysis a replicating molecule provides in a protocell or the amount of somatic work a cell performs in a multicellular organism). Replicators are assumed to face a trade-off between performing altruism and undergoing self-replication. Thus, the fitness of individual replicators (denoted by w ij ) decreases with individual trait k ij , whereas the collective-level fitness of replicators xw ij y increases with collective-level trait xk ij y, where xx ij y is x ij averaged over replicators in collective i (i.e., x ij is averaged over the index marked with a tilde; see also Table 1). For simplicity, we assume that the strengths of selection within and among collectives, defined as respectively, depend only very weakly on k ij and xk ij y (i.e., Bs w {Bk ij « 0 and Bs a {Bxk ij y « 0).
This assumption implies that the relative fitness of replicators and collectives is translationally invariant with respect to k ij and xk ij y, respectively-i.e., pw ij`∆ w ij q{w ij « 1`s w ∆k ij , and pxw ij y`∆xw ij yq{xw ij y « 1`s a ∆xk ij y. Owing to this assumption, our model informs only about short-term evolution. For computer simulations, we used the following fitness function: where s w and s a are constant so that Eq. (2) satisfies the above assumption. This particular form of fitness function, however, does not affect our main conclusion, as will be seen from the fact that the mathematical analysis presented below is independent of it.
The state of the model is updated in discrete time (Fig. 1). In each generation, M replicators are sampled with replacement from replicators of the previous generation with probabilities proportional to w ij , as in the Wright-Fisher process [42].
During the above sampling, a replicator inherits group identity i and trait k ij from its parental replicator with potential mutation (no migration among collectives is allowed). More precisely, the k ij value of a replicator is set to k ijp` , where k ijp is the trait of the parental replicator, and takes a value of zero with a probability of 1´m or a value sampled from a Gaussian distribution with mean zero and variance σ (determining mutation step size) with a probability of m (representing a genetic or epigenetic mutation rate). The assumption that the mean of is zero is made on the following premise: evolution is mainly driven by selection or random genetic drift, and the direction of evolution is not directly determined by mutation.
Although this premise often approximates reality, it can be wrong if a mutation rate is so high as to dictate the direction of evolution as in the error catastrophe [43], a situation that is ignored in this study. The assumption that σ is independent of k ij is made for simplicity and implies that our model informs only about short-term evolution. Although we could reduce the number of parameters by aggregating m and σ into mσ (which is the variance of ), we keep them separate so that the mutation rate as usually defined is discernible.
After the above sampling, collectives containing more than N replicators are randomly divided into two, and those with no replicators removed (Fig. 1 c w within-collective 3rd central moment of k ij : ave˜irxpk ij´x k ij yq 3 ys c a among-collective 3rd central moment of k ij : ave˜irpxk ij y´xxk˜ijyyq 3 s

Demonstration of a scaling relation by computer simulations
By simulating the above model, we measured the rate of change of xxk˜ijyy, where xxx˜ijyy is x ij averaged over all replicators, at steady states as a function of m and N , assuming s w " s a .
The result indicates the existence of two distinct parameter regions, where xxk˜ijyy either increases or decreases through evolution ( Fig. 2; Methods under 'Parameter-sweep diagram').  (Note that although the model displays an unlimited increase or decrease of xxk˜ijyy over time, the model is intended to inform about short-term evolution as described above; therefore, its result should be considered as providing information about an instantaneous rate of evolution in a steady state for given parameters.) The two parameter regions mentioned above are demarcated by scaling relation N 9 m´α, where α Ó 0 as s Ó 0 (Fig. 3a)-i.e., the evolution of xxk˜ijyy becomes increasingly independent of m as s decreases. Similar scaling relations hold also when s w " 10s a , or s w " 0.1s a (Fig. 3a). These results generalize those previously obtained with specific models of protocells [39,40].

Mathematical analysis of the scaling relation
Next, we present a theory that can account for N 9 m´α under the assumptions that s a and s w are sufficiently small. Although such a theory could in principle be built by calculating the dynamics of the frequency distribution of k ij , for simplicity, we instead calculate the dynamics of the moments of this distribution. The expected change of xxk˜ijyy per generation is expressed by Price's equation as follows [44,45] (see Supplemental Text S1 "Derivation of Eq. (3)"): where Erxs is the expected value of x after one iteration of the Wright-Fisher process, cov˜irx i , y i s is the covariance between x i and y i over collectives, cov ij rx ij , y ij s is the covariance between x ij and y ij over replicators in collective i, and ave˜irx i s is x i averaged over collectives (see Table 1 for precise definitions). Note that the right-hand side of Eq. (3) is divided by xxw˜ijyy, so that Er∆xxk˜ijyys depends on relative rather than absolute fitness (note also that relative fitness is independent of the absolute values of k ij and xk ij y, as described in Model).
Expanding xw ij y and w ij in Eq. (3) as a Taylor series around xk ij y " xxk˜ijyy and k ij " xk ij y [46], we obtain (see Supplemental Text S1 "Derivation of Eq. (4)") where v a is the variance of xk ij y among collectives, and v w is the average variance of k ij among replicators within a collective (Table 1). Equation (4) implies that if s a and s w are sufficiently small, the boundary of the parameter regions, on which Er∆xxk˜ijyys " 0, is given by the following equation: s a v a " s w v w . Since this equation is expected to imply scaling relation N 9 m´α, we need to calculate v w and v a to calculate α.
To calculate v w and v a , we first consider a neutral case where s a " s w " 0. Let the total variance be v t " v a`vw . In each generation, M replicators are randomly sampled from replicators of the previous generation with mutation. Mutation increases v t to the variance of k ij` , which is v t`m σ since k ij and are uncorrelated (the variance of is mσ). Moreover, the sampling decreases the variance by a factor of 1´M´1 (in general, sample variance of sample size M is smaller than population variance by a factor of 1´M´1). Therefore, the expected total variance of the next generation is Likewise, the expected within-collective variance of the next generation can be calculated as follows. To enable this calculation, we assume that all collectives always consist of β´1N replicators, where β is a constant (as will be described later, this approximation becomes invalid for s Á 1; however, its validity for s ! 1 is suggested by the fact that it enables us to calculate scaling exponent α correctly). Randomly sampling β´1N replicators from a collective with mutation is expected to change v w to E rv 1 w s " p1´βN´1qpv w`m σq.
Since Erv 1 a s " Erv 1 t s´Erv 1 w s, we obtain E rv 1 a s " p1´M´1qv a`p βN´1´M´1qpv w`m σq, where the first term on the right-hand side indicates a decrease due to random genetic drift, and the second term indicates an increase due to random walks of xk ij y through withincollective neutral evolution. Note that Eqs. (6) and (7) partially incorporate the collectivelevel division-removal process implicitly through the assumption of a constant collective size.
Next, we incorporate the effect of selection on v w and v a . Allowing for the fact that replicators are sampled with probabilities proportional to fitness w ij , we can use Price's equation to express the expected values of v w and v a after one iteration of the Wright-Fisher process as follows (see Supplemental Text S1 "Derivation of Eq. (8)") [47]: where c w is the average third central moments of k ij within a collective, and c a is the third central moment of xk ij y. Besides the assumption of a constant collective size, the derivation of Eq. (8) involves the additional assumption that the variance of k ij within collective i is statistically independent of xk ij y as i varies.
Given that the dimension of c w and c a is equivalent to that of v 3{2 w and v 3{2 a , we make a postulate, which we verify later by simulations, that where γ a and γ w are positive constants. An intuitive reason for postulating that c a ă 0 is due to the finiteness of M , as follows (Fig. 4). The distribution of xk ij y has a finite range since M is finite. The right tail of this distribution, the one with greater xk ij y, is exponentially amplified by selection among collectives; however, the right tail cannot be extended because its length is finite [48,49]. By contrast, the left tail is contracted by among-collective selection, and this contraction is unaffected by the finiteness of the tail length. Likewise, the finiteness of tail lengths does not affect the rightward shift of the mean of the distribution due to amongcollective selection. Consequently, asymmetry builds up such that the right tail becomes selection nothing to amplify × no skew c a = 0 Figure 4: Mechanism by which trait distribution becomes skewed owing to selection and finiteness of population. Drawing depicts frequency distributions of collective-level trait xk ij y (orange) and effect of among-collective selection (blue arrows) (for simplicity, withincollective selection is not depicted). Distribution is initially assumed to be symmetric (left), so that its third central moment c a is zero. Tails of distribution have finite lengths due to finiteness of total population size M (red arrows). Because of finite lengths, left and right tails react differently to selection depending on whether they are amplified or reduced (red cross; see also main text). Consequently, distribution gets skewed (right), and c a becomes negative. It is postulated based on dimension that c a 9´v 3{2 a at steady state, where v a is variance of xk ij y.
shorter than the left tail, hence c a ă 0. The same argument can be applied to c w , but the direction of selection is opposite, hence the opposite sign: c w ą 0.
Combining Eqs. (8) and (9), we obtain Equations (10) and (11) enable us to calculate v w and v a at a steady state if s a and s w are sufficiently small (a steady state is defined as E rv 1 w s " v w and E rv 1 a s " v a ). For illustration, let us consider extreme conditions in which the expressions of v w and v a become simple.
Moreover, Eq. (11) implies that where the term involving s a M´1 is ignored under the assumption that both s a and M´1 are sufficiently small (and the assumptions that β´1N " 1 and Equation (14) shows that v a at a steady state is independent of N if β´1N ! M , a result that might be contrary to one's intuition since by the law of large number, increasing N reduces random genetic drift within collectives and thus decelerates the growth of v a .
Indeed, the increase of v a per generation is approximately proportional to N´1v w according to the second term of Eq. (11). However, since v w 9 N m according to Eq. (12), N cancels out, so that v a is independent of N (see Supplemental Fig. S1 for simulation results). This cancellation resembles that occurring in the rate of neutral molecular evolution, which is also independent of population size [50].
To examine the validity of Eqs. (10) and (11), we measured v a , v w , c a , and c w through simulations, assuming s w " s a " s ( Fig. 5). The results show that v a 9 m for a very small value of s (viz., 10´6) in agreement with Eq. (14) (Fig. 5a). Moreover, v w 9 mN as predicted by Eq. (12) (Fig. 5b), except for cases where ∆xxk˜ijyy ă 0 (this deviation will be discussed (Fig. 5c), and c w « γ w v 3{2 w if s ! 1 and ∆xxk˜ijyy AE 0 (Supplemental Fig. S2), as postulated in Eq. (9). Taken together, these results support the validity of Eqs. (10) and (11)   Using Eqs. (10) and (11), we can calculate the scaling exponent (α) of the boundary of the parameter regions for sufficiently small s a and s w . Since Er∆xxk˜ijyys " 0 on the parameter boundary, Eq. (4) implies that v a {v w « s w {s a . Thus, for extreme parameter conditions (viz., (12) and (14) imply that For s a " pγ a ? mσM 3{2 q´1, Eqs. (12) and (13) imply that where r " s w {s a and Γ " γ a a σ{βM , and Eq. (16) implies that α increases from zero to one third as ? rs a increases from zero.
We also numerically obtained α by calculating the values of N and m (m P r10´4, 10´1s) that satisfy v a {v w " s w {s a at a steady state using Eqs. (10), (11), and the values of β, γ a , and γ w estimated from Fig. 5bc and Supplemental Fig. S2, respectively (viz., β´1 " 0.45 and γ a " 0.26 through least squares regression of Eqs. (12) and (9) for s " 10´6 and 10´2, respectively; γ w " 0.25 through least squares regression of Eq. (9) for ∆xxk˜ijyy AE 0). The results agree with the simulation results for s a ă 1 when r " 1 and 10, and for s a ă 10´3 when r " 0.1 (Fig. 3a). We do not know why the validity of analytical prediction is restricted when r is small. Overall, the above results support the validity of Eqs. (10) and (11) for sufficiently small values of s a and s w .
In addition, we note that the postulate in Eq. (9) is also supported by previous studies calculating the evolution of a quantitative trait (viz., fitness) subject to single-level selection [48,49]. These studies show that fitness increases through evolution at a rate proportional to the two-third power of the mutation rate. That result is consistent with Eqs. (10) and (11) and, hence, also with Eq. (9), as follows. Since Ref. [48] assumes single-level selection and a very large population, let us also assume that s w " 0 and M Ñ 8, respectively, in our model. Then, Eqs. (4) and (14) imply that logarithmic fitness, lnxxw˜ijyy 9 xk˜ijy, increases at a rate proportional to m 2{3 (Supplemental Fig. S1). Reversing the argument, we can also use the model of Ref. [48] to estimate the value of γ a as about 0.25 (Supplemental Text S1 under "Estimation of γ a "), which matches the value measured in our model (viz. 0.26). Moreover, the model of Ref. [48] can also be applied to estimate γ w , and the value of γ w measured in our model is about 0.25 (Supplemental Fig. S2). Taken together, these agreements corroborate the validity of Eq. (9).
Finally, to clarify why Eqs. (10) and (11)  The results indicate that the model displays a phenomenon previously described as evolutionarily stable disequilibrium (ESD, for short) [39]. Briefly, the collectives constantly oscillate between growing and shrinking phases (Fig. 5d). During the growing phase, the collectives continually grow and divide, and their xk ij y values gradually decline through within-collective evolution, a decline that eventually puts the collectives to a shrinking phase. In the shrinking phase, the collectives steadily decrease in the number of constituent replicators; however, their xk ij y values abruptly jump at the end of the shrinking phase, a transition that brings the collectives back to the growing phase. This sudden increase of xk ij y is due to random genetic drift induced by very severe within-collective population bottlenecks. Although such an increase of xk ij y is an extremely rare event, it is always observed in the lineage of common ancestors because these ancestors are the survivors of among-collective selection, which favours high xk ij y values [39,40].
ESD breaks the assumption-involved in Eqs. (10) and (11)-that all collectives always consist of β´1N replicators because ESD allows extremely small collectives to regrow and contribute significantly to v w and v a (note that the contributions of collectives to v w and v a are proportional to the number of replicators they contain, as defined in Table 1). We found that ESD occurs for s Á 1 (Fig. 5d) or for ∆xk˜ijy ă 0 (Supplemental Fig. S3). Thus, ESD might be responsible for the failure of Eqs. (10) and (11) to predict α for s Á 1 (Fig. 3) as well as the fact that c a ‰ γ a v 2{3 a for s Á 1 (Fig. 5c). In addition, ESD might also be responsible for the fact that v w is not proportional to mN when ∆xk˜ijy ă 0 (Fig. 5b).
Another potential reason for the failure of Eqs. (10) and (11) for s ě 1 is the fact that v a and c a constantly oscillate with a periodic sign change of c a (Fig. 6). This oscillation not only invalidates the assumption that c a " γ a v 2{3 a , but also makes it questionable to consider a steady-state solution of Eq. (10) and (11). Finally, we add that this oscillation is distinct from ESD, in that it is observed in terms of v a and c a , which are properties of an entire population of collectives, whereas ESD is observed in terms of the properties of common ancestors of collectives.

Comparison to a binary-trait model
To examine further the generality of the scaling relation described above, we next consider a study by Kimura [22,23]. Kimura has investigated a binary-trait (i.e., two-allele) model of multilevel evolution formulated based on a diffusion equation. Using this model, Kimura has revealed the following scaling relation that holds when within-and among-collective evolution exactly balance each other out: (the notation has been converted to ours as described in Supplemental Text S1 under "Converting Kimura's notation into ours") [22,23]. Equation (17) is derived under the assumption that the steady-state frequency of the altruistic allele is identical to that in the absence of selection, thus involving a weak-selection approximation [22,23]. Therefore, the scaling exponent in Kimura's model (α « 1) differs from that in ours (α « 0) for s a « 0 and s w « 0.
To study how α depends on s (where s " s w " s a ) if the trait is binary, we modified our model into a binary-trait model by assuming that k ij switches between zero and one at mutation rate m. By simulating the modified model, we obtained a parameter-sweep diagram, where parameter regions were defined by the sign of xxk˜ijyy´1{2 at steady states (Supplemental Fig. S4; this definition of parameter regions is essentially equivalent to that used for the quantitative-trait model, in that it can be rephrased in terms of the sign of ∆xxk˜ijyy at xxk˜ijyy " 1{2). The results show that the parameter-region boundary constitutes scaling relation N 9 m´α, where α Ò 1 as s Ó 0 (Fig. 3b)-i.e., the evolution of xxk˜ijyy becomes increasingly dependent on m as s decreases. Therefore, the way α depends on s is compatible with Eq. (17), but is opposite to that in the quantitative-trait model, where α Ó 0 as s Ó 0 ( Fig. 3a).
To pinpoint why the two models yield such distinct predictions, we re-derived Eq. (17) using the method developed in the previous section (for details, see Supplemental Text S1  (7) need to be modified to respectively, where we have assumed that the parameters are on a parameter-region boundary and, therefore, that xxk˜ijyy " 1{2. Equations (18) and (19) Imposing the condition for a parameter-region boundary, v w {v a « s a {s w , we obtain which is approximately the same as Eq. (17) if m ! 1 and M´1 ! βN´1 ! 1 as assumed by Kimura [22,23].
Equations (18) and (19) allow us to understand why the two models display different scaling exponents. These equations contain terms involving˘4mp1´mqv a , which increase v w and commensurately decrease v a . This 'transfer' of variance occurs because mutation causes xk ij y to tend towards 1{2, for which v w is maximized, in every collective. In other words, mutation directly causes the convergent evolution of xk ij y, raising the v w {v a ratio.
Consequently, the balance between within-and among-collective evolution strongly depends on m. By contrast, the quantitative-trait model assumes that mutation does not cause any directional evolutionary change in xk ij y. Moreover, mutation equally increases v w and v a according to Eqs. (10), (11), and (12). Consequently, the balance between within-and among-collective evolution does not much depend on m if selection is weak.

Discussion
The results presented above suggest that scaling relation N 9 m´α is a general feature of conflicting multilevel evolution. Scaling exponent α, however, depends in a non-trivial manner on the strength of selection and whether altruism is a quantitative or binary trait.
Although we have assumed that the parameters involved in the scaling relation-the mutation rate, selection strength, and the distinction between quantitative and binary traitsare independent of each other, these parameters are potentially correlated in reality. While such correlations are not well understood [51,52], discussing them can illustrate the utility of the findings of this study. For this illustration, we first note that whether altruism is a 19 quantitative or binary trait can be translated into the number of loci involved in altruism: a quantitative trait involves many loci, whereas a binary trait involves one. The number of loci is likely to be positively correlated with the mutation rate of the trait, and it is possibly negatively correlated with the effect size of mutation (e.g., a single locus with large effects versus many loci with small effects). The effect size of mutation, in turn, is possibly positively correlated with the strength of selection. These correlations, which we assume here for the sake of illustration, would imply a spectrum of altruism ranging from a strongly-selected, binary trait with a low mutation rate to a weakly-selected, quantitative trait with a high mutation rate (we are ignoring the possibility that mutations have highly heterogeneous effects). Such correlations would be conducive to the evolution of altruism, an inference that is enabled by the following findings of this study: binary-trait altruism is susceptible to the invasion by cheaters for a high mutation rate, but this susceptibility decreases with selection strength (α decreases with s); by contrast, quantitative-trait altruism is relatively insensitive to mutation for weak selection (α decreases to zero as s decreases to zero).
To do this, we define the relatedness of replicators as the regression coefficient of xk ij y on k ij [65,66], i.e., R " v a {pv a`vw q, and express all the results in terms of R instead of v w {v a .
Therefore, our results are compatible with the kin selection theory.
An important issue to address for future research is to test whether the scaling relation is observed in reality. Such tests could in principle be conducted through evolutionary experiments.
Capturing the essence of multilevel selection, the models and analyses presented above are likely to have broad utility. They are generally relevant for the evolution of altruism in replicators grouped into reproducing collectives, e.g., symbionts, organelles, or genetic elements grouped into cells [3], cells grouped into multicellular organisms [4], or other systems 20 that have emerged through major evolutionary transitions [1].

Parameter-sweep diagram
In Fig. 2, the value of ∆xxk˜ijyy was estimated from slopes of the least squares regression of

Ancestor tracking
Ancestor tracking is a method that provides novel information about evolutionary dynamics by tracking the genealogy of individuals backwards in time. In our study, individuals whose genealogy was tracked were collectives. Since collectives undergo binary fission, their genealogy can be pictured as a binary tree, where an event of binary fission is represented by the coalescence of two branches of the tree. As the tree is traversed from the tips to the root (i.e., from the present to the past), all branches eventually coalesce to a single branch, the stem of the tree, which represents the lineage of common ancestors of all collectives present at a particular point in time. Information about common ancestors can be visualized as 21 time-series data along their line of descent, i.e., along the stem of the tree. In Fig. 5d, n i and xk ij y of the common ancestors are plotted.

Competing interests
We have no competing financial interests.

Author Contributions
N.T. conceived the study, designed, implemented and analysed the models, and wrote the

Supplemental Information captions
Text S1 Supplemental texts. It consists of the following sections: 1. Derivation of Eq.  In this section, we derive Eq. (3) of the main text, which is redisplayed below: where the symbols are defined as follows: where M is the total number of replicators, n i is the number of replicators in collective i, L is the number of collectives, and xk ij y :" 1 In each generation, a replicator is sampled M times with replacement from replicators of the previous generation with probabilities proportional to fitness w ij , as in the Wright-Fisher process (see the main text under "Model"). To express xxk˜ijyy in the next generation, we introduce the following symbols. Let I l be the index of the collective to which the lth sampled replicator belongs (l P t1, 2,¨¨¨, M u), J l be the index of the sampled replicator within collective I l , and P pI l " i, J l " jq be the probability that replicator j in collective i is sampled. By the definition of the Wright-Fisher process, Moreover, let I l J l be the effect of mutation and P p I l J l q be its probability distribution function ( I l J l takes a value of 0 with a probability 1´m or a value sampled from a Gaussian distribution with mean 0 and variance σ with a probability m). Finally, let Erxs denote the expected value of x after one iteration of the Wright-Fisher process; e.g., Using these definitions, we can express the expected change of xxk˜ijyy per generation, denoted by Er∆xxk˜ijyys, as follows: Since I l and J l are independent and identically distributed for different values of l, we can remove the summation in the above equation to obtain Er∆xxk˜ijyys " E rk IJ` IJ s´xxk˜ijyy where we used the fact that E r IJ s " 0 and omitted subscript l. The first term on the RHS of Eq. (S6) can be calculated as follows: Substituting Eq. (S7) into Eq. (S6), we obtain Eq. (3).

Derivation of Eq. (4)
In this section, we derive Eq. (4) of the main text, which is redisplayed below: where the symbols are defined as follows: Equation (4) is obtained by expanding xw ij y and w ij in Eq. (3), i.e., ∆xxk˜ijyy " xxw˜ijyy´1cov˜i " xw ij y, xk ij y ‰`x xw˜ijyy´1ave˜i " cov ij rw ij , k ij s ‰ , as Taylor series around xk ij y " xxk˜ijyy and k ij " xk ij y, respectively.
First, we obtain the first term of Eq. (4), which stems from the first term of Eq. (3). We assume that xw ij y is an analytic function of xk ij y and that xw ij y " xxw˜ijyy for xk ij y " xxk˜ijyy. Expanding xw ij y around xk ij y " xxk˜ijyy, we obtain Dividing both sides by xxw˜ijyy, we obtain By the definition of selection strength (see the main text under "Model"), 1 xxw˜ijyy Bxw ij y Bxk ij yˇˇˇˇx k ij y"xxkĩj yy "ˆ1 xw ij y Bxw ij y Bxk ij y˙ˇˇˇˇx k ij y"xxkĩj yy :" s a (S11) By mathematical induction, it can be shown that 1 xw ij y B l`1 xw ij y Bxk ij y l`1 "ˆB Bxk ij y`1 xw ij y Bxw ij y Bxk ij y˙1 xw ij y B l xw ij y Bxk ij y l (S12) for l P t1, 2, 3,¨¨¨u. Since it is assumed that Bs a {Bxk ij y " 0 (see the main text under "Model"), the above equation implies that 1 xw ij y B l xw ij y Bxk ij y l˙ˇx k ij y"xxkĩj yy " s l a . (S13) Given the above equation, Eq. (S10) implies that xw ij y xxw˜ijyy " 1`s a`x k ij y´xxk˜ijyy˘`Ops 2 a q.
(S14) Therefore, Second, we obtain the second term of Eq. (4), which stems from the second term of Eq. (3). Using the same method as above, we can show that We assume that xw ij y and v wi are statistically uncorrelated as i varies (this is equivalent to assuming that xk ij y and v wi are uncorrelated). Under this assumption, ´ave˜i " xw ij y ‰ s w ave˜i rv wi s`Ops 2 w q "´xxw˜ijyys w v w`O ps 2 w q. (S17) Substituting Eqs. (S15) and (S17) into Eq. (3), we obtain Eq. (4).

Calculation of Erv 1 w s
To calculate Erv 1 w s, we introduce the following symbols. Let n 1 i be the number of replicators in collective i after one iteration of the Wright-Fisher process. Note that n 1 i is a random variable and can be expressed as where δ I l i is the Kronecker delta (i.e., δ I l i " 1 if I l " i, and δ I l i " 0 otherwise). Moreover, let xk iJl y and x iJl y be the sample mean of k I l J l and I l J l within collective i: which are defined to be zero when n 1 i " 0. The probability that J l " j given I l " i is where Eq. (S3) is used. Using the symbols defined above, we can express the sample variance of k ij within collective i in the next generation as v 1 wi :" 1 which is defined to be zero when n 1 i " 0. Using v 1 wi , we can express Erv 1 w s as follows: In the last equation, we can separate k iJ l and iJ l as follows: The last term of the final line of Eq. (S23) can be shown to be zero, as follows. With the Kronecker delta δ I l i , this term can be calculated as where we used the fact that k iJ l and iJ l are independent of each other. Therefore, To calculate the first term of Eq. (S25), we define within-collective conditional expectation as follows: Using Eq. (S26), we can transform the first term of Eq. (S25) as follows: ı . (S27) The first term in the last line of Eq. (S27) is calculated as follows: Using the Kronecker delta δ I l i , we can show that Likewise, we can show that Also, we can show that Using the above results, we can transform the last line of Eq. (S28) as follows: The conditional expectation in the last line of Eq. (S32) can be calculated as follows: where we used the fact that xk 2 ij y´xk ij y 2 " v wi . The last line of Eq. (S33) can be interpreted as the expected variance of k ij within collective i after one iteration of the Wright-Fisher process excluding the effect of random sampling. Thus, let us introduce the following symbol: which denotes the expected change of the variance of k ij within collective i due to withincollective selection. Combining Eqs. (S28), (S32), (S33), and (S34), we can transform the first term in the last line of Eq. (S27) as follows Next, we calculate the second term in the last line of Eq. (S27) as follows: where we used the fact that k iJ l and k iJm are independent of each other for l ‰ m in the final step. Since n 1 i can be zero, the last line of Eq. (S36) needs to interpreted as follows: 1 Thus, where the case for n 1 i ą 0 follows from a calculation similar to Eqs. (S28) and (S32). To calculate the last line of Eq. (S36), we separate the case where n 1 i ą 0 @i P t1, 2, 3,¨¨¨, Lu and the case where n 1 i " 0 for some i P t1, 2, 3,¨¨¨, Lu. Let S be a proper subset of t1, 2, 3,¨¨¨, Lu, and P pSq be the probability that n 1 i " 0 @i P S and n 1 i ą 0 @i R S. The value of P pSq can be estimated as follows: where the RHS of the first inequality is the probability that n 1 i " 0 @i P S and n 1 i ě 0 @i R S. Using these symbols and Eq. (S34), we can express the last line of Eq. (S36) as follows: where ř S is a summation over all possible S Ă t1, 2,¨¨¨, Lu, and ř S‰H is the same summation excluding the case where S is empty.
We assume that the second term of the last line of Eq. (S40) is negligible for the following reasons. If n i " 1 for some i P S, then Eq. (S39) implies that P pSq « 0. Contrariwise, if the statement that n i " 1 is false for all i P S, then the value of ř iPS pv wi`∆s v wi q is likely to be small because n i is small. Under this assumption, Eq. (S40) implies that The second term of Eq. (S25) is calculated as follows: where |S| is the number of elements in S, and we have assumed that ř S P pSq|S| ! L in the last step.
To enable the further calculation of Eq. (S43), we assume that Under this assumption, we can transform Eq. (S43) as follows where we used Eq. (S16) in the last step. Expanding w ij as a Taylor series around k ij " xk ij y, we can show that where c wi :" xpk ij´x k ij yq 3 y, and that where c w :" ave˜irc wi s. Substituting Eq. (S47) into Eq. (S45), we obtain

Calculation of E rv 1 a s
To calculate E rv 1 a s, we first calculate E rv 1 t s and then use the fact that E rv 1 a s " E rv 1 t s´E rv 1 w s. We can express E rv 1 t s as follows: where we used the fact that k I l J l and iJ l are independent of each other, as we did in Eqs. (S23), (S24), and (S25). The first term of the last line of Eq. (S49) is calculated as follows: The first term of the last line of Eq. (S50) is calculated as follows: where we used Eq. (S34) and the assumption that xw ij y and v wi`∆s v wi are statistically uncorrelated as i varies, which has already been made in Eq. (S17). Using Eq. (S16) and (S47), we can transform the last line of Eq. (S51) as follows: ave˜i rv wi`∆s v wi s " v w´sw c w`O ps 2 w q. Therefore, The second term of the last line of Eq. (S50) is calculated as follows: The first term of the last line of Eq. (S54) is calculated as follows: where we have used the following notation in the final step: The second term of the last line of Eq. (S54) is calculated as follows: Doing the same calculation as in Eq. (S55), we can transform Eq. (S57) as follows: Combining Eqs. (S54), (S55), and (S58), we obtain ı´`a ve˜i " xk ij y`∆ s xk ij y ‰˘2 " xxw˜ijyy´1 cov˜i " xw ij y,`xk ij y`∆ s xk ij y´ave˜i " xk ij y`∆ s xk ij y ‰˘2 í`x xw˜ijyy´1cov˜i " xw ij y, xk ij y`∆ s xk ij y ‰˘2 We consider each term in the last line of Eq. (S59) in terms of the order of s a and s w . We begin with the first term.
where we used Eqs. (S16) and (S17) in the last step. The last term of the last line of Eq. (S60) is zero because ∆ s xk ij y is independent of xk ij y and xw ij y, a fact that stems from the assumptions that Bs w {Bxk ij y " 0 (see the main text under "Model") and that v wi is where the value of c depends weakly on p c and is around 4 in a wide range of p c [1]. Multiplying both sides of Eq. (S68) with r or pr´xryq 2 and integrating over the whole range, we get where V , C, 1 , and 2 are defined as follows: In obtaining Eqs. (S70) and (S71), we have assumed that the surface terms go to zero as r Ñ 8; i.e., lim rÑ˘8 P pr, tq " 0, lim rÑ˘8 rP pr, tq " 0, lim rÑ˘8 r Bp Br " 0, and lim rÑ˘8 r 2 Bp Br " 0. For a travelling-wave solution of Eq. (S68) with a constant speed v and shape, Eq. (S70) implies v " V´ 1 .

Converting Kimura's notation into ours
where the symbols are as described in Table S1 (see also the next paragraph). Equation (S79) includes Eq. (17) of the main text as a special case. Equation (S79) appears as Eq. (27) of Ref. [2] or Eq. (4.8) of Ref. [3] as and is derived therein under the assumption that the steadystate frequency of the altruistic allele is identical to that in the absence of selection, an approximation that is expected to be valid in the limit of weak selection.
To convert Kimura's notation into ours, we assumed that the rate of mutation from a non-altruistic to an altruistic allele is identical to the rate of mutation from the altruistic to the non-altruistic allele, so that the direction of mutation is unbiased as in our quantitativetrait model. Moreover, we assumed that the migration rate among collectives is zero since our model does not consider migration. Finally, we took account of the fact that Kimura's model considers diploid as follows. In Kimura's model, each collective consists of N diploid individuals, i.e., 2N alleles. The number of alleles per collective can be considered as the average number of replicators per collective in our model (i.e., β´1N ) because Kimura's model assumes no dominance.

Derivation of Kimura's result through our method
In this section, we derive Eq. (17) of the main text, which gives parameter-region boundaries of the binary-trait model, using the method developed in the main text. The most important difference between the binary-trait and quantitative-trait models resides in the definition of IJ . Thus, we consider only terms involving or mσ as described below.
First, we show that the change in the definition of IJ does not affect the condition for the parameter-region boundary given by Eq. (3). In the binary-trait model, IJ " 0 with probability 1´m and IJ " 1´2k IJ with probability m (I and J are random variables taking the indices of a sampled replicator, as defined in Section 1). Thus, E r IJ s " (S80) Therefore, Eq. (4) needs to be modified as follows: E " ∆xxk˜ijyy ‰ " s a v a´sw v w`m p1´2xxk˜ijyyq`Ops 2 w`s 2 a`m s w`m s a q.
This equation, however, becomes almost identical to Eq. (4) if the parameters are on the parameter-region boundary, on which xxk˜ijyy " 1{2. Thus, the condition for the parameterregion boundary when s a , s w , and m are sufficiently small is the same as in the quantitativetrait model. Next, we consider Eq. (5) and show that the change in the definition of IJ makes a significant difference, which explains the difference between the binary-and quantitative-trait models. In Eq. (5), mσ represents the difference between v t and the variance of k IJ` IJ . In the quantitative-trait model, this difference is simply the variance of IJ because IJ and Figure S1: Rate of logarithmic fitness increase as function of mutation rate measured through simulations with no within-collective selection (s w " 0, s a " 10´3, M " 5ˆ10 5 , and σ " 10´4). Fitness is defined as w ij " e saxk ij y . Symbols have following meaning: N " 10 2 (black circles); N " 10 3 (blue triangle up); N " 10 4 (orange triangle down); N " 10 5 (green diamond). Line is ∆xxk˜ijyy{∆t 9 m 2{3 , as predicted by Eqs. (4) and (14) in main text. This figure confirms that ∆xxk˜ijyy{∆t 9 m 2{3 in agreement with Ref. [1]. Note also that ∆xxk˜ijyy is roughly independent of N if N ! M , which is consistent with prediction of Eq. (14) in main text.  Figure S2: Average third central moment of k ij within collective (c w ) measured through simulations (M " 5ˆ10 5 , σ " 10´4, and s w " s a " 10´2). (a) Ratio between effect of selection and that of random genetic drift on ∆v w as function of ∆xxk˜ijyy{∆t: ∆xxk˜ijyy{∆t ą 0 (black triangle up); ∆xxk˜ijyy{∆t ă 0 (red triangle down). (b) |c w | as function of v w . Triangles are simulation results: ∆xxk˜ijyy{∆t ą 3ˆ10´7 (black triangle up); ∆xxk˜ijyy{∆t ă 3ˆ10´7 (red triangle down). Line is |c w | 9 v 3{2 w , as postulated in Eq. (9). Least squares fitting of |c w | " γ w v 3{2 w to data for ∆xxk˜ijyy{∆t ă 3ˆ10´7 yielded γ w « 0.25.  Figure S3: Dynamics of common ancestors of collectives (m " 0.01, M " 5ˆ10 5 σ " 10´4, and s a " s w " 0.01). Plotted are number of replicators per collective (black; left coordinate) and xk ij y (orange; right coordinate). (a) N " 5623. In this case, ∆xk˜ijy ą 0, and evolutionarily stable disequilibrium is not clearly observed. (b) N " 17783. In this case, ∆xk˜ijy ă 0, and evolutionarily stable disequilibrium is clearly observed.  Figure S4: Parameter-sweep diagrams of binary-trait model (s w " s a " s and M " 5ˆ10 5 ). Symbols have following meaning: xxk˜ijyy ą 1{2 (triangle up); xxk˜ijyy ă 1{2 (triangle down). Lines are estimated parameter-region boundaries. Parameter-region boundaries were estimated as follows. Zeros of xxk˜ijyy´1{2 were estimated with linear interpolation with respect to N from two simulation points around parameter-region boundary for various m values between 10´5 and 10´1. Estimated zeros were used to obtain parameter-region boundary through least squares regression of N 9 m´α.