## Abstract

In ecology and conservation biology, the number of species counted in a biodiversity study is a key metric but is usually a biased underestimate of total species richness because many rare species are not detected. Moreover, comparing species richness among sites or samples is a statistical challenge because the observed number of species is sensitive to the number of individuals counted or the area sampled. For individual-based data, we treat a single, empirical sample of species abundances from an investigator-defined species assemblage or community as a reference point for two estimation objectives under two sampling models: estimating the expected number of species (and its unconditional variance) in a random sample of (i) a smaller number of individuals (multinomial model) or a smaller area sampled (Poisson model) and (ii) a larger number of individuals or a larger area sampled. For sample-based incidence (presence–absence) data, under a Bernoulli product model, we treat a single set of species incidence frequencies as the reference point to estimate richness for smaller and larger numbers of sampling units.

The first objective is a problem in interpolation that we address with classical rarefaction (multinomial model) and Coleman rarefaction (Poisson model) for individual-based data and with sample-based rarefaction (Bernoulli product model) for incidence frequencies. The second is a problem in extrapolation that we address with sampling-theoretic predictors for the number of species in a larger sample (multinomial model), a larger area (Poisson model) or a larger number of sampling units (Bernoulli product model), based on an estimate of asymptotic species richness. Although published methods exist for many of these objectives, we bring them together here with some new estimators under a unified statistical and notational framework. This novel integration of mathematically distinct approaches allowed us to link interpolated (rarefaction) curves and extrapolated curves to plot a unified species accumulation curve for empirical examples. We provide new, unconditional variance estimators for classical, individual-based rarefaction and for Coleman rarefaction, long missing from the toolkit of biodiversity measurement. We illustrate these methods with datasets for tropical beetles, tropical trees and tropical ants.

Surprisingly, for all datasets we examined, the interpolation (rarefaction) curve and the extrapolation curve meet smoothly at the reference sample, yielding a single curve. Moreover, curves representing 95% confidence intervals for interpolated and extrapolated richness estimates also meet smoothly, allowing rigorous statistical comparison of samples not only for rarefaction but also for extrapolated richness values. The confidence intervals widen as the extrapolation moves further beyond the reference sample, but the method gives reasonable results for extrapolations up to about double or triple the original abundance or area of the reference sample. We found that the multinomial and Poisson models produced indistinguishable results, in units of estimated species, for all estimators and datasets. For sample-based abundance data, which allows the comparison of all three models, the Bernoulli product model generally yields lower richness estimates for rarefied data than either the multinomial or the Poisson models because of the ubiquity of non-random spatial distributions in nature.

## INTRODUCTION

Exhaustive biodiversity surveys are nearly always impractical or impossible (Lawton et al. 1998), and the difficulties inherent in estimating and comparing species richness from sampling data are well known to ecologists and conservation biologists. Because species richness increases non-linearly with the number of individuals encountered, the number of samples collected or the area sampled, observed richness is inevitably a downward biased estimate of true richness. ‘Adjustment’ for differences in sampling effort by calculating simple ratios of species per individual or species per unit of sampling effort seriously distorts richness values and should never be relied upon (Chazdon et al. 1999). For assessing and comparing species accumulation curves or rarefaction curves, methods that are based on an explicit statistical sampling model provide a straightforward resolution for many applications (Gotelli and Colwell 2011).

In many biodiversity studies, the basic units are individuals, ideally sampled randomly and independently, counted and identified to species. We refer to such data as ‘individual based’. In many other studies, the sampling unit is not an individual, but a trap, net, quadrat, plot or a fixed period of survey time. It is these sampling units, and not the individual organisms, that are sampled randomly and independently. If the number of individuals for each species appearing within each sampling unit can be measured or approximated, we refer to the resulting data as ‘sample-based abundance data’. For many organisms, especially microorganisms, invertebrates or plants, only the incidence (presence or absence) of each species in each sampling unit can be accurately recorded. We refer to such a dataset as ‘sample-based incidence data’ (Gotelli and Colwell 2001).

For individual-based data, we treat a single, empirical sample of species abundances, which we refer to as the ‘abundance reference sample’ from an investigator-defined species assemblage or community as a reference point for two estimation problems: (i) estimating the expected number of species (and its variance) in a random sample of a smaller number of individuals or a smaller area sampled and (ii) estimating the number of species (and its variance) that might be expected in a larger number of individuals or a larger area sampled. The first is an ‘interpolation’ problem that is addressed with classical rarefaction and Coleman rarefaction. The second is an ‘extrapolation’ problem that we address with sampling-theoretic predictors for the number of species in a larger sample or larger area based on an estimated asymptotic species richness.

For sample-based incidence data, the statistical equivalent of the abundance reference sample is the ‘incidence reference sample’, the set of incidence frequencies among the sampling units, one frequency for each observed species over all sampling units. For interpolation and extrapolation, we treat these incidence frequencies in nearly the same way that we treat the list of species abundances in a single abundance reference sample (with appropriate statistical modifications), to estimate richness for smaller and larger numbers of sampling units. For sample-based abundance data, the abundances are either first converted to incidences (presence or absence) before applying incidence-based methods or else abundances are summed across sampling units and individual-based (abundance) methods are applied to the sums.

Both interpolation and extrapolation from an empirical reference sample can be viewed as estimating the form of the underlying species accumulation curve. This curve is a plot of species richness as a function of the number of individuals or sampling units, including both smaller and larger numbers of individuals or sampling units than in the reference sample. We model the species accumulation curve as asymptotic to an estimate of the species richness of the larger community or assemblage represented by the empirical reference sample (Fig. 1).

In this paper, for the interpolation (rarefaction) problem for individual-based (abundance) data, we present a unified statistical framework for two distinct approaches: (i) a multinomial model for classical rarefaction (Heck et al. 1975; Hurlbert 1971; Sanders 1968; Simberloff 1979; Smith and Grassle 1977) and (ii) a continuous Poisson model for Coleman's ‘random-placement’ rarefaction method (Coleman 1981; Coleman et al. 1982). For sample-based (incidence) data, we present a ‘Bernoulli product model’ for sample-based rarefaction (Shinozaki 1963, Ugland et al. 2003 and Colwell et al. 2004, with an instructive historical perspective by Chiarucci et al. 2008).

For the extrapolation problem, we present—in the same unifying statistical framework as for interpolation—non-parametric methods for projecting rarefaction curves beyond the size of the reference sample under all three models. For the multinomial model, first explored for extrapolation by Good and Toulmin (1956), we rely on published work by Solow and Polasky (1999), Shen et al. (2003) and Chao et al. (2009). For the Poisson model, we follow Chao and Shen (2004), in which the pioneering work by Good and Toulmin (1956) was discussed. Alternative approaches to the extrapolation of individual-based rarefaction curves include the little-used ‘abundification’ method of Hayek and Buzas (1997) and the mixture model of Mao (2007). Extrapolating sample-based rarefaction curves beyond the incidence reference sample has been investigated by Colwell et al. (2004), Mao et al. (2005), Mao and Colwell (2005) and Mao (2007), but those methods, although theoretically useful and flexible, are based on rather complicated mixture models. Here, we take a simpler approach to extrapolation for the Bernoulli product model (e.g. Burnham and Overton 1978), in hopes that it will be more widely applied.

The principal use of rarefaction curves has long been the comparison of species richness among empirical samples that differ in the total number of individuals (e.g. Lee et al. 2007; Sanders 1968, among many others) or among sample-based datasets that differ in the total number of sampling units (e.g. Longino and Colwell 2011; Norden et al. 2009, among many others). Rigorous comparison of rarefaction curves at a common number of individuals or a common number of sampling units requires computation of confidence intervals for these curves. However, existing variance estimators for individual-based (classical) rarefaction (Heck et al. 1975) and for Coleman rarefaction (Coleman et al. 1982) are not appropriate for this purpose because they are conditional on the reference sample.

For sample-based rarefaction, Colwell et al. (2004) derived an unconditional variance estimator, which we use as a model to develop simple, approximate expressions for the unconditional variance for both classical rarefaction and Coleman’s random-placement rarefaction, long missing from the toolkit of biodiversity measurement and estimation for individual-based data (Gotelli and Colwell 2011). These unconditional variance expressions assume that the reference sample represents a random draw from a larger (but unmeasured) community or species assemblage, so that confidence intervals for rarefaction curves remain ‘open’ at the full-sample end of the curve. In contrast, traditional variance estimators for rarefaction (e.g. Heck et al. 1975; Ugland et al. 2003) are conditional on the sample data, so that the confidence interval closes to zero at the full-sample end of the curve, making valid comparisons of curves and their confidence intervals inappropriate for inference about larger communities or species assemblages. For all three models, we also provide unconditional variance estimators for extrapolation, modeled on the estimators of Shen et al. (2003) and Chao and Shen (2004).

For individual-based methods, we illustrate interpolation, extrapolation and comparison between reference samples from different assemblages using datasets from old-growth and nearby second-growth forests in two regions of Costa Rica. One dataset, from southwestern Costa Rica, is for beetles (Janzen 1973a, 1973b) and the other, from northeastern Costa Rica, is for trees (Norden et al. 2009). We illustrate sample-based methods with biogeographical data for Costa Rican ants sampled at five elevations along an elevational transect (Longino and Colwell 2011). We use the unconditional variance formulas to construct 95% confidence intervals for both interpolated and extrapolated values. For extrapolation, we also show, for all three models, how to estimate the sample size required to reach a specified proportion of the estimated asymptotic species richness, following the approach of Chao and Shen (2004) and Chao et al. (2009).

## THE MODELS

### Individual-based (abundance) data

Consider a species assemblage consisting of *N* individuals, each belonging one of *S* different species. Species *i* has *N _{i}* individuals, representing proportion

*p*

_{i}=

*N*

_{i}/

*N*of the total, $\u2211i=1SNi=N$. A single, representative sample of

*n*individuals, the reference sample, is drawn at random from the assemblage, from an area

*A*units in size. Each individual in the reference sample is identified to species (or to some other consistently applied taxonomic rank, DNA sequence similarity or functional group assignment). The total number of species observed in the sample is

*S*

_{obs}, with the

*i*th species represented by

*X*individuals, $\u2211i=1SXi=n$ (only species with

_{i}*X*> 0 contribute to

_{i}*S*

_{obs}in the reference sample). We define the ‘abundance frequency count’

*f*as the number of species each represented by exactly

_{k}*X*

_{i}=

*k*individuals in the reference sample, 0 ≤

*k*≤

*n*. Formally, $fk=\u2211i=1SI(Xi=k)$, where

*I*(·) is an indicator function that equals 1 when true and 0 otherwise, so that $\u2211k=1nkfk=n$, $Sobs=\u2211k=1nfk$. The number of species present in the assemblage but not detected in the reference sample is thus represented as

*f*

_{0}.

For most assemblages, no sampling method is completely unbiased in its ability to detect individuals of all species (e.g. Longino and Colwell 1997). For this reason, a ‘representative’ sample is necessarily defined as one that is random within the capabilities of the sampling method in relation to the taxon sampled. We use the term ‘assemblage’ to refer to the set of all individuals that would be detected with this sampling method in a very large sample. In other words, we assume in this paper that the assemblage is the effectively infinite sampling universe from which the reference sample has been collected.

We consider two alternative sampling models for individual-based (abundance) data. In the ‘multinomial model’ for classical, individual-based rarefaction (Hurlbert 1971), the reference sample is of fixed size *n*, within which discrete and countable organisms are assumed to be distributed among species multinomially. The assemblage has *S* species, in relative abundances (proportions) *p*_{1},*p*_{2},…*p*_{S}, so that the probability distribution is

The multinomial model assumes that the sampling procedure itself does not substantially alter relative abundances of species (*p*_{1},*p*_{2},…,*p*_{S}). We assume that, in most biological applications, the biological populations in the assemblage being sampled are sufficiently large that this assumption is met. If this assumption is not met, the hypergeometric model, which describes sampling without replacement, is technically more appropriate (Heck et al. 1975), but in practice the two probability distributions differ little if sample size (*n*) is small relative to assemblage size (*N*).

In the ‘continuous Poisson model’ or Coleman rarefaction (Coleman 1981), the reference sample is defined not by *n*, the number of individuals sampled, but instead by a specified area *A* (or a specified period of time), within which the *i*th species occurs at a species-specific mean rate *A**λ*_{i}, so that the probability distribution is

Based solely on information in the reference sample of *n* individuals or the individuals from area *A*, counted and identified to species, we have these six complementary objectives for abundance-based data (Fig. 1a and b): (i) to obtain an estimator $S~ind(m)$ for the expected number of species in a random sample of *m* individuals from the assemblage (*m* < *n*) or (ii) an estimator $S~area(a)$ for the expected number of species in a random area of size *a* within the reference area of size *A* (*a* < *A*); (iii) to obtain an estimator $S~ind(n+m*)$ for the expected number of species in an augmented sample of *n* + *m** individuals from the assemblage (*m** > 0), given *S*_{obs}, or (iv) an estimator $S~area(A+a*)$ for the expected number of species in an augmented area *A* + *a** (*a** > 0), given *S*_{obs}; and (v) to find an predictor $m~g*$ for the number of additional individuals or (vi) the additional area $a~g*$ required to detect proportion *g* of the estimated assemblage richness *S*_{est}.

### Sample-based incidence data

Consider a species assemblage consisting of *S* different species, each of which may or may not be found in each of *T* independent sampling units (quadrats, plots, traps, microbial culture plates, etc.) The underlying data consist of a species-by-sampling-unit incidence matrix, in which *W*_{ij} = 1, if species *i* is detected in sampling unit *j*, and *W*_{ij} = 0 otherwise. The row sum of the incidence matrix, $Yi=\u2211j=1TWij,$ denotes the incidence-based frequency of species *i*, for *i* = 1*,*2, …, *S*. The frequencies *Y _{i}* represent the incidence reference sample to be rarefied or extrapolated. The total number of species observed in the reference sample is

*S*

_{obs}(only species with

*Y*> 0 contribute to

_{i}*S*

_{obs}). We define the ‘incidence frequency count’

*Q*as the number of species each represented exactly

_{k}*Y*times in the incidence matrix sample, 0 ≤

_{i}= k*k*≤

*T*. Formally, $Qk=\u2211i=1SI(Yi=k)$, so that $\u2211k=1TkQk=\u2211i=1SYi$, $Sobs=\u2211k=1TQk$. Thus,

*Q*

_{1}represents the number of ‘unique’ species (those that are detected in only one sample) and

*Q*

_{2}represents the number of ‘duplicate’ species (those that are detected in only two samples), in the terminology of Colwell and Coddington (1994), while

*Q*

_{0}denotes the number of species among the

*S*species in the assemblage that were not detected in any of the

*T*sampling units.

For sample-based incidence data, we consider a Bernoulli product model for an incidence reference sample arising from incidence frequencies in a fixed number *T* of replicate sampling units. Assume that the probability of detecting species *i* in any one sample is *θ*_{i}, for *i* = 1, 2, …, *S*. Here, $\u2211i=1S\theta i$ may be greater than 1. (For example, the detection probability of the first species might be 0.6 and for the second species 0.8.) We assume that each *W _{ij}* is a Bernoulli random variable (since

*W*

_{ij}= 0 or

*W*

_{ij}= 1), with probability

*θ*

_{i}that

*W*

_{ij}= 1. Thus, the probability distribution for the incidence matrix is

This model has been widely used in the context of capture–recapture models (e.g. Burnham and Overton 1978). The row sums (*Y*_{1}, *Y*_{2}, … ,*Y*_{S}) are the sufficient statistics, and our analysis is based on the incidence frequency counts *Q _{k}* defined from (

*Y*

_{1},

*Y*

_{2},…,

*Y*

_{S}).

Based solely on information in the incidence reference sample of *T* sampling units, we have these three complementary objectives for sample-based incidence data (Fig. 1c): (i) to obtain an estimator $S~sample(t)$ for the expected number of species in a random set of *t* sampling units from the *T* sampling units defining the reference sample (*t* < *T*), (ii) to obtain an estimator $S~sample(T+t*)$ for the expected number of species in an augmented set of *T + t** sampling units (*t** > 0) from the assemblage, given *S*_{obs}, and (iii) to find a predictor $t~g*$ for the number of additional sampling units required to detect proportion *g* of the estimated assemblage richness *S*_{est}.

## INDIVIDUAL-BASED INTERPOLATION (RAREFACTION)

### The multinomial model (classic rarefaction)

For the multinomial model (classical rarefaction), we need to estimate the expected number of species *S*_{ind}(*m*) in a random set of *m* individuals from the reference sample (*m* < *n*) (Fig. 1a). If we knew the true occurrence probabilities (*p*_{1},*p*_{2},…,*p*_{S}) of each of the *S* species in the assemblage, we could compute

Instead, we have only the reference sample to work from, with observed species abundances *X*_{i}*.*Smith and Grassle (1977) proved that the minimum variance unbiased estimator (MVUE) for *S*_{ind}(*m*) is

They showed that this expression is also the MVUE for the hypergeometric rarefaction model, which assumes sampling without replacement. Because the MVUE is the same for the hypergeometric and the multinomial models, we can relax our assumption about sampling effects on assemblage abundances. In terms of frequency counts *f*_{k}, the estimator becomes

Assume that the occurrence probabilities (*p*_{1},*p*_{2},…,*p*_{S}) can be treated as a random vector from a multivariate distribution with identical marginal distributions, implying that the abundance frequency counts follow approximately a multinomial distribution. If we can estimate the full richness *S* of the assemblage with an estimator *S*_{est}, then an approximate unconditional variance *σ*_{ind}^{2}(*m*) of rarefied richness $S~ind(m)$ is given by

This variance is based on an approach similar to that used by Burnham and Overton (1978) for a jackknife estimator of population size in the context of capture–recapture models. Smith and Grassle (1977) provide an unconditional variance formula of $S~ind(m)$, but their expression for the variance is difficult to compute. We postpone specification of *S*_{est} for a later section.

### The Poisson model (Coleman rarefaction)

For the Poisson model (Coleman rarefaction), we need to estimate the expected number of species *S*_{area}(*m*) in a random area of size *a* within the reference area of size *A* (*a* < *A*) (Fig. 1b). If we knew the true Poisson occurrence rate (*λ*_{1},*λ*_{2},…,*λ*_{S}) of each of the *S* species in the assemblage, we could compute

Instead, based on species abundances *X*_{i} in the reference sample*,*Coleman (1981) showed that

This estimator is the MVUE for *S*_{area}(*a*) (Lehmann and Casella 1998, 108–9). In terms of frequency counts *f*_{k}, the estimator becomes

If we can estimate the full richness *S* of the assemblage by an estimator *S*_{est}, then an expression for the unconditional variance *σ*_{area}^{2}(*a*) of rarefied richness $S~area(a)$ is given by

Coleman et al. (1982) provide an estimator for the variance of $S~area(a)$ conditional on the reference sample. We postpone specification of *S*_{est} for a later section.

### Comparing the multinomial and Poisson models for interpolation

How different are the rarefaction estimates of species richness estimators under the multinomial and the Poisson models? From Equations (4) and (6), the estimates from the two models can be compared by computing

If we assume that individuals are randomly and independently distributed in space, then *a*/*A*≈*m*/*n* and

Colwell and Coddington (1994) and Brewer and Williamson (1994) showed that, for most datasets, $\Delta S~$ is quite small because both (1 − *m*/*n*)^{k} and *α*_{km} approach zero as subsample size *m* approaches the reference sample size *n*, and the frequency count *f*_{k} also becomes small at larger *k*. Thus, $\Delta S~$ is very small except for small values of *m*. In a later section, we compare the two methods using an example from tropical beetles (Janzen 1973a, 1973b). If individuals are not randomly distributed but are aggregated intraspecifically, both methods will overestimate the number of species for a smaller number of individuals *m* or a smaller area *a* (Chazdon et al. 1998; Colwell and Coddington 1994; Colwell et al. 2004; Gotelli and Colwell 2001; Kobayashi 1982).

## INDIVIDUAL-BASED EXTRAPOLATION

### The multinomial model

For the multinomial model, the extrapolation problem is to estimate the expected number of species *S*_{ind}(*n* + *m**) in an augmented sample of *n + m** individuals from the assemblage (*m** > 0) (Fig. 1a). If we knew the true occurrence probabilities *p*_{1},*p*_{2},…,*p*_{S} of each of the *S* species in the assemblage, given *S*_{obs}, we could compute

Instead, we have only the reference sample to work from, with observed species abundances *X*_{i} and their frequency counts *f _{i}*. Based on work by Solow and Polasky (1999), Shen et al. (2003) proposed an estimator for

*S*

_{ind}(

*n*+

*m*

^{*}),

*f*

_{0}, number of species present in the assemblage, but not observed in the reference sample.

Any estimator of *f*_{0} is a function of the frequencies (*f*_{1},*f*_{2},…,*f*_{n}). Thus, $S~ind(n+m*)$ can be expressed as a function of (*f*_{1},*f*_{2},…,*f*_{n}) and $\u2202S~/\u2202fi$, the partial derivative of $S~ind(n+m*)$ with respect to the variable *f*_{i}. Based on this expression, a standard asymptotic statistical method gives a variance estimator for $S~ind(n+m*)$,

*i*=

*j*and $co^v(fi,fj)=\u2212fifj/(Sobs+\u2009f^0)$ for

*i*≠

*j*. (For simplicity, we write $S~$ for $S~ind(n\u2009+\u2009m*)$ in the right-hand side of Equation 10.) See Shen et al. (2003, their Equation 11) for details. We postpone specification of $f^0$ for a later section.

Based on the estimator in Equation (9) for $S~ind(n+m*)$, Chao et al. (2009) showed that we can estimate the number of additional individuals $m~g*$ required, beyond the reference sample, to detect proportion *g* of the estimated assemblage richness *S*_{est} as

### The Poisson model

For the Poisson model, the objective is to estimate the expected number of species $S~area(A\u2009+\u2009a*)$ in an augmented area *A* + *a** (*a** > 0) (Fig. 1b). If we knew the true Poisson occurrence rates (*λ*_{1},*λ*_{2},…,*λ*_{S}) of each of the *S* species in the assemblage, we could compute, given *S*_{obs},

Working from species abundances *X*_{i} in the reference sample, Chao and Shen (2004) proposed an estimator for $S~area(A\u2009+\u2009a*)$,

We postpone specification of $f^0$, which estimates the species present in the assemblage but not observed in the reference sample, for a later section.

Chao and Shen (2004, their Equation 2.13) also proposed a variance estimator for $S~area(A+a*)$ (we write $S~$ for $S~area(A+a*)$ in the right-hand side of the following formula),

Given this estimator for *S*_{area}(*A* + *a*^{*}), it follows from Chao and Shen (2004) that an estimator $a~g*$ for the additional area required to detect proportion *g* of the estimated assemblage richness *S*_{est} is

### Estimating the number of species present in the assemblage but not observed in the reference sample for individual-based data

Several estimators in the previous two sections require either an estimate of *f*_{0}, the number of species present in the assemblage but not observed in the reference sample or an individual-based estimate for the full richness of the assemblage, *S*_{est}. Many estimators of the form $Sest=Sobs\u2009+\u2009f^0$ are available (Chao 2005). The simplest (Chao 1984), widely known as Chao1 (Gotelli and Colwell 2011), is $Sest Chao1=Sobs\u2009+\u2009f^0 Chao1$, where

For the individual-based empirical examples in this paper, we have used the Chao1 estimator, above, which Chao (1984) proved is a minimum estimator of asymptotic species richness.

For assemblages with many rare species, the abundance-based coverage estimator (ACE) (Chao and Lee 1992; Chao et al. 2000; Chazdon et al. 1998) is often a more appropriate estimator of asymptotic richness (Chao and Shen 2004), $Sest ACE=Sobs\u2009+\u2009f^0 ACE$. ACE takes into account the frequency counts for rare species *f*_{1},*f*_{2},…,*f*_{k},…,*f*_{R}, where *R* is a cutoff frequency between rare and common species in the reference sample. Thus, $Srare=\u2211i=1SI(0<Xi\u2264R)$ with summed abundance $Xrare=\u2211i=1SXiI(Xi\u2264R)$. These counts and an estimate of sample coverage, $C^ACE=1\u2212f1/Xrare$, are used to compute a squared coefficient of variation, $\gamma ^ACE2$, and the estimator $f^0 ACE$,

The expression for $\gamma ^ACE2$, above, is for the multinomial model. For the Poisson model, the summation in the denominator should be replaced by $(\u2211k=1Rkfk)2$. Chao and Shen (2004) recommended *R* = 10 as rule of thumb, with exploration of other values suggested for samples with large coefficients of variation.

## SAMPLE-BASED INTERPOLATION (RAREFACTION)

### The Bernoulli product model

For the Bernoulli product model (sample-based rarefaction), we need to estimate the expected number of species *S*_{sample}(*t*) in a random set of *t* sampling units from among the *T* sampling units defining the incidence reference sample (*t* < *T*) (Fig. 1c). If we knew the true detection probabilities *θ*_{1},*θ*_{2}…,*θ*_{S} of each of the *S* species in the assemblage, we could compute

Instead, we have only the incidence reference sample to work from, with observed species incidence frequencies *Y _{i}*. The MVUE for

*S*

_{sample}(

*t*) is

This analytic formula was first derived by Shinozaki (1963) and rediscovered multiple times (Chiarucci et al. 2008). Colwell et al. (2004, their Equation 5) provide a mathematically equivalent equation in terms of the incidence frequency counts *Q _{k}* similar to our Equation (4). This estimator has long been called ‘Mao Tau’ in the widely used software application ‘EstimateS’ (Colwell 2011). Colwell et al. (2004, their Equation 6) developed an estimator for the unconditional variance in terms of the frequency counts

*Q*, similar to our Equation (5), that requires an incidence-based estimator

_{k}*S*

_{est}for assembly richness

*S*. We postpone specification of

*S*

_{est}for a later section.

## SAMPLE-BASED EXTRAPOLATION

### The Bernoulli product model

For the Bernoulli product model, the extrapolation problem is to estimate the expected number of species *S*_{sample}(*T* + *t*^{*}) in an augmented set of *T + t** sampling units (*t** > 0) from the assemblage (Fig. 1c). If we knew the true detection probabilities *θ*_{1},*θ*_{2}…,*θ*_{S} of each of the *S* species in the assemblage, given *S*_{obs}, we could compute

Based on a derivation by Chao et al. (2009), we have the estimator

Expressing $S~sample(T+t*)$ as a function of (*Q*_{1},*Q*_{2},…,*Q*_{T}), and using an asymptotic method, we obtain an approximate variance formula

*i*=

*j*and $co^v(Qi,Qj)=\u2212QiQj/(Sobs\u2009+\u2009Q^0)$ for

*i*≠

*j*. (For simplicity, we write $S~$ for $S~sample(T\u2009+\u2009t*)$ in the above variance formula.)

Equations (18) and (19), above, both require an estimate of *Q*_{0}, the number of species present in the assemblage but not detected in any sampling units. We postpone specification of an estimator for *Q*_{0} for the next section.

Based on the estimator in Equation (18) for $S~sample(T\u2009+\u2009t*)$, the number of additional sampling units $t~g*$ required to detect proportion *g* of the estimated assemblage richness *S*_{est} is

### Estimating the number of species present in the assemblage but not observed in the reference sample for sample-based incidence data

The extrapolation estimators for the Bernoulli product model require either an estimate of *Q*_{0}, the number of species present in the assemblage but not observed in the sampling units comprising the incidence reference sample, or a sample-based estimate for the full richness of the assemblage, *S*_{est}. Many estimators of the form $Sest=Sobs\u2009+\u2009Q^0$ are available (Chao 2005). The simplest (Chao 1987), widely known as Chao2 (Gotelli and Colwell 2011), is $S^est Chao2=Sobs\u2009+\u2009Q^0 Chao2$, where

For the sample-based incidence example in this paper, we have used the Chao2 estimator, above, which Chao (1987) showed is a minimum estimator of asymptotic species richness.

For assemblages with many rare species, the incidence-based coverage estimator (ICE; Chazdon et al. 1998; Lee and Chao 1994) is often a more appropriate estimator of asymptotic species richness (Chao and Shen 2004),

ICE takes into account the frequency counts for rare species (*Q*_{1},*Q*_{2},…,*Q*_{k},…,*Q*_{R}), where *R* is a cutoff frequency between infrequent and frequent species in the reference sample. Thus, the number of species that occur in fewer than *R* sampling units is $Sinfreq=\u2211i=1SI(0<Yi\u2264R)$ with summed incidence frequencies $Yinfreq=\u2211i=1SYiI(Yi\u2264R)$. These counts, the number of sampling units that include at least one infrequent species (*T*_{infreq}), and an estimate of sample coverage, $C^ICE=1\u2212Q1/Yinfreq$, are used to compute a squared coefficient of variation, *γ*_{ICE}^{2}, and the estimator *Q*_{0 ICE}:

We recommend *R* = 10 as rule of thumb, with exploration of other values suggested for samples with large coefficients of variation.

## EXAMPLES

### Tropical beetles: individual-based rarefaction and extrapolation (multinomial model)

Janzen (1973a, 1973b) tabulated many data sets on tropical foliage insects from sweep samples in southwestern Costa Rica. We selected two beetle data sets (‘Osa primary’ and ‘Osa secondary’) to compare beetle species richness between old-growth forest and second-growth vegetation on the Osa Peninsula. The species frequency counts appear in Table 1.

(a) Osa second growth: S_{obs} = 140, n = 976 | |||||||||||||||||||||||||

i | 1 | 2 | 3 | 4 | 5 | 6 | 7 | 8 | 9 | 10 | 11 | 12 | 14 | 17 | 19 | 20 | 21 | 24 | 26 | 40 | 57 | 60 | 64 | 71 | 77 |

f _{i} | 70 | 17 | 4 | 5 | 5 | 5 | 5 | 3 | 1 | 2 | 3 | 2 | 2 | 1 | 2 | 3 | 1 | 1 | 1 | 1 | 2 | 1 | 1 | 1 | 1 |

(a) Osa second growth: S_{obs} = 140, n = 976 | |||||||||||||||||||||||||

i | 1 | 2 | 3 | 4 | 5 | 6 | 7 | 8 | 9 | 10 | 11 | 12 | 14 | 17 | 19 | 20 | 21 | 24 | 26 | 40 | 57 | 60 | 64 | 71 | 77 |

f _{i} | 70 | 17 | 4 | 5 | 5 | 5 | 5 | 3 | 1 | 2 | 3 | 2 | 2 | 1 | 2 | 3 | 1 | 1 | 1 | 1 | 2 | 1 | 1 | 1 | 1 |

(b) Osa old growth: S_{obs} = 112, n = 237 | ||||||||||

i | 1 | 2 | 3 | 4 | 5 | 6 | 7 | 8 | 14 | 42 |

f _{i} | 84 | 10 | 4 | 3 | 5 | 1 | 2 | 1 | 1 | 1 |

(b) Osa old growth: S_{obs} = 112, n = 237 | ||||||||||

i | 1 | 2 | 3 | 4 | 5 | 6 | 7 | 8 | 14 | 42 |

f _{i} | 84 | 10 | 4 | 3 | 5 | 1 | 2 | 1 | 1 | 1 |

Janzen’s study recorded 976 individuals representing 140 species in the Osa second-growth site and 237 individuals of 112 species in the Osa old-growth site. From the unstandardized raw data (the reference samples), one might conclude that the second-growth site has more beetle species than the old-growth site (140 vs. 112; Fig. 2c, solid points). However, the sample sizes (number of individual beetles) for the two samples are quite different (976 vs. 237 individuals, Fig. 2a and b). When the sample size in the second-growth site is rarefied down to 237 individuals to match the size of the old-growth sample (Fig. 2c, open point), using the multinomial model (Equation 4), the ordering of the two sites is reversed. The interpolated species richness for 237 individuals in the second-growth site is only 70, considerably less than primary site, with 112 species. Moreover, the 95% confidence intervals do not overlap (Fig. 2c).

Individual-based rarefaction of abundance data, like the interpolation analysis above, has been carried out in this way for decades. Here, we apply individual-based rarefaction and extrapolation to the same reference sample for the first time. Applying the multinomial model (Equation 9) to the Janzen dataset to increase the sample size (number of individuals) in each site yields the extrapolated curves (broken line curves) for each site is shown in Fig. 2. Even though the mathematical derivations for interpolation and extrapolation are fundamentally different, the interpolation and extrapolation curves join smoothly at the single data point of the reference sample.

In Table 2a, using the multinomial model (classical rarefaction), we show for the Osa old-growth data (*S*_{obs} = 112, *n* = 237 in the reference sample): (i) values for the interpolated estimate $S~ind(m)$, for values of *m* from 1 up to the reference sample size of 237 individuals (Equation 4), along with the unconditional standard error (SE, Equation 5) values that are used to construct the 95% confidence intervals shown in Fig. 2a and c; (ii) the extrapolated estimate $S~ind(n+m*)$ (Equation 9), where *m** ranges from 0 to 1 000 individuals, along with the unconditional SE (Equation 10); and (iii) the number of additional individuals $m~g*$ required to detect proportion *g* of the estimated assemblage richness (Equation 11), for *g* = 0.3 to 0.9, in increments of 0.1. In Fig. 2a, we plot the multinomial rarefaction curve and extrapolation curve up to a sample size of 1 200 individuals and show the predicted number of individuals need to reach for *g* = 0.6. The corresponding values and curves for the Osa second-growth data (*S*_{obs} = 140, *n* = 976 in the reference sample) are shown in Table 2b and Fig. 2b.

Rarefaction | Extrapolation | Individuals prediction | |||||

m | $S~ind(m)$ | SE | m* | $S~ind(n+m*)$ | SE | g | $m~g*$ |

(a) Osa old-growth site, S_{obs} = 112, n = 237. The extrapolation is extended to more than five times of the reference sample size, in order to compare with the Osa second-growth curve (b); see Table 2(b). | |||||||

1 | 1.00 | 0.00 | 0 | 112 | 9.22 | 0.3 | 80.60 |

20 | 15.89 | 1.95 | 100 | 145.74 | 12.20 | 0.4 | 234.04 |

40 | 28.44 | 3.00 | 200 | 176.25 | 15.38 | 0.5 | 415.52 |

60 | 39.44 | 3.85 | 400 | 228.80 | 22.58 | 0.6 | 637.64 |

80 | 49.40 | 4.57 | 600 | 271.77 | 30.84 | 0.7 | 924.00 |

100 | 58.62 | 5.22 | 800 | 306.93 | 39.79 | 0.8 | 1327.60 |

120 | 67.29 | 5.83 | 1 000 | 335.68 | 48.96 | 0.9 | 2017.56 |

140 | 75.54 | 6.42 | |||||

160 | 83.48 | 7.00 | |||||

180 | 91.16 | 7.57 | |||||

200 | 98.63 | 8.15 | |||||

220 | 105.92 | 8.73 | |||||

237 | 112.00 | 9.22 | |||||

(b) Osa second-growth site, S_{obs} = 140, n = 976. The extrapolation is extended to double the reference sample size. | |||||||

1 | 1.00 | 0.00 | 0 | 140.00 | 8.43 | 0.5 | 28.91 |

100 | 44.30 | 4.36 | 100 | 147.00 | 8.87 | 0.6 | 477.30 |

200 | 64.43 | 5.31 | 200 | 153.66 | 9.34 | 0.7 | 1055.37 |

300 | 78.83 | 5.85 | 400 | 166.02 | 10.34 | 0.8 | 1870.12 |

400 | 90.58 | 6.25 | 600 | 177.21 | 11.46 | 0.9 | 3262.94 |

500 | 100.83 | 6.60 | 800 | 187.34 | 12.68 | ||

600 | 110.11 | 6.95 | 1 000 | 196.51 | 13.99 | ||

700 | 118.72 | 7.32 | |||||

800 | 126.80 | 7.70 | |||||

900 | 134.45 | 8.11 | |||||

976 | 140.00 | 8.43 |

Rarefaction | Extrapolation | Individuals prediction | |||||

m | $S~ind(m)$ | SE | m* | $S~ind(n+m*)$ | SE | g | $m~g*$ |

(a) Osa old-growth site, S_{obs} = 112, n = 237. The extrapolation is extended to more than five times of the reference sample size, in order to compare with the Osa second-growth curve (b); see Table 2(b). | |||||||

1 | 1.00 | 0.00 | 0 | 112 | 9.22 | 0.3 | 80.60 |

20 | 15.89 | 1.95 | 100 | 145.74 | 12.20 | 0.4 | 234.04 |

40 | 28.44 | 3.00 | 200 | 176.25 | 15.38 | 0.5 | 415.52 |

60 | 39.44 | 3.85 | 400 | 228.80 | 22.58 | 0.6 | 637.64 |

80 | 49.40 | 4.57 | 600 | 271.77 | 30.84 | 0.7 | 924.00 |

100 | 58.62 | 5.22 | 800 | 306.93 | 39.79 | 0.8 | 1327.60 |

120 | 67.29 | 5.83 | 1 000 | 335.68 | 48.96 | 0.9 | 2017.56 |

140 | 75.54 | 6.42 | |||||

160 | 83.48 | 7.00 | |||||

180 | 91.16 | 7.57 | |||||

200 | 98.63 | 8.15 | |||||

220 | 105.92 | 8.73 | |||||

237 | 112.00 | 9.22 | |||||

(b) Osa second-growth site, S_{obs} = 140, n = 976. The extrapolation is extended to double the reference sample size. | |||||||

1 | 1.00 | 0.00 | 0 | 140.00 | 8.43 | 0.5 | 28.91 |

100 | 44.30 | 4.36 | 100 | 147.00 | 8.87 | 0.6 | 477.30 |

200 | 64.43 | 5.31 | 200 | 153.66 | 9.34 | 0.7 | 1055.37 |

300 | 78.83 | 5.85 | 400 | 166.02 | 10.34 | 0.8 | 1870.12 |

400 | 90.58 | 6.25 | 600 | 177.21 | 11.46 | 0.9 | 3262.94 |

500 | 100.83 | 6.60 | 800 | 187.34 | 12.68 | ||

600 | 110.11 | 6.95 | 1 000 | 196.51 | 13.99 | ||

700 | 118.72 | 7.32 | |||||

800 | 126.80 | 7.70 | |||||

900 | 134.45 | 8.11 | |||||

976 | 140.00 | 8.43 |

For both samples, the unconditional variance, and thus the 95% confidence interval, increased with sample size. For extrapolation, the SE values are relatively small up to a doubling of the reference sample, signifying quite accurate extrapolation in this range. For the Osa old-growth site (Table 2a; Fig. 2a), the extrapolation is extended to five times of the original sample size in order to compare with the Osa second-growth curve. This long-range extrapolation (>3× the original sample size) inevitably yields very wide confidence intervals. For the Osa second-growth site (Table 2b; Fig. 2b), the extrapolation is extended only to double the reference sample size (not fully shown in Fig. 2b) yielding a quite accurate extrapolated estimate with a narrow confidence interval.

Based on Fig. 2, even though the Osa old-growth site extrapolation for large sample sizes exhibits high variance, the old-growth and second-growth confidence intervals do not overlap for any sample size considered. This implies that beetle species richness for any sample size is significantly greater in the old-growth site than that in the second-growth site for sample size up to at least 1 200 individuals.

### Tropical beetles: individual-based rarefaction and extrapolation (Poisson model)

In addition to applying estimators based on the multinomial model, we also analysed the Janzen beetle dataset with estimators based on the Poisson model, including Coleman area-based rarefaction (Equations 6 and 7), area-based extrapolation (Equations 12 and 13), and estimation of the additional area required to detect proportion *g* of the estimated assemblage richness *S*_{est} (Equation 14). The results for the Osa old-growth beetle sample appear in Table 3a and the results for the Osa second-growth beetle sample in Table 3b. Comparison of the results for the Poisson model estimators (Table 3) with the corresponding results for the multinomial model estimators (Table 2) reveals a remarkable similarity that makes sense mathematically because the distribution for the Poisson model (Equation 2), conditional on the total number of individuals, is just the multinomial model (Equation 1). Moreover, the similarity applies not only to rarefaction (as previously noted by Brewer and Williamson 1994) but also to extrapolation. Figure 3 shows just how close the results based on the two models are for this example. For interpolation and extrapolation, the difference is always less than one-tenth of one individual (assuming for the Poisson model that individuals are randomly and independently distributed in space, so that *a*/*A*≈*m*/*n*). This means that rounding to the nearest individual consistently yields precisely the same values under both models. For this reason, we do not plot the results from the Poisson model because the figure would be identical to Fig. 2, with the Poisson variables in Fig. 1b substituted for the multinomial variables in Fig. 1a. In addition, estimates of the additional area required to detect proportion *g* of the estimated assemblage richness under the Poisson model (Table 3, from Equation 14) are identical to the estimates of the additional number of individuals required to reach proportion *g* of the estimated assemblage richness under the multinomial model (Table 2, from Equation 11).

Rarefaction | Extrapolation | Area prediction | |||||

a | $S~area(a)$ | SE | a* | $S~area(A+a*)$ | SE | g | $a~g*$ |

(a) Osa old-growth site, S_{obs} = 112, A = 237. The extrapolation is extended to more than five times of the reference sample size, in order to compare with the Osa second-growth curve (b). | |||||||

1 | 0.98 | 0.00 | 0 | 112.00 | 9.22 | 0.3 | 80.60 |

20 | 15.83 | 1.93 | 100 | 145.72 | 12.20 | 0.4 | 234.04 |

40 | 28.37 | 2.99 | 200 | 176.22 | 15.38 | 0.5 | 415.52 |

60 | 39.38 | 3.84 | 400 | 228.75 | 22.58 | 0.6 | 637.64 |

80 | 49.35 | 4.56 | 600 | 271.72 | 30.84 | 0.7 | 924.00 |

100 | 58.58 | 5.21 | 800 | 306.86 | 39.78 | 0.8 | 1327.60 |

120 | 67.26 | 5.82 | 1 000 | 335.61 | 48.96 | 0.9 | 2017.56 |

140 | 75.52 | 6.41 | |||||

160 | 83.46 | 6.99 | |||||

180 | 91.15 | 7.57 | |||||

200 | 98.62 | 8.15 | |||||

220 | 105.92 | 8.73 | |||||

237 | 112.00 | 9.22 | |||||

(b) Osa second-growth site, S_{obs} = 140, A = 976. The extrapolation is extended to double the reference sample size. | |||||||

1 | 0.98 | 0.17 | 0 | 140.00 | 8.43 | 0.5 | 28.91 |

100 | 44.23 | 4.35 | 100 | 147.00 | 8.87 | 0.6 | 477.30 |

200 | 64.38 | 5.31 | 200 | 153.65 | 9.34 | 0.7 | 1055.37 |

300 | 78.81 | 5.85 | 400 | 166.01 | 10.34 | 0.8 | 1870.12 |

400 | 90.57 | 6.24 | 600 | 177.20 | 11.46 | 0.9 | 3262.94 |

500 | 100.82 | 6.60 | 800 | 187.33 | 12.68 | ||

600 | 110.10 | 6.95 | 1 000 | 196.50 | 13.99 | ||

700 | 118.71 | 7.32 | |||||

800 | 126.79 | 7.70 | |||||

900 | 134.44 | 8.11 | |||||

976 | 140.00 | 8.43 |

Rarefaction | Extrapolation | Area prediction | |||||

a | $S~area(a)$ | SE | a* | $S~area(A+a*)$ | SE | g | $a~g*$ |

(a) Osa old-growth site, S_{obs} = 112, A = 237. The extrapolation is extended to more than five times of the reference sample size, in order to compare with the Osa second-growth curve (b). | |||||||

1 | 0.98 | 0.00 | 0 | 112.00 | 9.22 | 0.3 | 80.60 |

20 | 15.83 | 1.93 | 100 | 145.72 | 12.20 | 0.4 | 234.04 |

40 | 28.37 | 2.99 | 200 | 176.22 | 15.38 | 0.5 | 415.52 |

60 | 39.38 | 3.84 | 400 | 228.75 | 22.58 | 0.6 | 637.64 |

80 | 49.35 | 4.56 | 600 | 271.72 | 30.84 | 0.7 | 924.00 |

100 | 58.58 | 5.21 | 800 | 306.86 | 39.78 | 0.8 | 1327.60 |

120 | 67.26 | 5.82 | 1 000 | 335.61 | 48.96 | 0.9 | 2017.56 |

140 | 75.52 | 6.41 | |||||

160 | 83.46 | 6.99 | |||||

180 | 91.15 | 7.57 | |||||

200 | 98.62 | 8.15 | |||||

220 | 105.92 | 8.73 | |||||

237 | 112.00 | 9.22 | |||||

(b) Osa second-growth site, S_{obs} = 140, A = 976. The extrapolation is extended to double the reference sample size. | |||||||

1 | 0.98 | 0.17 | 0 | 140.00 | 8.43 | 0.5 | 28.91 |

100 | 44.23 | 4.35 | 100 | 147.00 | 8.87 | 0.6 | 477.30 |

200 | 64.38 | 5.31 | 200 | 153.65 | 9.34 | 0.7 | 1055.37 |

300 | 78.81 | 5.85 | 400 | 166.01 | 10.34 | 0.8 | 1870.12 |

400 | 90.57 | 6.24 | 600 | 177.20 | 11.46 | 0.9 | 3262.94 |

500 | 100.82 | 6.60 | 800 | 187.33 | 12.68 | ||

600 | 110.10 | 6.95 | 1 000 | 196.50 | 13.99 | ||

700 | 118.71 | 7.32 | |||||

800 | 126.79 | 7.70 | |||||

900 | 134.44 | 8.11 | |||||

976 | 140.00 | 8.43 |

### Tropical trees: individual-based rarefaction and extrapolation (multinomial model)

Norden et al. (2009) compared species composition of trees, saplings and seedlings in six 1-ha forest plots spanning three successional stages in lowland forests of northeastern Costa Rica. We selected data for tree stems ≥5 cm diameter at breast height in three samples from this dataset, all located within La Selva Biological Station. One of the samples represents an old-growth plot (Lindero El Peje [LEP] old growth, *S*_{obs} = 152, *n* = 943) and two were from second-growth forest plots, one of them 29 years old (LEP second growth, *S*_{obs} = 104, *n* = 1 263) and the other 21 years old in 2006 (Lindero Sur, *S*_{obs} = 76, *n* = 1 020), following pasture abandonment. The species frequency counts for the three plots appear in Table 4.

(a) LEP old growth, S_{obs} = 152, n = 943 | ||||||||||||||||||||||||

i | 1 | 2 | 3 | 4 | 5 | 6 | 7 | 8 | 9 | 10 | 11 | 13 | 15 | 16 | 18 | 19 | 20 | 25 | 38 | 39 | 40 | 46 | 52 | 55 |

f _{i} | 46 | 30 | 16 | 12 | 6 | 5 | 3 | 4 | 5 | 4 | 1 | 3 | 1 | 1 | 1 | 1 | 4 | 3 | 1 | 1 | 1 | 1 | 1 | 1 |

(a) LEP old growth, S_{obs} = 152, n = 943 | ||||||||||||||||||||||||

i | 1 | 2 | 3 | 4 | 5 | 6 | 7 | 8 | 9 | 10 | 11 | 13 | 15 | 16 | 18 | 19 | 20 | 25 | 38 | 39 | 40 | 46 | 52 | 55 |

f _{i} | 46 | 30 | 16 | 12 | 6 | 5 | 3 | 4 | 5 | 4 | 1 | 3 | 1 | 1 | 1 | 1 | 4 | 3 | 1 | 1 | 1 | 1 | 1 | 1 |

(b) LEP older (29 years) second growth, S_{obs} = 104, n = 1 263 | |||||||||||||||||||||||||||

i | 1 | 2 | 3 | 4 | 5 | 6 | 7 | 8 | 9 | 10 | 11 | 12 | 13 | 14 | 15 | 16 | 17 | 20 | 22 | 39 | 45 | 57 | 72 | 88 | 132 | 133 | 178 |

f _{i} | 33 | 15 | 13 | 4 | 5 | 3 | 3 | 1 | 2 | 1 | 4 | 2 | 2 | 1 | 2 | 1 | 1 | 1 | 1 | 1 | 1 | 1 | 1 | 2 | 1 | 1 | 1 |

(b) LEP older (29 years) second growth, S_{obs} = 104, n = 1 263 | |||||||||||||||||||||||||||

i | 1 | 2 | 3 | 4 | 5 | 6 | 7 | 8 | 9 | 10 | 11 | 12 | 13 | 14 | 15 | 16 | 17 | 20 | 22 | 39 | 45 | 57 | 72 | 88 | 132 | 133 | 178 |

f _{i} | 33 | 15 | 13 | 4 | 5 | 3 | 3 | 1 | 2 | 1 | 4 | 2 | 2 | 1 | 2 | 1 | 1 | 1 | 1 | 1 | 1 | 1 | 1 | 2 | 1 | 1 | 1 |

(c) Lindero Sur younger (21 years) second growth, S_{obs} = 76, n = 1 020 | ||||||||||||||||||||||

i | 1 | 2 | 3 | 4 | 5 | 7 | 8 | 10 | 11 | 12 | 13 | 15 | 31 | 33 | 34 | 35 | 66 | 72 | 78 | 127 | 131 | 174 |

f _{i} | 29 | 13 | 5 | 2 | 3 | 4 | 1 | 2 | 2 | 1 | 2 | 2 | 1 | 1 | 1 | 1 | 1 | 1 | 1 | 1 | 1 | 1 |

(c) Lindero Sur younger (21 years) second growth, S_{obs} = 76, n = 1 020 | ||||||||||||||||||||||

i | 1 | 2 | 3 | 4 | 5 | 7 | 8 | 10 | 11 | 12 | 13 | 15 | 31 | 33 | 34 | 35 | 66 | 72 | 78 | 127 | 131 | 174 |

f _{i} | 29 | 13 | 5 | 2 | 3 | 4 | 1 | 2 | 2 | 1 | 2 | 2 | 1 | 1 | 1 | 1 | 1 | 1 | 1 | 1 | 1 | 1 |

The results for interpolation and extrapolation from these three reference samples, under the multinomial model, appear in Table 5 and Fig. 4a. For each of the three samples, Table 5 shows: (i) species richness values for the interpolated estimate $S~ind(m)$, under the multinomial model (classical rarefaction, Equation 4), for values of *m* from 1 up to the reference sample size for each sample (*n* = 943, 1 263 or 1 020 individuals), along with the unconditional SE (Equation 5) values that are used to construct the 95% confidence intervals shown in Fig. 4a, and (ii) the extrapolated estimate $S~ind(n+m*)$, where *m** ranges from 0 to 1 500, 1 200 or 1 400 individuals (for the three samples), so that all samples are extrapolated to roughly 2 400 individuals, along with the unconditional SE (Equation 10).

Rarefaction | Extrapolation | ||||

m | $S~ind(m)$ | SE | m* | $S~ind(n+m*)$ | SE |

(a) LEP old growth, S_{obs} = 152, n = 943 | |||||

1 | 1.00 | 0.00 | 0 | 152.00 | 5.35 |

20 | 16.72 | 1.82 | 100 | 156.56 | 5.55 |

40 | 28.88 | 2.72 | 200 | 160.53 | 5.79 |

60 | 38.51 | 3.24 | 300 | 163.98 | 6.08 |

80 | 46.57 | 3.59 | 400 | 166.99 | 6.42 |

100 | 53.54 | 3.84 | 500 | 169.61 | 6.79 |

200 | 79.28 | 4.46 | 600 | 171.90 | 7.18 |

300 | 96.99 | 4.69 | 700 | 173.88 | 7.59 |

400 | 110.49 | 4.81 | 800 | 175.61 | 8.00 |

500 | 121.32 | 4.88 | 900 | 177.12 | 8.40 |

600 | 130.28 | 4.96 | 1 000 | 178.43 | 8.80 |

700 | 137.83 | 5.04 | 1 100 | 179.57 | 9.18 |

800 | 144.28 | 5.15 | 1 200 | 180.57 | 9.55 |

900 | 149.84 | 5.28 | 1 300 | 181.43 | 9.89 |

943 | 152.00 | 5.35 | 1 400 | 182.19 | 10.22 |

1 500 | 182.84 | 10.52 | |||

(b) LEP older (29 years) second growth, S_{obs} = 104, n = 1 263 | |||||

1 | 1.00 | 0.00 | 0 | 104.00 | 5.19 |

20 | 12.96 | 2.17 | 100 | 106.52 | 5.33 |

40 | 20.14 | 2.72 | 200 | 108.87 | 5.49 |

60 | 25.62 | 3.02 | 300 | 111.05 | 5.66 |

80 | 30.25 | 3.24 | 400 | 113.08 | 5.86 |

100 | 34.30 | 3.43 | 500 | 114.97 | 6.08 |

200 | 49.47 | 3.99 | 600 | 116.73 | 6.31 |

300 | 59.92 | 4.25 | 700 | 118.37 | 6.56 |

400 | 67.94 | 4.40 | 800 | 119.89 | 6.83 |

500 | 74.50 | 4.50 | 900 | 121.31 | 7.11 |

600 | 80.06 | 4.58 | 1 000 | 122.63 | 7.39 |

700 | 84.88 | 4.65 | 1 100 | 123.86 | 7.69 |

800 | 89.14 | 4.73 | 1 200 | 125.00 | 7.99 |

900 | 92.93 | 4.81 | |||

1 000 | 96.35 | 4.90 | |||

1 100 | 99.46 | 5.00 | |||

1 200 | 102.32 | 5.11 | |||

1 263 | 104.00 | 5.19 | |||

(c) Lindero Sur younger (21 years) second growth, S_{obs} = 76, n = 1 020 | |||||

1 | 1.00 | 0.00 | 0 | 76.00 | 4.76 |

20 | 11.51 | 2.19 | 100 | 78.72 | 4.95 |

40 | 17.08 | 2.68 | 200 | 81.22 | 5.16 |

60 | 21.16 | 2.94 | 300 | 83.50 | 5.40 |

80 | 24.52 | 3.12 | 400 | 85.59 | 5.66 |

100 | 27.41 | 3.26 | 500 | 87.51 | 5.95 |

200 | 38.15 | 3.65 | 600 | 89.26 | 6.25 |

300 | 45.72 | 3.83 | 700 | 90.87 | 6.58 |

400 | 51.77 | 3.95 | 800 | 92.34 | 6.91 |

500 | 56.90 | 4.05 | 900 | 93.69 | 7.26 |

600 | 61.41 | 4.16 | 1 000 | 94.92 | 7.61 |

700 | 65.44 | 4.28 | 1 100 | 96.05 | 7.97 |

800 | 69.09 | 4.42 | 1 200 | 97.09 | 8.33 |

900 | 72.40 | 4.56 | 1 300 | 98.03 | 8.68 |

1 000 | 75.43 | 4.73 | 1 400 | 98.90 | 9.03 |

1 020 | 76.00 | 4.76 |

Rarefaction | Extrapolation | ||||

m | $S~ind(m)$ | SE | m* | $S~ind(n+m*)$ | SE |

(a) LEP old growth, S_{obs} = 152, n = 943 | |||||

1 | 1.00 | 0.00 | 0 | 152.00 | 5.35 |

20 | 16.72 | 1.82 | 100 | 156.56 | 5.55 |

40 | 28.88 | 2.72 | 200 | 160.53 | 5.79 |

60 | 38.51 | 3.24 | 300 | 163.98 | 6.08 |

80 | 46.57 | 3.59 | 400 | 166.99 | 6.42 |

100 | 53.54 | 3.84 | 500 | 169.61 | 6.79 |

200 | 79.28 | 4.46 | 600 | 171.90 | 7.18 |

300 | 96.99 | 4.69 | 700 | 173.88 | 7.59 |

400 | 110.49 | 4.81 | 800 | 175.61 | 8.00 |

500 | 121.32 | 4.88 | 900 | 177.12 | 8.40 |

600 | 130.28 | 4.96 | 1 000 | 178.43 | 8.80 |

700 | 137.83 | 5.04 | 1 100 | 179.57 | 9.18 |

800 | 144.28 | 5.15 | 1 200 | 180.57 | 9.55 |

900 | 149.84 | 5.28 | 1 300 | 181.43 | 9.89 |

943 | 152.00 | 5.35 | 1 400 | 182.19 | 10.22 |

1 500 | 182.84 | 10.52 | |||

(b) LEP older (29 years) second growth, S_{obs} = 104, n = 1 263 | |||||

1 | 1.00 | 0.00 | 0 | 104.00 | 5.19 |

20 | 12.96 | 2.17 | 100 | 106.52 | 5.33 |

40 | 20.14 | 2.72 | 200 | 108.87 | 5.49 |

60 | 25.62 | 3.02 | 300 | 111.05 | 5.66 |

80 | 30.25 | 3.24 | 400 | 113.08 | 5.86 |

100 | 34.30 | 3.43 | 500 | 114.97 | 6.08 |

200 | 49.47 | 3.99 | 600 | 116.73 | 6.31 |

300 | 59.92 | 4.25 | 700 | 118.37 | 6.56 |

400 | 67.94 | 4.40 | 800 | 119.89 | 6.83 |

500 | 74.50 | 4.50 | 900 | 121.31 | 7.11 |

600 | 80.06 | 4.58 | 1 000 | 122.63 | 7.39 |

700 | 84.88 | 4.65 | 1 100 | 123.86 | 7.69 |

800 | 89.14 | 4.73 | 1 200 | 125.00 | 7.99 |

900 | 92.93 | 4.81 | |||

1 000 | 96.35 | 4.90 | |||

1 100 | 99.46 | 5.00 | |||

1 200 | 102.32 | 5.11 | |||

1 263 | 104.00 | 5.19 | |||

(c) Lindero Sur younger (21 years) second growth, S_{obs} = 76, n = 1 020 | |||||

1 | 1.00 | 0.00 | 0 | 76.00 | 4.76 |

20 | 11.51 | 2.19 | 100 | 78.72 | 4.95 |

40 | 17.08 | 2.68 | 200 | 81.22 | 5.16 |

60 | 21.16 | 2.94 | 300 | 83.50 | 5.40 |

80 | 24.52 | 3.12 | 400 | 85.59 | 5.66 |

100 | 27.41 | 3.26 | 500 | 87.51 | 5.95 |

200 | 38.15 | 3.65 | 600 | 89.26 | 6.25 |

300 | 45.72 | 3.83 | 700 | 90.87 | 6.58 |

400 | 51.77 | 3.95 | 800 | 92.34 | 6.91 |

500 | 56.90 | 4.05 | 900 | 93.69 | 7.26 |

600 | 61.41 | 4.16 | 1 000 | 94.92 | 7.61 |

700 | 65.44 | 4.28 | 1 100 | 96.05 | 7.97 |

800 | 69.09 | 4.42 | 1 200 | 97.09 | 8.33 |

900 | 72.40 | 4.56 | 1 300 | 98.03 | 8.68 |

1 000 | 75.43 | 4.73 | 1 400 | 98.90 | 9.03 |

1 020 | 76.00 | 4.76 |

In Fig. 4a, we plot the multinomial rarefaction curves and extrapolation curves up to a sample size of 2 400 individuals. Clearly the number of species at any plotted sample size (beyond very small samples) is significantly greater for LEP old growth than in either of the two samples from second-growth forest. The number of species in the plot of intermediate age, LEP second growth, significantly exceeds the number of species in the youngest plot, Lindero Sur, for sample sizes between 500 and 1 600 individuals, based conservatively on non-overlapping confidence intervals. Due to the prevalence of rare species in old-growth tropical forests and widespread dispersal limitation of large-seeded animal-dispersed species, tree species richness is slow to recover during secondary succession and may require many decades to reach old-growth levels, even under conditions favorable to regeneration.

### Tropical ants: sample-based rarefaction and extrapolation for incidence data (Bernoulli product model)

Longino and Colwell (2011) sampled ants at several elevations on the Barva Transect, a 30-km continuous gradient of wet forest on Costa Rica's Atlantic slope. For this example, we use results from five sites, at 50-, 500-, 1 070-, 1 500- and 2 000-m elevation, to illustrate sample-based rarefaction and extrapolation. The sampling unit consisted of all worker ants extracted from a 1-m^{2} forest floor plot, applying a method called ‘mini-Winkler extraction’. Because ants are colonial and the colony is the unit of reproduction, scoring each sampling unit for presence or absence of each species makes more sense than using abundance data (Gotelli et al. 2011). A sample-by-species incidence matrix was therefore produced for each of the five sites. The incidence frequency counts for the five sites appear in Table 6.

(a) Elevation 50 m, S_{obs} = 227, T = 599 | ||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||

i | 1 | 2 | 3 | 4 | 5 | 6 | 7 | 8 | 9 | 10 | 11 | 12 | 13 | 14 | 15 | 16 | 17 | 18 | 19 | 20 | 21 | 22 | 23 | 25 | 27 | 29 | 30 | 31 | 33 | 39 | 40 | 43 | 46 | 47 | 48 | 51 | 52 | 56 | 58 | 61 | 65 | 69 | 72 | 77 | 79 | 82 | 83 | 84 | 86 | 91 | 95 | 97 | 98 | 106 | 113 | 124 | 126 | 127 | 128 | 129 | 182 | 183 | 186 | 195 | 222 | 236 | 263 | 330 |

Q _{i} | 49 | 23 | 18 | 14 | 9 | 10 | 4 | 8 | 6 | 2 | 1 | 2 | 2 | 5 | 2 | 4 | 3 | 2 | 2 | 3 | 1 | 1 | 2 | 1 | 2 | 1 | 2 | 1 | 1 | 1 | 1 | 1 | 2 | 1 | 1 | 1 | 2 | 2 | 2 | 2 | 1 | 1 | 1 | 1 | 1 | 1 | 1 | 1 | 1 | 1 | 1 | 1 | 2 | 1 | 1 | 1 | 1 | 1 | 1 | 2 | 1 | 1 | 1 | 1 | 1 | 1 | 1 | 1 |

(a) Elevation 50 m, S_{obs} = 227, T = 599 | ||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||

i | 1 | 2 | 3 | 4 | 5 | 6 | 7 | 8 | 9 | 10 | 11 | 12 | 13 | 14 | 15 | 16 | 17 | 18 | 19 | 20 | 21 | 22 | 23 | 25 | 27 | 29 | 30 | 31 | 33 | 39 | 40 | 43 | 46 | 47 | 48 | 51 | 52 | 56 | 58 | 61 | 65 | 69 | 72 | 77 | 79 | 82 | 83 | 84 | 86 | 91 | 95 | 97 | 98 | 106 | 113 | 124 | 126 | 127 | 128 | 129 | 182 | 183 | 186 | 195 | 222 | 236 | 263 | 330 |

Q _{i} | 49 | 23 | 18 | 14 | 9 | 10 | 4 | 8 | 6 | 2 | 1 | 2 | 2 | 5 | 2 | 4 | 3 | 2 | 2 | 3 | 1 | 1 | 2 | 1 | 2 | 1 | 2 | 1 | 1 | 1 | 1 | 1 | 2 | 1 | 1 | 1 | 2 | 2 | 2 | 2 | 1 | 1 | 1 | 1 | 1 | 1 | 1 | 1 | 1 | 1 | 1 | 1 | 2 | 1 | 1 | 1 | 1 | 1 | 1 | 2 | 1 | 1 | 1 | 1 | 1 | 1 | 1 | 1 |

(b) Elevation 500 m, S_{obs} = 241, T = 230 | ||||||||||||||||||||||||||||||||||||||||||||||||||||||

i | 1 | 2 | 3 | 4 | 5 | 6 | 7 | 8 | 9 | 10 | 11 | 12 | 13 | 14 | 15 | 16 | 17 | 18 | 19 | 20 | 21 | 23 | 24 | 25 | 26 | 27 | 30 | 31 | 32 | 33 | 34 | 36 | 37 | 38 | 39 | 41 | 42 | 43 | 44 | 45 | 46 | 47 | 49 | 52 | 53 | 54 | 56 | 60 | 65 | 73 | 78 | 123 | 131 | 133 |

Q _{i} | 71 | 34 | 12 | 14 | 9 | 11 | 8 | 4 | 7 | 5 | 2 | 3 | 4 | 2 | 1 | 2 | 4 | 1 | 1 | 1 | 2 | 1 | 1 | 3 | 1 | 1 | 1 | 2 | 2 | 1 | 1 | 1 | 1 | 4 | 2 | 1 | 1 | 1 | 1 | 1 | 1 | 1 | 1 | 2 | 1 | 1 | 1 | 2 | 1 | 1 | 1 | 1 | 1 | 1 |

(b) Elevation 500 m, S_{obs} = 241, T = 230 | ||||||||||||||||||||||||||||||||||||||||||||||||||||||

i | 1 | 2 | 3 | 4 | 5 | 6 | 7 | 8 | 9 | 10 | 11 | 12 | 13 | 14 | 15 | 16 | 17 | 18 | 19 | 20 | 21 | 23 | 24 | 25 | 26 | 27 | 30 | 31 | 32 | 33 | 34 | 36 | 37 | 38 | 39 | 41 | 42 | 43 | 44 | 45 | 46 | 47 | 49 | 52 | 53 | 54 | 56 | 60 | 65 | 73 | 78 | 123 | 131 | 133 |

Q _{i} | 71 | 34 | 12 | 14 | 9 | 11 | 8 | 4 | 7 | 5 | 2 | 3 | 4 | 2 | 1 | 2 | 4 | 1 | 1 | 1 | 2 | 1 | 1 | 3 | 1 | 1 | 1 | 2 | 2 | 1 | 1 | 1 | 1 | 4 | 2 | 1 | 1 | 1 | 1 | 1 | 1 | 1 | 1 | 2 | 1 | 1 | 1 | 2 | 1 | 1 | 1 | 1 | 1 | 1 |

(c) Elevation 1 070 m, S_{obs} = 122, T = 150 | |||||||||||||||||||||||||||||||||||||||||

i | 1 | 2 | 3 | 4 | 5 | 6 | 7 | 8 | 9 | 10 | 11 | 12 | 13 | 14 | 15 | 16 | 18 | 19 | 21 | 22 | 23 | 24 | 25 | 26 | 30 | 31 | 32 | 34 | 36 | 38 | 39 | 43 | 45 | 46 | 54 | 60 | 68 | 74 | 80 | 96 | 99 |

Q _{i} | 28 | 16 | 13 | 3 | 1 | 3 | 6 | 1 | 1 | 1 | 4 | 3 | 4 | 1 | 1 | 4 | 1 | 2 | 1 | 1 | 1 | 1 | 3 | 1 | 1 | 3 | 1 | 1 | 1 | 1 | 1 | 2 | 2 | 1 | 1 | 1 | 1 | 1 | 1 | 1 | 1 |

(c) Elevation 1 070 m, S_{obs} = 122, T = 150 | |||||||||||||||||||||||||||||||||||||||||

i | 1 | 2 | 3 | 4 | 5 | 6 | 7 | 8 | 9 | 10 | 11 | 12 | 13 | 14 | 15 | 16 | 18 | 19 | 21 | 22 | 23 | 24 | 25 | 26 | 30 | 31 | 32 | 34 | 36 | 38 | 39 | 43 | 45 | 46 | 54 | 60 | 68 | 74 | 80 | 96 | 99 |

Q _{i} | 28 | 16 | 13 | 3 | 1 | 3 | 6 | 1 | 1 | 1 | 4 | 3 | 4 | 1 | 1 | 4 | 1 | 2 | 1 | 1 | 1 | 1 | 3 | 1 | 1 | 3 | 1 | 1 | 1 | 1 | 1 | 2 | 2 | 1 | 1 | 1 | 1 | 1 | 1 | 1 | 1 |

(d) Elevation 1 500 m, S_{obs} = 56, T = 200 | |||||||||||||||||||||||||||

i | 1 | 2 | 3 | 4 | 5 | 6 | 9 | 11 | 17 | 18 | 19 | 23 | 24 | 25 | 29 | 30 | 32 | 33 | 43 | 50 | 53 | 73 | 74 | 76 | 79 | 113 | 144 |

Q _{i} | 13 | 4 | 2 | 2 | 4 | 2 | 4 | 2 | 2 | 1 | 1 | 2 | 1 | 3 | 1 | 1 | 1 | 1 | 1 | 1 | 1 | 1 | 1 | 1 | 1 | 1 | 1 |

(d) Elevation 1 500 m, S_{obs} = 56, T = 200 | |||||||||||||||||||||||||||

i | 1 | 2 | 3 | 4 | 5 | 6 | 9 | 11 | 17 | 18 | 19 | 23 | 24 | 25 | 29 | 30 | 32 | 33 | 43 | 50 | 53 | 73 | 74 | 76 | 79 | 113 | 144 |

Q _{i} | 13 | 4 | 2 | 2 | 4 | 2 | 4 | 2 | 2 | 1 | 1 | 2 | 1 | 3 | 1 | 1 | 1 | 1 | 1 | 1 | 1 | 1 | 1 | 1 | 1 | 1 | 1 |

(e) Elevation = 2 000 m, S_{obs} = 14, T = 200 | ||||||||||||

i | 1 | 2 | 3 | 4 | 8 | 13 | 15 | 19 | 23 | 34 | 59 | 80 |

Q _{i} | 1 | 2 | 1 | 1 | 2 | 1 | 1 | 1 | 1 | 1 | 1 | 1 |

(e) Elevation = 2 000 m, S_{obs} = 14, T = 200 | ||||||||||||

i | 1 | 2 | 3 | 4 | 8 | 13 | 15 | 19 | 23 | 34 | 59 | 80 |

Q _{i} | 1 | 2 | 1 | 1 | 2 | 1 | 1 | 1 | 1 | 1 | 1 | 1 |

The results for sample-based interpolation and extrapolation from these five sites (at five elevations), under the Bernoulli product model, appear in Table 7 and Fig. 4b. For each of the five samples, Table 7 shows: (i) values for the interpolated estimate $S~sample(t)$, under the Bernoulli product model (Equation 17), for values of *t* from 1 up to the reference sample size *T* for each elevation (*T* = 599, 230, 150, 200, 200 sampling units), along with the unconditional SE values (Colwell et al. 2004, their Equation 6) that are used to construct the 95% confidence intervals shown in Fig. 4b; and (ii) the extrapolated estimate $S~sample(T+t*)$, where *t** ranges from 401 to 800 sampling units, to extrapolate all elevations to 1 000 sampling units (Equation 18), along with the unconditional SE (Equation 19).

Rarefaction | Extrapolation | Sampling units prediction | |||||

t | $S~sample(t)$ | SE | t* | $S~sample(T+t*)$ | SE | g | $t~g*$ |

(a) Elevation 50 m, S_{obs} = 227, T = 599 | |||||||

1 | 9.98 | 1.27 | 0 | 227.00 | 6.51 | 0.82 | 23.29 |

50 | 109.64 | 6.17 | 100 | 234.57 | 6.81 | 0.86 | 183.50 |

100 | 140.09 | 6.39 | 200 | 241.03 | 7.24 | 0.90 | 398.00 |

150 | 159.30 | 6.41 | 300 | 246.56 | 7.79 | 0.94 | 723.65 |

200 | 173.30 | 6.37 | 400 | 251.29 | 8.43 | 0.98 | 1424.02 |

250 | 184.27 | 6.33 | 401 | 251.33 | 8.44 | ||

300 | 193.23 | 6.30 | |||||

350 | 200.79 | 6.28 | |||||

400 | 207.32 | 6.27 | |||||

450 | 213.06 | 6.29 | |||||

500 | 218.19 | 6.34 | |||||

550 | 222.83 | 6.41 | |||||

599 | 227.00 | 6.51 | |||||

(b) Elevation 500 m, S_{obs} = 241, T = 230 | |||||||

1 | 12.80 | 1.42 | 0 | 241.00 | 7.52 | 0.82 | 63.33 |

20 | 98.96 | 6.03 | 100 | 266.19 | 8.93 | 0.86 | 123.55 |

40 | 132.85 | 6.57 | 200 | 282.78 | 11.08 | 0.90 | 204.17 |

60 | 155.08 | 6.76 | 300 | 293.71 | 13.45 | 0.94 | 326.56 |

80 | 171.85 | 6.84 | 400 | 300.91 | 15.64 | 0.98 | 589.79 |

100 | 185.42 | 6.88 | 500 | 305.65 | 17.49 | ||

120 | 196.90 | 6.92 | 600 | 308.78 | 18.97 | ||

140 | 206.90 | 6.97 | 700 | 310.84 | 20.11 | ||

160 | 215.80 | 7.04 | 770 | 311.84 | 20.75 | ||

180 | 223.83 | 7.14 | |||||

200 | 231.15 | 7.27 | |||||

230 | 241.00 | 7.52 | |||||

(c) Elevation 1 070 m, S_{obs} = 122, T = 150 | |||||||

1 | 11.53 | 1.51 | 0 | 122.00 | 4.50 | 0.84 | 5.06 |

20 | 68.06 | 4.59 | 100 | 135.00 | 5.95 | 0.86 | 22.53 |

40 | 85.18 | 4.57 | 200 | 141.06 | 8.00 | 0.88 | 42.71 |

60 | 95.91 | 4.44 | 300 | 143.88 | 9.63 | 0.90 | 66.57 |

80 | 103.97 | 4.36 | 400 | 145.19 | 10.69 | 0.92 | 95.77 |

100 | 110.41 | 4.34 | 500 | 145.80 | 11.32 | 0.94 | 133.42 |

120 | 115.68 | 4.37 | 600 | 146.09 | 11.67 | 0.96 | 186.49 |

140 | 120.07 | 4.44 | 700 | 146.22 | 11.87 | 0.98 | 277.20 |

150 | 122.00 | 4.50 | 800 | 146.28 | 11.97 | ||

850 | 146.30 | 12.00 | |||||

(d) Elevation 1 500 m, S_{obs} = 56, T = 200 | |||||||

1 | 5.85 | 1.17 | 0 | 56.00 | 3.91 | 0.74 | 15.70 |

20 | 31.68 | 3.49 | 100 | 61.58 | 4.61 | 0.78 | 69.80 |

40 | 38.64 | 3.65 | 200 | 65.68 | 5.74 | 0.82 | 134.79 |

60 | 42.71 | 3.68 | 300 | 68.70 | 7.10 | 0.86 | 216.19 |

80 | 45.65 | 3.69 | 400 | 70.91 | 8.50 | 0.90 | 325.16 |

100 | 47.98 | 3.69 | 500 | 72.53 | 9.82 | 0.94 | 490.60 |

120 | 49.94 | 3.70 | 600 | 73.72 | 11.00 | 0.98 | 846.42 |

140 | 51.67 | 3.73 | 700 | 74.60 | 12.01 | ||

160 | 53.22 | 3.77 | 800 | 75.24 | 12.87 | ||

180 | 54.66 | 3.83 | |||||

200 | 56.00 | 3.91 | |||||

(e) Elevation 2 000 m, S_{obs} = 14, T = 200 | |||||||

1 | 1.36 | 0.43 | 0 | 14.00 | 0.49 | 0.99 | 28.00 |

20 | 8.60 | 1.25 | 100 | 14.21 | 0.63 | ||

40 | 10.59 | 1.12 | 200 | 14.24 | 0.70 | ||

60 | 11.62 | 0.95 | 300 | 14.25 | 0.72 | ||

80 | 12.31 | 0.82 | 400 | 14.25 | 0.73 | ||

100 | 12.81 | 0.70 | 500 | 14.25 | 0.73 | ||

120 | 13.19 | 0.62 | 600 | 14.25 | 0.73 | ||

140 | 13.49 | 0.56 | 700 | 14.25 | 0.73 | ||

160 | 13.71 | 0.52 | 800 | 14.25 | 0.73 | ||

180 | 13.88 | 0.50 | |||||

200 | 14.00 | 0.49 |

Rarefaction | Extrapolation | Sampling units prediction | |||||

t | $S~sample(t)$ | SE | t* | $S~sample(T+t*)$ | SE | g | $t~g*$ |

(a) Elevation 50 m, S_{obs} = 227, T = 599 | |||||||

1 | 9.98 | 1.27 | 0 | 227.00 | 6.51 | 0.82 | 23.29 |

50 | 109.64 | 6.17 | 100 | 234.57 | 6.81 | 0.86 | 183.50 |

100 | 140.09 | 6.39 | 200 | 241.03 | 7.24 | 0.90 | 398.00 |

150 | 159.30 | 6.41 | 300 | 246.56 | 7.79 | 0.94 | 723.65 |

200 | 173.30 | 6.37 | 400 | 251.29 | 8.43 | 0.98 | 1424.02 |

250 | 184.27 | 6.33 | 401 | 251.33 | 8.44 | ||

300 | 193.23 | 6.30 | |||||

350 | 200.79 | 6.28 | |||||

400 | 207.32 | 6.27 | |||||

450 | 213.06 | 6.29 | |||||

500 | 218.19 | 6.34 | |||||

550 | 222.83 | 6.41 | |||||

599 | 227.00 | 6.51 | |||||

(b) Elevation 500 m, S_{obs} = 241, T = 230 | |||||||

1 | 12.80 | 1.42 | 0 | 241.00 | 7.52 | 0.82 | 63.33 |

20 | 98.96 | 6.03 | 100 | 266.19 | 8.93 | 0.86 | 123.55 |

40 | 132.85 | 6.57 | 200 | 282.78 | 11.08 | 0.90 | 204.17 |

60 | 155.08 | 6.76 | 300 | 293.71 | 13.45 | 0.94 | 326.56 |

80 | 171.85 | 6.84 | 400 | 300.91 | 15.64 | 0.98 | 589.79 |

100 | 185.42 | 6.88 | 500 | 305.65 | 17.49 | ||

120 | 196.90 | 6.92 | 600 | 308.78 | 18.97 | ||

140 | 206.90 | 6.97 | 700 | 310.84 | 20.11 | ||

160 | 215.80 | 7.04 | 770 | 311.84 | 20.75 | ||

180 | 223.83 | 7.14 | |||||

200 | 231.15 | 7.27 | |||||

230 | 241.00 | 7.52 | |||||

(c) Elevation 1 070 m, S_{obs} = 122, T = 150 | |||||||

1 | 11.53 | 1.51 | 0 | 122.00 | 4.50 | 0.84 | 5.06 |

20 | 68.06 | 4.59 | 100 | 135.00 | 5.95 | 0.86 | 22.53 |

40 | 85.18 | 4.57 | 200 | 141.06 | 8.00 | 0.88 | 42.71 |

60 | 95.91 | 4.44 | 300 | 143.88 | 9.63 | 0.90 | 66.57 |

80 | 103.97 | 4.36 | 400 | 145.19 | 10.69 | 0.92 | 95.77 |

100 | 110.41 | 4.34 | 500 | 145.80 | 11.32 | 0.94 | 133.42 |

120 | 115.68 | 4.37 | 600 | 146.09 | 11.67 | 0.96 | 186.49 |

140 | 120.07 | 4.44 | 700 | 146.22 | 11.87 | 0.98 | 277.20 |

150 | 122.00 | 4.50 | 800 | 146.28 | 11.97 | ||

850 | 146.30 | 12.00 | |||||

(d) Elevation 1 500 m, S_{obs} = 56, T = 200 | |||||||

1 | 5.85 | 1.17 | 0 | 56.00 | 3.91 | 0.74 | 15.70 |

20 | 31.68 | 3.49 | 100 | 61.58 | 4.61 | 0.78 | 69.80 |

40 | 38.64 | 3.65 | 200 | 65.68 | 5.74 | 0.82 | 134.79 |

60 | 42.71 | 3.68 | 300 | 68.70 | 7.10 | 0.86 | 216.19 |

80 | 45.65 | 3.69 | 400 | 70.91 | 8.50 | 0.90 | 325.16 |

100 | 47.98 | 3.69 | 500 | 72.53 | 9.82 | 0.94 | 490.60 |

120 | 49.94 | 3.70 | 600 | 73.72 | 11.00 | 0.98 | 846.42 |

140 | 51.67 | 3.73 | 700 | 74.60 | 12.01 | ||

160 | 53.22 | 3.77 | 800 | 75.24 | 12.87 | ||

180 | 54.66 | 3.83 | |||||

200 | 56.00 | 3.91 | |||||

(e) Elevation 2 000 m, S_{obs} = 14, T = 200 | |||||||

1 | 1.36 | 0.43 | 0 | 14.00 | 0.49 | 0.99 | 28.00 |

20 | 8.60 | 1.25 | 100 | 14.21 | 0.63 | ||

40 | 10.59 | 1.12 | 200 | 14.24 | 0.70 | ||

60 | 11.62 | 0.95 | 300 | 14.25 | 0.72 | ||

80 | 12.31 | 0.82 | 400 | 14.25 | 0.73 | ||

100 | 12.81 | 0.70 | 500 | 14.25 | 0.73 | ||

120 | 13.19 | 0.62 | 600 | 14.25 | 0.73 | ||

140 | 13.49 | 0.56 | 700 | 14.25 | 0.73 | ||

160 | 13.71 | 0.52 | 800 | 14.25 | 0.73 | ||

180 | 13.88 | 0.50 | |||||

200 | 14.00 | 0.49 |

The extrapolation is extended to 1 000 samples for each elevation.

## DISCUSSION

In this paper, we developed a unified theoretical and notational framework for modeling and analyzing the effects on observed species richness of the number of individuals sampled or the number of sampling units examined in the context of a single, quantitative, multispecies sample (an abundance reference sample) or a single set of incidence frequencies for species among sampling units (an incidence reference sample). We compared three statistically distinct models, one based on the multinomial distribution, for counts of individuals (Fig. 1a), the second based on the Poisson distribution, for proportional areas (Fig. 1b), and the third based on a Bernoulli product distribution, for incidence frequencies among sampling units (Fig. 1c).

For interpolation to samples smaller than the reference sample, these correspond to classical rarefaction (Hurlbert 1971), Coleman rarefaction (Coleman 1981) and sample-based rarefaction (Colwell et al. 2004). For the first time, we have linked these well-known interpolation approaches with recent sampling-theoretic extrapolation approaches, under both the multinomial model (Shen et al. 2003) and the Poisson model (Chao and Shen 2004), as well as to methods for predicting the number of additional individuals (multinomial model, Chao et al. 2009) or the amount of additional area (Poisson model, Chao and Shen 2004) needed to reach a specified proportion of estimated asymptotic richness. For the Bernoulli product model, we have developed new estimators, using a similar approach, for sample-based extrapolation (Fig. 1c). The fundamental statistics for all these estimators are the abundance frequency counts *f*_{k}—the number of species each represented by exactly *X*_{i} = *k* individuals in a reference sample (e.g. Tables 1 and 4)—for individual-based models, or the incidence frequency counts *Q _{k}*—the number of species that occurred in exactly

*Y*=

_{i}*k*sampling units (e.g. Table 6)—for sample-based models.

This novel integration of mathematically distinct approaches allowed us to link interpolated (rarefaction) curves and extrapolated curves to plot a unified species accumulation curve for empirical examples (Figs 2 and 4). Perhaps the most surprising (and satisfying) result is how smoothly the interpolated and extrapolated moieties of the curve come together at the reference sample, in all examples we have investigated. The remarkable degree of concordance between multinomial and Poisson estimators (e.g. Fig. 3), not only for interpolation (as anticipated by Brewer and Williamson [1994] and Colwell and Coddington [1994]) but also for extrapolation (as first shown here), was a second surprise, although the two models are closely related, as discussed earlier. We see little reason, for individual-based data, to recommend computing estimators based on one model over the other (although Coleman curves are computationally less demanding than classical rarefaction), and no reason whatsoever to compute both.

The ability to link rarefaction curves with their corresponding extrapolated richness curves, complete with unconditional confidence intervals, helps to solve one of most frustrating limitations of traditional rarefaction: ‘throwing away’ much of the information content of larger samples, in order to standardize comparisons with the smallest sample in a group of samples being compared. The ant dataset (Fig. 4b) (Longino and Colwell 2011), which spans an elevation gradient from lowland rainforest at 50-m elevation to montane cloud forest at 2 000 m, is an excellent example. Typical of tropical mountains, ants are scarce and represent few species above ∼1 500 m on this transect. Datasets range from 200 sampling units (with only 270 incidences) at the 2 000 m site, up to 599 sampling units (with 5 346 incidences) at the 50-m site. (Each incidence is the occurrence of one species in one sampling unit.)

Because each sampling unit is a 1-m^{2} plot, in the ant study, what Fig. 4b plots on the ‘species’ axis are actually estimates of ant species density (species per area) at multiples of 1-m^{2} spatial scale. To convert the plot to approximations of species richness for the local assembly at each elevation, the curves could be rescaled from ‘samples’ to ‘incidences’ for each elevation separately and replotted together on a new graph with ‘incidences’ as the *X*-axis (Longino and Colwell 2011). Rescaling to incidences can also be useful for any organisms that, like ants, live colonially or that cannot be counted individually (e.g. multiple stems of stem-sprouting plants or cover-based vegetation data).

The same approach to approximating species richness is recommended, but with re-scaling to individuals instead of incidences, for rarefaction of sample-based abundance datasets. For these datasets, abundances can first be converted to incidences (presence or absence) before applying incidence-based rarefaction. Then, differences in density (the number of individuals per sampling unit) among datasets can be accounted for by rescaling the *X*-axis of sample-based rarefaction and extrapolation curves to individuals (Chazdon et al. 1998; Gotelli and Colwell 2001, 2011; Norden et al. 2009). With rescaling to individuals, however, strong among-sample differences in dominance can produce misleading results.

Analytical methods (classical rarefaction and Coleman rarefaction) have existed for decades for estimating the number of species in a subset of samples from an individual-based dataset. Confidence intervals for those estimates have always been based on conditional variances because unconditional variances for individual-based classical rarefaction and Coleman curves have until now remained elusive. Suppose we wish to compare two reference samples differing in number of individuals, with sample *Y* larger than sample *X*. The two samples may be drawn from either the same assemblage or from two different assemblages. The conditional variance of the larger sample *Y* is appropriate for answering the question: ‘Is the number of species recorded in the smaller sample, *X*, statistically different from the richness of a random sample of the same size drawn from the larger reference sample, *Y* ? ’ (The conditional variance of sample *X* is zero for the full sample.) In contrast, ecologists would usually prefer to answer the question, ‘Are the numbers of species recorded in samples *X* and *Y* statistically different from the richness of random samples, matching the smaller sample *X* in number of individuals, from the assemblage or assemblages they represent?’ (Simberloff 1979). The latter question requires an estimate of the unconditional variance for both samples. We present, for the first time, simple and explicit variance estimators for both fixed size (multinomial, Equation 5) and random-size (Poisson, Equation 7) individual-based rarefaction models, and we extend the potential for statistical comparison beyond the size of reference samples by extrapolation.

Even when based on unconditional variances, the use of confidence intervals to infer statistical significance (or lack of it) between samples is not straightforward. In general, lack of overlap between 95% confidence intervals (mean plus or minus 1.96 SE) does indeed guarantee significant difference in means at *P* ≤ 0.05, but this condition is overly conservative: samples from normal distributions at the *P* = 0.05 threshold have substantially overlapping 95% confidence intervals. Payton et al. (2004) show that, for samples from two normal distributions with approximately equal variances, overlap or non-overlap of 84% confidence intervals (mean plus or minus 1.41 SE) provide a more appropriate rule of thumb for inferring a difference of mean at *P* = 0.05, and this approach has been suggested by two of us for comparing unconditional confidence intervals around rarefaction curves (Gotelli and Colwell 2011). Unfortunately, the statisticians among us (A.C., C.X.M. and S.-Y.L.) doubt that this approach is likely to be accurate for the confidence intervals around rarefaction (or extrapolation) curves, so the matter of a simple method must be left for further study. Meanwhile, non-overlap of 95% confidence intervals constructed from our unconditional variance estimators can be used as a simple but conservative criterion of statistical difference. Mao and Li (2009) developed a mathematically complicated method for comparing entire rarefaction curves, but it has so far been little used.

All our examples (Tables 2, 3, 5 and 7; Figs 2 and 4) reveal that the unconditional variance increases sharply with sample size for extrapolated curves, and thus, the confidence interval expands accordingly. As with any extrapolation, the estimate becomes more uncertain the further it is extended away from the reference sample. As a consequence, confidence intervals that do not overlap at moderate sample sizes may do so at larger sample sizes, even if the extrapolated curves are not converging. An example of this phenomenon can be seen in the lower two curves of Fig. 4a. We would suggest that extrapolation is reliable, at most, only up to a tripling of the reference sample size, or more conservatively, a doubling of sample size. We have carried out simulations to investigate the performance of the unconditional variance estimators (Equations5, 7, 10 and 13). The proposed unconditional variances perform satisfactorily when sample size is relatively large because they were derived by an asymptotic approach (i.e. assuming the sample size is large). When sample size is not sufficiently large, the unconditional variances tend to overestimate and, thus, produce a conservative confidence interval. For small samples, we suggest estimating variance by non-parametric bootstrapping.

Under all three of the models we discuss, all our estimators for extrapolated richness, as well as all our unconditional variance estimators, require an estimate of asymptotic species richness for the assemblage sampled. For this reason, the accuracy of our extrapolation and variance estimators is of course dependent upon the accuracy of the asymptotic richness estimates they rely upon. However, if and when better estimators of assemblage richness become available, they can simply be plugged into our equations wherever *S*_{est}, $f^0$, or $Q^0$ appear in our equations.

Under the Poisson model, individual-based rarefaction curves and species accumulation curves, because they rely on area, assume that individuals are randomly distributed in space, within and between species. The multinomial model can be viewed as having the same assumption, or alternatively, may be viewed as assuming that species need not be randomly distributed, but that individuals have been recorded randomly without regard to their position in space. Neither assumption is realistic for a practical study of any natural assemblage, which routinely exhibit spatial aggregation within species, as well as spatial patterning in association and dissociation between species. All such violations of the assumptions of spatial randomness lead to an overestimation of richness for a given number of individuals or a given amount of accumulated space, compared with what richness would be for actual smaller or larger samples (Chazdon et al. 1998; Colwell and Coddington 1994; Kobayashi 1982).

Sample-based approaches (e.g. estimators based on the Bernoulli product model), using replicated incidence data (or sample-based abundance data converted to incidence), perform better in this regard as they retain some aspects of the spatial (or temporal) structure of assemblages (Colwell et al. 2004; Gotelli and Colwell 2001; Smith et al. 1985), although sampling designs are nonetheless critical to avoiding bias from spatial structure (Collins and Simberloff 2009; Chiarucci et al. 2009). It may at first appear paradoxical that a simple list of incidence frequencies (e.g. Table 6) retains any information on the spatial structure of the biological populations sampled. But consider two equally abundant species in the same assemblage, one with a very patchy spatial distribution and the other with all individuals distributed independently and at random. With individual-based rarefaction, the two species will be indistinguishable. In a sample-based study of the same assemblage, however, the aggregated species will generally have a lower incidence frequency (since many individuals will end up some samples and none in others) than the randomly distributed species. While accounting for within-species aggregation, however, sample-based rarefaction is blind to interspecific association or dissociation (Colwell et al. 2004, their Table 2).

When sample-based (replicate) data are not available, the individual-based methods we present here can be applied, with the understanding that spatial structure is ignored. To model species aggregation explicitly, the current models could be extended to a negative binomial model (a generalized form of our Poisson model; Kobayashi 1982, 1983) and to a multivariate negative binomial model (a generalized form of our multinomial) model. Extra parameters that describe spatial aggregation would need to be introduced in the generalized model, and thus, statistical inference would become more complicated.

We plan to implement the rarefaction and extrapolation estimators discussed in this paper in the freeware applications EstimateS (Colwell 2011) and in iNEXT (http://chao.stat.nthu.edu.tw/softwareCE.html).

## FUNDING

US National Science Foundation (DEB 0639979 and DBI 0851245 to R.K.C.; DEB-0541936 to N.J.G.; DEB-0424767 and DEB-0639393 to R.L.C.; DEB-0640015 to J.T.L.); the US Department of Energy (022821 to N.J.G.); the Taiwan National Science Council (97-2118-M007-MY3 to A.C.); and the University of Connecticut Research Foundation (to R.L.C.).

We are grateful to Fangliang He and Sun Yat-sen University for the invitation to contribute this paper to a special issue of JPE and to an anonymous reviewer for helpful comments.

*Conflict of interest statement.* None declared.

## References

*10th Annual Meeting of the Ecological Society of Japan*. Abstract, p. 5. Ecological Society of Japan, Tokyo, Japan

Science285:1459