## Abstract

Current statistical biogeographical analysis methods are limited in the ways ecology can be related to the processes of diversification and geographical range evolution, requiring conflation of geography and ecology, and/or assuming ecologies that are uniform across all lineages and invariant in time. This precludes the possibility of studying a broad class of macroevolutionary biogeographical theories that relate geographical and species histories through lineage-specific ecological and evolutionary dynamics, such as taxon cycle theory. Here we present a new model that generates phylogenies under a complex of superpositioned geographical range evolution, trait evolution, and diversification processes that can communicate with each other. We present a likelihood-free method of inference under our model using discriminant analysis of principal components of summary statistics calculated on phylogenies, with the discriminant functions trained on data generated by simulations under our model. This approach of model selection by classification of empirical data with respect to data generated under training models is shown to be efficient, robust, and performs well over a broad range of parameter space defined by the relative rates of dispersal, trait evolution, and diversification processes. We apply our method to a case study of the taxon cycle, that is testing for habitat and trophic level constraints in the dispersal regimes of the Wallacean avifaunal radiation.

Model-based historical biogeography methods have advanced remarkably in recent years, facilitating the statistical investigations of a broad range of questions that relate geographical history to the evolutionary history of lineages. The “Dispersal-Extinction-Cladogenesis” (DEC) model was a crucial contribution to the field, providing a rigorous probabilistic framework to model dispersal and vicariance processes informed by phylogenetic and geographical data ( Ree et al. 2005 ; Ree and Smith 2008 ). Under this DEC model, numerous studies into the geographical origins of groups as well as the histories of changes to the geographical ranges of lineages in those groups over time have been carried out (e.g., de Bruyn et al. 2014 ; Loader et al. 2014 ; Mitchell et al. 2014 ). Powerful and flexible as the DEC model was in answering questions relating to the geographical history of lineages, it was limited in trying to answer questions relating to the process of dispersal itself. The original DEC model treated dispersal as a process that was uniform across lineages. Matzke (2014) extended the original DEC model to incorporate founder-event “jump” dispersal, and its

A major advance in integrating ecology with historical biogeography was the geographic state speciation and extinction model (“GeoSSE”; Goldberg et al. 2011 ), which related geographic distribution to rates of diversification in a statistical probabilistic framework. By using geographic areas occupied by a species as proxies for habitat preferences, the GeoSSE approach allowed study of the relationships between ecology and diversity. Naturally, this approach was restricted by only being able to consider cases in which geographical area could, in fact, be taken as a proxy for habitat (as opposed to being able to model ecology and geography as separate, even if reciprocally interacting, classes of phenomena). But, in addition, it still treated all lineages as exchangeable with respect to the dispersal process, and thus did not allow the modeling of ecology (whether by geographical proxy or not) on dispersal, and thus the effect of lineage ecology on geographical range evolution. Furthermore, the ecologies of the lineages themselves were considered to be fixed in evolutionary time, instead of evolving along with their geographies.

This field of inquiry, that is the study of the relationship between ecology and its effects on patterns of geographic distribution of lineages, is of long-standing interest in biogeography. For example, the taxon cycle hypothesis ( Wilson 1961 ; Ricklefs and Bermingham 2002 ) proposes linkages between trait evolution, niche shifts, dispersal, and diversification dynamics. Taxon cycle theory posts a particular combination of ecological and evolutionary constraints that lead to a cyclical pattern of biogeographic expansion and contraction associated with niche shifts and diversification. Habitat-dependent dispersal is a necessary (but not sufficient) component of the taxon cycle as well as other biogeographic hypotheses. Such habitat dependency may be implied by correlations between range size/endemism and habitat type. For example, many studies have found statistical support for correlations between habitat affinity and the species range extent, implying habitat-dependent dispersal (e.g., Ricklefs and Cox 1972 , 1978 ; Dexter 2010 ; Carstensen et al. 2012 ; Economo and Sarnat 2012 ; Economo et al. 2015 ). However, such correlations are potentially misleading about habitat-driven dispersal in the context of the other processes shaping biogeographic patterns such as speciation, extinction, and niche evolution.

Our goal is to develop a statistical, process-based model choice framework to determine if we can discriminate habitat-constrained dispersal regimes from cases where dispersal is not constrained by habitat. More generally, we investigate whether we can identify cases where a particular ecological trait does indeed regulate dispersal of lineages throughout the history of a group. To address this goal, we need an approach that allows for separate geographical as well as ecological state spaces, and conditions transitions of lineages through the geographical state space (which includes range gain through dispersal as well as range loss through extirpation) on the state of the given lineage in ecological space. In this article, we first introduce a continuous-time discrete-area model that not only incorporates diversification and biogeographical processes (i.e., lineage birth, lineage death, dispersal) as submodels, but also a trait evolution submodel that informs the rates of these processes. We then describe a likelihood-free approach for carrying out inference under this model, using simulation-trained multivariate discriminant analysis classification. As a test case, we apply our approach to the Wallacean island bird radiations; a system in which taxon cycles have been proposed and, more specifically, in which habitat-constrained dispersal has been suggested as an important mechanism in regulating radiation ( Carstensen et al. 2012 ).

## M aterials and M ethods

### Archipelago: A Trait-Driven Biogeographical Phylogenesis Model

The Archipelago model generates spatially explicit phylogenies under a complex of superpositioned stochastic processes, in which the per-lineage speciation, extinction, and dispersal events each occur under separate stochastic processes. The rates of speciation, extinction, and dispersal are not (neccessarily) homogenous, but instead are determined by the states of traits that are associated with the given lineage. These trait states (which can represent any aspect of a lineage: biological, ecological, etc.) are, in turn, evolving independently along each lineage under a homogenous Poisson process. If the relationship between the trait states and the processes which they regulate were direct (e.g., the trait states were continuous variables that served as the rates for the speciation, extinction, and dispersal processes), then we would describe our model as a compound Poisson model with the rate of change for the trait states as hyperparameter(s). However, in our formulation, the trait states are categorical variables, and their relationship to the rates of speciation, dispersal, and extinction are given by arbitrary user-defined functions that are part of our model. Our model can be understood as a modification or extension of the Dispersal-Extinction-Cladogenesis model ( Ree et al. 2005 ; Ree and Smith 2008 ; Matzke 2014 ), by (i) the incorporation of the branching process that generates a phylogeny as part of the model (instead of treating the phylogeny as a parameter of the model supplied by the user); (ii) the addition of the modeling of evolution of one or more traits under separate Poisson processes on the growing phylogeny; and (iii) the addition of a set of rate functions that relate a given lineage's trait states to the speciation, extinction, and dispersal rates of that lineage.

Our model consists of the following components (described in detail following the summaries):

Geography: a set of areas and connections between them which define the spatial configuration of the system.

Traits: defines the ecology or biology of the lineages insofar as they inform other aspects of the system (in particular, the speciation, extinction, and dispersal processes).

Lineages, representing an evolutionary operational taxonomic unit, corresponding to a node on a phylogeny.

Lineage rate functions: a set of functions that provide rates or weights for the independent processes of speciation, extinction, and range evolution of each lineage.

The process of anagenetic range evolution, which is actually the superposition of two independent processes: the process of

*anagenetic range gain*, where a lineage adds an additional area to its range through dispersal and colonization of a new area; and the process of*anagenetic range loss*, where a lineage loses an area in its range through local extirpation. Following the terminology of Ree et al. (2005) , Ree and Smith (2008) , Matzke (2014) , we refer to range gain as “dispersal” and range loss as “extinction” in this article.The process of cladogenesis, which generates speciation events in the branching process of the phylogeny, corresponding to “birth” events in a classical birth–death model, where the phylogeny gains a lineage by an existing lineage splitting into two daughter lineages. This includes the generation of cladogenetic range evolution events, as each daughter lineage acquires a range based on various different kinds of speciation modes.

Termination conditions: which determines how long the process is run before the phylogeny is sampled.

#### Geography

The geography of our model can be represented as a directed graph, $G$ , where vertices represent *areas* and arcs represent differential dispersal weights between areas. An “area” is an operationally defined biologically meaningful distinct geographical subunit of the study region: islands, mountains, zones of suitable habitat in a continuous landscape, etc. We distinguish between two classes of areas: focal and supplementary. Focal areas are geographically demarcated units within the study region from which lineages will be sampled, whereas supplementary areas external to the study region which nonetheless share or contribute to the evolutionary history of the lineages within the study region. For example, in the study of an adaptive radiation of a group of birds in an island system, the areas representing individual islands in the archipelago would be focal areas, whereas the continental source with which fauna is interchanged with the islands would be a supplemental area. The distinction between focal and supplemental areas is useful, as it allows for a better approximation of many island biogeographical systems, where the lineages evolving under quite different ecologies in distant regions contribute to the diversity and dynamics of the study region following introduction by stochastic dispersal. The relative diversity of each of the areas is a model parameter, representing the relative probability of which of the areas within a particular lineage's range becomes the site of speciation in the event that the lineage splits. This allows for modeling of unequal diversity across areas, either as a reflection of area size or other factors. We use the notation $Ai$ to refer to the $i$ -th area of the model.

The relative connectivity of the areas to each other are represented by the arcs of $G$ . An arc is a directed edge connecting two vertices, going from a source or tail vertex to a destination or head vertex. In the geographical submodel, an arc connecting area $u$ to area $v$ , $\alpha \u3008Au\u21d2Av\u3009$ represents the weighting of a dispersal event from area $u$ to $v$ , relative to all other area-pairwise dispersal events. If all areas are equally connected, then all arcs will have the same weight, which normalizes to 1: $\alpha \u3008Au\u21d2Av\u3009=1,\u2200u\u2260v$ . If some areas are not accessible from other areas, then the outgoing arcs of the latter toward the former will have a weight of 0. These weights will be multiplied by the lineage-specific dispersal weight of each lineage and the global disperal rate, $\delta $ , to obtain the effective rate of dispersal of that lineage from the source area to the destination area. The relative weightings of arcs connecting areas allow for modeling of various geographical geometries, such as stepping-stone or equal-island configurations, as well as for more nuanced modeling of the dispersal of lineages from one (or more) continental sources to the areas that are the focus of the study.

#### Traits

Traits are characters associated with lineages. In the current discussion, we consider traits to represent discrete characters, though there is nothing in principle that prevents them from representing continuous characters. An arbitrary number of trait *types* can be modeled, with an arbitrary number of trait states per trait type. Although traits can be used to represent any aspect of lineages, in this discussion we find it useful to consider them as representative of lineage ecology, just as areas represent the lineage geography.

Each lineage in the model shares the same suite of trait types, though, of course, the states of each particular trait type will vary from lineage to lineage. Each trait type evolves under its own Poisson process, and thus has its own distinct Markovian transition kernel, with the state of each trait type for a lineage evolving independently on each lineage under the Markovian transition kernel for that trait type. Daughter lineages inherit the trait state set of their parents, unmodified. This is, in fact, a fairly traditional model of phylogenetic character evolution (see, e.g. Felsenstein 2004 ). We denote the set of all trait types defined in the model as $T$ , and an individual trait type will be referenced by its labeling, as $Tlabel$ ; for example, the “habitat” trait will be referred to as $Thabitat$ . We denote the state space of an arbitrary trait with label $t$ , $Tt$ , as $\sigma t$ , and the size of this state space as $|\sigma t|$ . We denote as $Qt$ the $n\xd7n$ normalized instantaneous rate of change matrix for trait with label $t$ , $Tt$ , where $n$ is the size of the state space for trait type $Tt$ , $n=|\sigma t|$ . In our model parameterization, the transition matrix $Qt$ is normalized so that the mean flux is 1, and describes the relative weights of state transitions, but to obtain the actual rate of transition this matrix is multiplied by a global trait transition rate. Each trait type has its own independent trait transition rate, and we denote the global trait transition rate for an arbitrary trait with label $t$ , $Tt$ , as $qt$ .

#### Lineages

A “lineage” in our system is an independent evolutionary unit, representing an operational taxonomic unit corresponding to a node on a phylogeny. Each lineage in our model can be characterized by two attributes: its “ecology,” which is a vector of trait states under the trait submodel (described below), and its “range,” which is the set of distinct areas (i.e., vertices in the geographical submodel, $G$ , described below) in which it occurs. We use the notation $si$ to denote the $i$ -th lineage of the system, $\xi t(si)$ to denote the current state of the trait with label $t$ for lineage $si$ , and $R(si)$ to denote the set of areas currently in the range of lineage $si$ .

The traits of a lineage change following a Poisson process as described above, with the states of each trait evolving independently on each lineage under the Markovian transition kernel specified for that particular trait. The range of a lineage changes through gain or loss of areas under two distinct process: anagenetic range evolution and cladogenetic range evolution. These are discussed in detail below.

#### Lineage process rate functions

Our model requires that three functions be defined that regulate the processes of speciation, extinction, and range evolution for each lineage:

A

*lineage birth rate*function, $\Phi \beta (\xb7)$ .A

*lineage extirpation rate*function, $\Phi \u03f5(\xb7)$ .A

*lineage dispersal weight*function, $\Phi \delta (\xb7)$ .

Each of these functions takes a particular lineage as an argument, and maps to a real value giving the weight or rate of the given process for that particular lineage at that particular point in time, based on evaluation of the properties of the lineage or some other criteria. Each of these functions can be fixed to a constant value irrespective of the lineage argument to implement a model where the given process is independent of any trait or property of the lineage. For example, we can specify that the birth rate is a fixed value of 0.01 regardless of the lineage or any other criteria by specifying:

Alternatively, the functions can be more complex, and dependent on one or more trait states of the lineage. For example, if $\xi t(si)$ returns the state value for trait $t$ of lineage $i$ , then we can construct a lineage dispersal weight function that evaluates to 0.5 if the color trait of lineage $i$ has state 0, and 1.0 if it has state 1:

The functions can also reference other attributes of the lineage such as, for example, the number of areas in their range, or even the identities of particular areas. In this article, however, we restrict our discussion to functions that only consider the traits associated with each lineage.

#### Anagenetic range evolution

The range of a lineage can expand anagenetically by adding an area through the process of colonization by dispersal. The global dispersal rate model parameter, $\delta $ , determines the base probability of range gain through dispersal across the system. This model parameter is modified by the lineage-specific dispersal weight, as returned by the lineage dispersal weight function, and the weight of the connection arcs between the source and destination areas. The rate that a lineage $si$ disperses from one particular area in its range, $Au,Au\u2208R(si)$ , to another particular area not already in its range, $Av,Av\u2209R(si)$ , is given by the product of the lineage dispersal weight, $\Phi \delta (i)$ , the weight of the arc from $Au$ to $Av$ , $\alpha \u3008Au\u21d2Av\u3009$ , and the global dispersal rate, $\delta $ :

*total*rate that a lineage $si$ gains a new area not already in its range, $Av,Av\u2209R(si)$ , is given by the sum of the rates of it dispersing from every one of the areas already in its range to the new area $Av$ :

Conversely, the range of a lineage can contract anagenetically by losing an area through the process of extirpation. The probability that a lineage $si$ loses an area already in its range is given by the lineage extirpation rate function applied to the lineage, $\Phi \u03f5(si)$ . If the lineage loses all areas in its range, that is its range is equal to the empty set, $R(si)=\u2205$ , this means that it has gone globally extinct.

#### Speciation and cladogenetic range evolution

The branching process that generates the phylogeny is driven by a birth–death process.

The lineage-specific birth rate, that is the rate that a particular lineage splits into two daughter lineages, is given by the lineage birth rate function $\Phi \beta (\xb7)$ . The following range-inheritance processes are allowed at speciation/cladogenesis events:

*single-area sympatric speciation*— both daughters inherit the entire ranges of their parent.*sympatric subset speciation*— one daughter inherits the complete range of the parent lineage, while the other inherits a single area in the ancestral range.*peripheral allopatry*— one daughter inherits a single area within the parent's range, while the other inherits all remaining areas.*multi-area vicariant speciation*— the ancestral range is partitioned unequally between the two daughter species.*founder-event “jump” dispersal*— one daughter inherits the entire range of the parent, while the other has its range set to a single new area not in the parent range.

These speciation modes correspond to modes defined in the BioGeoBEARS cladogenesis model ( Matzke 2014 ), which extends the DEC model ( Ree and Smith 2008 ) by adding a number of other modes for assigning ranges to daughter lineages, including a founder-event “jump” dispersal. At every cladogenesis event, one of these speciation modes is selected under a model-specified probability distribution to determine the ranges of the daughter lineages. In all work reported here, we placed a probability of 0 on the founder-event “jump” dispersal speciation mode, and equal probability on all the other modes. This restriction is so that in all the analyses we report on here, the cladogenesis submodel replicates the DEC ( Ree et al. 2005 ; Ree and Smith 2008 ) model, adding the widespread vicariance mode but not the founder-event jump dispersal introduced by BioGeoBEARS ( Matzke 2014 ). Our model allows the founder-event jump disperal to occur (during simulations) following a weight provided by the user. This weight will be multiplied by the lineage-specific dispersal weight to get the actual weight of the founder-event jump dispesal, relative to the other speciation modes which are assigned a weight of 1.0.

If there is a need to model different levels of the diversity in the various areas, different diversity weights can be given to each of the areas. These diversity weights determine the probability that a particular area becomes the site of a new species in speciation modes 3, 4, and 5, for example. In all the work reported here, we set the diversity levels of all areas to be equal.

The lineage-specific extirpation rate, $\Phi \u03f5(\xb7)$ , gives the rate by which a lineage loses areas from its range. If the range of a lineage is reduced to the empty set, then it is considered to have gone extinct. Thus, although there is real extinction in our model in addition to speciation, the death rate parameter is not an explicit primary parameter of our model as it is in the standard birth–death model. If there is a need to establish full correspondence between the extirpation rate as given in our model and the death rate in the standard birth–death model (for, e.g., calibration purposes), the lineage extirpation rate function can be modified to return the global death rate multiplied by the number of areas currently occupied by the lineage, as the global extinction probability is the joint probability of the lineage being exirpated from every area in its range simultaneously.

### Model Selection by Simulation-Trained Discriminant Analysis of Principal Components Classification

Carrying out full likelihood inference under this model would be challenging, both for theoretical as well as practical reasons. Depending on the various lineage-specific rate and weight functions, the rates of the various processes can vary across both lineages as well as time in a stochastic and unpredictable way. Here we present a simulation-based likelihood-free model choice procedure as an alternative to full likelihood Bayesian or frequentist approaches, where we use a multivariate discriminant analysis function to classify the target or empirical data with respect to one of a number of competing generating models.

Our procedure consists of the following steps ( Fig. 1 ):

Simulation of data (phylogenies) under regimes corresponding to hypotheses in the competing model set, with nuisance parameters estimated from target data that we wish to classify.

Calculation of summary statistics on simulated data to obtain a data set that can be used to train a classification function.

Construction of a discriminant analysis function based on principal components extracted from the training data set.

Assessment of the performance of the summary statistics by inspection of posterior prediction of the training data set.

Application of the discriminant analysis function to the original data to classify it with respect to the competing models.

#### Simulation of training data

We simulate data under the Archipelago model using the Python package

Consider, for example, determining whether or not the habitat association of lineages influences their dispersal capacity. We would define two classes of dispersal regimes: a habitat-unconstrained dispersal versus a habitat-constrained dispersal. Under the habitat-unconstrained regime, all lineages disperse between areas at the same rate, regardless of their ecology in terms of habitat association, whereas under the habitat-constrained reigme, only lineages associated with a particular habitat type are allowed to disperse ( Fig. 2 ).

Under the Archipelago model, we define a trait, “habitat,” $Thabitat$ , that had two states, “open” and “interior,” $\sigma habitat={open,interior}$ . In all the studies presented here, we assume equally weighted transition rates between all trait states in all our trait evolution models, such the rate of transitions from the “open” trait state to the “interior” is equal to the rate of transitions from the “interior” to the “open” state.

In the case of the habitat-unconstrained dispersal regime, we would define a fixed-value lineage dispersal rate function,

On the other hand, in the case of the habitat-constrained dispersal regime we would define a lineage dispersal rate function that only allows dispersal if the lineage's habitat trait was “open”:

Here, the effective dispersal rate for any lineage with the “interior” habitat trait is 0.0, so that these lineages will not gain any areas anagenetically, regardless of the connections between areas or the global dispersal rate, $\delta $ . Only lineages with the “open” habitat trait will gain areas in their range anagenetically, at an effective rate of $2.0\delta $ multiplied by the dispersal weights of all the areas in which they occur. Note that the dispersing lineages in the constrained model have their rates of dispersal weighted twice as much as the lineages in the unconstrained model, to ensure that the average dispersal over both constrained and unconstrained dispersal regimes are equal. We use a factor of 2 because, under a trait-transition model in which transition rates between all states are equal, we expect an equal number of lineages to be found associated with each of the two habitat states, and thus in the constrained model, on the average only half the lineages are dispersing in relation to the unconstrained model. This means that for the total dispersal flux to remain equal when estimated from data generated under both systems, we have to weight the constrained regime dispersal rates to compensate for the fact that only half the lineages are dispersing. We show later that this weighting does, indeed, result in correct behavior, as measured when dispersal rates are re-estimated from the simulated data under both regimes.

In all other respects, apart from the lineage dispersal weight function, the models of unconstrained and constrained dispersal regimes are identical: they are calibrated to share the same (speciation, extinction, dispersal, and trait evolution) process parameters, number of focal areas, supplemental areas, as well as termination conditions. Thus in *both* the unconstrained and constrained dispersal regime models, we set the lineage specific birth rate and extirpation rate functions to return fixed values, $\Phi \beta (\xb7)=b\u02c6$ and $\Phi \u03f5(\xb7)=e\u02c6$ , where $b\u02c6$ and $e\u02c6$ are the birth rate and local extirpation rate are estimated from the empirical data, using

We simulated 100 replicates under each of the regimes, habitat-unconstrained and habitat-constrained dispersal. Each simulation yields a phylogeny containing information not only on speciation times, but on ranges (incidence in terms of islands or areas) and habitat associations of each lineage.

#### Calculation of summary statistics on training data

The simulations described in the previous step produced a set of phylogenies under each of the model regimes. For each of the simulated phylogenies, we calculate a vector of summary statistics. The actual choice of summary statistics would, of course, vary based on application. The principal criteria for summary statistic selection here is that they capture some of the patterns generated by the processes that we are studying. In our motivating examples, as we are interested in the relationship between a particular trait (habitat association) and geographic distribution, we use statistics based on the Mean Phylogenetic Distance (MPD) and Mean Nearest Taxon Distance (MNTD) concepts from community ecology ( Webb 2000 ). We calculate two variants of each of these statistics, depending on whether we define a “community” based on area (i.e., island) or habitat (i.e., open vs. interior habitats). For the area-based community scores, we calculate the mean and discard the individual area-based community scores; this is because in the analyses discussed here, we consider the areas/islands as exchangeable as a simplifying assumption, and more complex analyses may gain from retaining the individual island scores. For the habitat-based community scores, we retain the ones defined on considering each individual habitat as a community, as well as the mean of these scores. Furthermore, for each of these statistics, we calculate both weighted as well as unweighted distances, with edge lengths being considered when calculating distances between taxa in the former and just the number of edges between taxa being used in the latter. All weighted scores are normalized with respect to total tree length, while all unweighted scores are normalized with respect to the total number of edges on tree. In addition to the actual MPD or MNTD Z-score, we also use the $p$ -value and variance of the scores, with $p$ -value calculated using a null distribution obtained from shuffling the taxon labels. The actual number of summary statistics with this scheme will, of course, vary depending on the number of trait states (i.e., habitat category types). For example, with two habitat categories as trait states, open and interior, we obtain a total of 47 individual summary statistics. All these community-ecology summary statistics were calculated using the

#### Construction, assessment, and application of the discriminant analysis of principal components Function

We construct a discriminant analysis function using the simulated data as a training data set, and the dispersal model regime as the grouping factor. We do not construct this function based on the summary statistics of the training data set directly, but rather on principal components extracted from the summary statistics, i.e., a Discriminant Analysis of Principal Components, or DAPC, approach ( Jombart et al. 2010 ). The discriminant analysis function that will be used to classify data under one of the model regimes is constructed using the

The number of principal component axes retained or used by the discriminant analysis function is an analysis design choice. Too many principal components will result in overfitting to the training data set, whereas too few will result in lack of power ( Jombart et al. 2010 ). Here we select the number of the principal component axes to be retained for the discriminant analysis function using a heuristic optimization criteria, where, we attempt to (re-classify) the training data set elements using discriminant functions constructed on a different number of principal components retained, and select the number of principal components to retain that maximizes the proportion of correct (re-)classification of the original training data elements.

The performance of the DAPC function can be assessed in terms of its success rate in re-classifying the training data: the discriminant analysis function is applied to each of the vectors of summary statistics to estimate the posterior probability of being generated under each of the competing model regimes. This performance can be measured in two ways: the mean posterior probability of the true model regime, as well as the proportion of instances in which the true model was the preferred model regime.

The application of the DAPC function to classify the empirical data is carried out with the

The entire process of ingesting a training data set, selecting the number of PC axes to retain, constructing a DAPC function, and applying it to an empirical data set to yield the model choice results is implemented in the Python package

### Assessment and Validation of Model Choice Performance

We assessed the performance of our approach with the following procedure: Simulating a “known” phylogeny/geography/trait history under a given model and parameter values, then seeing if we could accurately predict the generating model using the simulation discriminant function approach described above ( Fig. 3 ). We explored the performance of our method over a range of process parameter space, using a landsape consisting of four focal areas and one supplemental area, with the dispersal weights between all areas and the diversity weights in each area set to be equal. Simulations running until 200 lineages were generated in the focal areas. We defined our parameter space exploration in terms of birth rates, that is, we explored suites of parameters all scaled to a particular birth rate. The birth rates we used are given by the set, $b$ , where $b={0.0015,0.0030,0.0060,0.0150,0.0300,0.0600}$ . For *each* birth rate, $b,b\u2208b$ , all distinct combinations of the following parameters were explored:

extirpation rates, $e$ : 0, $13b$ , $23b$ .

dispersal rates, $d$ : $b\xd710\u22121,b\xd710\u22120.75,b\xd710\u22120.5,b\xd710\u22122.5,b\xd7100,b\xd7102.5,b\xd7100.5,b\xd7100.75,b\xd7101$ .

trait transition rates, $q$ : $b\xd710\u22121,b\xd710\u22120.75,b\xd710\u22120.5,b\xd710\u22122.5,b\xd7100,b\xd7102.5,b\xd7100.5,b\xd7100.75,b\xd7101$ .

For each distinct combination of parameters above, we then simulated 200 target data sets, with 100 target data sets under the “unconstrained” dispersal regime and 100 target data sets under the “constrained” dispersal regime.

Each of these target data sets were profiled with respect to their major process parameters. The maximum-likelihood estimates of the birth rate under a pure-birth model were computed using

For each target data set, we then generated a training data set of 200 replicates, with 100 replicates under the “constrained” dispersal regime and the other 100 replicates under the “unconstrained” dispersal regime, using the *true* model parameters (i.e., not the estimated ones recovered during the profiling in the previous step). We acknowledge that using the estimated parameters would provide a more stringent assessment of the performance of the method, as our method then has to deal with estimation error in the calibration of the process parameters when simulating the training data sets. However, we suggest that our approach is justified because, as the results of the profiling described above show, our simulations produce data that result in estimates of process parameters that are very close to the true parameters. A discriminant analysis classification function was constructed on principal components calculated on each training data set, with the number of principal components retained selected by the optimality criteria described above. The proportion of correctly classified test data sets was taken as the primary performance metric, but we also recorded the posterior probability of the true and false models to understand how these varied across parameter space.

### Sensitivity of Model Choice to Process Parameter Calibration Value Errors

In our approach as presented here, we “calibrate” the simulations that generate the training data sets to match the target data set with respect to the birth, dispersal, and trait evolution process parameters. In practice, when attempting to classify empirical data, these parameters are not known and must be estimated. We are interested in characterizing the sensitivity of our model choice approach to errors in these calibration parameters. To do this, using the same two-trait, four-focal area, and one-supplemental area system described previously, we simulate a target data set under a set of known process parameters, and then attempt to classify it with respect to the model regime using training data sets that deliberate mis-specify process parameters values. We use the following set of process parameters as baseline values for the process parameters: birth rate, $b=0.03$ ; extirpation rate $e=0.00$ ; global dispersal rate $d=0.03$ ; and trait evolution transition rate $q=0.03$ . The training data set will use the same baseline values except one of the parameters will have an error introduced. We analyze the effect of errors in the birth rate, dispersal rate, and trait trait evolution transition rate separately. In each case, the target data set will be generated under the baseline values, and the training data set will be generated with the parameter being studied changed by a factor (and all other parameters fixed to the baseline values). The factors that we use result in parameter values that range from two orders of magnitude less than, to two orders of magnitude greater than the true (baseline) values, in increments of quarter orders of magnitude: $10\u22122,10\u22121.75,10\u22120.5,10\u22120.25,100,100.25,100.5,100.75,101,101.25,100.5,100.75,102$ . We repeat each analysis 200 times, with 100 replicates in which the true model of the target data set was “unconstrained” and 100 replicates in which the true model of the target data set was “constrained.” Performance was assessed by measuring how the posterior probabilities of the true and false models varied across these different magnitudes of errors in each of the parameters.

### Sensitivity of Model Choice to Cladogenetic Range Inheritance Mode Misspecification

As noted above, for all the analyses discussed here, we restricted ourselves to a subset of cladogenetic range inheritance modes: single-area sympatric speciation, sympatric subset speciation, peripheral allopatry, and multi-area vicariant speciation. The founder-event jump dispersal cladogenesis mode, however, was excluded. Previous studies have indicated that exclusion or inclusion of this mode can affect the results of analysis when estimating or reconstructing ancestral areas, with exclusion of the mode leading to a tendency to estimate widespread or multi-area ancestral ranges while inclusion of the mode leads to a tendency to estimate single-area ancestral ranges ( Matzke 2014 ). To study the effects of inclusion and exclusion of this speciation mode on our analytical objective (determination of whether or not dispersal is constrained by habitat), we carried out the following analyses of simulated data under the following four conditions:

Both test (i.e., the simulated data, representing the “truth”) and estimation models do NOT incorporate jump dispersal. This is the baseline case for comparison purposes.

The test model incorporates jump dispersal and estimation model does not incorporate jump dispersal. This case represents a deliberate under-specification of the estimation cladogenesis model.

The test model does not incorporate jump dispersal and estimation model incorporate jump dispersal. This case represents a deliberate over-specification of the estimation cladogenesis model.

Both test and estimation models incorporate jump dispersal. This case represents the correctly specified estimation cladogensis model.

In all cases, for both the test and estimation simulations, the birth rate was fixed at $b=0.03$ , and all distinct combinations of the following process parameters were explored:

extirpation rates, $e$ : 0.0, 0.01, 0.02.

dispersal rates, $d$ : $b\xd710\u22121,b\xd710\u22120.75,b\xd710\u22120.5,b\xd710\u22122.5,b\xd7100,b\xd7102.5,b\xd7100.5,b\xd7100.75,b\xd7101$ .

trait transition rates, $q$ : $b\xd710\u22121,b\xd710\u22120.75,b\xd710\u22120.5,b\xd710\u22122.5,b\xd7100,b\xd7102.5,b\xd7100.5,b\xd7100.75,b\xd7101$ .

If the cladogenetic jump dispersal process was indicated, either in the test or estimation model, then a prior weight of 1.0 was used (i.e., equal per-event prior weight to the other cladogenetic modes). A total of 100 test data sets were simulated under the two dispersal regimes: one in which dispersal is constrained by habitat and the other in which dispersal was unconstrained (i.e., dispersal was possible regardless of habitat association). Note that under the constrained dispersal regime, the dispersal constraint was applied to the jump dispersal cladogenetic inheritance mode as well as anagenetic dispersal (i.e., cladogenetic jump dispersal received a prior weight of 0 and thus was not allowed if the habitat trait of the lineage was in the “interior” state). For each test data set, an analysis was carried out using the above-described procedure, using training data sets of 200 replicates under each of the unconstrained and constrained regimes. The proportion of correct classifications of each test data set was recorded as the performance metric.

### Analysis of Trait-Dependent Dispersal in Island Bird Radiations

We applied our model and inference method to try and identify whether dispersal is constrained by traits in radiations of birds in Wallacea ( Fig. 4 ), focusing on two traits in particular: habitat type and trophic level.

#### Geography

Each trait was analyzed separately in a series of independent analyses using the same geographical submodel. The basic geographical units or “areas” of the analyses corresponded to the “modules” of Carstensen et al. (2012 , 2013 ): Banda Sea, Lesser Sundas, Maluku, and Sulawesi ( Fig. 4 ). These areas constituted the primary focal areas of the study, that is, the areas from which lineages were sampled. However, in addition, we added supplementary areas to the model to represent the Sundaland and Sahul continental sources which contributed to the evolutionary history, dynamics, and diversity of the region, but from which lineages were not sampled for analysis. This more authentically approximated the sampling of the empirical data we were analyzing. In three independent sub-analyses for each trait, we used zero, one, and two supplementary areas respectively, followed by a fourth independent subanalysis for each trait in which the training data sets were the union of the previous three (i.e., essentially integrating over the number of supplementary areas with an equal prior on each).

#### Phylogeny

We downloaded 10000 post-burnin MCMC samples of the Jetz et al. (2012) global avian phylogeny using the Hackett backbone from http://birdtree.org/ mapped to the Jetz nomenclature. These trees were originally estimated using

#### Habitat-dependent dispersal analyses trait configuration

The original Carstensen study associated each species with one of six habitat categories:

(H1) Interior: species only occupying interior forest.

(H2) Open-forest: species occupying interior forest and/or open forest.

(H3) Coastal: species only occupying littoral and/or open habitats.

(H4) Open: species only occupying open habitats.

(H5) Generalist: species occupying all four habitat types.

(H6) Other: species that did not fit into any of the previous five categories.

In this study, we collapsed these habitats into two categories, “open” and “interior,” and model “habitat” as a two-state trait associated with, and evolving along, lineages. We categorized any species that occurs in open or disturbed habitat as a “open area” species, in the sense that they have access to open areas and thus are at least partially subject to dispersal regimes associated with open areas. On the other hand, only species that are strictly restricted to interior habitats are considered “interior area” species: they are excluded from any dispersal regimes associated with open areas. This means that we considered birds associated with habitats 2 through 5 as open area species, while birds associated with habitats 1 were categorized as interior area species. We discarded (and pruned from the phylogeny) any species associated with habitat category 6 (“Other”), as this was a category consisting of different ecological categories of different evolutionary origins, and we are modeling the habitat trait as an evolutionary character.

We define two dispersal regimes categories based on the lineage dispersal weight responses to the habitat trait. The first is the “unconstrained” dispersal regime model, where the lineage dispersal weight function is a fixed value of 1.0. The second is the “constrained” dispersal regime model, where the lineage dispersal weight function is conditional on the habitat trait state of the lineage: if the lineage habitat trait state is “open,” then the dispersal weight is 2.0, but if it is “interior,” then the dispersal rate is set 0. This latter configuration thus only permits lineages with the “open” habitat trait to disperse.

#### Trophic-level dependent dispersal trait configuration

The original Carstensen study associated each species with one of nine guilds: frugivores, granivores, herbivores, nectarivores, insectivores, invertebrates, omnivores, piscivores, and carnivores. We modeled these guilds as trophic levels using a two-state trait system, with the trait, “trophic level” being either “low” or “high.” The “low” trophic level consisted of the “frugivore,” “`granivore,” “herbivore,” “nectarivore,” “insectivore,” and “omnivore” categories, whereas the “high” trophic level consisted of the rest. As before, we define two dispersal regimes categories based on the lineage dispersal weight responses to the trophic level trait: the “unconstrained” dispersal regime model and the “constrained” dispersal regime model. In the “unconstrained” dispersal regime model, the lineage dispersal weight function is a fixed value of 1.0. In the “constrained” dispersal regime model, the lineage dispersal weight function is conditional on the trophic level trait state of the lineage: if the lineage habitat trait state is “low,” then it the dispersal weight is 2.0, but if it is “high,” then the dispersal rate is set 0. This latter configuration thus only permits lineages with the “low” trophic level trait to disperse. We pruned from the phylogeny any species for which no trophic level or guild information was available.

#### Simulation calibration

The simulation calibration process parameters were then estimated separately on each trait-specific pruned phylogeny: the birth rate, $b$ under a pure-birth model using

#### Training data generation and classification analysis

For each of the analyses, we generated an independent training data set consisting of 100 replicates under the “unconstrained” dispersal regime and another 100 under the “constrained” dispersal regime. Each simulation was set to terminate when the number of lineages generated in the focal areas were equal to the number of lineages in the pruned trait-specific phylogeny. For each training data set, we constructed a discriminant analysis function classification function and applied it to the empirical data to classify it with respect to the generating regime, “unconstrained” or “constrained,” following the approach described previously. We then conducted a fourth analyses for each trait, constructing a discriminant analysis classification function based on the union of the previous three training data sets, and applying this to the empirical data as well. In all cases, the number of principal component axes retained was selected using the heuristic we describe above.

## R esults

### Process Parameter Estimates Under Unconstrained and Constrained Dispersal Regimes

In general, the maximum likelihood estimates of the birth rate track the simulation birth rate values closely, and there are no strong differences between the dispersal regimes ( Fig. 5 a). The long tails of the violin plots are indicative that most parts of the parameter space explored included moderate to very strong extirpation processes which results in extinctions that are not accounted for by the pure-birth model.

As with the birth rates, the estimated trait transition rates generally track the true trait transition rate closely, with no strong differences between the dispersal regimes ( Fig. 5 b). As the true trait transition rate increases, however, to an order of magnitude higher than the birth rate, the variance in the estimates of the trait transition rates increases dramatically, with a very strong over-estimation bias.

The estimated rates of dispersal also generally track the true rates, and the centers of mass of the estimates under both regimes are generally co-located except in the most extreme dispersal rates, that is, with dispersal rates half or a full order of magnitude higher than the birth rates ( Fig. 5 c). However, the differences in variances between the dispersal regimes are striking. In particular, the variance in the estimates of the dispersal rate in the constrained model are extremely exaggerated. This does not affect our estimation method directly, as we do not use or attempt to estimate the dispersal rate, but rather rely on patterns resulting from the partitioning of the dispersal rates between lineages. However, this variance in effective global dispersal rate might lead to reduction of power in our method, as any differences due to how the dispersal is partitioned or distributed across various components of the system may be obscured. This variance in dispersal rates is due to the violation of the assumptions of the DEC model with respect to dispersal. As noted above, in the constrained dispersal regime we weight the dispersal of lineages associated with the open habitat trait twice the global dispersal rate, to maintain, on the average, the same dispersal flux as the unconstrained dispersal regime. As can be seen here, this has the desired effect. The noise that we are seeing is that the weighting factor of 2 is an expectation based on the stationary distribution of the habitat trait state. This expectation, however, is only an expectation, and the stochasticity of the trait transition process adds noise to the dispersal estimates, as, in any one (constrained dispersal regime) simulation a varying and unpredictable number of lineages are not dispersing. The variance of the trait transition process increases strongly as the trait transition rates get large, as seen in Figure 5 b. If we “correct” for the trait transition rate, as can be seen in Figure 5 d, the dispersal rates of both the constrained and unconstrained dispersal regimes are more concordant.

### Performance Over Parameter Space

As might be expected, the general performance of the method is dominated by the relative, rather than absolute, birth rates ( Fig. 6 a). Reducing the birth rate to a relative scaling factor and marginalizing over extirpation rates shows that there is a clear area where our method performs well and a clear area where it does not, and these areas can be defined in terms of the rates of trait evolution and dispersal relative to the birth rate as well as each other ( Fig. 6 b): the method performs best when the dispersal rate is *equal to or greater* than the trait transition rate. This becomes strikingly evident in Fig. 6 c, which shows the differences in performance when the dispersal rate is less than, equal to, or greater than the trait transition rate. The birth rate is shown on the $y$ -axis, and, as can be seen, the performance of the method is dominated by the relative values of the dispersal rate and the trait transition rate. The regionalization of parameter space in terms of performance at an even higher resolution, faceting the performance plots by the relative values of the transition rates to birth rate in columns and the dispersal rates to birth rates in rows shows that, in addition to the dispersal rates being higher than or equal to the trait transition rates, the optimum performance is gained when the birth rate, as well, is higher than or equal to the trait transition rate ( Fig. 6 d).

There are two important observations to be noted in the results performance over parameter space as measured in term of posterior probabilities found for the true versus false models ( Fig. 7 ). First, that there is no difference in response between the “constrained” and “unconstrained” cases, that is, the performance of the method is invariant with respect to the true model. The second, and perhaps for greater practical interest, is that the posterior of the false model very rarely reaches or exceeds 0.90, even in the suboptimal parts of parameter space where it is preferred over the true model. In contrast, the support for the true model generally exceeds 0.90, and often 0.99, in the optimal parts of parameter space.

### Sensitivity to Calibration Errors

The method is sensitive to errors in the birth rate parameter (shown in the left-most panel) when the true model is “unconstrained,” and somewhat less so when the true model is “constrained” ( Fig. 8 , left-most column). We suggest that, in the extreme cases, these results are an artifact of speciation occurring either too slow or too fast to generate a pattern of diversity (with distributions of traits and areas across lineages) for the summary statistics to capture. The method is also sensitive to error in the dispersal rate parameter, especially when the error is due to an underestimate in the dispersal rate and the true model is “unconstrained” ( Fig. 8 , center column). In both the “unconstrained” and “constrained” cases, the method is fairly robust to an overestimate of the dispersal rates, up to an order of magnitude in the former and two orders of magnitude in the latter. The method is least sensitive to error in the trait evolution transition rate parameter, with very little discernible reduction in performance over *two* orders of magnitude difference in rate, regardless of whether or not the true model is “unconstrained” or “constrained” ( Fig. 8 , right-most column).

### Sensitivity of Model Choice to Cladogenetic Range Inheritance Mode Misspecification

Over the range of parameter space that we explored, inclusion or exclusion of founder-event jump dispersal in either the test (true) model or the estimation model did not have a significant effect on performance. This was verified by two different ANOVA tests using, as predictor variables, the analysis condition (i.e., whether or not jump dispersal was included or excluded in either the true or estimation models for each analysis). In the first test, the proportion of correctly classified dispersal regimes was used as the response variable, and this yielded a $P$ -value of $P=0.596$ . In the second test, the posterior probability of the true model under the classification was used as the response variable, and this yielded a $P$ -value of $P=0.739$ . Inclusion or exclusion of founder event jump dispersal in the estimation model, regardless of whether or not it occurs in in the true model, is of secondary importance when compared with location in parameter space, and the rate of dispersal relative to the rate of trait transition dominates performance ( Fig. 9 ). For the analytical objective of determining whether or not a trait regulates dispersal, the cladogenetic founder-event jump dispersal is essentially a nuisance process, and even if it might be statistically significant, it is not of analytical significance as it has little effect relative to other factors on the analytical results.

### Dispersal in Island Birds

#### Habitat-dependent dispersal

The proportion of correct (re-)classifications when the discriminant analysis function constructed on the union of all three training data sets was applied to the training data ranged from 0.76 to 0.81 ( Table 1 ). More informative is that we see that there is very little bias in the method in most cases: both the correct classifications as well as the incorrect classifications are more or less equally distributed between the “constrained” and “unconstrained” models.

Supplementary areas | Retained PC axes | Proportion of correct reclassifications | Mean posterior of true model | Proportion of correct classifications when true model was “Unconstrained” | Proportion of correct classifications when true model was “Constrained” | Proportion of incorrect classifications when true model was “Unconstrained” | Proportion of incorrect classifications when true model was “Constrained” |
---|---|---|---|---|---|---|---|

0 | 32 | 0.76000 | 0.67332 | 0.50000 | 0.50000 | 0.50000 | 0.50000 |

1 | 22 | 0.81000 | 0.68190 | 0.51235 | 0.48765 | 0.44737 | 0.55264 |

2 | 20 | 0.82500 | 0.72939 | 0.49091 | 0.50909 | 0.54288 | 0.45714 |

0 + 1 + 2 | 38 | 0.69000 | 0.59101 | 0.49758 | 0.50242 | 0.50538 | 0.49462 |

Supplementary areas | Retained PC axes | Proportion of correct reclassifications | Mean posterior of true model | Proportion of correct classifications when true model was “Unconstrained” | Proportion of correct classifications when true model was “Constrained” | Proportion of incorrect classifications when true model was “Unconstrained” | Proportion of incorrect classifications when true model was “Constrained” |
---|---|---|---|---|---|---|---|

0 | 32 | 0.76000 | 0.67332 | 0.50000 | 0.50000 | 0.50000 | 0.50000 |

1 | 22 | 0.81000 | 0.68190 | 0.51235 | 0.48765 | 0.44737 | 0.55264 |

2 | 20 | 0.82500 | 0.72939 | 0.49091 | 0.50909 | 0.54288 | 0.45714 |

0 + 1 + 2 | 38 | 0.69000 | 0.59101 | 0.49758 | 0.50242 | 0.50538 | 0.49462 |

*Notes:* The proportion of correct classifications when the true model was unconstrained and constrained colums show how the correctly classified entries are distributed between unconstrained and constrained data. In these two columns, the ideal value should be 0.50, indicating that the method does not perform better in correctly classifying unconstrained cases versus constrained or vice versa. Similarly, the proportion of incorrect classifications when the true model was unconstrained and constrained show how the *incorrectly* classified entries are distributed between the unconstrained and constrained data. Again, in these two columns, the ideal value should be 0.50, that is, indicating that there is no bias toward misclassifying either unconstrained or constrained cases.

Regardless of the number of supplemental areas used, all analyses strongly supported a “constrained” dispersal regime, that is that dispersal is indeed constrained to lineages associated with the open habitats, with a posterior probability of greater than 0.99 ( Table 2 ). Even though this analysis falls in a suboptimum region of parameter space (c.f. Figure 6 ), the very high posterior probability in support for the “constrained” model is encouraging, given the results shown in Figure 7 .

Supplementary areas | Retained PC axes | Assigned dispersal regime | Posterior |
---|---|---|---|

0 | 31 | Constrained | 0.99418 |

1 | 40 | Constrained | 0.99992 |

2 | 34 | Constrained | 1.00000 |

0 + 1 + 2 | 40 | Constrained | 0.99981 |

Supplementary areas | Retained PC axes | Assigned dispersal regime | Posterior |
---|---|---|---|

0 | 31 | Constrained | 0.99418 |

1 | 40 | Constrained | 0.99992 |

2 | 34 | Constrained | 1.00000 |

0 + 1 + 2 | 40 | Constrained | 0.99981 |

#### Trophic-level dependent dispersal

The discriminant analysis function constructed on zero supplemental areas had a stronger tendency to misclassify the cases where the true model was “constrained,” and this effect presumably influenced the cases where the training data set used the union of the different supplemental area configurations ( Table 3 ). However, the analyses with one or two supplemental areas were unbiased with respect to the generating model, and the use of two supplemental areas had the peak proportion of correction (re-)classifications, with 0.91 replicates in the training data set correctly classified. In constrast to the analyses of habitat-dependent dispersal, application of the discriminant analysis function on the empirical data resulted in strong support for the *non* -constrained regimes: that is, that trophic level does *not* regulate or drive the dispersal process in Wallacean birds ( Table 4 ).

Supplementary areas | Retained PC axes | Proportion of correct reclassifications | Mean posterior of true model | Proportion of correct classifications when true model was “Unconstrained” | Proportion of correct classifications when true model was “Constrained” | Proportion of incorrect classifications when true model was “Unconstrained” | Proportion of incorrect classifications when true model was “Constrained” |
---|---|---|---|---|---|---|---|

0 | 26 | 0.84000 | 0.76132 | 0.52381 | 0.47619 | 0.37500 | 0.62500 |

1 | 26 | 0.84000 | 0.74755 | 0.50000 | 0.50000 | 0.50000 | 0.50000 |

2 | 23 | 0.91000 | 0.83198 | 0.50000 | 0.50000 | 0.50000 | 0.50000 |

0 + 1 + 2 | 27 | 0.81167 | 0.72967 | 0.50719 | 0.49281 | 0.46903 | 0.53097 |

Supplementary areas | Retained PC axes | Proportion of correct reclassifications | Mean posterior of true model | Proportion of correct classifications when true model was “Unconstrained” | Proportion of correct classifications when true model was “Constrained” | Proportion of incorrect classifications when true model was “Unconstrained” | Proportion of incorrect classifications when true model was “Constrained” |
---|---|---|---|---|---|---|---|

0 | 26 | 0.84000 | 0.76132 | 0.52381 | 0.47619 | 0.37500 | 0.62500 |

1 | 26 | 0.84000 | 0.74755 | 0.50000 | 0.50000 | 0.50000 | 0.50000 |

2 | 23 | 0.91000 | 0.83198 | 0.50000 | 0.50000 | 0.50000 | 0.50000 |

0 + 1 + 2 | 27 | 0.81167 | 0.72967 | 0.50719 | 0.49281 | 0.46903 | 0.53097 |

Supplementary areas | Retained PC axes | Assigned dispersal regime | Posterior |
---|---|---|---|

0 | 26 | Unconstrained | 0.99789 |

1 | 26 | Unconstrained | 0.99619 |

2 | 23 | Unconstrained | 0.73873 |

0 + 1 + 2 | 27 | Unconstrained | 0.98983 |

Supplementary areas | Retained PC axes | Assigned dispersal regime | Posterior |
---|---|---|---|

0 | 26 | Unconstrained | 0.99789 |

1 | 26 | Unconstrained | 0.99619 |

2 | 23 | Unconstrained | 0.73873 |

0 + 1 + 2 | 27 | Unconstrained | 0.98983 |

## D iscussion

### The Archipelago Model

The Archipelago model is an extremely flexible biogeographical phylogenesis model that not only simulataneously integrates the diversification, trait evolution, and the geographical evolution processes in a single unified superprocess, but also provides for these subprocesseses to inform each other dynamically as the phylogeny grows. This provides for the unique possibility of analyzing the influences of ecology, or any other biological trait, on *both* the diversification and geographical evolution process, in a way that is not possible using any other current biogeographical model.

In addition, unlike other existing methods, with our method, the ability to partition the geographical system into focal areas and supplemental areas, allowing the two classes of areas to interact biogeographically, yet only sampling data from the focal areas, solves a major conceptual limitation. Specifically, with community-based biogegoraphical analyses using previous approaches, the simulated data are typically generated under, for example, a birth–death model, which results in a strict and complete monophyly of all sampled lineages: the entire diversity of the region is assumed to have evolved in situ. However, if the study region exchanges fauna with external regions, then the simulated phylogenies fail to account for the fact that the empirical phylogeny is really paraphyletic with respect to the full diversity of the radiation unless there has only been a single colonization of the system. With some groups of limited scope, this latter case might indeed be true, but in most cases this is not.

### Simulation-Trained Discriminant Analysis of Principal Components Classification for Biogeographical Model Choice

We use a canonical discriminant analysis function trained by simulated data to classify a target data set, based on principal components of summary statistics calculated on both data sets. This method is extremely flexible, in that it can be used to carry out model selection with models for which likelihoods are too expensive to calculate in either an MCMC or maximum-likelihood optimization heuristc framework, or even cannot be calculated at all due to the difficulty in formulating a likelihood statement. In this respect, it is very similar to other simulation-based alternatives to full likelihood inference, such as Approximate Bayesian Computation (ABC) approaches (e.g., Beaumont et al. 2002 ; Beaumont 2010 ; Marjoram et al. 2003 ). However, the posterior probabilities of the models in the approach we use are not directly comparable to the posterior probabilities of either full likelihood Bayesian or ABC approaches for a number of reasons. For example, we develop the discriminant analysis functions based on simulations calibrated with process parameter values estimated from the original data, whereas both full likelihood Bayesian and ABC approaches require the prior be specified without using the original data (though empirical Bayesian approaches use maximum-likehood estimates of nuisance parameters). Even without this issue, however, (e.g., if we were to use an independent data set to estimate these parameters), it is still not entirely clear if the posterior probabilities can be compared, due to a number of other factors in our approach that are difficult to relate to these other approaches; for example, the selection of the number of PC axes to retain in the construction of the DA function. As such, we prefer to view our approach as more akin to a machine-learning approach, that is, model classification as opposed to model selection, which, indeed, strictly speaking it truly is. The posterior probabilities obtained by our approach might be taken to be the relative strength of support for different candidate models in the same pool (i.e., within the same single discriminant analysis), but at this point more work needs to be done to establish how to compare these values across different analysis or the posterior probabilities produced by full likelihood or ABC approaches.

We cannot use full likelihood approach to estimate parameters under all but the simplest case of our model (e.g., where rates of speciation, dispersal, and so on are not dynamically informed by the states of lineages, but are instead fixed to be true homogenous Poisson processes), but why do we not use ABC? The primary reason is that summary statistic development in ABC is very complex, requiring detailed exploration of parameter space to establish the efficacy of the summary statistics with respect to the particular configuration of model space and data ( Robert et al. 2011 ). Approximate Bayesian Computation has been shown to be a fairly robust procedure for parameter estimation, but it has been shown to have many issues with model selection, and it takes an immense amount of effort to develop summary statistics that are effective for a particular model pool, and even more effort to *demonstrate* this effectiveness and to understand and characterize the biases that might arise with any particular suite of summary statistics (e.g., Oaks et al. 2013 , 2014 ). Although we do conduct such a detailed exploration here, in principle, with the DAPC approach, this is not neccessary in practice for any particular study: re-application of the discriminant analysis classification on the training data set allows immediate evaluation of the efficacy of the summary statistics used in terms of the proportion of elements in the training data that were classified correctly (supplemented by, if preferred, the mean posterior probability of the true model), as well as the biases that might be apparent (e.g., if one model is preferred more than any other when mis-classified). Furthermore, with ABC approaches, selection of summary statistics is always a balance between not having too many summary statistics such that it becomes difficult to simulate data that falls within the acceptence threshold, or having too few such that resolving power is lost. With the DAPC approach used here in conjunction with the heuristic we developed to select the number of principal component axes to retain, this issue is largely alleviated. Another reason to use the DAPC approach over ABC is due to computational efficiency. An ABC analysis requires a large number of simulations to adequately sample parameter space, unless these are fixed or reduced by using empirically informed priors, though the latter approach needs to be used with caution ( Oaks et al. 2014 ). As shown here, effective results are available with our approach with as few as 200 simulation replicates.

Our analysis indicates that there is a “sweet spot” in which the method performs particularly well and reliably: when the dispersal rate is higher than the birth rate, and the trait transition rate is lower than the birth rate. This is, of course, related to the particularities of the summary statistics that we use, and the relative high dispersal rate and low trait transition rate allows for the geographical signal to accumulate across multiple species with the same trait. Conversely, we found that our approach is most challenged in conditions where the trait transition rate is very high relative to the dispersal rate and birth rate. This is not, as might naively be assumed, due to phylogenetic saturation of signal in the trait transition rate. Saturation in phylogenetic signal is when the rate of change is high enough relative to a branch (or, alternatively, a branch is long enough relative to the rate) that the character achieves its equilibrium frequency, making it difficult to estimate the ancestral state. However, the issue here is different: we are not attempting to estimate the rate of trait evolution or ancestral states of the trait, but to identify whether or not dispersal is constrained by the trait. When the the rate of trait transition is high relative to the speciation rate, it becomes difficult to determine this, regardless of whether or the trait transition signal is saturated or not. The reason for this is not saturation of the trait evolution signal, but rather that, if a trait flips back and forth enough times along the lifespan of a lineage, the lineage does not spend sufficient time in one state or another for the signal to accumulate under any trait state in particular, and the signal essentially is a weighted mixture of multiple traits. For example, even if dispersal is only possible while a lineage is in trait state 0 and not while in trait state 1, if the rate of trait evolution is fast enough that on the average the lineage is in trait state 0 half the time and 1 in the other, the signal of constrained dispersal would be reduced by half.

There are four major limitations with our approach to model choice via simulation-trained discriminant analysis classification, at least as presented here.

The first is that, while immensely more tractable than ABC, as with ABC, the inference framework that we describe still requires formulation and testing of summary statistics if applied to different models and questions. The summary statistics that we present here are explicitly designed to capture differences in the patterns of how areas and traits sort themselves on a phylogeny and are partitioned with respect to each other. This is suitable for the types of studies presented here, that is, the relationship between dispersal and traits, and as long as users apply the analysis to this class of questions and data, our approach may be used as-is. For the broader classes of studies that might be carried out with the Archipelago model, or, even more generally, a simulation-trained machine learning approach, users will have to at the very least validate the performance of the summary statistics and, if found inadequate, find or develop new ones that are needed. As noted above, however, this task is greatly facilitated by the ease and effectiveness of summary statistic assessment through reapplication of the result discriminant analysis function on the training data.

The second and third issues are related: the need for external estimation of parameters, and accounting for uncertainty in these estimates. As presented here, the speciation, extinction, dispersal, and trait transition rate parameters of the model need to be calibrated using estimates made on the empirical data. This is less than ideal both for theoretical as well as practical reasons. The theoretical objection is related to the same objection expressed against empirical Bayesian approaches, that is, in some sense the data are used “twice.” We assert that this is not a salient concern in our case, as we are not trying to recover, even approximately, posterior probabilities comparable to Bayesian posterior probabilities, but are rather focussing on model choice. The practical concern is that the dispersal parameter, in particular, is computationally too intensive to estimate once the number of areas gets very large using either

Furthermore, even if the parameters can be obtained readily from the empirical or target data to calibrate the training data simulations, we do not treat the uncertainty in these estimates in the approach that we present here. To some extant, it can be argued that the “fuzziness” of our approach allows for a large degree of slop in the calibration parameters, and, indeed, we demonstrated this in our results, where, depending on the parameters, errors of moderate to even large degrees of magnitude can be tolerated. One modification to our approach that allows both for treatment of uncertainty as well as avoiding the need to rely on other methods to estimate the calibration parameters is possible, albeit at an obvious and substantial increase in computational cost. Basically, the training data sets need to be constructed over a broad range of parameter space, with the parameter values of each training data element being sampled randomly from a prior distribution, thus both obviating the need for external estimates of calibration parameters as well as integrating out uncertainty in these parameter values. This brings our approach closer to ABC operationally, though theoretically both are very distinct, and operationally the two remain different in how the model support is assessed. For example, in our empirical application presented here, the fourth class of training data set, that is the one that combined zero, one, and two supplementary areas, can be seen as implementing this concept to integrate out uncertainty in the number of supplemental or external areas that might be interacting with the focal areas under a prior that is uniform over the range $U{0,2}$ .

Like all model selection approaches, full likelhood Bayesian or otherwise, the analysis is unable to choose a model that it is not offered, and thus, in a sense, any analysis result can only be as “true” as the models that are constructed by the researcher. And we assert that this is not a problem, or a limitation, but rather the basis of model-based science. Although model- *selection* or choice should be carried out through rigorous and objective statistical procedures, model *formulation* is, we suggest, the primary task of the investigator, *not* the statistical apparatus. Formulating a model is the way for investigators to express their ideas, expertise, knowledge to a statistical framework. As the first step in an analysis, it is essentially equivalent to asking the question, and although statistics and data have the capacity to answer questions, asking the question remains and should always remain the role of the investigator.

## S upplementary M aterial

Data available from the Dryad Digital Repository: http://dx.doi.org/10.5061/dryad.s43f8 .

## F unding

This work was supported by subsidy funding from OIST and an NSF grant (DEB-1145989) to E.P.E. and L.L.K.

## A cknowledgements

We wish to thank the Editor-in-Chief, Dr Frank (Andy) Anderson, the Associate Editor, Dr Tiffani Williams, as well as two reviewers, Dr Nick Matzke and an anonymous one, for their time and effort in providing detailed comments and suggestions to improve the article. We also wish to thank Dr Mark T. Holder at the University of Kansas for his generosity with the HPC computational resources of his lab, without which this work would not have been possible. All plots were done using the

## R eferences

*Pheidole*