-
PDF
- Split View
-
Views
-
Cite
Cite
Dean C. Adams, A METHOD FOR ASSESSING PHYLOGENETIC LEAST SQUARES MODELS FOR SHAPE AND OTHER HIGH-DIMENSIONAL MULTIVARIATE DATA, Evolution, Volume 68, Issue 9, 1 September 2014, Pages 2675–2688, https://doi.org/10.1111/evo.12463
Close - Share Icon Share
Abstract
Studies of evolutionary correlations commonly use phylogenetic regression (i.e., independent contrasts and phylogenetic generalized least squares) to assess trait covariation in a phylogenetic context. However, while this approach is appropriate for evaluating trends in one or a few traits, it is incapable of assessing patterns in highly multivariate data, as the large number of variables relative to sample size prohibits parametric test statistics from being computed. This poses serious limitations for comparative biologists, who must either simplify how they quantify phenotypic traits, or alter the biological hypotheses they wish to examine. In this article, I propose a new statistical procedure for performing ANOVA and regression models in a phylogenetic context that can accommodate high-dimensional datasets. The approach is derived from the statistical equivalency between parametric methods using covariance matrices and methods based on distance matrices. Using simulations under Brownian motion, I show that the method displays appropriate Type I error rates and statistical power, whereas standard parametric procedures have decreasing power as data dimensionality increases. As such, the new procedure provides a useful means of assessing trait covariation across a set of taxa related by a phylogeny, enabling macroevolutionary biologists to test hypotheses of adaptation, and phenotypic change in high-dimensional datasets.
In comparative biology it is generally accepted that the evaluation of evolutionary correlations among traits requires a phylogenetic perspective. Because of shared evolutionary history, species in a phylogeny are not independent (Felsenstein 1985; Harvey and Pagel 1991; Felsenstein 2004; Revell et al. 2008; Pennell and Harmon 2013). This leads to phenotypic similarities among taxa, whose expected covariance may be described in proportion to their shared ancestry under a specified model of trait evolution (Felsenstein 1985; Grafen 1989; Martins and Hansen 1997; Rohlf 2001). As a consequence, statistical procedures that assess patterns of covariation among traits from cross-species data must take the evolutionary relationships among taxa into consideration during the analysis, so that the lack of independence among species may be properly accounted for (Revell 2010). In these circumstances, when evolutionary correlations among traits are identified via phylogenetic comparative methods, such correlations provide important evidence of adaptation and inform on the evolutionary causes of phenotypic change (Garland et al. 1992; Rüber and Adams 2001; Pennell and Harmon 2013), as the potential effect of shared evolutionary history may be ruled out.
Over the past several decades, numerous analytical approaches have been developed to account for phylogenetic nonindependence while assessing patterns of trait covariation (Cheverud et al. 1985; Felsenstein 1985; Grafen 1989; Martins and Hansen 1997; Diniz-Filho et al. 1998; Garland and Ives 2000; Guill et al. 2003). Of the approaches available, phylogenetic regression (Felsenstein 1985; Grafen 1989) is most commonly used for assessing evolutionary associations in a comparative framework. Phylogenetic regression, which can be implemented either using independent contrasts (PIC: Felsenstein 1985) or generalized least squares (PGLS: Grafen 1989; Martins and Hansen 1997), evaluates trait associations as described by linear models, where one trait is treated as the dependent (response) variable, and one or more additional traits are treated as explanatory (independent) variables (Grafen 1989; Rohlf 2001; Pennell and Harmon 2013: described below). Most empirical studies of evolutionary covariation examine patterns between one or a few traits. Some classic examples include characterizing the evolutionary correlations between body mass and home range size in mammals (Garland et al. 1992), evaluating the correlation between life history traits and growth-related variables in maples (Ackerly and Donoghue 1998), assessing the relationship between limb measurements and functional performance in lizards (Losos 1990), and evaluating the association between genome size and developmental rate in salamanders (Sessions and Larson 1987).
Phylogenetic regression can also be used to evaluate evolutionary patterns in multivariate datasets. In this case, the response variable may be a set of measured traits treated simultaneously, or may be a complex multidimensional trait like shape (for instance, as quantified from geometric morphometric methods: Bookstein 1991; Adams et al. 2013). Indeed, several studies have investigated multivariate phenotypic evolution using phylogenetic regression using shape as data (e.g., Rüber and Adams 2001; McPeek et al. 2008; Monteiro and Nogueira 2011; Blankers et al. 2012; Klingenberg and Marugán-Lobón 2013; Outomuro et al. 2013a,b; Piras et al. 2013; for applications to other multivariate data types see: Lawing and Polly 2011). However, one underappreciated challenge for these approaches when evaluating highly multivariate datasets is that as the number of trait dimensions equals or exceeds the number of species in the phylogeny (p ≥ N), parametric methods such as multivariate regression and MANOVA cannot be used to assess significance. The reason for this is that when p ≥ N, the covariance matrices obtained from these models are singular. Thus, parametric test statistics that require inverting a covariance matrix or finding its determinant (Wilks’ lambda, Pillai's trace, etc.) cannot be computed, and the matrix computations for obtaining the likelihood of such models cannot be completed (see Adams 2014a). Furthermore, for a given number of species, it may be expected that the ability to detect evolutionary patterns in such data will decrease as trait dimensionality increases (a property that is demonstrated below).
Thus, while standard procedures for implementing phylogenetic regression can accommodate multivariate datasets, their utility is compromised by the complexity of the phenotypic data under consideration. For this reason, numerous authors have reduced the dimensionality of their data by representing it using a smaller set of principal component axes before implementing phylogenetic analyses (Bergmann et al. 2009; Nogueira et al. 2009; Monteiro and Nogueira 2011; Brusatte et al. 2012: see recommendations in Monteiro 2013). However, this procedure is not optimal, because one cannot assume that patterns of evolutionary covariation between multivariate phenotypes and other factors align with the major axes of variation that summarize the phenotypic traits (Monteiro 2013). Therefore, for high-dimensional phenotypic datasets, an alternative procedure is required.
In this article, I propose a new statistical procedure for evaluating linear models in a phylogenetic context, under a Brownian motion model of evolution, for high-dimensional multivariate datasets like shape. The approach is derived from the statistical equivalency between methods based on covariance matrices and those based on distance matrices. I show that for univariate data, the distance-based approach (D-PGLS) provides numerically identical estimates of evolutionary patterns to those obtained from standard implementations of phylogenetic regression, demonstrating the statistical equivalency between the two. I then demonstrate that the approach displays acceptable Type I error and high statistical power for detecting evolutionary correlations between an independent variable and high-dimensional multivariate traits like shape. Further, I show that for multivariate data, standard implementations of phylogenetic regression exhibit decreasing statistical power with increasing dimensionality of the response variable. Thus, these methods are hampered in their ability to identify evolutionary associations in a phylogenetic context for high-dimensional phenotypic data. Finally, I present a biological example demonstrating the utility of the new approach for high-dimensional datasets. Computer code written in R for implementing the procedure is also provided.
Phylogenetic Regression via Generalized Least Squares
One approach for assessing the evolutionary covariation between variables in a phylogenetic context is phylogenetic regression, or phylogenetic generalized least squares (PGLS: Grafen 1989; Martins and Hansen 1997). For regression models, PGLS is mathematically equivalent to phylogenetically independent contrasts (PIC: Garland and Ives 2000; Rohlf 2001; Blomberg et al. 2012). However, phylogenetic regression can more easily accommodate alternative statistical designs such as ANOVA and more complicated factorial designs (Pennell and Harmon 2013), and can be used with different evolutionary models (see below). With phylogenetic regression, phylogenetic relationships are taken into account while parameterizing the linear model , which is then statistically evaluated. Model parameterization is accomplished by incorporating the expected covariance due to the phylogeny into the residual error, ε (Martins and Hansen 1997). Under a Brownian motion model of evolution, the lack of independence due to shared ancestry is described by the phylogenetic covariance matrix (C), which is an N × N matrix whose diagonal elements contain the phylogenetic distance from the root of the tree to each of the N species tips, and the off-diagonals contain the phylogenetic distances from the root of the tree to the most recent common ancestor for each pair of species (Garland and Ives 2000; Rohlf 2001). In some cases, suitable branch-length transformations of C can be used to represent alternative evolutionary models, such as Ornstein–Uhlenbeck processes or an ACDC model of evolutionary change (see Butler et al. 2000; Blomberg et al. 2003).
Next, the total sums of squares (SSTot) is obtained by repeating equations 1–3 on a design matrix X containing only the column of ones. Variation explained by the regression model is then estimated as SSTot – SSResid, from which mean squares, F-ratios, and R2 may be obtained and evaluated. For more complicated statistical designs, multiple terms may be included in X, and the statistical relationship between each X-variable and Y may be determined by repeating equations 1–3 above in a sequential fashion, each time removing a term from the model to produce a new reduced model Xred (see Rencher and Christensen 2012). Finally, for multivariate Y data, the analytical procedure described here is identical, but in this case sums-of-squares and cross-product matrices (SSCP) are obtained rather than univariate sums-of-squares (Rencher and Christensen 2012).
Phylogenetic Regression and the Problem of High-Dimensional Data
Phylogenetic regression provides a flexible analytical tool for assessing the degree of evolutionary association between variables while accounting for phylogeny. Statistically, phylogenetic regression is a parametric method that summarizes information in Y based on variances and covariances. As such it is considered an R-mode technique (sensu Legendre and Legendre 1998; see also Krzanowksi 1993; Rencher and Christensen 2012). However, one complication with R-mode techniques is that for a given sample size, as the number of trait dimensions (p) of the response data increases, the ability to detect patterns in the data decreases (Rao 1966), because the number of parameters to be estimated increases with the number of dependent variables. This phenomenon is known as Rao's paradox, and its effects are well-appreciated for standard multivariate R-mode methods such as Hotelling's T2 and MANOVA (Healy 1969; Olson 1974; Stevens 1980). Importantly, Rao's paradox should also apply to phylogenetic regression, because it uses the same algebraic machinery to obtain variances and parameter estimates as is implemented in regression and ANOVA (equations 1–3 above). Thus it is expected that the statistical power to detect patterns in a phylogenetic context using standard implementations of phylogenetic regression will decrease as the dimensionality of the dependent variables (Y) increases. This relationship is demonstrated empirically below. Furthermore, when the number of trait dimensions equals or exceeds the number of species in the phylogeny (p ≥ N), significance testing using standard approaches to phylogenetic regression cannot be completed, because the covariance matrices obtained from the model are singular, and thus parametric test statistics that require inverting a covariance matrix cannot be obtained.
Unfortunately, the implications of these statistical shortcomings have not been widely appreciated by comparative biologists, though they pose severe limitations on the assessment of multivariate data in a phylogenetic context. Specifically, because of Rao's paradox, phylogenetic assessments of phenotypic data will be compromised by the power of phylogenetic regression to detect patterns in these data, because their statistical power is a function of sample size (N) relative to trait dimensionality (p). Thus, it will generally be more difficult to detect evolutionary patterns in phenotypes that are characterized multivariately (like shape), than it will be to identify patterns in single, univariate traits such as body size. Further, in the extreme case where only a few species exist in a clade, evolutionary biologists will be limited to examining single traits, or must simplify their data through dimension reduction approaches (e.g., principal components) so that the statistical analyses themselves may be completed (see discussion in Monteiro 2013). The undesirable consequence is that methodological limitations are restricting the scope of phenotypic traits that are able to be examined with standard procedures. In essence, comparative biologists are unable to fully test hypotheses of adaptation in complex phenotypic traits because they are constrained by the analytical tools currently at their disposal for evaluating patterns in such data. Therefore, to assess evolutionary patterns in high-dimensional phenotypic traits using a phylogenetic perspective, an alternative analytical framework is required.
In classical multivariate analysis, some of the challenges of analyzing high-dimensional data may be alleviated by using procedures based on the matrix of pairwise distances among specimens, rather than methods that use covariance matrices (Krzanowksi 1993; Rencher and Christensen 2012). Such distance-based (Q-mode) methods provide a complementary (dual) view of multivariate data, as they summarize the same information represented by covariance matrices but do so in a different manner (Krzanowksi 1993; Legendre and Legendre 1998). Further, for Euclidean datasets there exists a statistical equivalency between distance-based and covariance-based approaches for many procedures such that empirical results derived from both methods are numerically identical. This equivalency has been demonstrated analytically for ordination methods (principal components analysis versus principal coordinate analysis: Gower 1966), sums-of-squares from linear models (ANOVA and regression versus permutational-MANOVA: Anderson 2001; McArdle and Anderson 2001), methods for estimating rates of phenotypic evolution in a phylogenetic context (Adams 2014a), and methods for estimating phylogenetic signal (Adams 2014b). Additionally, distance-based techniques have a critical advantage over covariance-based approaches when used on high-dimensional data, such that even if the number of trait dimensions exceeds the number of observations (p ≥ N), statistical summaries from Q-mode methods may still be computed (see Anderson 2001; McArdle and Anderson 2001; Zapala and Schork 2006). Thus, for high-dimensional datasets, distance-based procedures provide a practical and useful means of assessing statistical patterns.
A Q-Mode Phylogenetic Regression Method for High-Dimensional Data
Here I propose a distance-based (Q-mode) procedure for evaluating linear models for high-dimensional data in a phylogenetic context (hereafter: D-PGLS). The method assumes that trait variation accumulates over time following a Brownian motion model of evolution (Felsenstein 1973, 1988). Under Brownian motion, phenotypic changes are assumed to be independent from time step to time step, and variation increases proportionally with time (σ2t). For multivariate data, each trait dimension may differ in the rate at which variation accumulates, and changes may also be correlated across trait dimensions (Felsenstein 1988, 2004). For the procedure developed here, both the phenotypic data (Y) and the design matrix (X) are transformed by the phylogeny, which are then used to obtain matrices that capture variation in Y relative to X as an among-specimen distance matrix. The sums of squares explained by the model is then estimated from the distances among species in the phylogeny-transformed multivariate space, and factors in the multivariate model are evaluated statistically using permutation.
Additionally, a N × 1 column vector of ones is transformed as . From the transformed data, predicted values ( ) from the linear model: Yphy ∼ Xphy are obtained. Predicted values from Yphy ∼ 1phy are also estimated ( ). Note that unlike standard linear models, the values in the rows of are not identical. This is because the phylogenetic relationships among taxa have been taken into account, thereby adjusting the expected value for each species with respect to its relationship to the other species in the phylogeny.
This N × N outer-product matrix is known as the Gower-centered distance matrix, and represents the pairwise relationships among species in the high-dimensional data space (see Gower 1966; McArdle and Anderson 2001). If the design matrix X contains more than one factor, predicted values are obtained from the terms of Xphy in a sequential fashion to obtain the contributions of each factor to the model sums-of-squares (see Anderson 2001). Residual values are obtained from the full model (Yphy ∼ Xphy) and used to estimate the SSresid. Sums-of-squares are then used to obtain F-ratios and R2 values for all terms in the model. Next, the significance of each term in the model is assessed using permutation, in which the phenotypic data (Y) are permuted across the tips of the phylogeny, and the above procedure is repeated. Finally, to account for phylogenetic uncertainty the procedure described above can be repeated on an empirical distribution of alternative phylogenetic hypotheses to obtain a distribution of test statistics, from which confidence intervals can be derived (see, e.g., de Villemereuil et al. 2012).
It is important to recognize that for univariate data, the D-PGLS procedure outlined in equations (4)(7) yields statistical estimates that are numerically identical to those obtained from parametric implementations of phylogenetic regression. A demonstration of this property is found in Appendix A. Therefore, for univariate data, the distance—covariance equivalency between phylogenetic regression and D-PGLS has been preserved, since estimates of SS, F, and R2 for these models are the same. However, when used on multivariate data, D-PGLS is not restricted by the number of trait dimensions (p), and in fact may be used even when the number of trait dimensions exceeds the number of species. Additionally, from a computational perspective the approach proposed here is more efficient than standard procedures, as calculations based on the among species distance matrix require considerably less effort than those based on the variable covariance matrix (whose size increases dramatically as the number of trait dimensions increases). This is particularly important for phylogenetic assessments of traits such as shape, where the number of trait dimensions frequently exceeds the number of taxa in a phylogeny (e.g., McPeek et al. 2008; Klingenberg and Gidaszewski 2010; Adams 2014a). As such, D-PGLS alleviates the analytical challenges found when using phylogenetic regression with high-dimensional data. Computer code written in R for implementing the approach is found in Appendix B, and may be found in the R library geomorph (Adams et al. 2014).
Statistical Performance
To evaluate the statistical performance of D-PGLS, I executed a series of computer simulations. Simulations were designed to examine the statistical association between a multivariate response variable Y and a single continuous variable X. The multivariate data used in these simulations were generated under a Brownian motion model of evolution using two different patterns of error covariance: isotropic error and nonisotropic error. Initial simulations were conducted on a balanced phylogeny containing 32 species. Simulations under a wider set of conditions are found in the Supporting Information.
For each simulation, the number of trait dimensions for Y was first selected (p = 2, 10, 15, 20, 30). Next, input covariance matrices (S) of dimension (p+1) × (p+1) were constructed, and used to generate the phenotypic data (the additional dimension was included to simulate the data for the X variable). For simulations under an isotropic error model, the diagonal elements of S were all identical (σ2 = 1.0). The off-diagonal elements of S were then varied depending upon simulation conditions. Simulations evaluating type I error rates assumed no relationship between X and Y, and thus used no initial covariation between traits ( ). By contrast, simulations evaluating statistical power assumed a positive relationship between X and Y, and thus used positive initial levels of covariation between traits. The degree of covariation between X and Y varied depending upon the desired strength of the Y∼X relationship ( ). For simulations under a model of nonisotropic error, the diagonal elements of S were drawn from a normal distribution ( ) for each dimension, and the off diagonal elements of S were drawn from a normal distribution whose mean followed those used in the isotropic simulations ( ). Values were drawn repeatedly from these distributions until S satisfied all mathematical conditions of a covariance matrix.
From each initial covariance matrix S, 1000 phenotypic datasets were obtained by evolving multidimensional traits along the phylogeny following a Brownian motion model of evolution. Then, the evolutionary relationship between the multivariate response variable (Y) and the dependent variable (X) for each dataset was statistically evaluated using D-PGLS, as well as with standard phylogenetic regression methods (e.g., PIC). The proportion of significant results (out of 1000) was then treated as an estimate of the Type I error (when ) or statistical power (when ) of both approaches.
Simulations were also performed across a wider set of conditions to evaluate the robustness of D-PGLS. These simulations examined the effect of the number of taxa in the phylogeny (N = 16, 32, 64, 128), as well as the effect of randomly generated phylogenies on statistical performance (N = 16, 32, 64, 128). Further, simulations were executed to investigate the effect of random noise on the statistical performance of D-PGLS. Here, some trait dimensions of the response variable (Y) were obtained using the Y∼X relationship as described above, while other dimensions of the response variable contained random noise. Additional implementation details and results from all simulations are found in the Supporting Information.
RESULTS
For all simulations, D-PGLS displayed slightly conservative Type I error rates, ranging between the nominal α = 0.05 and α = 0.002. For both isotropic and nonisotropic conditions, Type I error became more conservative as traits of higher dimensionality were examined (Figs. 1 A, C). With respect to statistical power, the power of tests using D-PGLS increased rapidly as the degree of covariation between X and Y increased. Statistical power also increased as the number of trait dimensions (p) increased (Figs. 1 A, C; Supporting Information). This finding runs counter to Rao's paradox, and implies that for the same number of taxa, D-PGLS is capable of detecting evolutionary trends more easily in high-dimensional phenotypic traits than in data represented by fewer dimensions. This pattern is due to the fact that as additional dimensions of data are included the distances between specimens can only increase as the total dispersion among specimens increases. Thus, any covariation between the independent and dependent variables will be more easily detected, especially if the additional dimensions covary with X. A similar finding was obtained with Q-mode methods for comparing evolutionary rates for multivariate traits on phylogenies (Adams 2014a), and for estimating phylogenetic signal in multivariate data (Adams 2014b). Further, the power of tests using D-PGLS did not decrease as dimensions of random noise were added to the response variable Y (Supporting Information). Thus, unlike parametric approaches, the distance-based procedure is insensitive to the addition of variables containing unstructured variation. Together, these findings reveal that D-PGLS is capable of detecting significant associations between variables, even when the response variable is highly dimensional.
Simulation results evaluating the Type I error and statistical power of hypothesis testing procedures evaluating evolutionary covariation between a multivariate response variable (Y) and a continuous predictor variable (X) for the proposed approach D-PGLS (A, C) and standard phylogenetic regression (B, D). Data were simulated under a Brownian motion model of evolution on randomly generated phylogenies containing 32 species under an isotropic model of input covariance (panels A & B), and a nonisotropic model of input covariance (panels C & D). Curves for increasing numbers of trait dimensions are shown.
In contrast to results using D-PGLS, and as expected, the parametric version of phylogenetic regression displayed decreasing power as trait dimensionality (p) increased (Fig. 1 B, D). Further, for traits of high-dimensionality, the power of these approaches was 0.0, as the statistical significance from standard multivariate test statistics (e.g., Pillai's trace) was not able to be computed. Similar findings were revealed for the broader set of simulations examined in the Supporting Information. Overall these results demonstrate that for high-dimensional datasets, standard phylogenetic comparative approaches such as implementation with phylogenetic regression suffer from Rao's paradox, implying that the ability of these methods to detect evolutionary associations are compromised as trait dimensionality increases. Further, in cases where p approaches N, statistical assessment of evolutionary patterns cannot be evaluated with phylogenetic regression, because the matrix computations for assessing significance with these approaches are singular. By contrast, the Q-mode approach developed here (D-PGLS) does not suffer from these shortcomings, and displays increasing power as trait dimensionality increases. As such, D-PGLS provides a useful means of detecting patterns of evolutionary association in a phylogenetic context for high-dimensional phenotypic datasets.
A Biological Example
To illustrate the utility of the approach described above, I evaluated the degree of evolutionary association between body size and head shape (i.e., allometry) in a phylogenetic context in Plethodon salamanders. Allometry is the degree of covariation between size and shape (Gould 1966; Mosimann 1970), and thus describes the consequences of size changes on patterns of shape variation (Mitteroecker et al. 2013; Voje et al. 2014). Comparisons of allometric trajectories among species are frequently used to identify evolutionary changes in the size-shape relationship across taxa (e.g., Bookstein et al. 2003; Mitteroecker et al. 2004; Adams and Nistri 2010; Piras et al. 2010; Gunz 2012). Additionally, evolutionary allometry may be inferred by characterizing the relationship between size and shape among species in a phylogenetic context (e.g., Monteiro and Nogueira 2011; Cardini and Polly 2013; Klingenberg and Marugán-Lobón 2013; Outomuro et al. 2013b). In Plethodon, body size varies widely among species (Highton 1995; Adams and Church 2008, 2011), with species in some lineages displaying relatively small body sizes (e.g., P. cinereus species complex), while species in other lineages exhibit larger body sizes (e.g., P. glutinosus species complex). In many ecological communities, body size also appears to be a key trait in determining species coexistence in the group (Adams 2007). Similarly, head shape varies within and among species, and considerable evidence suggests that competitive interactions among species may drive morphological evolution (e.g., Adams 2010). Head shape also displays a strong genetic component (Adams 2011), thereby enabling microevolution via selection. Therefore, because both head shape and body size vary widely in the genus, it is of interest to determine whether the two traits coevolve among species allometrically.
To evaluate patterns of evolutionary allometry in Plethodon I used head shape data from 691 adult salamanders from 18 species (data from Maerz et al. 2006; Adams et al. 2007; Arif et al. 2007; Myers and Adams 2008; Adams 2010; Deitloff et al. 2013). For all individuals, 11 landmarks were digitized from left-lateral images of the side of each head (Fig. 2A), and geometric morphometric methods were used to generate a set of variables representing head shape (Bookstein 1991; Adams et al. 2013). Specifically, variation in the jaw position relative to the skull was mathematically standardized among specimens (Adams 1999), and a generalized Procrustes analysis was then performed to align specimens to a common coordinate system and remove variation in their position, orientation, and size (Rohlf and Slice 1990). This resulted in a set of shape variables (Procrustes tangent coordinates) for all specimens, from which the mean head shape was estimated for all 18 species. Typical adult body sizes for all species were obtained from Adams and Church (2008, 2011).
(A) Positions of 11 anatomical landmarks used to quantify head shape in Plethodon salamanders (image from [Adams et al. 2007]). (B) Fossil-calibrated molecular phylogeny displaying the estimated phylogenetic relationships among the species of Plethodon examined here. (C) Plot of phylomorphospace viewed as the first two principal component axes of tangent space. Thin-plate spline deformation grids representing exemplar individuals from each species complex are shown (magnified by 3×).
With these data, I examined trends of evolutionary allometry by assessing the multivariate regression between head shape and body size in a phylogenetic context using D-PGLS. A time-calibrated multigene molecular phylogeny for the genus (Wiens et al. 2006: Fig. 2B) was pruned to match the species in the dataset, and provided an estimate of the evolutionary relationships among taxa. For comparison I also performed a nonphylogenetic evaluation of evolutionary allometry using standard regression based on Procrustes distances (sensu Goodall 1991). In addition, because the species studied here were members of two clades within Plethodon, I compared head shape and body size between them using Procrustes ANOVA and ANOVA, respectively. Finally, phylogenetic patterns of head shape evolution were visualized using a phylomorphospace approach (sensu Sidlauskas 2008), and thin-plate spline deformation grids of representative specimens were generated to facilitate description of the observed shape changes. All analyses were performed in R 3.0.2 (R Development Core Team 2014) using routines in the library geomorph (Adams and Otárola-Castillo 2013; Adams et al. 2014).
Results: Standard multivariate regression revealed a significant relationship between head shape and body size (F = 9.481; P = 0.001; R2 = 0.372), suggesting that evolutionary allometry was present in Plethodon. However, when this relationship was examined in a phylogenetic context, the association between head shape and body size was no longer significant, with body size explaining little of the variation in head shape (F = 1.885; P = 0.188; R2 = 0.105). Thus, in contrast to the ahistorical analysis, the phylogenetic analysis using D-PGLS strongly implied that evolutionary allometry was not present in the group. This apparent discrepancy between the approaches was resolved by examining patterns of head shape variation in phylomorphospace (Fig. 2C). Here it was found that the species examined in this study formed two distinct clusters, which perfectly corresponded with two monophyletic lineages (species complexes) within the genus. Specifically, species in the P. glutinosus complex were found to display heads that were stouter and relatively more compressed in the anterio-posterior direction (Fig. 2C), whereas species in the P. cinereus complex exhibited heads that were more elongated anterio-posteriorly, and were slightly compressed in the dorsal-ventral direction. Further, both lineages also differed in body size, where species in the P. glutinosus clade were larger than species in the P. cinereus clade (Fig. 2C). Indeed, statistical analyses confirmed that the two clades differed significantly in both head shape (F = 16.384; P = 0.001; R2 = 0.506) and body size (F = 35.612; P < 0.0001; R2 = 0.69), suggesting that patterns of variation in these phenotypic traits was distinct between lineages. Finally, a comparison of R2 values between phylogenetic regression models revealed that considerably more variation in head shape was explained by a model describing lineage-specific differences (R2 = 0.506) than by a model describing evolutionary allometry (R2 = 0.105), implying that clade-level differences provided a better overall explanation of the observed trends in shape variation when taking phylogeny into account.
In conclusion, when all patterns were taken into consideration, it was clear that the apparent evolutionary allometry observed in Plethodon was the result of two lineages within the genus evolving differences in both body size and head shape, as opposed to the alternative explanation in which more subtle changes in both traits cooccurred from species to species. Thus, this example highlights the importance of incorporating phylogeny directly into cross-species evaluations of evolutionary allometry. Without a phylogenetic perspective, the hypothesized mechanism responsible for patterns of size-shape covariation would have been misidentified, and incorrectly attributed to allometric trends related to size. However, when phylogenetic nonindependence was included in the multivariate regression, this explanation became untenable. Instead, the observed patterns appear due to the fact that two sublineages exist within the group, and these lineages have evolved differences in both body size and head shape. As such, it is lineage-specific differences and not evolutionary allometric trends that best explain patterns in head shape variation at the macroevolutionary level. Only when phylogeny was taken into consideration could the correct pattern be revealed (see also Garland et al. 1992).
Discussion
A critical component in the study of adaptation is to identify evolutionary correlations between phenotypic traits in species related by a phylogeny. To evaluate such hypotheses, phylogenetic regression is typically used, but this method is incapable of assessing evolutionary patterns in highly multivariate datasets like shape, as the large number of parameters to be estimated prohibits the analyses from being analytically completed. In this article, I developed a distance-based procedure (D-PGLS) for evaluating ANOVA and regression models for high-dimensional datasets in a phylogenetic framework under a Brownian motion model of evolution. I showed that when used on univariate data, the approach yields statistical estimates that are numerically identical to those obtained from phylogenetic regression, thereby demonstrating the equivalency between phylogenetic regression and D-PGLS for such datasets. I also illustrated that the approach exhibits appropriate type I error and high statistical power for detecting evolutionary patterns in high-dimensional datasets. By contrast, I found that standard implementations of phylogenetic regression have decreasing power as trait dimensionality increases. Together, these findings have several important implications for the evaluation of evolutionary patterns in high-dimensional data and how they may be identified phylogenetically.
First, the results presented here imply that parametric implementations of phylogenetic comparative approaches, such as independent contrasts and PGLS, may not be the most appropriate procedures for assessing patterns in high-dimensional phenotypic datasets, because the ability of these procedures for detecting patterns in multivariate data are analytically compromised by trait dimensionality. Recently, it was suggested that comparative analyses of multivariate data like shape may be accomplished simply by converting the data to their phylogenetic independent contrasts and using the appropriate statistical procedure to assess the given hypothesis (see Klingenberg and Marugán-Lobón 2013). Although this recommendation may be acceptable for large datasets where the number of species examined greatly outpaces the dimensionality of the phenotypic data (see e.g., Blankers et al. 2012; Klingenberg and Marugán-Lobón 2013), the results presented here show that employing parametric implementations of the standard phylogenetic comparative toolkit ubiquitously to all datasets (including high-dimensional data), may not be sufficient, as these methods lose statistical power as trait dimensionality increases. Thus, as comparative biologists quantify phenotypic data by more complex and comprehensive methods, standard phylogenetic regression approaches are not guaranteed to identify patterns that may be present (see Fig. 1). Further, when the number of trait dimensions equals or exceeds the number of species in the phylogeny, the parametric approach to phylogenetic regression has zero statistical power as its algebra cannot be completed, thereby limiting the datasets that can be examined with this method. Indeed, the biological example presented here represents such a scenario, as the number of phenotypic variables describing salamander head shape exceeded the number of species used in the study. By contrast, evaluating evolutionary patterns in such data via the distance-based approach developed here has no such data restrictions. Therefore, as comparative biologists continue to characterize phenotypic attributes in increasingly more complex ways (from univariate measures to sets of a few traits to highly multivariate phenotypes), so too must they alter their analytical procedures; moving from standard implementations of phylogenetic regression that evaluate multivariate test statistics such as Wilks’ lambda and Pillai's trace to the distance-based procedure proposed here (D-PGLS).
Second, the findings presented here demonstrate that the comparative biologist need not reduce the dimensionality of the phenotypic data to a few principal component axes to perform phylogenetic assessments of evolutionary correlations (e.g., Bergmann et al. 2009; Nogueira et al. 2009; Monteiro and Nogueira 2011; Brusatte et al. 2012). Indeed, the method developed here exhibits higher statistical power as trait dimensionality increases (Fig. 1), implying that more data can be leveraged in a meaningful way to evaluate evolutionary patterns. This empirical finding is in direct contrast to the suggestion that dimension reduction is a required step for evaluating patterns in high-dimensional data like shape when using phylogenetic comparative methods (Monteiro 2013). While this remains the case for covariance-based procedures that require the estimation of many parameters, by moving from the parametric implementations of PGLS to the distance-based D-PGLS procedure there is no need to simplify one's phenotypic data to force it into the standard analytical paradigm (for a related conclusion on dimension reduction see Bookstein 2013). Thus, when viewed in this manner, D-PGLS provides a useful complement to existing phylogenetic comparative approaches, that enables the evaluation of hypotheses of adaptation and phenotypic change for high-dimensional phenotypic data in a manner analogous to that provided by PIC and PGLS for univariate traits.
One current limitation with the approach developed here is that the method is only formulated under a Brownian motion model of evolution. Thus, evolutionary patterns derived from other evolutionary processes, such as an Ornstein–Uhlenbeck model (OU: Hansen and Martins 1996; Butler and King 2004) cannot be investigated. Recently, multivariate extensions of the OU models have been developed for assessing phenotypic trends in a phylogenetic context (Bartoszek et al. 2012; see also Butler and King 2004). However, like phylogenetic regression, this method requires the estimation of a large number of parameters that increases greatly with trait dimensionality (see discussion in Monteiro 2013). As such, the multivariate OU model will suffer the same limitations when used on highly multivariate data as does standard phylogenetic regression, and thus in practice it will be limited in utility to those scenarios where one evaluates a few number of traits across a phylogeny containing many species. Future work is required to derive an OU equivalent of D-PGLS to fill this void. With such an approach, evolutionary biologists could then compare the evolution of highly multivariate phenotypes as described by both Brownian motion and Ornstein–Uhlenbeck processes, providing the multivariate equivalent of evolutionary model selection approaches currently used for univariate traits (e.g., Butler and King 2004; Collar et al. 2010; Harmon et al. 2010; Beaulieu et al. 2012).
Despite this, the distance-based phylogenetic regression procedure proposed here (D-PGLS) provides an important quantitative merging of methods that characterize phenotypes using high-dimensional data (morphometrics) with those that evaluate interspecific patterns in a phylogenetic context. Recent years have seen an increased attention on the empirical intersection between morphometrics and phylogenetic comparative biology, and how data from the former may be used in the analytics of the latter (e.g., Adams et al. 2009; Klingenberg and Gidaszewski 2010; Adams et al. 2011; Klingenberg and Marugán-Lobón 2013; Monteiro 2013; Polly et al. 2013; Adams and Felice 2014; Adams 2014a,b). While in some cases the standard tools of the phylogenetic comparative toolkit may be sufficient for evaluating patterns in morphometric data (e.g., Sidlauskas 2008; Monteiro and Nogueira 2010; Klingenberg and Marugán-Lobón 2013; Piras et al. 2013), in other instances the characteristics of this high-dimensional data type dictate that alternative analytical procedures are sometimes required (see Klingenberg and Gidaszewski 2010; Adams 2014a,b). The analysis of evolutionary correlations between high-dimensional phenotypic data and other factors represents one such case, as the number of parameters that require estimation using standard approaches quickly becomes prohibitive. However, by deriving a Q-mode alternative that leverages the statistical equivalency between covariance-based and distance-based approaches, comparative biologists may circumvent this conundrum, and may evaluate evolutionary trends in high-dimensional data while accounting for phylogenetic nonindependence.
ACKNOWLEDGMENTS
I thank M. Collyer, G. Hunt, A. Kaliontzopoulou, and E. Sherratt for comments and discussion. This work was sponsored in part by NSF grants DEB-1257827 and DEB-111884.
DATA ARCHIVING
The doi for my data is 10.5061/dryad.36df0.
LITERATURE CITED
Associate Editor: P. David Polly
A Appendix
Worked Example Demonstrating the Equivalency of Phylogenetic Regression and D-PGLS for Univariate Data
In this example, five hypothetical species are related by the following phylogeny and have the following phenotypic values for X and Y:
A linear regression of picy on picx yields the following statistical summary values: SS = 0.2369; F = 0.1713; R2 = 0.054, which are identical to those found using phylogenetic generalized least squares (PGLS).
Finally, residual values are estimated from the model in a similar manner, from which SSres is obtained. From SSX and SSres one obtains F = 0.1713 and R2 = 0.054; which are identical to the values obtained using phylogenetic regression as shown above.
B Appendix
Computer Code for R
#The function below performs distance-based phylogenetic least-squares analysis. The method may
#be used to assess regression, ANOVA, and other linear models in a phylogenetic context.
#The approach is particularly useful for high-dimensional data where standard (parametric)
#phylogenetic generalized least squares analyses, or the analysis of phylogenetically
#independent contrasts, cannot be performed. The function below obtains the sums of squares
#(SS) for all factors in the linear model, and statistically evaluates them via permutation.
D.pgls<-function(f1,phy,iter = 999){
library(ape)
data = NULL
form.in<-formula(f1)
Terms<-terms(form.in,keep.order = TRUE)
Y<-as.matrix(eval(form.in[[2]],parent.frame()))
N<-length(phy$tip.label)
p<-ncol(Y)
if(is.null(rownames(Y))){
stop(“No species names with Y-data.”)}
if(length(match(rownames(Y), phy$tip.label))! = N)
stop(“Data matrix missing some taxa present on the tree.”)
if(length(match(phy$tip.label,rownames(Y)))! = N)
stop(“Tree missing some taxa in the data matrix.”)
C<-vcv.phylo(phy); C<-C[rownames(Y),rownames(Y)]
eigC <- eigen(C)
D.mat<-solve(eigC$vectors%*% diag(sqrt(eigC$values))%*% t(eigC$vectors))
Y.new<-D.mat%*% (Y)
ones.new<-D.mat%*%(array(1,N))
pred.1<- predict(lm(Y.new∼ones.new-1))
dat<-model.frame(form.in,data)
df<-df.tmp<-SS.tmp<-SS.obs<-F<-array()
for (i in 1:ncol(attr(Terms, “factors”))){
mod.mat<-model.matrix(Terms[1:i],data = dat)
x.new<-D.mat%*%mod.mat
pred.y<-predict(lm(Y.new∼x.new-1))
G<-(pred.y-pred.1)%*%t(pred.y-pred.1)
SS.tmp[i]<-sum(diag(G))
ifelse(i = = 1, SS.obs[i]<-SS.tmp[i], SS.obs[i]<-(SS.tmp[i]-SS.tmp[i-1]))
df.tmp[i]<-ifelse(ncol(mod.mat) = = 1,1,(ncol(mod.mat)-1))
ifelse(i = = 1, df[i]<-df.tmp[i], df[i]<-(df.tmp[i]-df.tmp[i-1]))
}
MS<-SS.obs/df
mod.mat<-model.matrix(Terms)
x.new<-D.mat%*%mod.mat
y.res<-residuals(lm(Y.new∼x.new-1))
SS.res<-sum(diag(y.res%*%t(y.res)))
df.res<-nrow(Y)-1-sum(df)
MS.res<-SS.res/df.res
Rsq<-SS.obs/(sum(SS.obs)+SS.res)
F<-MS/MS.res
F.r<-P.val<-array(1,dim = length(SS.obs))
for(i in 1:iter){
SS.tmp<-SS.r<-array()
Y.r<-as.matrix(Y[sample(nrow(Y)),])
row.names(Y.r)<-row.names(Y)
Y.r.new<-D.mat%*% (Y.r)
pred.1.r<- predict(lm(Y.r.new∼ones.new-1))
for (ii in 1:ncol(attr(Terms, “factors”))){
mod.mat<-model.matrix(Terms[1:ii])
x.new<-D.mat%*%mod.mat
pred.y.r<-predict(lm(Y.r.new∼x.new-1))
G.r<-(pred.y.r-pred.1.r)%*%t(pred.y.r-pred.1.r)
SS.tmp[ii]<-sum(diag(G.r))
ifelse(ii = = 1, SS.r[ii]<-SS.tmp[ii], SS.r[ii]<-(SS.tmp[ii]-SS.tmp[ii-1]))
}
MS.r<-SS.r/df
mod.mat<-model.matrix(Terms)
x.new<-D.mat%*%mod.mat
y.res.r<-residuals(lm(Y.r.new∼x.new-1))
SS.r.res<-sum(diag(y.res.r%*%t(y.res.r)))
MS.r.res<-SS.r.res/df.res
F.r<-MS.r/MS.r.res
P.val<-ifelse(F.r> = F, P.val+1,P.val)
}
P.val<-P.val/(iter+1)
anova.tab<-cbind(df,SS.obs,MS,F,P.val,Rsq)
anova.tab<-rbind(anova.tab,c(df.res,SS.res,MS.res,NA,NA,NA))
rownames(anova.tab)<-c(colnames(attr(Terms, “factors”)), “Residual”)
return(anova.tab)
}

![(A) Positions of 11 anatomical landmarks used to quantify head shape in Plethodon salamanders (image from [Adams et al. 2007]). (B) Fossil-calibrated molecular phylogeny displaying the estimated phylogenetic relationships among the species of Plethodon examined here. (C) Plot of phylomorphospace viewed as the first two principal component axes of tangent space. Thin-plate spline deformation grids representing exemplar individuals from each species complex are shown (magnified by 3×).](https://oup.silverchair-cdn.com/oup/backfile/Content_public/Journal/evolut/68/9/10.1111_evo.12463/5/m_evo12463-fig-0002.jpeg?Expires=1709961139&Signature=gqb6NdS6JFz-8dnDdQ~0Gxtef8ZQuSXYbg6e524yrfs4b8YOS~ZbbazElSVxcvEP7dSxPDpvl6idmKsu7nV-dCwFna2mcABQSsb0I61cpNfa1MzW7Jwp7Be8ttsVDjyxrM6zH6qwTjAYug1ygm1nI2kR3r9zvZGR1Zq-PTOgvxcHjGtBTvJDcI7seP1w3QMaumYr69~UD30zkknVLZPjrA-r76BHmzXOzajby3O0qklKpZuX029335tv6Qg3l1qBolR3em4QiDcVkuZIfkTK6NDV1BneTehTIk5ZW~wstKB0GaPJf~Ke-SP6tPTltOlK4kN4tAmJrqp1k3OIqez9PA__&Key-Pair-Id=APKAIE5G5CRDK6RD3PGA)