-
PDF
- Split View
-
Views
-
Cite
Cite
Miguel D. Mahecha, Alfredo Martínez, Holger Lange, Markus Reichstein, Erwin Beck, Identification of characteristic plant co-occurrences in neotropical secondary montane forests, Journal of Plant Ecology, Volume 2, Issue 1, March 2009, Pages 31–41, https://doi.org/10.1093/jpe/rtp001
- Share Icon Share
Abstract
Inferring environmental conditions from characteristic patterns of plant co-occurrences can be crucial for the development of conservation strategies concerning secondary neotropical forests. However, no methodological agreement has been achieved so far regarding the identification and classification of characteristic groups of vascular plant species in the tropics. This study examines botanical and, in particular, statistical aspects to be considered in such analyses. Based on these, we propose a novel data-driven approach for the identification of characteristic plant co-occurrences in neotropical secondary mountain forests.
Floristic inventory data were gathered in secondary tropical mountain forests in Ecuador. Vegetation classification was performed by coupling locally adaptive isometric feature mapping, a non-linear ordination method and fuzzy-c-means clustering. This approach was designed for dealing with underlying non-linearities and uncertainties in the inventory data.
The results indicate that the applied non-linear mapping in combination with fuzzy classification of species occurrence allows an effective identification of characteristic groups of co-occurring species as fuzzy-defined clusters. The selected species indicated groups representing characteristic life-form distributions, as they correspond to various stages of forest regeneration. Combining the identified ‘characteristic species groups’ with meta-information derived from accompanying studies indicated that the clusters can also be related to habitat conditions.
In conclusion, we identified species groups either characteristic of different stages of forest succession after clear-cutting or of impact by fire or a landslide. We expect that the proposed data-mining method will be useful for vegetation classification where no a priori knowledge is available.
INTRODUCTION
Increasing areas of neotropical forests consist of secondary stands following natural or anthropogenic impacts (or the interference of both) that have altered the old-growth vegetation (Chazdon et al. 2007; Malhi et al. 2008; Ramanakutty et al. 2007). From an ecological viewpoint, secondary tropical forests may play a role similar to their predecessors, e.g. in preventing soil erosion, regulating the carbon and water cycles, preserving biodiversity through habitat regulation and, last but not least, by providing timber and non-timber products to humans (Beck et al. 2008; Wright 2005; Malhi et al. 2008). The variety of these ecological services connotes that the development of conservation efforts is also of social relevance. Linking characteristic patterns of vascular plant co-occurrences to environmental conditions can be of vital importance for the elaboration of future conservation and management decisions (DeWalt et al. 2003; Guariguata et al. 1997). However, the geobotanical monitoring of secondary tropical forests is difficult since they exhibit sequences of transitory stages manifested through highly non-stationary floristic composition (Chazdon et al. 2007). Not surprisingly, characteristic associations of vascular plants in secondary vegetation can hardly be detected by classical methods of vegetation analysis, and the question whether characteristic species groups are identifiable in a non-stationary environment remains an unsolved issue. It is still unknown whether (possibly recurrent) characteristics of species compositions can be related to the environmental conditions. In this context, we examine botanical aspects and, in particular, the statistical prerequisites for extracting characteristic species co-occurrences from vegetation inventories in secondary tropical forests. Building on these, we propose a novel data-mining approach that looks at various aspects of vegetation data collected in secondary tropical forests. We apply this new approach to a ‘space-for-time’ sequence sampled in a secondary mountain forest in southern Ecuador.
Throughout the identification of characteristic plant groups, two methodical entities have to be addressed: the botanical inventory and the statistical analysis. From a botanical point of view, the high diversity of vascular plants implies that one encounters many species that cannot be identified with certainty (Wright 2005). Others may be completely unknown to science or can temporarily not be identified (especially as juveniles). Even adult trees or shrubs may be difficult to identify outside their flowering periods (Martínez et al. 2008). Without safe identification, spotting of characteristic species groups and their relations to certain habitats is difficult. The vegetation inventory data are unavoidably plagued by sampling uncertainty. Focusing on specific plant life forms, most commonly on trees, can be a means to facilitate the interpretation of an inventory (cf. for summarizing metastudies Finegan 1996; Guariguata and Ostertag 2001). Often, only abundant species are considered: those which show pronounced stand dominance, as indicated by the leaf area index, biomass or minimum stem diameter (e.g. Valencia et al. 2004). Yet all these approaches are based on a priori assumptions on the importance of the selected species with respect to patterns of plant coexistence, a shortcoming which this study strived to avoid.
From a statistical point of view, we have to consider further critical aspects: (i) the high dimensionality of the data (inherited from a high level of α-diversity), (ii) possible underlying non-linearities in the topology of the data space (a general issue in real-world data) and (iii) the data uncertainty (effect of the sampling uncertainty). The purpose of this study was to develop a novel analysis that incorporates these issues, thus leading to the reliable identification of characteristic plant co-occurrences.
Extracting the underlying structures of high-dimensional data is well established in ecology as ‘ordination’ or, in the present case, could be more precisely referred to as an ‘indirect gradient analysis’ (terBraak 1995). These are blind analyses of high-dimensional data where the central step is a ‘dimensionality reduction’: the goal is to find a representation of the data points in a low-dimensional embedding space, making them accessible to further interpretations (McCune and Grace 2002). Conventionally, linear approaches (e.g. Principal Component Analysis, PCA, Classical Multidimensional Scaling, CMDS) or more advanced solutions (e.g. non-metric multidimensional scaling, NMDS) are used for ordination in vegetation ecology (Legendre and Legendre 1998). This holds especially true for the ordination in tropical vegetation analysis when characteristic species groups are to be identified (e.g. Capers et al. 2005; Flåten et al. 2007; Navarro et al. 2005; Rodriguez-Rojo et al. 2001). Advanced ordination tools go beyond linear mappings, e.g. NMDS, or self-organized feature maps—to allow for more flexible dimensionality reduction. However, the latter are based on iterative learning schemes and the embeddings are not replicable. Today, ordinations can be realized using powerful non-linear and non-iterative techniques (Lee and Verleysen 2007). These techniques are of crucial importance, since ecological data can hardly be projected in a linear way to an ordination space (Mahecha and Schmidtlein 2008). In particular, new methods of non-linear dimensionality reduction within the framework of ‘Isometric Feature Mapping, Isomap’ (Tenenbaum et al. 2000) have a high potential for species ordinations (Mahecha et al. 2007; McCune and Grace 2002). This observation holds especially true in view of recently developed improvements of Isomap rendering the method fully data-adaptive, and able to cope with noisy data (e.g. Mekuz and Tsotsos 2006; Wen et al. 2008).
For identifying characteristic species associations in an ordination space, a coupling of ordination and clustering algorithms has proven to be very useful (e.g. Olano et al. 1998; Zhang 1994). Since the goal of the present study was to identify characteristic species co-occurrences, a partitioning of the data set was necessary. Equihua (1990) showed that in species assemblages, an exclusive species classification might be inappropriate. Natural phenomena follow gradients (the underlying ‘manifold’), result from overlapping units and the classification should account for observational uncertainties. Against this background, he proposed species group identification based on fuzzy numbers. This allows making explicit use of data uncertainty in the classification effort. Considering the characteristics of tropical inventory data, fuzzy classification appears to be a very promising approach.
Here we advocate the combination of a novel non-linear ordination method: a locally adaptive Isomap with fuzzy-c-means (FCM) clustering. We demonstrate that this approach satisfies the requirements of dealing with a high-dimensional data set in a non-linear projection under the assumption that the space-for-time sequence is under-sampled. Data uncertainty is taken into account in the subsequent step: the identification of characteristic species groups. The analyzed species inventories were made in secondary forests of different ages that resulted from various types of impacts. The species records comprise all species of vascular plants found, without disregarding rare species, certain life forms or low biomass producers. The principle goal was to relate characteristic species groups to environmental conditions.
METHODS
Study area
Seven secondary forests were investigated that resulted from various well-documented interferences with the pristine tropical mountain rain forest of the valley of the Rio San Francisco in the Cordillera Oriental of South Ecuador. The central coordinates of the research area are 3° 58′ 21″ S and 79° 4′ 33″ W. The area borders Podocarpus National Park, which is considered the region with the highest number of endemic species in Ecuador (Brummitt and Lughagha 2003; Valencia et al. 2000). A detailed description of the area is given by Beck et al. (2008). The elevation of the secondary forests is 1950 (±50) m a.s.l.; the daily temperature ranges between 6 and 29°C with an annual mean of 15.3°C. The average precipitation was determined to be 2067 mm yr−1, which is about three times the annual evaporation (Beck and Richter 2008).
Sampling
Each of the seven secondary forests is well delimited from its surroundings: primary mountain rain forest, pasture land or abandoned pastures overgrown with bracken fern and bushes (Hartig and Beck 2003). Each forest plot sampled was further subdivided into subplots (for an overview see Table 1). A detailed description of the site characteristics, including a short history of impacts, is given by Mahecha et al. (2007). Four of the plots represent a succession sequence after several clear-cuttings between 1962 and 1989 (C1–C4), which we call a ‘space-for-time’ sequence. Two patches of secondary forest were found in a ravine recovering from a fire that occurred about 15 years ago (F1 and F2). The seventh plot (L) is a 10-year-old forest that developed on a landslide area that in part borders the original forest, while on the other sides it is adjacent to pastures (see also, Martínez et al. 2008). Each forest or plot was subdivided into rectangular subplots measuring 5 m × 5 m. Due to its complex morphology, one plot was sampled on a 4 m × −4 m grid level.
Plot | Type of impact | Number of subplots | Elevation (m a.s.l.) | Exposition | Inclination (°) | Number of species |
C1 | Clear-cut | 80 | 1910 | N | 30 | 155 |
C2 | Clear-cut | 91 | 1900 | N | 45 | 145 |
C3 | Clear-cut | 65 | 1900 | NE | 50 | 206 |
C4 | Clear-cut | 60 | 1920 | N | 40 | 197 |
L | Landslide | 60 | 1950 | SE | 70 | 148 |
F1 | Fire | 75 | 2010 | O | 50 | 217 |
F2 | Fire | 60 | 2000 | O | 35 | 174 |
Plot | Type of impact | Number of subplots | Elevation (m a.s.l.) | Exposition | Inclination (°) | Number of species |
C1 | Clear-cut | 80 | 1910 | N | 30 | 155 |
C2 | Clear-cut | 91 | 1900 | N | 45 | 145 |
C3 | Clear-cut | 65 | 1900 | NE | 50 | 206 |
C4 | Clear-cut | 60 | 1920 | N | 40 | 197 |
L | Landslide | 60 | 1950 | SE | 70 | 148 |
F1 | Fire | 75 | 2010 | O | 50 | 217 |
F2 | Fire | 60 | 2000 | O | 35 | 174 |
Plot | Type of impact | Number of subplots | Elevation (m a.s.l.) | Exposition | Inclination (°) | Number of species |
C1 | Clear-cut | 80 | 1910 | N | 30 | 155 |
C2 | Clear-cut | 91 | 1900 | N | 45 | 145 |
C3 | Clear-cut | 65 | 1900 | NE | 50 | 206 |
C4 | Clear-cut | 60 | 1920 | N | 40 | 197 |
L | Landslide | 60 | 1950 | SE | 70 | 148 |
F1 | Fire | 75 | 2010 | O | 50 | 217 |
F2 | Fire | 60 | 2000 | O | 35 | 174 |
Plot | Type of impact | Number of subplots | Elevation (m a.s.l.) | Exposition | Inclination (°) | Number of species |
C1 | Clear-cut | 80 | 1910 | N | 30 | 155 |
C2 | Clear-cut | 91 | 1900 | N | 45 | 145 |
C3 | Clear-cut | 65 | 1900 | NE | 50 | 206 |
C4 | Clear-cut | 60 | 1920 | N | 40 | 197 |
L | Landslide | 60 | 1950 | SE | 70 | 148 |
F1 | Fire | 75 | 2010 | O | 50 | 217 |
F2 | Fire | 60 | 2000 | O | 35 | 174 |
Complete inventories at the subplot level were realized in several field campaigns between 2001 and 2003. Species identification was based on the Flora of Ecuador by Harling and Anderson (1973). Identification of the specimens was verified by comparing them with herbarium samples held in the ‘Herbario Reinaldo Espinosa’ of the Universidad Nacional de Loja, the ‘Herbario de la Pontifícia Universidad Católica del Ecuador, Quito’ and the ‘Herbario de la Universidad de Azuay, Cuenca’. Furthermore, the database ‘Visual Plants’ (Dalitz 2002, www.visualplants.de) was used for identification as well as to register new samples. The total inventory comprises 773 samples of vascular plants that could tentatively be addressed as different species. One-hundred-and-forty of these could be identified at the species level, 358 at the genus level, while 213 could only be attributed to plant families. Sixty-two could not be identified. For differentiation purposes, working names were given to those specimens that could not be assigned to the species level. This obviously entails some degree of uncertainty in the data array, which can be described as a presence–absence matrix X = {x1, …, xn}, where each vector xi represents one differentiated species (here, n = 773). In some cases, we found pairs of species of identical occurrence vectors xi, which hold true for species of ubiquitous occurrence in one of the plots, or co-occurrences of very rare plants. Repeated species vectors were treated as a single xi since their information content is fully redundant, resulting in a final n = 626. Ultimately the column dimensionality of X was n = 626. The row dimensionality m of the data matrix is the number of analyzed subplots (here, m = 491). Note that for the sake of comparison, plot C3 (sampled at a 4 m × −4 m grid) was resampled by overlaying a virtual 5 m × 5 m grid. Species-to-subplot assignments followed a spatial-nearest neighbor principle. In cases of equidistant subplots at the resample grid, the species were randomly assigned to those of the resample grid. The analysis considered only those resampled subplots whose areas were covered by at least 50% of the original subplots.
Non-linear ordination




The linear distances matrix allows a selection of k–nearest neighbors (k–NN) for each data point. This sets up a connectivity structure, where each species-to-species relationship is defined as ‘connected’ if one species belongs to the k–NNs of the other and otherwise ‘unconnected’. In other words, the k–NN threshold defines the vertices of the graph. The requirement is that the graph is fully connected, which means that each species that is not directly linked is so across other species. The edge weights of this graph are given by the estimated inter-point distances D(X) (for an introduction to graph theoretical concepts in ecology see, e.g. Urban and Keitt, 2001). It is then possible to compute the inter-point (species-to-species) distances using the shortest path on the graph. Such ‘shortest paths’ through graphs are conventionally found by applying Dijkstra's algorithm (Dijkstra 1959). The important point is that these new species-to-species distances preserve the local linear (Sørensen-based) metric but turn into geodesic distances between non-k–NN points. These geodesic distances D(G) are subjected to the (CMDS) mapping described in equation (1).
Despite the substantial progress of Isomap compared to conventional ordination methods, the global parameter setting might cause problems with real-world data. Examples can be found where the sampling of the underlying manifold is noisy and incomplete (Balasubramanian and Schwartz 2002), varies in density (Lafon and Lee 2006) or where the manifold curvature varies (Wen et al. 2008). In such cases, the constructed graph contains shortcuts, and an optimal mapping cannot be warranted.
Thus, we are seeking a method that provides locally adaptive neighborhood estimates taking into account locally varying sampling density or locally varying curvatures of the manifold. Several algorithms have been proposed which aim at identifying locally optimal k–NN parameters for each data point (e.g. Mekuz and Tsotsos 2006; Saxena et al. 2006; Wen et al. 2008). Isomap then turns into a fully data-adaptive method for non-linear dimensionality reduction. Here, we applied the method proposed by Mekuz and Tsotsos (2006) as we adapted it for dealing with binary data (the full method for identifying locally optimal k–NN values is outlined in Appendix A provided as supplementary material online).
FCM clustering
For the identification of relevant patterns of plant co-occurrences, the ordination space needs to be partitioned. For the latter, we used FCM clustering (Bezdek 1981), which classifies the data points into homogeneous clusters based on a linear distance measure. In light of the concept of geodesic distances introduced above, a direct FCM application to an ecological data set can cause severe shortcomings in the presence of topological non-linearities. However, by analyzing the ordination space recovered by the locally adaptive Isomap with FCM, this problem is circumvented.

Here we provide the fundamental algorithmic steps toward minimizing the FCM cost function:
I. Choose a number of clusters c and then initialize the algorithm by setting random values for ui,j and a defined exponent q. The most common value for q is setting q = 2 (Zahid et al. 1999).
II. Calculate the cluster centroids by with i = 1, …, c.
III. Update the entries of pattern matrix Ui = {ui,j}.
IV. Halt if the cost Jq(U,V) < ϵ (the tolerance threshold has to be defined by the user and was set here to ϵ = 0.001). Otherwise go to II.
An implementation of FCM algorithm is provided within the ‘Fuzzy Clustering and Data Analysis Toolbox’ (Janos Abonyi and Balazs Feil) available at http://www.fmt.vein.hu/softcomp/.

In this formula, is introduced for normalizing the term so that it becomes a relative measure of compactness for the ith cluster, while is the estimate of the fuzzy separation. By measuring minimum exponential separation of the clusters, considering the compactness, the index is suited for noisy environments. PCAES is computed for a varying c in the range between
. The upper limit is justified, since we are searching for a rough partitioning of the data set, whose cardinality should be much smaller than the number of observations n (Wu and Yang 2005; Zahid et al. 1999). The optimum number of clusters can be found as the global maximum
.
Here, we applied FCM to the locally adaptive Isomap ordination space, where we consecutively considered additional dimensions. Ultimately, each species has as many membership values as there are clusters. A characteristic set of plant co-occurrences was defined here heuristically as a group of species with a high degree of fuzzy classification in respect to one of the c* clusters. Choosing the species belonging to a cluster with membership ui,j ≥ 0.9, for example, is expected to identify a set of ‘characteristic species’.
RESULTS AND DISCUSSION
Non-linear ordinations
Applying conventional k-Isomap successfully reduced the dimensionality of the vegetation data. The method achieved a maximum of explained variance by setting the threshold parameter to k = 3 (see Fig. 1) in the first two dimensions. At the same time, this was the minimum where the k–NN graph remained connected. Under this parameter setting, the first two embedding components explained 60% of the variance, while explaining 69% in the first three components. The cumulative explained variance converged to >89% in the first 10 dimensions. From these findings, it is evident that Isomap is a substantial improvement compared to linear ordination efforts: the respective CMDS embedding did not recover more than 5% of the variance in the first 10 dimensions. We interpret this as empirical evidence for the existence of pronounced underlying non-linearities in the data space.

The cumulative ratios of explained variance recovered by the k-Isomap in the first 10 dimensions. Results for a varying neighborhood-global k–NN parameter (indicated by the color scale) in the range of k = 3 to k = 30. Smaller k-values imply a higher degree of non-linearity in the ordination. The figure also shows the cumulative value of variance explained by the neighborhood-adaptive Isomap approach. It can be seen that the adaptive Isomap in which the neighborhood definition varies locally, here k < 15, recovers almost equal amounts of variance than the optimal standard k-Isomap embedding in all dimensions. The figure also shows a break in the slope of the cumulative explained variance at three dimensions, which is consistent with the estimate of the intrinsic dimensionality.
Yet in the present study, certain aspects are to be considered that can raise concerns regarding the suitability of the conventional k-Isomap. The variance maximization at the edge of graph connectivity stems from a high degree of non-linearity in the topology of the data space. This could also be due to the unsystematical sampling design: investigations of the successional development in secondary neotropical forest often rely on space-for-time sequences (Chazdon et al. 2007), as was the case here. The underlying manifold of the successional development of neotropical secondary forests is expected to be spanned or at least influenced by time, pedogenesis, biogeographical history and recent seed dispersal (Beck et al. 2008). The effectively observable stages of vegetation development within the seven study plots are very sparse. In view of the time span under investigation (ranging from one to several decades) and the high degree of landscape fragmentation and land use change in the region (Goerner et al. 2007), it is impossible to build up data sets that effectively sample the underlying manifold. Any analysis has to be able to cope with a lack of observations of important successional steps. The conventional Isomap sets a global threshold parameter, which in turn would require that sampling along the underlying manifold is more or less of equal density.
Hence, the less restrictive adaptive Isomap can be regarded as a better tool: it is free of assumptions on sampling along the manifold and allows, e.g. large sub-areas of the data set to be almost linearly connected. For these reasons, the application of adaptive Isomap is attractive. Indeed, the method recovered a whole range of k-values (k ≤ 15), and at the same time explained almost the same amount of variance in the first dimensions compared with the very best k-Isomap performance (Fig. 1). Even in higher dimensions, adaptive Isomap recovered almost equal values of variance than the very best k-Isomap. The statistical power, in conjunction with the mentioned ecological considerations and the fact that the method does not require additional user input, was the reason for using the recovered embedding space for subsequent FCM clustering.
FCM partitioning of the ordination space
The visualization of the adaptive Isomap ordination is presented in Fig. 2. The inspection of the spaces spanned by the first three dimensions affords repercussive justification for the application of the FCM algorithm. Although the species indicate a strong clustering, the separation is not sharp. The continuous transition between the clusters is what we expected from the sampling design. The quality assessment of the achieved partitions is shown in Fig. 3 in terms of the PCAES (c) index, where more dimensions of the ordination space were consecutively considered. The index evolution (over varying number of clusters c and Isomap dimensions) shows some interesting phenomena: when analyzing only the first dimension, PCAES did not find an optimal separation (c* = 1, see Fig. 3). When analyzing two dimensions, however, the species could be best partitioned into three fuzzy sets; however, we also find a slight local maximum at c = 5. The behavior of the membership values is illustrated in this two-dimensional ordination space in Fig. 4, which also presents two examples of inferior partitioning performance (c = 3 and c = 7). All the same, these results have to be seen in the light of the Isomap performance. In Fig. 2, one can observe that the perception of data clustering seen in a two-dimensional space may be biased due to visually overlapping clusters, which are effectively separated in higher dimensions. Indeed, when considering more dimensions in FCM clustering, the respective global maxima of PCAES are shifted toward higher values for c* (Fig. 3). We observed c* in the range of four to seven clusters in all cases. Since the intrinsic dimensionality (which was estimated in the course of adaptive isometric feature mapping) was μ = 3 and an ‘elbow effect’ emerged in the cumulative values of explained variance in all Isomap applications (Fig. 1) after three dimensions, we base all further analysis on this (three dimensional) embedding. In three dimensions, the optimal partitioning was found to be c* = 6.

Isomap embedding for all data points (each point represents one species) and the related neighborhood graph based on a locally adaptive choice of k–NN. (a) Shows embedding visualized for the first two dimensions, (b) shows the same for dimensions 1 and 3, (c) shows the space spanned by the second and thirds dimensions and (d) comprises dimensions 1–3.

The PCAES cluster validity index as a function of a varying number of species clusters c = 2, … ,16. We show the results for the locally adaptive Isomap embedding, considering unidimensional to 10-dimensional embeddings. For each space, the global maximum of the PCAES(c) cluster validity index indicates the optimum data set partition.

Fuzzy partitioning of the two-dimensional-adaptive Isomap embedding space by FCM clustering. From top to bottom, the figure shows the species partitioning for a choice of c = three, five and seven clusters. The gray scale is adjusted to the highest degree of membership for the classification of each species and interpolated to space. The corresponding partition index is given in Fig. 2.
The visualizations of the ordination space (Fig. 2) reveal another interesting phenomenon: the clustering not only affects the plant arrangement but it is also inherent to the neighborhood graph. This means that species in regions of ‘dense’ clustering are also highly interconnected, whereas the connectivity between clusters is relatively weak. This ‘graph clustering’ is induced by the locally adaptive nature of our mapping, but it is not addressed in the FCM clustering setup. Indeed, the theoretical meaning of a ‘clustered’ Isomap neighborhood graph is difficult to ascertain. Future applications of non-linear dimensionality reduction might be able to make explicit use of such clustered connectivity properties. Coifman et al. (2005) developed a geometric framework in which unevenly distributed connectivities are explicitly used for defining a diffusion process on the graph. This framework also enables the performance of tasks of dimensionality reduction, which in the context of ecological data arrays still needs to be investigated in detail.
Characteristics of species clusters
The identification of species groups using data mining techniques is only the first step in an ecological assessment of plant communities. Analyzing the intrinsic properties of these groups is pivotal, especially with respect to life-form distribution (Vazquez and Givnish 1998). Figure 5 illustrates the percentages of life-form distribution in the respective clusters, taking into account species classified by decreasing FCM membership values. The figure shows that increasing classification confidence (expressed through high membership values ui,j) is associated with an increasing differentiation of life-form distribution. Cluster 1, for example, depicts higher fractions of shrubs compared to the overall observed distribution of shrubs. By contrast, no shrubs are contained at very high degrees of classification confidence (ui,j > 0.9) in cluster 2, while decreasing membership values lead to increasing fractions of shrubs. Lianas are not at all present in this cluster. A very nice example is also cluster 5, where 50% or more of the cluster (ui,j > 0.9) is composed of trees, while the rest are shrubs. Herbs and lianas appear only for smaller thresholds. Less distinct life-form distribution patterns are found in clusters 3 and 4.

The percentage of life-form occurrence in six fuzzy clusters that partition the first three dimensions of the adaptive Isomap embedding. Each subplot shows the percentage of life forms in one cluster when considering species classified by the indicated membership values. Note that species assignment to each cluster was based on the respective best classification (that is why the lowest membership value is >0.2). As reference for comparison, the life-form distribution of all species is shown at the right edge of each subplot.
Life-form distributions within all clusters (an exception is cluster 2) converge to the overall life-form distribution by considering species of decreasing membership values. This is due to a trade-off in the interpretation of Fig. 5. On the one hand, focusing on well-classified species (high ui,j-values) implies that the ‘characteristic species group’ might consist of very few species (see also Fig. 4), and the observed life-form distribution lacks representativeness. On the other hand, taking more species into account, increases the representativeness of life-form distribution at the cost of decreasing classification accuracy. Note that we only counted each species once, corresponding to the cluster of highest membership degree.
This effect could also be interpreted as a result of overlapping units—gradients in species distributions over space and time, which will never be classified unambiguously. It is, however, clear that it cannot be said a priori which life form is dominant, and assumptions on species selections according to their life forms are inappropriate. A more in-depth botanical analysis of the succession patterns is provided elsewhere (see Martínez 2007, Martínez et al. 2008).
Ecological interpretation of the species groups
The collected meta-information regarding forest succession and impact history (Mahecha et al. 2007, Martínez et al. 2008) let us expect certain patterns of nutrient availability, given the entire area has adequate sampling intensity (Bendix et al. 2006; Beck et al. 2008, and chapters therein; Wilcke et al. 2002, 2003). It is then possible to identify relationships between characteristic species groups and environmental conditions. Technically, we create n-dimensional presence–absence vectors of species in each cluster and for forest plots of known impact history. Then we simply calculate the distances of the presence–absence vectors of the clusters and the forest plots of known impact history. We used a vector of characteristic species classified by membership values ui,j > 0.8, which is a compromise between very high classification confidence and the number of considered species discussed above.
The results are presented in Fig. 6 in terms of the Sørensen distance measure as introduced above (we used the non-square-rooted version of equation (2)). The clusters 5, 1, 4 and 2 follow the succession development of the plot series C1, C2, C3 and C4 (sorted along a gradient of increasing forest maturation). Also, the fire-affected plots (F1 and F2) are shaped by a characteristic species pool (cluster 6), which is not related to the botanical path of recovery after landslides or clear cutting. It seems evident that the third cluster is associated with the subplots belonging to L, which encodes a landslide-affected region. We know from the respective biogeochemical soil analysis in the same area that this is accompanied by losses of upper soil layers and thus a considerable decrease of plant-available nutrients (Wilcke et al. 2002, 2003).

The Sørensen distance of the presence–absence vector of species in each cluster (defined by membership values of >0.8, derived from a three-dimensional-adaptive Isomap embedding) with a presence–absence vector belonging to forest plots of known impact history.
Bringing together, these plot-cluster affinities (Fig. 6) and the previously discussed life-form distribution (Fig. 5) provide further evidence that we effectively found characteristic species groups of indicative value for ecological assessments: the life-form distribution in cluster 5 does not depict herbs and lianas at the accurate classification ui,j > 0.9, which makes perfect sense considering that this plot is also closely associated with the oldest forest plot (C1). Over the successional gradient depicted in the clusters 2, 4, 1 and 5 (5 representing the most mature forest state), we observe an increasing dominance of trees and shrubs over time, which is in line with previous findings in the literature (Chazdon 2008).
We present these results with caution, bearing in mind that the ‘characteristic species’ are obtained from a sampling strategy that in part anticipates the results in Fig. 6. However, the species pools of the seven plots are far from being isolated. We analyzed the intersections of species pools sampled at plot level and found that each plot comprises between 27 and 53% of unique species, while the rest is spread over two or more forest plots. In this context, it has also to be clarified whether the proposed data-mining approach extracted the six characteristic species groups fully data-adaptive and without a priori input. Although ‘plot labels’ at each data point can be used for supervised learning methods (Mahecha et al. 2007), here they were used only after the analysis for validating our results. Therefore, we assume that the method presented here is suitable for applications where data-point affiliation remains unknown.
CONCLUSION AND OUTLOOK
A novel data-mining approach was proposed to identify characteristic co-occurrence patterns of vascular plants from inventory data of secondary neotropical mountain forests. The combination of adaptive non-linear ordinations and fuzzy clustering effectively allowed the identification of characteristic species in the presence of underlying non-linearities and in situations where high degrees of uncertainties were expected to affect the data array. Locally adaptive Isomap proved to be particularly suitable when only parts of the underlying gradients of interest (forest maturation, spatio temporal biogeochemical gradients, land-use history and land use) can effectively be sampled. Furthermore, the coupling of locally adaptive Isomap and FCM clustering allows data set partitions when inaccurate botanical classifications lead to high data uncertainty. The coupled Isomap–FCM clustering approach is principally applicable to data arrays where no meta-information is available for validating the indicative properties of ‘characteristic species groups’. The method needs to be tested with larger and well-sampled data sets in which independent data, e.g. accurate biogeochemical parameters, are available; however, the authors hold the view that the data mining approach presented is suitable for similar applications concerning a variety of problems in vegetation science.
In this respect, we also expect that innovations in both floristic inventory as well as statistical analysis might become central in the elaboration of future management and conservation guidelines. The present study demonstrated that sophisticated data mining tools can identify characteristic species groups that contain a considerable amount of environmental information.
SUPPLEMENTARY MATERIAL
Supplementary material is available at JPE Journal online.
The study was supported by the German Research Foundation (DFG) within the scope of the Research Unit 402: ‘Functionality in a Tropical Mountain Rainforest’. The quality of the manuscript was improved following suggestions by Catherine Schloegel. M.D.M. and M.R. thank the Max Planck Society for supporting the ‘Biogeochemical Model-Data Integration Group’ as an independent junior research unit.