-
PDF
- Split View
-
Views
-
Cite
Cite
M. Bučas, U. Bergström, A-L. Downie, G. Sundblad, M. Gullström, M. von Numers, A. Šiaulys, M. Lindegarth, Empirical modelling of benthic species distribution, abundance, and diversity in the Baltic Sea: evaluating the scope for predictive mapping using different modelling approaches, ICES Journal of Marine Science, Volume 70, Issue 6, September 2013, Pages 1233–1243, https://doi.org/10.1093/icesjms/fst036
- Share Icon Share
Abstract
The predictive performance of distribution models of common benthic species in the Baltic Sea was compared using four non-linear methods: generalized additive models (GAMs), multivariate adaptive regression splines, random forest (RF), and maximum entropy modelling (MAXENT). The effects of data traits were also tested. In total, 292 occurrence models and 204 quantitative (abundance and diversity) models were assessed. The main conclusions are that (i) the spatial distribution, abundance, and diversity of benthic species in the Baltic Sea can be successfully predicted using several non-linear predictive modelling techniques; (ii) RF was the most accurate method for both models, closely followed by GAM and MAXENT; (iii) correlation coefficients of predictive performance among the modelling techniques were relatively low, suggesting that the performance of methods is related to specific responses; (iv) the differences in predictive performance among the modelling methods could only partly be explained by data traits; (v) the response prevalence was the most important explanatory variable for predictive accuracy of GAM and MAXENT on occurrence data; (vi) RF on the occurrence data was the only method sensitive to sampling density; (vii) a higher predictive accuracy of abundance models could be achieved by reducing variance in the response data and increasing the sample size.Bučas, M., Bergström, U., Downie, A-L., Sundblad, G., Gullström, M., von Numers, M., Šiaulys, A., and Lindegarth, M. 2013. Empirical modelling of benthic species distribution, abundance, and diversity in the Baltic Sea: evaluating the scope for predictive mapping using different modelling approaches. – ICES Journal of Marine Science, 70: 1233–1243.
Introduction
Knowledge of the regional and local distribution of structural and functional biodiversity in the marine environment is a vital component for spatial planning and management of coastal seas (Ehler and Douvere, 2009; Gee et al., 2011). In this context, predictive modelling based on species–environment relationships provides a potentially useful way to synthesize information from scattered samples into coherent maps of distributions of species and habitats, ecological goods, and services (Guisan and Zimmerman, 2000; Guisan and Thuiller, 2005). Thus, the prediction of species distribution and diversity can provide useful information for management and conservation of endangered species and vulnerable biological communities. The ability to effectively predict the occurrence of habitats and species has indeed become urgent because of recent rapid habitat loss and the impacts of global climate change (Clark et al., 2001). Importantly, the results of predictive modelling of communities can also enhance our understanding of how various environmental variables influence the distribution of communities (e.g. Franklin, 1995; Austin, 2002). Over the last two decades, there has been a significant increase in research focused on the modelling distributions of terrestrial biota (Robinson et al., 2011) whereas little attention has been paid to marine ecosystems, largely due to the comparative paucity of appropriate marine environmental and biological data. New marine survey techniques (e.g. multibeam sonars and georeferenced underwater video), as well as the development of international policies for conservation of marine biodiversity, are now enabling and driving research into marine ecological processes and large-scale mapping of habitats and biological communities (e.g. Greene et al., 1999; Roff et al., 2003).
A wide array of methods are now applied in species distribution modelling (e.g. Guisan and Zimmerman, 2000; Elith et al., 2006). Several studies comparing the performance of habitat models have been carried out primarily for terrestrial species (e.g. Segurado and Araujo, 2004; Elith et al., 2006; Moisen et al., 2006; Aertsen et al., 2010). Although some studies have stressed that there is no one universally good modelling technique (e.g. Aertsen et al., 2010), others have recommended using non-parametric models (e.g. Segurado and Araujo, 2004). The findings of the comprehensive comparative studies of Elith et al. (2006) and Reiss et al. (2011) further demonstrated that non-parametric models, such as boosted regression trees, maximum entropy modelling (MAXENT), and generalized dissimilarity models, performed better than parametric or semi-parametric models (e.g. generalized linear models, GLMs, and generalized additive models, GAMs), which in turn performed better than BIOCLIM, LIVES, and DOMAIN that use presence-only data. Classical techniques such as the GLM (McCullagh and Nelder, 1989) were successfully applied to model benthic invertebrates in the Baltic Sea (Gogina and Zettler, 2010); however, GLMs were found to be limited in their ability to fit the complex, non-linear relationships often occurring between species and environmental predictors. However, comparative studies of species distribution modelling methods in marine environmental science are scarce, especially for abundance data (Li and Heap, 2008; Reiss et al., 2011). Furthermore, the number of species and spatial extent of such comparisons is usually limited (Segurado and Araujo, 2004).
Because of current policy developments in general and in the Baltic Sea in particular (HELCOM, 2007), there is a growing need for powerful methods to integrate and model distributions of marine benthic assemblages. The primary aim of this study was to evaluate the predictability of benthic species occurrence, abundance, and diversity at a regional scale to arrive at recommendations for modelling practices to be used in mapping the biota of the Baltic Sea. We focused on five independent areas across the Baltic Sea, evaluating the performance of different predictive modelling methods averaged over many taxa. We tested the performance of methods for qualitative data (species occurrence) and quantitative data (species abundance and diversity), using a wide range of species and environmental variables. Our model comparison involved four fundamentally different modelling approaches that cope with non-linear empirical relationships between responses and environmental predictors: GAM, multivariate adaptive regression splines (MARSs), random forest (RF), and MAXENT. In the Baltic Sea, the use of the GAM for modelling benthic communities has increased over the last decade (e.g. Carlén et al., 2007; Sandman et al., 2008; Florin et al., 2009; Carlström et al., 2010), whereas the applications of other methods are rare (Susi et al., 2010).
Many studies have also demonstrated that the predictive ability of empirical models may be strongly dependent on the quality of data (e.g. Kadmon et al., 2003; Guisan et al., 2007; Li and Heap, 2008). Therefore, we also assessed how data traits such as response type, number of samples, sampling density, response prevalence of the occurrence data, and variance in the response abundance data affected the performance of the studied modelling methods.
Material and methods
Study areas
The Baltic Sea represents the world's largest brackish-water sea area (382 000 km2) influenced by a limited inflow of marine, fully saline water from the North Sea and a high input of freshwater from the many rivers flowing into the enclosed basin. The wide variety of coastal types (Voipio, 1981) and complex bottom topography around the Baltic Sea are being strongly shaped by wave exposure and currents. Archipelagos of various geographical extents are found mainly along the Swedish and Finnish coasts. The largest archipelago covers the area east of Stockholm in a belt extending over the Åland islands into the large Archipelago Sea of southwestern Finland and further along the north coast of the Gulf of Finland. Open sandy shores are often found in the southeastern part of the Baltic Proper from Lithuania to Denmark.
Five study areas were selected across the Baltic Sea to represent the variety of environments (Figure 1): (i) Stockholm-Turku archipelago area, (ii) Archipelago Sea, (iii) Östergötland, (iv) Vinga, and (v) Lithuanian coast. Their environments differ mainly in salinity and coastal types (Table 1) according to topographical, hydrodynamical, and biological characteristics (Voipio, 1981). Salinity in the study areas varies from 3–7 PSU in the inner parts of the Archipelago Sea, through 5–9 PSU in the Baltic Proper (Lithuanian coast), to 18–30 PSU in the Kattegat (Vinga area).
Physical environmental characteristics (area, depth, substrate, wave exposure, near bottom salinity, Secchi depth, number of ice days) of the study areas, used sampling methods, number of samples, minimum of sampling unit, and distance between samples.
Study area . | Vinga . | Östergötland . | Lithuania . | Stockholm-Turku . | Archipelago Sea . | ||
---|---|---|---|---|---|---|---|
Extent of area (km2) | 40 | 2 335 | 400–2 700 | 40 000 | 9 000 | ||
Depth range (m) | 1–20 | 0–30 | 1–52 | 0–6 | 1–140 | ||
Substrate | Soft to rocky | Soft to rocky | Soft to stony | Soft to rocky | Soft to rocky | ||
Wave exposure (Hiscock, 1990) | Very sheltered to very exposed | Extremely sheltered to very exposed | Very exposed | Ultra sheltered to very exposed | Very sheltered to very exposed | ||
Salinity range (PSU) | 18–30 | 1–7 | 5–9 | 3–7 | 4–6 | ||
Secchi depth range (m) | NA | NA | 0.1–7 | 0.2–10 | 0.35–7 | ||
Number of ice days | 30 | 10–90 | 10–30 | 10–90 | 10–120 | ||
Sampling method | Drop-video | Drop-video | Dive or drop-video | van Veen grab | Detonation | Dive or ROV transect | Drop-video |
Number of samples | 125 | 1 473 | 448 | 148 | 399–2 827 | 355 | 361 |
Minimum sampling unit (m2) | 25 | 25 | 0.04 | 0.1 | 80 | 4 | 25 |
Minimum distance between samples (m) | 50 | 50 | 4 | 40 | 30 | 30 | 50 |
Study area . | Vinga . | Östergötland . | Lithuania . | Stockholm-Turku . | Archipelago Sea . | ||
---|---|---|---|---|---|---|---|
Extent of area (km2) | 40 | 2 335 | 400–2 700 | 40 000 | 9 000 | ||
Depth range (m) | 1–20 | 0–30 | 1–52 | 0–6 | 1–140 | ||
Substrate | Soft to rocky | Soft to rocky | Soft to stony | Soft to rocky | Soft to rocky | ||
Wave exposure (Hiscock, 1990) | Very sheltered to very exposed | Extremely sheltered to very exposed | Very exposed | Ultra sheltered to very exposed | Very sheltered to very exposed | ||
Salinity range (PSU) | 18–30 | 1–7 | 5–9 | 3–7 | 4–6 | ||
Secchi depth range (m) | NA | NA | 0.1–7 | 0.2–10 | 0.35–7 | ||
Number of ice days | 30 | 10–90 | 10–30 | 10–90 | 10–120 | ||
Sampling method | Drop-video | Drop-video | Dive or drop-video | van Veen grab | Detonation | Dive or ROV transect | Drop-video |
Number of samples | 125 | 1 473 | 448 | 148 | 399–2 827 | 355 | 361 |
Minimum sampling unit (m2) | 25 | 25 | 0.04 | 0.1 | 80 | 4 | 25 |
Minimum distance between samples (m) | 50 | 50 | 4 | 40 | 30 | 30 | 50 |
NA, not available data; ROV, remote automated vehicle.
Physical environmental characteristics (area, depth, substrate, wave exposure, near bottom salinity, Secchi depth, number of ice days) of the study areas, used sampling methods, number of samples, minimum of sampling unit, and distance between samples.
Study area . | Vinga . | Östergötland . | Lithuania . | Stockholm-Turku . | Archipelago Sea . | ||
---|---|---|---|---|---|---|---|
Extent of area (km2) | 40 | 2 335 | 400–2 700 | 40 000 | 9 000 | ||
Depth range (m) | 1–20 | 0–30 | 1–52 | 0–6 | 1–140 | ||
Substrate | Soft to rocky | Soft to rocky | Soft to stony | Soft to rocky | Soft to rocky | ||
Wave exposure (Hiscock, 1990) | Very sheltered to very exposed | Extremely sheltered to very exposed | Very exposed | Ultra sheltered to very exposed | Very sheltered to very exposed | ||
Salinity range (PSU) | 18–30 | 1–7 | 5–9 | 3–7 | 4–6 | ||
Secchi depth range (m) | NA | NA | 0.1–7 | 0.2–10 | 0.35–7 | ||
Number of ice days | 30 | 10–90 | 10–30 | 10–90 | 10–120 | ||
Sampling method | Drop-video | Drop-video | Dive or drop-video | van Veen grab | Detonation | Dive or ROV transect | Drop-video |
Number of samples | 125 | 1 473 | 448 | 148 | 399–2 827 | 355 | 361 |
Minimum sampling unit (m2) | 25 | 25 | 0.04 | 0.1 | 80 | 4 | 25 |
Minimum distance between samples (m) | 50 | 50 | 4 | 40 | 30 | 30 | 50 |
Study area . | Vinga . | Östergötland . | Lithuania . | Stockholm-Turku . | Archipelago Sea . | ||
---|---|---|---|---|---|---|---|
Extent of area (km2) | 40 | 2 335 | 400–2 700 | 40 000 | 9 000 | ||
Depth range (m) | 1–20 | 0–30 | 1–52 | 0–6 | 1–140 | ||
Substrate | Soft to rocky | Soft to rocky | Soft to stony | Soft to rocky | Soft to rocky | ||
Wave exposure (Hiscock, 1990) | Very sheltered to very exposed | Extremely sheltered to very exposed | Very exposed | Ultra sheltered to very exposed | Very sheltered to very exposed | ||
Salinity range (PSU) | 18–30 | 1–7 | 5–9 | 3–7 | 4–6 | ||
Secchi depth range (m) | NA | NA | 0.1–7 | 0.2–10 | 0.35–7 | ||
Number of ice days | 30 | 10–90 | 10–30 | 10–90 | 10–120 | ||
Sampling method | Drop-video | Drop-video | Dive or drop-video | van Veen grab | Detonation | Dive or ROV transect | Drop-video |
Number of samples | 125 | 1 473 | 448 | 148 | 399–2 827 | 355 | 361 |
Minimum sampling unit (m2) | 25 | 25 | 0.04 | 0.1 | 80 | 4 | 25 |
Minimum distance between samples (m) | 50 | 50 | 4 | 40 | 30 | 30 | 50 |
NA, not available data; ROV, remote automated vehicle.

Study areas in the Baltic Sea: (1) Stockholm-Turku, (2) Archipelago Sea, (3) Östergötland, (4) Vinga, and (5) Lithuania.
Dataset
Existing biological (Supplementary Table S1) and environmental (Supplementary Table S2) data from the five study areas across the Baltic Sea were used in this study covering various benthic species and water bodies.
Species data
Vinga area
Field sampling of benthic vegetation was performed at the outermost part of the archipelago of Gothenburg during August and September 2007. The study area was stratified and divided into five different types according to depth and wave exposure: (i) exposed sites with a depth of <6 m, (ii) sheltered sites with a depth of <6 m, (iii) sites of 6–10-m depth, (iv) sites of 10–20-m depth, and (v) sites previously known to contain seagrass. For each type of location, habitat information was collected from 25 randomly chosen sampling sites using a Seaviewer underwater video camera. At each sampling site, five fields of view (each sample covered minimum 25 m2 area) were selected randomly, analysed separately for per cent cover using 100 regular points on the video screen. The coverage of different groups of macrophytes was estimated based on the photo identification to the lowest taxonomic level possible. Models were based on data on the presence/absence and the average cover from each of the 125 sites.
Östergötland area
Benthic vegetation was mapped by an underwater video in late summer during 2008 and 2009. Based on randomly stratified sampling, a total of 2104 drop video points was collected, each sample covering ∼5 × 5 m. The coverage of different groups of macrophytes and zoobenthos was visually estimated in view based on the video identification to the lowest taxonomic level possible. The maximum sampled depth was 39 m (Carlström et al., 2010).
Stockholm-Turku area
A survey of Eurasian perch (Perca fluviatilis) egg strands was conducted during the spring season in 2003 and 2007 (Snickars et al., 2011). The latter survey was performed from a boat, slowly manoeuvred along the shoreline up to 6-m depth covering 15 sites during 1–3 occasions between 24 April and 6 June. Juvenile fish were sampled in late summer 2005–2007 and 2009 using small underwater detonations (Snickars et al., 2007), with the addition of 10 g of dynamite to increase the sampling area (ca. 80 m2) and detectability. Data of young-of-the-year of Eurasian perch, northern pike (Esox lucius), pikeperch (Sander lucioperca), roach (Rutilus rutilus), and sticklebacks (Gasterosteus aculeatus and Pungitius pungitius) were included in this study. These species are vital in the coastal fish community and the first three mentioned are top predators and fished both commercially and recreationally. Distribution models of all species except sticklebacks have previously been attempted and are here updated (Sundblad et al., 2011).
Archipelago Sea area
A dataset of macrophytes and sessile fauna was compiled from multiple sources of data collected during 2004–2009. Sampled depth ranged from 1 to 25 m. The cover of sessile species was visually estimated from an underwater video or from dive transects recording per cent cover in view (minimum 4 m2 area). The video sequences were acquired using observations from a drop video (each video sample covered minimum 25 m2 area) and remotely operated vehicle (ROV). Transect data of ROV were randomly subsampled to ensure a minimum distance of 30 m between observations to minimize the unwanted effects of autocorrelation. The compiled dataset included only macroscopic sessile species that are identifiable from video recordings. In the case that identification to species was not possible, observations were assigned to groups at the higher taxonomic level (e.g. Hydroidea, phanerogams) or to a group according to their growth form (e.g. filamentous algae, sessile filter-feeders).
Lithuania area
Field sampling of macrozoobenthos was performed between 1998 and 2010 for different seabed mapping (e.g. Olenin et al., 2003) and monitoring purposes in the territorial waters. The maximum sampled depth was 52 m. A van Veen grab sampler was used to take samples from soft bottom and a 0.04 m2 frame (Kautsky, 1993) was used to take samples from hard bottom. All samples were washed through a 0.5-mm mesh-sieve and preserved with 4% formalin neutralized with NaHCO3. Further treatment of material was performed according to HELCOM (1984). Animals were identified to the species level where practicable and their biomass were determined as formalin wet weight (g m−2). Mapping surveys of benthic macrovegetation (Bučas et al., 2007, 2009) and Baltic herring (Clupea harengus membras) spawning grounds were performed in 2003–2008 and 2008–2010, respectively, in the coastal waters along the mainland coast of Lithuania (<ref. 8>Bučas, 2009). The sampling sites were stratified according to depth and substrate that are suitable for the species (Olenin et al., 1996; Bučas et al., 2009). The benthic habitats were mapped using drop-down and hand-held cameras or described by scuba divers. Occurrence and percentage cover of macroalgal species were visually estimated in view. Occurrence of Baltic herring spawning grounds was determined by scuba divers during the spring season.
The study included 78 response variables in total across the five study areas (Supplementary Table S1). They were classified into five categories: vegetation (20 macroalgae and 11 phanerogams), invertebrates (25), fish reproduction areas (7), species diversity (5), and functional groups (10)—taxa with similar ecological function, e.g. mussels and barnacles grouped to suspension filter-feeders. Out of all responses, 68 were measured quantitatively in different units (abundance, biomass, cover, or number of species) and 10 qualitatively (occurrence). The quantitative data were used for the abundance models, and after transformation to presence/absence added to the qualitative data for use in the occurrence models.
Environmental data
In all, 27 environmental predictors were available among the datasets (Supplementary Table S2). All predictors were categorized to one of five groups: location (coordinates), bottom topography (e.g. depth, bottom slope), exposure (e.g. wave exposure, orbital wave velocity), substrate (e.g. soft/hard bottoms), and hydrographic variables (e.g. Secchi depth, salinity). The location variable was included to represent any unknown physical gradients. At least one predictor from each group was included in the set of predictors used to build the models in each study area. Collinearity among predictors was quantitatively checked using variance inflation factors (VIFs), where predictors with VIF > 3 were removed from further analysis (Quinn and Keough, 2002).
Modelling techniques
Generalized additive models
GAMs are semi-parametric extensions of GLMs useful for fitting non-linear relationships without prior assumptions on the shape of the response (Wood, 2006). The “mgcv” package for R version 1.6-2 was used. Model selection was based on penalized regression splines with default gamma values and the maximum of 4 degrees of freedom for continuous predictor variables to maintain ecologically interpretable models (Wood and Augustin, 2002). Location variables (coordinates) were fitted as a smooth surface and individual main effects were included from the remaining predictor variables. Binomial error distributions were used for qualitative responses and Poisson, or if necessary quasi-Poisson, for quantitative responses.
Multivariate adaptive regression splines
MARS is a flexible data-driven non-parametric method of fitting regression models, incorporating the benefits of the recursive partitioning approach, while producing a continuous model (Friedman, 1991). The “earth” package for R version 2.11.1 was used. Models were built using the GLM approach and specified to include first-order interactions, where significant. The GLMs used binomial error distributions for qualitative responses and Poisson, or if necessary quasi-Poisson, for quantitative responses.
Random forest
RF is an ensemble method where a large number of decision trees (typically 500–1000) are built and responses are predicted based on averages (regression) or by majority rules (classification) from all trees (Breiman, 2001; Cutler et al., 2007). Compared with traditional regression and classification trees (Breiman et al., 1984), the main advantages using RF are that it produces more accurate predictions and that it is easier to use as it requires no pruning. Individual RF trees were constructed using the default values of random sample mtry of the predictors (p): p/3 for regression problems and p1/2 for classification problems (Liaw and Wiener, 2002). The package “randomForest” for R version 4.5-35 was used. Models were developed with 1000 classification trees each; exploratory graphs indicated that error rates became stable well before this number of trees was developed.
Maximum entropy (MAXENT)
MAXENT is a general-purpose method for making predictions or inferences from incomplete information (Phillips et al., 2006). When used to predict species distribution, it estimates a target's probability distribution based on a distribution of maximum entropy under the constraint that the expected value of each environmental variable under this estimated distribution matches its empirical mean. This method is equivalent to finding the maximum-likelihood distribution of a species. MAXENT is termed a presence-only technique, where instead of absences it takes a random sample of pixels from the study region used in model calibration to characterize the “background” of environments available to the species (Heumann et al., 2011). Version 3.3.3a of the “MaxEnt” software was used. In this study, we had reliable absence data, which were therefore used as background data, thus giving MaxEnt the same amount of information as the other methods. The convergence threshold was set at 10−5 and the maximum number of iterations at 1000 to allow the algorithm to get close to convergence (Phillips et al., 2006). The regularization parameter used ranged between 1 and 2 depending on the response and the level of smoothing that was required to avoid overfitting and produce ecologically interpretable models.
All four methods were used to model and predict qualitative response variables, yielding a total of 292 occurrence models (Supplementary Table S1). Since MAXENT is restricted to presence-only data, three methods (GAM, MARS, and RF) were used to model and predict a total of 204 quantitative (species abundance and diversity) response variables.
Evaluation of predictive performance of models
A split sample method was used to measure the predictive performance of the derived models (Picard and Berk, 1990); models were calibrated on a random subsample consisting of 70% of all data (internal or train data) and accuracy or error of predictions was determined on the withheld 30% of data (external or test data). Presences and absences were randomly split into equal proportions in train and test data maintaining prevalence (the proportion of sampled sites where a species is present). The performance of occurrence models was measured in terms of area under the curve (AUC) of a receiver operating characteristic plot (Fielding and Bell, 1997). According to Swets (1988), models providing AUC values above 0.9 are considered highly accurate or “excellent”, those providing AUC values in the range 0.7–0.9 moderately accurate or “good”, and those lower than 0.7 poorly accurate or “poor”. The performance of abundance models was measured based on explained variance or coefficient of determination (r2), and root mean squared error normalized to the range (NRMSE) due to right-skewed data and different abundance units of the responses.
Autocorrelation in residuals of all models was inspected visually using spline correlograms plotted with the “ncf” package for R. Autocorrelation was determined to be insignificant (based on 95% pointwise bootstrap confidence intervals) and was therefore considered as negligible in this study.
One-way analysis of variance (ANOVA) was used to test if there was a significant difference in model performance, in terms of AUC, r2, and NRMSE, between modelling methods or the type of response variable. Residuals of ANOVA models were checked for adherence to the assumption of variance equality and normality by using scatterplots of residuals vs. predictors and histogram of residuals (Zuur et al., 2007). Transformations of responses were applied for variables where the assumptions were not met, and if the unequal variances remained, different residual variation was allowed in a model by fitting the best variance structure (Zuur et al., 2009). For multiple means in ANOVA, Tukey's studentized range test was used to find the means that were significantly different from each other (Quinn and Keough, 2002).
The Pearson product–moment correlation coefficient was determined among the performance statistics of the models across the used methods to test the consistency among the different modelling methods. Variables were transformed where needed to meet the assumptions of distribution.
Assessment of factors important to the performance of models
In this study, we tested six data traits that could influence the performance of models: number of samples, sampling density, response type, for occurrence data response prevalence, and for abundance data variance in the response. The number of samples ranged from 125 to 2827 (Table 1). The sampling density, calculated as a ratio of the number of samples to the size of study area, ranged from 0.01 to 3.13 samples km−2. The response type was classified into five categories (Supplementary Table S1): macrophytes, invertebrates, fish reproduction areas, diversity (species richness), and functional groups (e.g. primary producers). The response prevalence (the proportion of sampled sites where a species is present) ranged from 0.04 to 1.00 (Supplementary Table S1). Variance in the abundance of each response variable was estimated as the coefficient of variation (ratio between standard deviation and mean), which ranged from 0.27 to 8.37.
The importance of factors (data traits) to performance of occurrence models (in terms of AUC) and abundance models (in terms of r2 and NRMSE) was assessed by regression tree analysis, due to unbalanced datasets and missing values (Zuur et al., 2007). The regression tree analysis seeks to repeatedly split the response variable (e.g. AUC) into more homogeneous subgroups, using combinations of explanatory variables (De'ath and Fabricius, 2000). The size of the final tree was selected (pruned) by tenfold cross-validation and the 1-SE rule (Breiman et al., 1984). The total variance explained by the best single tree was assessed. The tree was represented graphically, with the relative lengths of the vertical lines (branches) associated with each split representing the proportion of explained variance by each split. The regression tree analysis was performed using “rpart” package for R.
Results
Performance of predictive modelling methods
In total, 292 occurrence models for presence/absence data and 204 abundance and diversity models for quantitative data were built using four modelling techniques (GAM, MARS, RF, and MAXENT). The measures of method performance are summarized in Supplementary Table S3.
Predictive accuracy of all the occurrence models in terms of the AUC values ranged from poor to excellent level (Figure 2a) with a mean AUC of 0.85 ± 0.09 (mean ± s.d.) and a median AUC of 0.87. Overall, performance was comparable across methods. A significant difference (F = 3.562, df = 3, p = 0.015) in the performance could only be seen between RF, showing the highest mean predictive accuracy (AUC = 0.87 ± 0.08), and MARS, with the lowest mean predictive accuracy (AUC = 0.82 ± 0.08).

Performance of the four predictive modelling methods for occurrence data (a) in terms of AUC and three predictive modelling methods for abundance data (b and c) in terms of coefficient of determination (r2) and NRMSE. The performance of different methods is shown by a combination of strip charts and mean plots, where values = dots, means = squares, 95% confidence intervals = short whiskers, and standard deviation = long whiskers. Modelling methods: GAMs, MARSs, RF, and MAXENT. Dotted line = mean statistic of all the methods.
The predictive performance in terms of coefficient of determination (r2) of all abundance models ranged from almost none to 0.74 with a mean r2 of 0.24 ± 0.19 (Figure 2b). Again, RF had the best mean predictive performance (r2 = 0.28 ± 0.20). In comparisons between the models, the performance of RF was found to be higher than that of GAM (r2 = 0.21 ± 0.17; F = 3.47, df = 2, p = 0.03), whereas the predictive performance of MARS was between RF and GAM and did not significantly differ from either of the two methods. The predictive error in terms of NRMSE varied from 0.02 to 1.07 with a mean NRMSE of 0.17 ± 0.11 (Figure 2c). The mean predictive error of RF (NRMSE = 0.15 ± 0.05) was lower than that of MARS (NRMSE = 0.19 ± 0.13; F = 3.47, df = 2, p = 0.03), whereas a difference was neither found between RF and GAM nor found between MARS and GAM.
The predictive performance measures were correlated among the modelling methods for both occurrence (r = 0.60–0.79, p < 0.001, n = 72) and abundance data (r = 0.65–0.73, p < 0.001, n = 68; Table 2). As expected from the analyses of overall predictive performance, the performance of MARS was the most deviant in relation to the other methods (r = 0.60–0.72), whereas GAM, MAXENT, and RF were more strongly correlated (r = 0.70–0.79).
Correlations between the performance of the predictive modelling methods used for the occurrence data in terms of area under curve and abundance data in terms of coefficient of determination (upper triangle of correlation matrix) and NRMSE (lower triangle of correlation matrix).
Data . | Occurrence . | Abundance . | |||||
---|---|---|---|---|---|---|---|
Method . | GAM . | MARS . | RF . | MAXENT . | GAM . | MARS . | RF . |
GAM | 1.00 | 0.65 | 0.70 | 0.79 | 1.00 | 0.68 | 0.70 |
MARS | 1.00 | 0.62 | 0.60 | 0.71 | 1.00 | 0.65 | |
RF | 1.00 | 0.77 | 0.73 | 0.72 | 1.00 | ||
MAXENT | 1.00 |
Data . | Occurrence . | Abundance . | |||||
---|---|---|---|---|---|---|---|
Method . | GAM . | MARS . | RF . | MAXENT . | GAM . | MARS . | RF . |
GAM | 1.00 | 0.65 | 0.70 | 0.79 | 1.00 | 0.68 | 0.70 |
MARS | 1.00 | 0.62 | 0.60 | 0.71 | 1.00 | 0.65 | |
RF | 1.00 | 0.77 | 0.73 | 0.72 | 1.00 | ||
MAXENT | 1.00 |
GAM, generalized additive model; MARS, multivariate adaptive regression spline; RF, random forest; MAXENT, maximum entropy modelling.
Correlations between the performance of the predictive modelling methods used for the occurrence data in terms of area under curve and abundance data in terms of coefficient of determination (upper triangle of correlation matrix) and NRMSE (lower triangle of correlation matrix).
Data . | Occurrence . | Abundance . | |||||
---|---|---|---|---|---|---|---|
Method . | GAM . | MARS . | RF . | MAXENT . | GAM . | MARS . | RF . |
GAM | 1.00 | 0.65 | 0.70 | 0.79 | 1.00 | 0.68 | 0.70 |
MARS | 1.00 | 0.62 | 0.60 | 0.71 | 1.00 | 0.65 | |
RF | 1.00 | 0.77 | 0.73 | 0.72 | 1.00 | ||
MAXENT | 1.00 |
Data . | Occurrence . | Abundance . | |||||
---|---|---|---|---|---|---|---|
Method . | GAM . | MARS . | RF . | MAXENT . | GAM . | MARS . | RF . |
GAM | 1.00 | 0.65 | 0.70 | 0.79 | 1.00 | 0.68 | 0.70 |
MARS | 1.00 | 0.62 | 0.60 | 0.71 | 1.00 | 0.65 | |
RF | 1.00 | 0.77 | 0.73 | 0.72 | 1.00 | ||
MAXENT | 1.00 |
GAM, generalized additive model; MARS, multivariate adaptive regression spline; RF, random forest; MAXENT, maximum entropy modelling.
Effects of data traits
For the occurrence models, the way different attributes of the data (traits) affected the performance in terms of AUC differed among methods (Table 3 and Figure 3a). For RF, the sampling density was the most important trait followed by the effect of response prevalence. The mean AUC was higher at sampling densities ≥0.04 samples km−2 (mean AUC = 0.88 and n = 66) than at sampling densities <0.04 samples km−2 (AUC = 0.74 and n = 7). In cases of higher sampling densities (≥0.04 samples km−2), the models with a response prevalence of ≥0.28 were classified as relatively “good” (AUC = 0.86 and n = 30), and the models with a prevalence of <0.28 were classified as “excellent” (AUC = 0.90 and n = 36).
Relative importance of the effects of different data traits on the AUC by different predictive modelling methods used on occurrence data.
Trait . | RF . | GAMs . | MARSs . | Maximum entropy models . |
---|---|---|---|---|
Response type | 0 | 0 | 0 | 0 |
Number of samples | 0 | 0 | 0 | 0 |
Response prevalence | 6 | 18 | 0 | 14 |
Sampling density | 25 | 0 | 0 | 0 |
Total | 31 | 18 | 0 | 14 |
Trait . | RF . | GAMs . | MARSs . | Maximum entropy models . |
---|---|---|---|---|
Response type | 0 | 0 | 0 | 0 |
Number of samples | 0 | 0 | 0 | 0 |
Response prevalence | 6 | 18 | 0 | 14 |
Sampling density | 25 | 0 | 0 | 0 |
Total | 31 | 18 | 0 | 14 |
Importance is measured by the percentage of variance explained in regression tree analysis.
Relative importance of the effects of different data traits on the AUC by different predictive modelling methods used on occurrence data.
Trait . | RF . | GAMs . | MARSs . | Maximum entropy models . |
---|---|---|---|---|
Response type | 0 | 0 | 0 | 0 |
Number of samples | 0 | 0 | 0 | 0 |
Response prevalence | 6 | 18 | 0 | 14 |
Sampling density | 25 | 0 | 0 | 0 |
Total | 31 | 18 | 0 | 14 |
Trait . | RF . | GAMs . | MARSs . | Maximum entropy models . |
---|---|---|---|---|
Response type | 0 | 0 | 0 | 0 |
Number of samples | 0 | 0 | 0 | 0 |
Response prevalence | 6 | 18 | 0 | 14 |
Sampling density | 25 | 0 | 0 | 0 |
Total | 31 | 18 | 0 | 14 |
Importance is measured by the percentage of variance explained in regression tree analysis.

Pruned regression trees of the prediction accuracy in terms of the AUC by different predictive modelling methods (a–c) for occurrence data. Each of the splits is labelled with the important effect and its values that determine the split. Each of the terminal nodes is labelled with the mean AUC value and the number of observations in the group (n) forming the terminal node. For example, with GAMs (b), the mean rating of the AUC is lower at a response prevalence ≥0.275 (AUC = 0.8176 and n = 33) than at a response prevalence <0.275 (AUC = 0.8902 and n = 40). The length of each vertical branch is proportional to the variance explained by the splitting variable.
The response prevalence was the only important effect for GAM and MAXENT (Table 3 and Figure 3b and c). For GAM, the mean AUC was higher at a response prevalence of <0.28 (AUC = 0.89 and n = 40) than at a response prevalence of ≥0.28 (AUC = 0.82 and n = 33). For MAXENT, the mean AUC was higher at a response prevalence <0.34 (AUC = 0.88 and n = 49) than at a response prevalence ≥0.34 (AUC = 0.81 and n = 23).
The classification tree analysis produced no significant partition of the data for MARS, suggesting that none of the tested data traits influenced model performance (Table 3).
For the abundance models, the only effect of variance in the response data (variance) was found on the predictive performance in terms of r2 (Table 4 and Figure 4). Generally, the mean of r2 was higher at a variance <1.6–2.6 (mean r2 = 0.31–0.40 and n = 15–30) than at a variance ≥1.6–2.6 (r2 = 0.18–0.19 and n = 37–52). The variability in the predictive performance of RF measured as NRMSE was not affected by any of the data traits (Table 4 and Figure 4). For GAM and MARS, the number of samples was the most important trait, affecting the predictive performance in terms of NRMSE (Table 4 and Figure 4). Lower numbers of samples (<137) resulted in a higher error (mean NRMSE = 0.30–0.33 and n = 12–14) than where the number of samples was ≥137 (NRMSE = 0.15–0.16 and n = 53).
Relative importance of the effects of data traits on coefficient of determination (r2) and NRMSE for different predictive modelling methods used for abundance data.
Trait . | r2 . | NRMSE . | ||||
---|---|---|---|---|---|---|
RF . | GAMs . | MARSs . | RF . | GAMs . | MARSs . | |
Response type | 0 | 0 | 0 | 0 | 0 | 0 |
Variance in the response data | 29 | 11 | 19 | NA | NA | NA |
Number of samples | 0 | 0 | 0 | 0 | 36 | 17 |
Data density | 0 | 0 | 0 | 0 | 0 | 0 |
Total | 29 | 11 | 19 | 0 | 36 | 17 |
Trait . | r2 . | NRMSE . | ||||
---|---|---|---|---|---|---|
RF . | GAMs . | MARSs . | RF . | GAMs . | MARSs . | |
Response type | 0 | 0 | 0 | 0 | 0 | 0 |
Variance in the response data | 29 | 11 | 19 | NA | NA | NA |
Number of samples | 0 | 0 | 0 | 0 | 36 | 17 |
Data density | 0 | 0 | 0 | 0 | 0 | 0 |
Total | 29 | 11 | 19 | 0 | 36 | 17 |
Importance is measured by the percentage of variance explained in regression tree analysis. Effect of variance in the response data was considered not applicable (NA) for NRMSE, because it is normalized to the range (which is correlated with the variance).
Relative importance of the effects of data traits on coefficient of determination (r2) and NRMSE for different predictive modelling methods used for abundance data.
Trait . | r2 . | NRMSE . | ||||
---|---|---|---|---|---|---|
RF . | GAMs . | MARSs . | RF . | GAMs . | MARSs . | |
Response type | 0 | 0 | 0 | 0 | 0 | 0 |
Variance in the response data | 29 | 11 | 19 | NA | NA | NA |
Number of samples | 0 | 0 | 0 | 0 | 36 | 17 |
Data density | 0 | 0 | 0 | 0 | 0 | 0 |
Total | 29 | 11 | 19 | 0 | 36 | 17 |
Trait . | r2 . | NRMSE . | ||||
---|---|---|---|---|---|---|
RF . | GAMs . | MARSs . | RF . | GAMs . | MARSs . | |
Response type | 0 | 0 | 0 | 0 | 0 | 0 |
Variance in the response data | 29 | 11 | 19 | NA | NA | NA |
Number of samples | 0 | 0 | 0 | 0 | 36 | 17 |
Data density | 0 | 0 | 0 | 0 | 0 | 0 |
Total | 29 | 11 | 19 | 0 | 36 | 17 |
Importance is measured by the percentage of variance explained in regression tree analysis. Effect of variance in the response data was considered not applicable (NA) for NRMSE, because it is normalized to the range (which is correlated with the variance).

Pruned regression trees of the performance of abundance models by three predictive modelling methods, in terms of the coefficient of determination (a–c) and NRMSE (d and e). Each of the splits is labelled with the important effect and its values that determine the split. Each of the terminal nodes is labelled with the mean and the number of observations in the group (n). For example, for RF (a): the mean of r2 is lower at a variance of ≥2.557 (mean r2 = 0.1886 and n = 37) than at a variance of <2.557 (r2 = 0.3983 and n = 30). The length of the vertical branch is proportional to the variance explained by the splitting variable. Variance, response variance; N samples, number of samples.
Discussion
Predictive performance of the modelling methods
The analyses in this study represent a comprehensive assessment of the overall predictive potential of species–environment relationships in the Baltic Sea, and a comparison of the performance of modelling methods applied to a large variety of biological response variables. Species–environment relationships were assessed for 78 quantitative and qualitative response variables, based on taxonomy or functional traits, using four different modelling techniques.
Our results indicate that predictions based on qualitative data, i.e. occurrence data, can be sufficiently accurate to be used in conservation and coastal management planning and in other applications where estimates of the distribution of species are relevant. All methods provided useful models, with a good or excellent mean predictability (AUC > 0.7) for more than 70% of all models. However, there were some differences in performance among the modelling methods: the mean performance of RF was marginally better, whereas the performance of MARS was slightly worse than GAM and MAXENT. The relatively low correlation between these methods (r = 0.60–0.79) suggests that method-specific performance depended on the response ecological traits.
For abundance data, the majority of models had a statistically significant fit, but the mean predictability of species abundance was relatively low (r2 = 0.28 ± 0.20). On the other hand, the performance (NRMSE) showed that the mean deviation of models was ∼17 ± 11% of the relative range of abundance data. It has been suggested that a large part of the variation in abundance models can probably be explained by including the measures of biological interactions and dispersal (Elith and Leathwick, 2009). Generally, this type of data in marine environment is usually spatially and temporally limited (McArthur et al., 2009) and it was the case in this study too. However, the species and environment data in this study covered relatively large spatial scales (the extent of study areas ranged from 40 to 40 000 km2); therefore, we focused on physical environmental predictors, which have a stronger effect than biological interactions on such scale of the Baltic Sea (Kautsky and van der Maarel, 1990). According to our results, this pattern seems to correspond to the species occurrence models, but not with the abundance models. This could be explained by that species presence is primarily constrained by abiotic factors such as available substrate, light climate, and/or wave exposure, whereas species abundance is likely to be more strongly shaped by biological factors than by physical ones. Thus, future studies should address these issues, since new approaches for incorporating population dynamics, biotic interactions, and community ecology into species distribution models at multiple spatial scales are becoming increasingly available (e.g. Guisan and Thuiller, 2005; Boulangeat et al., 2012). Nevertheless, for some of the responses, e.g. total cover of vegetation and species richness, models of abundance were generally quite powerful, explaining more than 50% of the variability in the test set by the model. Again, similarly to the occurrence models, the correlations among methods suggest that the ability to correctly predict abundances differed among the responses.
Most studies show that it is difficult to create accurate models for all species regardless of the technique used (Guisan et al., 2007). According to several authors (McPherson and Jetz, 2007; Syphard and Franklin, 2010), the effect of species ecological characteristics on predictability of species distributions overrides any differences in the modelling technique. Other studies have, however, shown that the accuracy of the predictions was mostly affected by the characteristics of the data rather than species traits (Chefaoui et al., 2011). No direct comparison of model performance between species with differing traits was made in this study; instead, model accuracy was compared between different groups of organisms. The expectation was that model accuracy of fish reproduction areas and invertebrates would be relatively lower than for macrophytes, since benthic vegetation is sessile while fish and most invertebrates are more mobile in comparison. The response type at the group level did not affect model performance. Variation within the groups of many of the data traits was of greater importance.
In this study, the highest mean predictive performance among the occurrence and abundance models was found by RF. The method belongs to the machine learning techniques and includes automatically all possible interactions among explanatory variables, which could help to explain why RF performed slightly better than the other methods. Interactions are available in other methods as well, although a degree of interaction must be set manually in MARS, MAXENT, and GAM. Although second-order interactions were allowed in MARS, its predictive performance was the lowest among the other methods and it had the highest variance of predicted error (NRMSE). This indicates that MARS is not very stable for certain responses, especially for quantitative ones, and should be applied with care. Moreover, predictive performance of MARS was the most sensitive to splitting of the data into train and test datasets (Šiaulys and Bučas, 2012). Overall, these results suggest that there exist several useful methods for modelling and mapping benthic species, habitats, and to some degree ecological functions in the Baltic Sea. It has previously been recommended that averaged predictions by several techniques (usually called ensemble modelling) should be used, as it gives better accuracy by reducing the predictive uncertainty of single models (Araujo and New, 2007; Grenouillet et al., 2011, and references therein). The differences in performance observed in our study among the methods appear largely unpredictable, and therefore, the differences in predictions resulting from any of these methods represent a source of uncertainty. Consequently, our results very much support an ensemble approach, where a set of methods are applied.
Data traits affecting the performance of the modelling methods
The differences in performance among the predictive modelling methods could be explained to some extent by data traits. Response prevalence and sampling density were the most important effects in occurrence models explaining up to 31% of variance of the model accuracy. Variance in the response data and the number of samples were the most important factors for species abundance and diversity models explaining up to 36% of variance in the predicted performance.
Prevalence has been reported to affect the predictive performance of species occurrence models (Manel et al., 2001; McPherson and Jetz, 2007). This has been considered a drawback, as good performance of a distribution model can be achieved according to a number of performance measures simply because the species has a specific prevalence. Moreover, the effect of prevalence may differ across different biological classes of species (Pearce and Ferrier, 2000) and methods (Marmion et al., 2009). Santika (2011) suggested that the AUC provides information about the degree of specialization of a species along the range of predictor conditions in the study area. In our study, the mean prevalence of the responses was 0.31 ± 0.23 (median = 0.24), indicating that most of the species could be specialized, rather than commonly occurring. Previous studies have suggested that the distribution of specialist species are better modelled than those of generalist species (Segurado and Araujo, 2004; Elith et al., 2006; Franklin et al., 2009; Glockzin et al., 2010). Marmion et al. (2009) found that the GAM was the most sensitive to prevalence, followed by MARS and RF. This agrees with our results, where the effect of prevalence was the most important for the GAM and MAXENT, and less important for RF and MARS.
Sampling density plays a significant role in the performance of predictive modelling methods (Li and Heap, 2008). When the sampling density is high, most methods produced similar results (Burrough and McDonnell, 1998), whereas when sampling is sparse, the underlying assumptions about the variation among samples may differ and the choice of method and parameters may become critical (Hartkamp et al., 1999). In this study, the sampling density was relatively low (maximum 3.13 samples km−2); however, its evident effect was found only for distribution models by RF. Why the effect of sampling density was not significant for other methods, could be related to its interaction with spatial scale and prevalence. A very commonly occurring species would likely require a lower sampling density at small spatial scales, but a larger effort over large scales to accurately capture the full range, and the opposite, a rare species (low prevalence) requires higher sampling density not only in its optimal conditions but even more so to detect the range limits.
Predictive performance of abundance methods depends on the variance in the response data, and with increased variance, performance decreases (Li and Heap, 2008, and references therein). A similar pattern can be seen in our results, where the coefficient of variation negatively correlated with the coefficient of determination (r = −0.32, p < 0.01, n = 199). The effects of variance in the data on the predictive performance was considered independent of sampling density but was method-dependent (Li and Heap, 2008, and references therein). Our results were consistent with these findings, where the effect of variance in the response data was most important to RF, followed by MARS and GAM (Table 4 and Figure 4). However, the differences in relationship between the coefficient of variation and predictive performance among the models were relatively small (Figure 4).
The consensus is that sample size affects the modelling outcome; small sample sizes result in decreased predictive potential when compared with models developed with more samples (McPherson et al., 2004; Hernandez et al., 2006). However, an increase in the number of observations may even reduce model accuracy due to overfitting of smaller datasets (Mateo et al., 2010, and references therein). Nevertheless, there is a common need to obtain a minimum number of samples to reduce an expensive field sampling effort and generate robust models. Although there is no established rule to decide a minimum sample size for modelling, several studies have suggested different minimum sample sizes, where the predictive power improved greatly, e.g. 30 (Wisz et al., 2008), 70 (Jimenez-Valverde et al., 2009), 200 (Hanberry et al., 2012), and ca. 520–570 (Mateo et al., 2010). In our study, we found that more than 137 samples were needed to separate relatively “good” models from the “poor” ones. It has been shown that the effect of minimum sample size varies among predictive modelling techniques (Wisz et al., 2008) and that MAXENT is generally considered robust to variation in sample size and usable even with very small samples (Pilar et al., 2006). This is supported by our findings that the number of samples had no effect on the performance of MAXENT, whereas an effect was found on the performance of other modelling methods, especially for species abundance and diversity models by GAM and MARS (Table 4 and Figure 4). However, Newbold et al. (2009) stress that the completeness of sampling with respect to environmental gradients rather than sample size alone is the most important determinant of model accuracy, and well-sampled data with few records are considered better than biased data of any size (Bean et al., 2011).
Conclusions and recommendations
We conclude that the spatial distribution, abundance, and diversity of benthic species in the Baltic Sea can be successfully predicted using several non-linear predictive modelling techniques. Even if the performance of different methods (measured as AUC, r2, NRMSE, or any other metric) is comparable, it is clear that the predictions produced by these methods will be slightly different and there is no way to judge which of these methods is the best. To estimate the uncertainty of model predictions, in agreement with other authors (Araujo and New, 2007; Grenouillet et al., 2011, and references therein), we recommend applying more than one modelling method (ensemble approach). Different classes of modelling method should be included in the ensembles to take the advantage of the each of their strengths. From machine learning techniques, RF should be considered due to the highest mean predictive performance found in this study, especially for quantitative data.
There were no conceptual differences in modelling response types at the group level: macrophytes, invertebrates, fish reproduction areas, diversity, and functional groups. The differences in performance among the modelling methods could to some extent be explained by data traits. The response prevalence had a strong effect on the performance of GAM and MAXENT on the occurrence data, where the predictive accuracy of models was higher at a response prevalence <0.28–0.34. RFs on the occurrence data were the only models sensitive to sampling density, showing higher accuracy of predictions with increasing sampling density. For the abundance and diversity models, accuracy increased with sample size and with reduced variance in the response data. We consequently recommend that sampling design for modelling should take account of the need to produce a comprehensive dataset that encompasses the appropriate environmental gradients within meaningful spatial scales for the modelled response.
Supplementary material
Acknowledgements
The research leading to these results has received funding from the European Community's Seventh Framework Programme (FP/2007-2013) under grant agreement no 217246 made with the joint Baltic Sea research and development programme BONUS. The following kindly contributed data: VELMU, funded by the Finnish Ministry of Environment, the NANNUT project funded by Interrreg, the FINMARINET project funded by Life+ as well as Metsähallitus and Alleco oy.
References
Author notes
Handling editor: Rochelle Seitz