Image-based seabed classiﬁcation: what can we learn from terrestrial remote sensing?

Image-based seabed classiﬁcation: what can we learn from terrestrial remote sensing? ICES Journal Science, 73: 2425–2441. Maps that depict the distribution of substrate, habitat or biotope types on the seabed are in increasing demand by marine ecologists and spatial planners, underpinning decision making in relation to marine spatial planning and marine protected area network design. Yet, the science discipline of image-based seabed mapping has not fully matured and rapid progress is needed to improve the reliability and accuracy of maps. To speed up the process we have conducted a literature review of common practices in terrestrial image classiﬁcation based on remote sensing data, a related discipline, albeit with a larger scientiﬁc community and longer history. We identiﬁed the following key elements of a mapping workﬂow: (i) Data pre-processing, (ii) Feature extraction, (iii) Feature selection, (iv) Classiﬁcation, (v) Post-classiﬁcation enhancements, and (vi) Evaluation of classiﬁcation performance. Insights gained from the review served as a baseline against which recent seabed mapping studies were compared. In this way we identiﬁed knowledge gaps and propose modiﬁcations to the mapping workﬂow. A main concern in current seabed mapping practice is that a large amount of often correlated predictor features is extracted, creating a multidimensional feature space. To effectively ﬁll this space with an appropriate amount of training samples is likely to be impossible. Hence, it is necessary to reduce the dimensionality of the feature space via data transformation [e.g. principal component analysis (PCA)] or feature selection and remove correlated features. We propose to make dimensionality reduction an integral part of any mapping workﬂow. We also suggest to adopt recommendations for accuracy assessment originally drawn up for terrestrial land cover mapping. These include the publication of two or more measures of accuracy including overall and class-speciﬁc metrics, publication of associated conﬁdence limits and the provision of the error matrix.


Introduction
Categorical maps of seabed characteristics (e.g. substrate, habitat or biotope type) have become an essential tool for various purposes including marine spatial planning and management, design of marine protected area networks, scientific research, assessment of non-living resources, and fisheries resource management among others (Harris and Baker, 2012). As electromagnetic waves in the visible light spectrum (400-700 nm) are quickly attenuated by seawater, optical imaging of the seabed is limited to areas of shallow, and clear water. Acoustic methods have therefore become the main tool for seabed mapping. Recent progress was mainly driven by improvements in multibeam echosounder technology. In conjunction with ground-truth information collected with various seabed samplers and underwater imaging techniques it is now possible to derive highly detailed maps.
In the context of seabed mapping, the term habitat has acquired various meanings. In its original form, a habitat is a place where an organism lives (Begon et al., 2006) and species distribution modelling is the appropriate methodology to map habitats in that sense. Kostylev et al. (2001), in an influential article on benthic habitat mapping, gave a working definition of a habitat as a "spatially defined area where the physical, chemical and biological environment is distinctly different from the surrounding environment". This definition has recently been revised by Lecours et al. (2015) who define benthic habitats as "areas of seabed that are (geo)statistically significantly different from their surroundings in terms of physical, chemical and biological characteristics, when observed at particular spatial and temporal scales". However, such definitions are largely synonymous with that of a biotope, which might be defined as consisting of the physical environment (habitat) and its distinctive assemblage of conspicuous species (Olenin and Ducrotoy, 2006). It is beyond the scope of this review to settle the question of what a habitat is. For the purpose of this review we are concerned with seabed mapping, which might include any seabed characteristic that can be derived and classified from ground-truth observations and spatially predicted using suitable predictor features and modelling approaches. In that sense, it might include sediments (e.g. Diesing et al., 2014), substrates (e.g. Lucieer et al., 2013), benthoscapes (Brown et al., 2012), habitats (e.g. Che Hasan et al., 2012b), communities (e.g. Ierodiaconou et al., 2011), biotopes (Buhl-Mortensen et al., 2009), and ecosystems (Costa and Battista, 2013).
Following Brown et al. (2011), seabed mapping approaches based on high-resolution acoustic data can be distinguished into expert interpretation, signal-based classification and image-based classification. [These authors actually use the term "segmentation" instead of classification. However, we avoid this term here, as we use segmentation in a specific context related to Geographic Object-Based Image Analysis (GEOBIA) (see chapter 3.4.1)]. Expert interpretation has been the method of choice in the past and is still widely used. However, this approach is highly subjective, not repeatable and often time-consuming. Because of that, we are not dealing with any studies that are solely based on expert interpretation. Signal-based approaches, such as angular range analysis (ARA), usually only utilize backscatter data and are difficult to integrate with other data sets. Here we focus on image-based classification, whereby the "imagery" is taken to mean acoustic backscatter, bathymetry, derivatives of the aforementioned and ancillary data (e.g. oceanographic variables). In that sense, image-based seabed classification has much in common with remote-sensing image classification including terrestrial land cover mapping. The term "land cover" has been defined as the observed bio-physical cover on the earth's surface (Di Gregorio and Jansen, 1998). Land cover can be readily mapped from images of the Earth's surface. Initial efforts of land cover mapping were based on aerial photography; however, the modern era of land cover mapping arguably started with the launch of the Landsat 1 satellite in 1972. As such, it has a significantly longer history than image-based seabed mapping and consequently, the science of land cover mapping is much more advanced. This is reflected in the significantly higher number of published studies in the field (Figure 1). It might therefore be beneficial to review the common practice and past experience in land cover mapping processes with a view to improve mapping methodologies and workflows for benthic habitat mapping. Based on such a review, it should be possible to establish best practice in image-based land cover mapping. This will then in turn serve as a baseline against which current practice in seabed mapping can be compared. The objective of this review is to improve current processes in image-based seabed mapping based on experience gained in terrestrial mapping utilizing remote-sensing data.

Method
Initially, we carried out a review of the current literature on land cover mapping based on remotely sensed optical imagery. Due to the very large number of published papers (Figure 1), this was mainly conducted as a review of reviews with a focus on classification processes and their common elements. This approach also allowed us to identify relevant studies dealing with specific topics and those were specifically consulted for the sections on feature selection, geographic object-based image analysis (GEOBIA), and accuracy assessment.
As the literature on seabed mapping is much more limited, it was feasible to conduct a structured review of relevant studies. The review approach was adapted from Hughes et al. (2014) and is visualized in Figure 2. The review focussed on studies utilizing multibeam echosounder data to make categorical predictions of benthic substrate or habitat utilizing (semi)automated classification methods, i.e. excluding studies that are solely based on expert interpretation. An online literature search was carried out on   AND TOPIC: ("supervised classification" OR "unsupervised classification" OR classifier OR prediction OR "machine learning" OR "object based"] This yielded a total of 30 studies. Following an initial screening, a further six studies were introduced. These studies were not included in the search results, as the studies were either too new (in press or just published) or missed for other reasons, but deemed important. The 36 articles were subjected to an abstract and title screening and 7 articles were removed as they were either reviews or not on topic. The remaining 29 articles were obtained as full text copies and after a full text screening a further 9 studies were removed, as they were not on topic. The final set of papers to be reviewed comprised 20 studies spanning the publication years 2004-2015, some of which utilizing more than one classifier (comparative studies).

Land cover mapping process
A universal process for land cover mapping is unlikely to exist; however, a sequence of typical elements of the mapping process (Lu and Weng, 2007) is summarized in Figure 3 and described in the following sections.

Pre-processing
In terrestrial remote sensing, the objective of this step is to present the data in a format from which accurate land cover information can be extracted (Cihlar, 2000). Important image preprocessing steps include radiometric correction of variations in the image resulting from environmental conditions or sensor anomalies, geometric correction to compensate for the Earth's rotation and for variations in the position and attitude of the satellite, terrain correction of relief distortions with the help of digital elevation models (DEMs) and image enhancement to improve the visual interpretability of an image (Purkis and Klemas, 2011). Despite some similarities, acoustic remote-sensing is significantly different from terrestrial remote-sensing and the details of the above-mentioned approaches will not be discussed here.
Acoustic remote-sensing data, either collected with sidescan sonar or multibeam echosounder, need to be radiometrically and geometrically corrected. Geometric corrections change the spatial position of a pixel, while radiometric corrections change the digital number or value assigned to a pixel (Chavez et al., 2002). Further pre-processing steps might include the removal of speckle, i.e. high-frequency noise in the imagery (Blondel and Murton, 1997;Chavez et al., 2002). Variations in backscatter intensity parallel to the vessel's track might be reduced by the application of 2D Fourier filtering (Wilken et al., 2012).

Feature extraction
The term feature is often used to describe predictor variables derived from remote-sensing data to be used for image classification. Many features such as spectral signatures, vegetation indices (e.g. Normalized Difference Vegetation Index, NDVI), transformed images, textural or contextual information, multitemporal images, multi-sensor images and ancillary data are available for land cover mapping (Lu and Weng, 2007).
For example, Lucas et al. (2011) used NDVI, band ratios, differences and products, relative difference to neighbours, seasonal difference images and other indices for their habitat map of Wales. In seabed mapping, the most widely used feature is acoustic backscatter strength . As backscatter data are typically collected at one specific frequency (a single band), several useful features successfully applied in terrestrial remote sensing, such as spectral signatures, band ratios, differences, products, and other indices, cannot be applied to image-based seabed mapping. If backscatter data are collected with MBES, then coregistered bathymetric data are available, too. Such data can be used to develop DEMs; a feature that is sometimes used as ancillary data in terrestrial remote sensing.
Due to the fact that primary features are limited to backscatter and bathymetry, it is common practice to calculate secondary features or derivatives. Features derived from bathymetry can be grouped into slope, orientation, curvature/relative position, and terrain variability (Wilson et al., 2007). Several derivatives of backscatter have also been calculated and employed in seabed mapping studies, including standard deviation, roughness, texture (e.g. grey level co-occurrence matrices; Haralick et al., 1973), spatial auto-correlation (Moran's I; Moran, 1950), Hue-Saturation-Intensity (Daily, 1983), Q-values (Preston, 2009), and features extracted from ARA (Fonseca et al., 2009). Table 1 is an attempt to summarize the secondary features employed in the peer-reviewed studies that have explicitly mentioned types of and details on features. From this, it is apparent that the most widely used secondary features are slope, bathymetric position index (BPI), rugosity and some form of curvature. Derivatives of backscatter are less frequently calculated and employed. It is also apparent that frequently derivatives are calculated from a local neighbourhood, i.e. a 3 Â 3 kernel. This might be mainly related to the fact that calculation of derivatives from bordering cells is typically the default setting in the applied software packages, rather than based on scientific reasoning and judgement. It has been argued that a multiscale terrain analysis approach is better suited to capture seabed features at different spatial scales (Wilson et al., 2007) and to account for fuzziness of landscape morphometry (Fisher et al., 2004), but from the above it appears that the possibilities of such an approach are rarely explored.
It follows from the above that the possibilities to extract secondary features from the two primary acoustic data sets are virtually endless, as Table 1, which is non-exhaustive, already contains more than thirty derivatives that could be calculated at various spatial scales. The available choice of features might impact on classification performance and accuracy, as using many features in the hope of capturing as much variability in the data as possible might not only be computationally expensive, but also have a negative effect on classification accuracy . It is also well known that several derivatives calculated Image-based seabed classification

Continued
from the same primary data are often highly correlated, e.g. slope and measures of terrain complexity or ruggedness (Sappington et al., 2007). It is therefore essential to reduce the number of features (dimensionality) and remove correlated features prior to any classification.

Feature selection
In land cover mapping, features might be selected based on trial and error, previous performance or systematic studies. However, such approaches become increasingly ineffective, when dealing with a large number of features such as in the case of hyperspectral data (e.g. Demarchi et al., 2014). High dimensionality of data, i.e. a large number of features, may be an obstacle to classification because the number of training samples is often too low to fill the multidimensional space created by these features, a problem known as the curse of dimensionality. The predictive power reduces as the dimensionality increases with a fixed number of training samples (Hughes effect; Hughes, 1968). Dimensionality  reduction methods aim to reduce the dimensionality of the data by reducing redundancy without losing information content. Two main types of dimensionality reduction methods can be distinguished: Feature transformation methods transform data from the original high-dimensional feature space to a new space with reduced dimensionality. The most common technique of this type of dimensionality reduction methods is the PCA. The main drawback of such methods is that transformed features (e.g. principal components) are often difficult to interpret. Conversely, feature selection algorithms select a subset of features from the original feature set. The objective of feature selection is 3-fold: improving the prediction performance of the predictors, providing faster and more cost-effective predictors and providing a better understanding of the underlying process that generated the data (Guyon and Elisseeff, 2003). Feature selection methods can be categorized into filter-based, wrapper, and embedded techniques. Filter-based methods assess the performance of features independently of the predictor. Filters are generally applied as a pre-processing step before training a classifier. They are usually a fast and generic way of ranking features but do not consider feature interaction (Friedman, 2013). Removing seemingly redundant features based on individual measures of relevance can potentially lead to loss of information due to synergistic interactions between predictor variables (Guyon and Elisseeff, 2003). Wrappers (Kohavi and John, 1997) are linked to a certain classifier and identify relevant features by performing multiple runs of the classifier testing the performance of different subsets of input features. They can be distinguished into forward selection and backward elimination. In the former case, variables are progressively incorporated into larger subsets, whereas in the latter case one starts with the set of all variables and progressively eliminates the least promising ones (Guyon and Elisseeff, 2003). Wrapper methods are generally seen as computationally expensive. Embedded methods perform feature selection in the process of training and are specific to given classifiers (Friedman, 2013).
Removal of correlated features can be achieved via a correlation analysis, whereby only the feature with the highest importance (e.g. established with the Boruta algorithm, Kursa and Rudnicki, 2010) of correlated features is retained. The Marine Geospatial Ecology Tool (MGET, Roberts et al., 2010) provides a means to create a scatterplot matrix plotting predictor features against each other and allowing to assess how correlated they are.

Classification
Classification approaches can be grouped in several ways, e.g. Lu and Weng (2007) distinguish between (i) supervised or unsupervised, (ii) parametric or non-parametric, (iii) hard or soft, and (iv) per-pixel, subpixel, or per-field (including image object) classification. However, the above-mentioned characteristics lead to substantial conceptual overlap; e.g. the same supervised classification might be carried out on pixels and image objects. In the hope of providing more clarity, we therefore distinguish between the unit of analysis and the classification method, similar to Tewkesbury et al. (2015), who synthesized image change detection techniques. The unit of analysis might be an image pixel, an image object (a group of contiguous pixels with similar spectral characteristics) or a vector polygon (e.g. a land parcel from a land register database). As the latter is irrelevant to marine applications, it will not be reviewed here. Classification methods can be grouped into supervised and unsupervised methods. Supervised classification assumes the presence of a priori knowledge of all classes to be mapped.

Unit of analysis
Traditionally, land cover mapping is carried out as a per-pixel approach, whereby classifiers develop a signature by combining the spectra of training-set pixels for a given feature. However, heterogeneity in complex landscapes results in high spectral variation within the same land cover class (Lu and Weng, 2007). This problem has become even more prevalent with the availability of ever higher-resolution images (Blaschke et al., 2014). Under such circumstances, pixel-based approaches often lead to noisy results, as each pixel is grouped into a certain category (Lu and Weng, 2007). Further limitations of pixel-based approaches are that pixels are not true geographical objects, that pixel topology is limited and that texture, context and shape are often neglected (Hay and Castilla, 2006). Burnett and Blaschke (2003) also conclude that the pixel-based approach is usually uni-scale in methodology while landscapes and their land cover exist across several spatial scales. Due to the limitations of pixel-based image analysis, a new field called GEOBIA (Hay and Castilla, 2008) emerged in the early 2000s. Some authors argue that this shift towards object-based approaches has been so profound that GEOBIA might be considered as a new paradigm (Blaschke et al., 2014;Hay and Castilla, 2008).
GEOBIA has been defined as a sub-discipline of geographic information science devoted to developing automated methods to partition remote-sensing imagery into meaningful image objects, and assessing their characteristics through spatial, spectral and temporal scale (Hay and Castilla, 2008). GEOBIA is a special case of OBIA applied to remote-sensing data. The term GEOBIA distinguishes it from other OBIA applications in computer vision, material sciences and biomedical imaging (Blaschke, 2010;Blaschke et al., 2014;Hay and Castilla, 2008).
GEOBIA (as well as OBIA) is a two-step approach consisting of segmentation and classification. Image segmentation has its roots in industrial image processing and was not extensively used in geospatial applications before the 2000s (Blaschke, 2010). Numerous algorithms for image segmentation exist today and are generally divided into four groups: point-based, edge-based, region-based, and combined (Schiewe, 2002). For a review of image segmentation methods see Pal and Pal (1993). Segmentation is the process of complete partitioning of a scene (e.g. remotesensing image) into non-overlapping regions in scene space (Schiewe, 2002) on the basis of homogeneity (Blaschke et al., 2014). The non-overlapping regions are also called segments or image object primitives, which are different from objects of interest. The latter match real-world objects (e.g. buildings or agricultural land parcels) and object primitives are usually the necessary intermediate step to derive objects of interest (Benz et al., 2004). In image segmentation, regions of minimum heterogeneity given certain constraints have to be found. Criteria for heterogeneity, definition of constraints and the strategy for the sequence of aggregation will determine the final results (Benz et al., 2004).
The creation of image objects increases the number of features that might be used for the subsequent step of image classification. Additional features include image object statistics, texture, shape, topology, and semantics (Benz et al., 2004). In this way, GEOBIA offers possibilities for situations where spectral properties are not unique, but where e.g. shape or neighbourhood relations are distinct (Blaschke et al., 2014). For example, a lake and a river might have very similar spectral properties and it will be difficult to distinguish these two based on spectral properties alone. Provided the segmentation has resulted in image objects that closely resemble the real-world objects lake and river, it should be relatively straightforward to differentiate the two based on features such as object shape, e.g. length-to-width ratio or asymmetry.
GEOBIA is now widely used in terrestrial remote-sensing. Blaschke et al. (2014) estimate that more than 600 peer-reviewed journal articles have been published until September 2013. Refer to Blaschke (2010) for various application examples in terrestrial remote sensing. Conversely, published seabed mapping literature applying an object-based approach is far more limited. Studies can be sub-divided by sensor type: Several studies present results from mapping shallow subtidal habitats such as coral reefs using optical sensors (Knudby et al., 2011;Phinn et al., 2012;Roelfsema et al., 2013;Zhang et al., 2013). Studies applying object-based approaches to acoustic remote-sensing data include Che Hasan et al.  (2011) and Lucieer et al. (2013). A recent study has also applied an object-based approach to seabed stills imagery with the aim to extract metrics of seabed heterogeneity (Lacharité et al., 2015).

Classification method
In unsupervised classification, classes to be mapped are not predetermined. Instead, one attempts to find regularities in unclassified data. In remote-sensing applications, an image is classified based on natural groupings of the spectral properties of the pixels or objects. Typical unsupervised procedures are clustering techniques, e.g. ISODATA (Ball and Hall, 1965) and k-means clustering. Unsupervised classification has been frequently used when mapping land cover over large areas for which there does not exist a sufficient number of ground samples (Cihlar, 2000), while the land cover classes are reasonably straightforward (Franklin and Wulder, 2002). Seabed mapping studies that employed unsupervised classification have been presented by several authors (Lathrop et al., 2006;Brown and Collier, 2008;Blondel and Gomez Sichi, 2009;McGonigle et al., 2009;Brown et al., 2012). One common problem associated with unsupervised classification is the determination of the "correct" or "optimum" number of clusters. A large number of criteria for determining the "optimum" number of clusters exist. For example, Milligan and Cooper (1985) tested 30 criteria for their ability to predict the "correct" number of clusters in a data set. Another potential drawback is that the resultant classification rarely shows an oneto-one relationship with classes derived from ground-truth data (Brown et al., 2012). If the aim is to map habitats according to a classification scheme, then cross-tabulation and aggregation needs to be carried out so that the classified acoustic data reflect the classes found in the ground-truth data.
Classification in GEOBIA is typically performed by supervised classification, although it is also possible to carry out unsupervised classification, e.g. k-means clustering, on objects and their properties. Supervised classification requires class-describing information that must be as accurate, representative and complete as possible. Such information on the classes' characteristics might be acquired by training classifiers (e.g. maximum likelihood, k-nearest neighbour, classification tree, random forest, support vector machine and Bayesian network) with ground samples. Alternatively, the user might describe the class properties based on prior knowledge or knowledge acquired by data exploration. An accurate, representative and complete class description is in most cases effectively impossible. Hence, it can only be a general estimation of the desired class properties (Definiens Imaging, 2004).

Post-classification enhancements
Classifications based on pixels typically lead to "noisy" classification results, often referred to as the "salt and pepper" effect. A majority filter might be applied to generalize the resulting map in such an instance (Lu and Weng, 2007). Further generalization procedures might include the removal of habitat polygons below a previously specified minimum mapping unit by merging them into neighbouring polygons (e.g. Costa and Battista, 2013). Ancillary and contextual data might be applied to modify a classification based on established expert rules. Such knowledge-based enhancements resolve spectral confusion, which occurs when different surface types have similar reflectance properties, and increase the accuracy of the map. The knowledge base might relate to urban context, coastal proximity, soil type, and terrain. For example, littoral sediment might be confused with arable land cover. The distance to the coastline might help resolve such a confusion (Morton et al., 2011). Previous research has indicated the importance of post-classification enhancements in improving the quality of the final map (Harris and Ventura, 1995;Murai and Omatu, 1997).

Evaluation of classification results
Traditionally, land cover maps were qualitatively evaluated based on expert knowledge and estimating map quality was therefore very subjective. This process has been largely superseded by quantitative accuracy assessments to gauge the quality of a produced map. If, however, the aim is to evaluate the suitability of classification algorithms for a specific task, then other criteria such as computational resources, the stability of the algorithm and Image-based seabed classification robustness to noise might be equally important (DeFries and Chan, 2000).

Accuracy assessment
In land cover mapping the term accuracy refers to how closely a map reflects the actual environment from which it was derived (Foody, 2002). Map accuracy is analysed to determine whether a predictive map is as close to reality as possible given the available resources, and to determine whether it is fit for purpose (Bennett et al., 2013). For example, accuracy assessments allow comparison between maps derived from different datasets and mapping processes (Foody, 2004;Rattray et al., 2009), and are regularly used to inform the confidence of land change maps through time (Olofsson et al., 2013).
In the land cover mapping research literature, accuracy assessments are typically performed by measuring a map's predictive performance against a reference dataset. A number of studies have identified potential improvements to the accuracy assessment process (Stehman, 1997;Foody, 2002;Pontius and Millones, 2011); however, these recommendations have been more slowly adopted in the seabed mapping literature. Due to the sampling methods commonly used by seabed mappers, there are also specific issues to consider when planning a study and during accuracy assessment.
Reference dataset to assess accuracy. Where the quality of a reference dataset is poor the accuracy measure will only indicate the similarity of the map to the reference data, and not necessarily reality (Foody, 2002(Foody, , 2008. Therefore, selection of appropriate reference data should be carefully considered early in the project development. Ideally, the reference data should be independently sourced from the data used to construct the map (Foody, 2002). However, the costs associated with additional ground-truthing campaigns mean this is rarely an option, particularly in the marine field due to the difficulties of benthic sampling. Therefore, reference data are regularly made up of a subsample of the original dataset that has been withheld from map building.
Determining the correct survey design to develop unbiased and confident accuracy assessments is another commonly discussed issue for land cover and benthic mapping. Random sampling ensures an unbiased estimate of accuracy. However, to sufficiently populate the confusion matrix this may require a very large reference dataset with a minimum of 50 sample sites per class and may require several hundred samples to accurately determine a map's accuracy through random sampling (Congalton, 1991;Carlotto, 2009). Therefore, due to the practical and budgetary constraints of benthic sampling, surveys often target areas of geographic interest or are designed to sample all anticipated habitats (Che Hasan et al., 2012a;Zavalas et al., 2014). Although a random sample of these data may be withheld as the reference dataset, it is important to keep in mind that this is not truly a random sample. Although classification models may be developed from non-randomly targeted samples, the use of these data for accuracy assessment may produce an unrealistic estimation of accuracy (Stehman and Czaplewski, 1998;Foody, 2002). Ultimately, this is a trade-off, and the spatial complexity of the habitat to be mapped may determine the appropriate sampling regime for a site to minimize the amount of bias while ensuring all seabed types are adequately sampled.
As the classification map is compared with a reference dataset, the accuracy assessment is dependent upon the assumption that the reference data contains no error. This assumption is often optimistic, particularly in the marine environment. Rattray et al. (2014) found large variation in how the same still images of benthic habitat were classified by different trained interpreters, and by the same interpreters at separate times. The differences in interpretation were greater when images were classified into more class levels with only 75% agreement observed for a six-class classification scheme. The differences in classifications would therefore be highly influenced by the classification scheme but would be likely to be interpreted as classification error in the final maps. The distinction between habitat classes in the marine environment are often unclear (Fraschetti et al., 2008), and therefore these types of errors may be more prevalent in benthic mapping than in terrestrial studies.
In determining the quality of a reference dataset, the precision of the testing data location must be considered. If ground truth samples are not accurately georeferenced they may affect the measured map accuracy, particularly in a heterogeneous environment or near habitat boundaries (Stehman and Czaplewski, 1998;Foody, 2002). These types of errors may be more prevalent in the marine realm due to difficulties in accurately positioning benthic samples, even with the use of ultra-short baseline positioning systems . Where reference data are not completely accurate, differences between the seabed map and the reference dataset may be incorrectly attributed to false classification, thereby reducing the measured accuracy (Verbyla and Hammond, 1995;Carlotto, 2009).
Accuracy reporting. There has been extensive discussion regarding the relative merit of a number of accuracy assessment measures within the remote sensing literature (Stehman, 1997;Foody, 2002;Liu et al., 2007;Pontius and Millones, 2011). However, all of the most widely recommended and utilized accuracy measures are derived from an error matrix (Foody, 2002;Liu et al., 2007), also known as a confusion matrix or contingency table. The error matrix is a simple cross-tabulation of predicted occurrence versus observed occurrence in the withheld reference dataset. As the error matrix allows all other measures of accuracy, including perclass accuracy, to be calculated, it is recommended to provide a table alongside any other results or accuracy measures (Stehman, 1997;Foody, 2002;Olofsson et al., 2013). Providing additional measures of accuracy to summarize the error matrix is advisable, however, there is still no general consensus as to which measures to include. In addition, different measures may be more suitable depending on the objective of the map, so it is generally recommended to provide multiple measures of map accuracy (Stehman, 1997;Foody, 2002;Liu et al., 2007). It is also important to consider that habitats may change along gradational boundaries and therefore the difference between distinct habitat classes may be somewhat arbitrary (Foody, 2002). Therefore, when considering any accuracy measure it is also important to always consider the significance of the error rather than the value alone.
One of the simplest and most widely used performance metrics is overall accuracy. This is simply the percentage of correctly classified cells from the total number of cells in the error matrix (Congalton, 1991). As this is an estimate of the overall accuracy of the map it should be accompanied by confidence limits (Strahler et al., 2006). In addition to this overall metric, it is often more informative to present class specific measures such as user's and producer's accuracy (Liu et al., 2007;Lyons et al., 2012). User's accuracy is the probability that a pixel on the map Table 2. Summary of pre-processing methods, extracted features and feature selection methods utilized in the reviewed benthic habitat mapping studies.  Image-based seabed classification Table 3. Summary of classification methods, post-classification enhancements and accuracy assessments utilized in the reviewed benthic habitat mapping studies.

References
Year Choice of classifier  represents that category on the ground. Producer's accuracy is the probability that a reference pixel has been correctly classified. These measures are inversely proportional to another pair of commonly reported measures, the commission error and omission error respectively (Janssen and van der Wel, 1994). Despite the simplicity of overall accuracy, user's and producer s accuracy, they are generally more informative and widely recommended than other regularly used metrics (Liu et al., 2007;Olofsson et al., 2013).
Overall accuracy has been criticised as being an overly optimistic estimation of accuracy as some classes can be correctly classified as a result of chance. Thus the kappa coefficient (Cohen, 1960) was introduced as an accuracy metric which compensates for chance inflating model accuracy. Despite its widespread use, it has regularly been criticised as an accuracy metric (Liu et al., 2007;Pontius and Millones, 2011), as comparing accuracy relative to randomness is of little interest and provides no indication of how to improve classification. Furthermore, as a metric for comparison between different analyses it is commonly used incorrectly as the maps have not been generated from independent samples (Foody, 2004). Pontius and Millones (2011) suggest two new measures in place of kappa, "quantity disagreement" and "allocation disagreement". These new measures provide more relevant information to researchers, allowing them to examine the source of the map error. Despite kappa's critics, and the availability of more informative metrics, it remains commonly used in seabed mapping (Rioja-Nieto et al., 2013;Zhang et al., 2013;Che Hasan et al., 2014). We are aware of only two studies in the marine literature (Savini et al., 2014;Calvert et al., 2015), that have used these new measures in place of kappa. If map accuracy is being assessed to compare different maps and classifications, or to identify where error is occurring and thereby improve classification, then perhaps it is time to move away from kappa in favour of more informative statistics.
Spatial representation. Although the accuracy measures discussed may allow for comparison between maps they represent the "global" average accuracy, and reveal nothing about the spatial distribution of error. In reality, error is rarely distributed evenly across the map (Steele et al., 1998;Kyriakidis and Dungan, 2001). For example, errors are typically concentrated near class boundaries or in areas of high terrain complexity (Steele et al., 1998;Foody, 2005). This is important, as map users may be particularly interested in certain regions within the map, and therefore the overall accuracy may be of little relevance (Comber et al., 2012). Furthermore, these types of analyses may influence how a map is utilised, for instance informing the design of additional surveys. A number of terrestrial studies have developed approaches to map the distribution of classification error across the study site. These methods include geostatistical approaches (Kyriakidis and Dungan, 2001), developing locally constrained confusion matrices (Foody, 2005), and analysing the spatial variation through geographically weighted regression (Comber et al., 2012). A discussion regarding the relative merit of these different approaches is beyond the scope of this review; however, they all provide an indication of the spatial variation in accuracy and assist in the interpretation of map products.
Few studies in the seabed mapping literature have attempted a spatial representation of accuracy. Ahsan et al. (2010) apply bootstrap aggregation to map the confidence of the classification map (Breiman, 1996;Steele et al., 2003;Saatchi et al., 2007;). Diesing and Stephens (2015) present a confidence map based on the degree of agreement between models in a multi-model ensemble approach. While not directly mapping accuracy, these methods produce a map of the relative spatial uncertainty of the prediction, and can be highly informative. There are many scenarios in seabed mapping when it may therefore be beneficial to provide similar figures in combination with other accuracy statistics.

Review of marine habitat mapping studies
The analysed studies and the results of the review are summarized in Tables 2 and 3. Only those studies that employed multibeam echosounder data and carried out a form of supervised or unsupervised classification to spatially predict categorical information on seabed type (sediment, substrate, habitat, biotope etc.) were included. All studies employed multibeam backscatter as a primary feature; however, two studies (Che Hasan et al., 2012a,b) did not utilize multibeam bathymetry.
Pre-processing beyond the standard radiometric and geometric corrections is not common practice in seabed mapping. Only two studies Diesing and Stephens, 2015) employed a form of speckle removal; a Gaussian filter with a 5 Â 5 kernel was used in these cases. One study  utilized a 2D Fourier filter to reduce prominent track-parallel intensity variations.
Secondary (derived) features of bathymetry are frequently employed, with seabed slope most often used (80% of studies), followed by terrain variability such as rugosity, roughness, and complexity (70%), curvature and relative position such as BPI (65%), and orientation expressed as northness and eastness (60%). About one third of the studies (35%) calculated bathymetric derivatives at more than one spatial scale. Backscatter derivatives were less frequently calculated. A quarter of the studies utilized neighbourhood statistics and one fifth employed derived HSI layers. Less frequently used were spatial auto-correlation, texture, ARA, and Q-values (all 15% of studies). Again, in the majority of cases these derivatives were calculated at a local scale (3 Â 3 kernel). One study (Rattray et al., 2015) utilized an oceanographic variable (annual maximum wave orbital velocity at the seabed).
Thirty percent of the studies employed a form of data transformation, e.g. PCA (Costa and Battista, 2013;Calvert et al., 2015), or feature selection, such as the Boruta algorithm Diesing and Stephens, 2015), Kruskal-Wallis tests (Calvert et al., 2015), random forest variable importance (Che Hasan et al., 2014) and removal of features based on the variance inflation factor (Rattray et al., 2015). The remaining studies only provided a verbal justification for the selected features or no justification at all.
A wide variety of classification methods has been applied. The most frequently used were classification trees, such as QUEST (Quick Unbiased Efficient Statistical Tree, Loh and Shih, 1997), CRUISE (Classification Rule with Unbiased Interaction Selection and Estimation, Kim and Loh, 2001), and rpart (Therneau and Atkinson, 1997), utilized in 22% of the reviewed cases (8 out of 35 applied classifiers). Random forest and maximum likelihood classifier were employed in 17% of cases each, followed by rulebased classifications (11%). Less frequently used methods included unsupervised clustering (8%), k-nearest neighbour (6%), support vector machines (6%), QTC (6%), Bayesian decision rules (3%), neural networks (3%), and a classifier ensemble (3%). No justification for the choice of the applied classifier is given in more than half of the cases (51%). If a justification is given, past Image-based seabed classification performance (37%) is more frequently cited than principal considerations (11%). Classifications are most often carried out on pixels as the unit of analysis (71%), while image objects were utilized in 31% of the cases. Note that one study (Che Hasan et al., 2014) utilized both image pixels and objects, hence the percentages do not sum up to 100%. Supervised classifications are the predominant classification method applied in 29 out of 35 cases (83%), as compared with 6 cases that employed unsupervised classification methods.
Post-classification enhancements have rarely been applied in the reviewed cases. Two studies (Costa and Battista, 2013;Diesing et al., 2014) carried out manual edits of the automated classifications. One study (Costa and Battista, 2013) generalized the automated classification based on a pre-defined minimum mapping unit. No study was found that applied knowledge-based rules as a scheme of post-classification enhancements.
More than three quarters (77%) of the cases had a form of accuracy assessment based on an "independent", i.e. not being used for model training, reference dataset associated with them. The remaining studies had either no accuracy assessment carried out or did not withhold samples for an "independent" test. Occasionally, it was also difficult to judge whether samples were withheld or not. The most frequently reported metrics were overall accuracy (86%) and the kappa coefficient (71%), followed by user's and producer's accuracy (each 43%) and the balanced error rate (20%). Newly proposed statistics, such as total agreement, quantity disagreement and allocation disagreement (Pontius and Millones, 2011) were employed in one study (Calvert et al., 2015), as was Cramer's V (Cramer, 1946). In about two thirds (63%) of cases, the error matrix associated with a classification was not published.

Discussion
We have reviewed the current practice in terrestrial remote-sensing with a view to inform and improve standards for seabed mapping. The rationale was that land cover mapping, while similar to imagebased seabed mapping in many respects, has a significantly longer history and larger user community and can therefore inform the development of best practices in seabed mapping. Whilst both disciplines have many similarities, significant differences with regard to data availability, temporal and spectral resolution exist between satellite-based remote sensing and acoustic datasets. Whereas the terrestrial land surface is almost continuously mapped and monitored through a wide array of satellite-based sensors at varying spatial and spectral resolutions, large parts of the seafloor remain unsurveyed at spatial resolutions similar to those achieved with satellite sensors. It has been estimated that only 5-10% of the seabed is mapped with a resolution comparable to that on land (Wright and Heyman, 2008) and that >50% of the ocean floor is more than 10 km away from a depth sounding (Sandwell et al., 2014). Global maps of bathymetry estimated from marine gravity measurements are available (Smith and Sandwell, 1997;Sandwell et al., 2014), but lack the resolution that is required for detailed benthic habitat mapping. Backscatter data, arguably the most important feature for seabed mapping, are even more restricted in availability. To complicate matters further, there exist no standards for the collection and processing of backscatter to date. This means that backscatter data collected during different surveys are practically not comparable. Another significant drawback is the lack of spectral resolution of backscatter data: With only one band available, we lack the ability to define and exploit spectral signatures, band ratios and indices (such as the NDVI in land cover mapping). The potential for improved seabed classification using an array of three multibeam echosounders with frequencies spaced about an octave apart has recently been demonstrated (Hughes-Clarke, 2015); however no commercial multifrequency multibeam echosounders for seabed mapping exist to date. Clearly, progress on that front would benefit the seabed mapping community. Conversely, spatial resolutions achieved with sonar systems are comparable to high-and very high-resolution satellite data products in the case of continental shelf applications or when multibeam echosounders are brought close to the seabed, e.g. mounted on remotely operated vehicles or automated underwater vehicles. Although the limitations outlined above present real barriers to progress in seabed mapping, these differences do not mean that improvements in current imagebased seabed mapping practice can't be made by adapting common land cover mapping practice. We will therefore discuss in which areas image-based seabed mapping practice can be improved and make suggestions how this could be achieved.
Despite the fact that backscatter data are typically noisy, there is little evidence for the application of smoothing filters or speckle removal in the reviewed literature. Furthermore, track-parallel variations in backscatter intensity are commonplace, although less so in the case of multibeam echosounder as compared with sidescan sonar data; yet there is only one instance of the use of a 2D Fourier filter . The reluctance to use certain data preprocessing methods might stem from the fact that it is often difficult to establish how much filtering is appropriate, as such filters not only suppress noise but also affect the signal. Speckle, for example, is difficult to distinguish from real signals at the limit of the resolution of the sonar (Blondel, 2009). Studies that systematically investigate the benefits and limitations of various filtering methods across a range of settings might therefore be desirable.
The extracted features, apart from the primary features, utilized in the classification process are typically derived from bathymetry and to a lesser extent backscatter. In only one instance (Rattray et al., 2015) has an oceanographic variable been used. The influence of environmental variables, such as hydrodynamics, temperature, salinity, light, and productivity on habitats is widely acknowledged Kostylev and Hannah, 2007); however, such datasets are seldom available at spatial scales comparable to multibeam echosounder data and the variables might also exhibit high temporal variability, which might explain the paucity in their usage. Further on, the choice of the model, the hindcast period, the model resolution and the extracted statistics are likely to influence prediction performance differently, but our understanding regarding this is very limited. Systematic studies that investigate the influence of such factors on prediction performance and accuracy would therefore be an important step towards a better choice of oceanographic features.
Secondary features derived from bathymetry and backscatter are most frequently calculated at a local scale (3 Â 3 kernel). The advantages of multiscale terrain analysis have been highlighted in the past (Wilson et al., 2007). It is also well known that it is not possible to select a single fixed scale (image resolution and neighbourhood size) that will perfectly capture all landscape elements of interest (e.g. MacMillan and Shary, 2009). The importance of spatial scale and geographic context in benthic habitat mapping have been reviewed elsewhere (Lecours et al., 2015) and are not repeated here.
It is frequently the case that a suite of features is utilized with little or no consideration regarding the relevance of these features for the spatial predictions made. At the same time, the number of training samples is generally low as it is often costly and logistically difficult to obtain seabed samples. This means that a formal process of either data transformation or feature selection is advisable. Yet, the literature review presented here demonstrated that this was rarely the case and no form of either data transformation or feature selection was applied in studies prior to 2013. We strongly argue in favour of a formal step of dimensionality reduction as part of the standard image-based seabed mapping workflow. A main consideration for the choice of method will be whether interpretability of predictor features is desirable and necessary. If this is not the case, then well-known and implemented methods of data transformation such as PCA might be appropriate. Otherwise, there is a choice between different classes of feature selection methods. It appears that filter-based methods, although easy to implement, computationally fast and independent of the selected predictor, run the risk of potentially losing information due to synergistic interactions between predictor variables (Guyon and Elisseeff, 2003). Their independence of predictors might make filter-based methods more suitable for comparative studies with the aim to investigate the performance of various classifiers. Otherwise, wrappers, such as the Boruta algorithm used in conjunction with the popular random forest classifier (Breiman, 2001), and embedded methods might be preferable. Feature selection should be followed by the removal of correlated features, which can be identified with a scatterplot matrix as provided by the MGET tool (Roberts et al., 2010).
Several principal arguments have been made in favour of image objects as the unit of analysis, especially in the case of high to very-high resolution data (Burnett and Blaschke, 2003;Castilla, 2006, 2008;Blaschke et al., 2014). GEOBIA has been successfully trialled on marine acoustic datasets as well (Lucieer, 2008;Che Hasan et al., 2012bLucieer et al., 2013;Diesing et al., 2014;Hill et al., 2014;Lucieer and Lamarche, 2011). However, no marine study exists that has systematically compared pixels and image objects as units of analysis whilst utilizing the same input datasets and classifiers. The assumption that image objects yield better classification results must therefore be seen as unproven although plausible, until the effect of the unit of analysis on classification performance has been assessed in a structured way.
The application of a certain classification scheme is often mandatory in seabed mapping, which might explain the preference for supervised classification approaches observed in the reviewed studies. However, the classes of frequently used classification schemes were often derived without consideration of what can be mapped acoustically. The relationships between the class of interest and acoustic response are often not well understood and show large class overlap (see boxplots in Lucieer et al., 2013;Diesing et al., 2014;Neves et al., 2014;Stephens and Diesing, 2014;Calvert et al., 2015). The most likely causes are limited spectral resolution as discussed above and insufficient class descriptions associated with shortcomings in sampling design (Clements et al., 2010) and classification of samples . The limitations in spectral resolution might be alleviated by the application of ARA (Fonseca et al., 2009). Che Hasan et al. (2014) have demonstrated how backscatter angular response and multibeam bathymetry data might be integrated for benthic habitat mapping. Unsupervised classification has been frequently used when mapping land cover over large areas for which ground samples are not available in a sufficient amount (Cihlar, 2000). Unsupervised classification for seabed mapping might therefore be an alternative when the amount of samples is too small to make supervised classification viable. One important advantage of unsupervised classification is that concerns about the location and representativeness of the sample data are much reduced because clusters are homogeneous by definition (Cihlar, 2000). Therefore, it might be advisable to further explore the benefits of unsupervised classification in seabed mapping. A major problem with this approach is the effect of user-specified parameters, such as the number of clusters and allowable dispersion around a cluster mean, on the classification results. This limitation might be circumvented by "hyperclustering", i.e. producing a large number of clusters, typically 100-400. Subsequently, the number of clusters is reduced by merging steps based on statistical measures or iteratively by the analyst (Cihlar, 2000). Unsupervised classification of remote sensing data might also be utilized to stratify the collection of ground samples, which then in turn could be utilized in a supervised classification. Such an approach could help increase the spectral separability of classes of interest (Franklin and Wulder, 2002).
There is little evidence for the use of post-classification enhancement methods in the reviewed literature on image-based seabed mapping. This is probably due to the fact that this science discipline is still maturing and has not yet become fully operational. The current focus seems to be placed on the performance of various classifiers with an emphasis on providing objective results. Hence, any post-classification alterations based on expert knowledge, be it by manual editing or in a more formalized rulebased way, are avoided. Once, seabed mapping moves towards being fully operational, it might be expected that knowledgebased enhancements will be implemented more frequently. This will, however, also require a growing knowledge base including geodatabases containing relevant contextual information.
Although it is probably impossible to specify a single, allpurpose measure of classification accuracy (Foody, 2002), it should nevertheless be feasible to draw up minimum requirements based on recommendations made in the literature. These include: (i) the publication of two or more measures of accuracy (Stehman, 1997;Foody, 2002) including overall and class-specific metrics (Strahler et al., 2006), (ii) publication of associated confidence limits (Stehman, 1997;Strahler et al., 2006), and (iii) the provision of the error matrix (Stehman, 1997;Foody, 2002). More research into ways of visualizing spatial variability of accuracy is also encouraged, as such depictions are expected to be of great use for managers and decision makers.
This review has focussed on the processes and workflow elements of an image-based mapping approach. It can be expected that more can be learned from terrestrial remote-sensing and image classification practice that is of relevance for seabed mapping in relation to marine management. This might include how a link between abiotic variables and species is established in terrestrial studies to develop biotope maps at a scale that is relevant to management. This is likely a field where transfer of experience to seabed mapping would be fruitful.

Summary
We have reviewed the current state of terrestrial land cover mapping and identified elements of a mapping workflow, which served as a baseline against which the current practice in imagebased seabed mapping was assessed. This has led to the identification of knowledge gaps, which could be addressed with the design of specific comparative studies and recommendations for Image-based seabed classification improvements, which should be implemented in mapping workflows. This would allow thematic seabed maps to be used more effectively in marine management and decision making.

Funding
This study was funded by the Cefas Research and Development Fund.