Fishing for data and sorting the catch: assessing the data quality, completeness and fitness for use of data in marine biogeographic databases

Being able to assess the quality and level of completeness of data has become indispensable in marine biodiversity research, especially when dealing with large databases that typically compile data from a variety of sources. Very few integrated databases offer quality flags on the level of the individual record, making it hard for users to easily extract the data that are fit for their specific purposes. This article describes the different steps that were developed to analyse the quality and completeness of the distribution records within the European and international Ocean Biogeographic Information Systems (EurOBIS and OBIS). Records are checked on data format, completeness and validity of information, quality and detail of the used taxonomy and geographic indications and whether or not the record is a putative outlier. The corresponding quality control (QC) flags will not only help users with their data selection, they will also help the data management team and the data custodians to identify possible gaps and errors in the submitted data, providing scope to improve data quality. The results of these quality control procedures are as of now available on both the EurOBIS and OBIS databases. Through the Biology portal of the European Marine Observation and Data Network (EMODnet Biology), a subset of EurOBIS records—passing a specific combination of these QC steps—is offered to the users. In the future, EMODnet Biology will offer a wide range of filter options through its portal, allowing users to make specific selections themselves. Through LifeWatch, users can already upload their own data and check them against a selection of the here described quality control procedures. Database URL: www.eurobis.org (www.iobis.org; www.emodnet-biology.eu/)


Introduction
Progress in information technology has resulted in an increasing flood of data and information. Efficiently mining this sea of data and determining the quality of the data and its fitness for use has become a major challenge of many disciplines. Evaluating and documenting the quality of data has already become a standard practice in several scientific disciplines over many years, e.g. in medicine (1)(2)(3)(4), remote sensing (5-7) and gene sequencing (8)(9)(10). It is however only in the last decade that its importance-in combination with the assessment of the fitness for use-has become evident for biological sciences, more specifically for biodiversity data and data related to species occurrences (11)(12)(13)(14)(15).
Biodiversity is inextricably linked with biogeography (16), which is clear from the many papers that contain both biodiversity and biogeography in their titles, abstracts and keywords (e.g. [17][18][19][20]. And both concepts are not only essential in research hypotheses, but also in the field of conservation, management (16,21,22) and modelling (23)(24)(25).
When looking at larger patterns-e.g. on a European or global scale-data are mostly aggregated from a variety of sources. For the marine environment, data on all living marine species from different regional data centres and nodes flow towards the international Ocean Biogeographic Information System (OBIS; www.iobis.org), making marine biogeographic data freely available online. A variety of data is captured, going from data collected during research and monitoring campaigns to data from museum collections or data derived from literature. Given this very diverse nature of data, there is a strong need to be able to assess the quality of these data and provide feedback to the data providers. In addition, a system to assess the completeness of the record needed to be developed, offering specific filters to the users to be able to e.g. only query species records where complete abundance information is available.
Assessing the quality of a distribution record has thus become indispensable, as has the ability to give an indication of the completeness of that record, especially in database infrastructures such as e.g. EurOBIS, OBIS and the Global Biodiversity Information Facility (GBIF; www.gbif.org) that provide access to data from a wide range of sources (e.g. 13,14). Several actions regarding quality control and data cleaning have already been undertaken on regional or group-specific databases such as SpeciesLink (http://splink. cria.org.br) for Brazilian data collections, Fauna Europaea (26) for European land and freshwater animal species, fish collection databases in relation to FishBase (27) and the Atlas of Living Australia (ALA, http://www.ala.org.au/). However, efforts on quality control and fitness for use for marine biogeographic data were not yet globally organized, as is now presented here for OBIS.
An indication of the completeness can help the user in evaluating whether a particular record is useful for their analysis or not. A distribution record without a timestamp can e.g. be used to get insights in the general distribution of a species but will not be useful for temporal analysis. This illustrates that distribution records, although they do not share the same level of completeness, can be used for a multitude of applications, depending on the user's needs.
Over the last year, quality control (QC) tools have been developed to be able to document both the quality and completeness of each distribution record within EurOBIS. After extensive testing these QC tools have been implemented in OBIS and extended with extra quality control procedures. This article will elaborate on these recently developed automated quality control procedures and their relevance. In addition, we will demonstrate the importance and usability of these procedures with some use cases. The main goal of these QC steps is to provide a measure of fitness for use of marine biogeographic data both for the scientists and data managers, by offering several tools that help assessing the completeness and validity of distribution records. For a general description of the structure and content of the EurOBIS and OBIS database, we respectively refer to (10,28,29).

Data systems
The quality control procedures were originally developed on EurOBIS, to add quality flags to the available data. Because these data are largely limited to European seasand a number of QC steps only make sense on a global level (e.g. outlier detection)-the exercise was repeated on the OBIS database, with addition of a number of steps related to outlier analyses.
The QC procedures on EurOBIS were developed in two different ways: (1) as an automated process, to be able to assess the quality and completeness of the records already available within the database and (2) as online web services that can be used by potential data providers and researchers to assess the quality and completeness of their own data prior to use or submission. The former allows data managers to provide feedback to data providers and to check whether they can make their data more complete and correct gaps and putative errors. In addition, the results of the QC steps can be used for specific filtering on the data. The latter return a result report, listing all records that do not comply with a certain QC step. Users can immediately adapt their data and rerun the QC procedures online before analysing or submitting the data to EurOBIS.
EurOBIS is one of the many regional nodes within OBIS and is committed to a continuous support of OBIS, translated in serving its distribution data to OBIS. As the QC procedures also run on OBIS, the results of this can provide a valuable feedback to the other involved nodes and will therefore improve the quality and completeness of the online available records. Both the data providers and the separate nodes would benefit from this. From OBIS, data are sent to the Global Biodiversity Information Facility (GBIF), which would thus imply that GBIF could also only offer marine data that comply with a certain quality standard.

Quality control procedures
The quality control procedures have been developed for two main reasons. First of all, the available tools offer scientists the opportunity to quality check their data, prior to planned analyses or publishing their data through (Eur)OBIS and they help the (Eur)OBIS data management team in assessing the completeness and quality of the data when making them available online. When incomplete or possibly incorrect data are sent to (Eur)OBIS, the data management team can easily communicate with the provider on the possibly incorrect records based on the assigned quality flags. Secondly, the assigned quality flags can (i) help users in selecting data that are fit for their specific use and purpose or (ii) make it possible to filter records that comply with a certain quality standard and send those to other data systems such as e.g. the European Marine Observation and Data Network (EMODnet).
Each distribution record goes through a series of automated quality control steps, each generating a QC flag. Each QC step is a question that has a yes/no (¼ 1/0) answer and the result is stored as a bit-sequence (2 (xÀ1) ) where X represents the number of the QC flag. The results of all these QC steps are added up and stored in a single QC field in the (Eur)OBIS database, generating a unique integer value for each possible combination of positively evaluated QC steps. An overview of all the QC steps and their corresponding bit-sequence is given in Table 1. Given the different structure and scope of EurOBIS and OBIS, a number of QC steps have been specifically developed for either EurOBIS or OBIS. The majority (17) of the QC steps are, however, available for both data systems.
The strength of the quality control procedures is that they not only evaluate a dataset as a whole but also look at each record individually, giving a much more detailed view on the quality and completeness of the data and providing more opportunities to users in their data selection as one dataset may contain several useful records, which might have been rejected if the evaluation had been done solely on the dataset level. The data format check compares the general format of a dataset with the requirements of the OBIS Schema. When any of the required fields is missing or original field names are not correctly mapped to the field names used within OBIS, then these records are negatively evaluated in the QC procedures and are thus in need of an additional check. Fields that are not part of the OBIS Schema can still be shared with EurOBIS-e.g. through the DarwinCore Archive format (32)-but the corresponding data will-at this time-not be shown through the data portal. If the OBIS Schema recommends the use of certain wording or codes-e.g. in the field 'BasisOfRecord'-this is also checked. The 'BasisOfRecord' defines the kind of data: which can be actual observations (O), specimen information from museum collections (S) or distribution data derived from literature (L), which can already provide a first important data filter for the user.

Assessment of the completeness and validity of information
Besides the basic information of a distribution record (what-where-by whom), the OBIS Schema can capture a lot of other species-related information. A number of the quality checks verify the completeness and soundness of different parts of information in a record. This includes traceability information-e.g. institution code and catalogue number-checking how detailed the date information is, verifying that a given date is possible and-if relevant-if the start date is always before the end date and the minimum depth is always smaller than or equal to the maximum depth.
A number of QC steps make it possible to distinguish between records that can be used as 'presence-only' or where actual counts are available. When a count is given,  (33). Only by linking the given taxon names to a widely accepted marine taxonomic standard, such as WoRMS is it possible to rule out spelling variations and link synonyms to their currently accepted names within (Eur)OBIS. A thorough taxonomic standardization allows the grouping of distribution records in a reliable way for further analysis (12).

Geographic quality control
As EurOBIS and OBIS are biogeographic information systems, verifying the geographic content is as important as verifying the taxonomic data. The geographic checks do not only include a 2D check-latitude and longitude-but they also evaluate the third dimensiondepth-if documented in the dataset.
Several checks relate to the latitude-longitude fields within a given dataset (see Table 1). First of all, it is evaluated . Although 0-0 is a marine position in the Gulf of Guinea (Atlantic Ocean), the odds of having sampled at that exact location is relatively small; All 0-0 cases in OBIS so far were referring to unknown positions, which have been auto-filled by zeros. As both data systems are marine, it is verified whether the sampling locations are located in the marine environment, being seas or oceans. Given the fact that they both receive coastal and estuarine datasets, a land mask accommodating for a 20 km buffer from the coastline (http://www. ngdc.noaa.gov/mgg/shorelines/gshhs.html) is taken into account, hence also including most of the estuarine areas.
Although some datasets document the coordinate uncertainty or precision, this information has thus far not been taken into account in any of the quality control steps.
In nearly all cases, a dataset is accompanied by a detailed metadata description, including text information on the geographical range. Within the metadata information system used for EurOBIS, this geographical range information is coupled to Marine Regions (www.marineregions.org), a standard list of marine geo-referenced place names and areas (34). Based on the available information and shape files within Marine Regions, a comparison is made between the location of the sampling points and the general geographical coverage mentioned in the metadata. If this does not correspond, the relevant sampling locations are flagged as possibly incorrect. When no metadata is available, this check cannot be performed and the record is evaluated as being correct. This check is not yet available on the OBIS database.
Within the marine environment, the relevance of information on sampling depth cannot be underestimated. Based on depth, it is possible to distinguish between e.g. planktonic and benthic observations or coastal and deepsea observations. Given its importance, it is valuable to evaluate if the given depth-value related to the species observation is a possible value. This assessment combines the given depth-values with their geographic coordinates and compares this to the General Bathymetric Chart of the Oceans (GEBCO) (35). As not all depth values are registered with the same precision-and fluctuations exist due to e.g. tidal differences-a 100 m margin is taken into account when assigning a quality flag for this check. This margin should also largely account for the fact that the mean depth within a grid can potentially differ from the actual sampling depth, especially in topographically complex areas.

Outlier analysis
Next to the earlier documented QC steps that run both on EurOBIS and OBIS, global geographic and environmental outlier analyses were developed specifically for OBIS, generating 10 more QC flags. These additional outlier analyses use external environmental and geographical (depth) data to assess the credibility of a certain distribution record, when compared with the available distribution records within the checked dataset or within OBIS as a whole. Given the non-normal distribution of the environmental, depth and distance values of the sampling points, the following two robust outlier detection methods are used: (i) the absolute deviation from the median, with a limit at six times the median absolute deviation (MAD) (36,37) and (ii) an approach based on the Tukey box plot method, with boundaries at three times the interquartile range (IQR) (38). Although a value of three times MAD is already considered as conservative (39), setting the values for the rejection criteria is by definition a subjective decision (37). The values used for the QC flags are based on visual analysis of a subset of the OBIS database and on the fact that a point lying at 6xMAD or 3xIQR from the first or third quartile is considered an extreme outlier (38).
Six of the outlier checks are related to the environment: these checks compare the locality details of a record with depth, sea surface salinity (SSS) and sea surface temperature (SST) values extracted from the global grids of (1) GEBCO (www.gebco.net; The GEBCO_08 Grid, version 20100927), (2) ETOPO1 Global Relief Model (40) and (3) MARSPEC (Ocean Climate Layers for Marine Spatial Ecology) (41), with the earlier explained decision criteria of 6xMAD and 3xIQR. The depth layers of these three global grids are combined and the average of the two most similar depth values is used to average out inconsistencies between the three bathymetric layers. It needs to be taken into account that due to the used resolution of these depth layers-30 arc-second for GEBCO_08 and MARSPEC and 1 arc-minute for ETOPO1 Global Relief Model-the calculated bathymetric values of the positions can significantly deviate from the values at the exact sampling position due to the resolution of the depth layers. These checks help identifying observations that (possibly) occur outside of their environmental range. The four geographic outlier procedures aim (i) to compare the orthodromic or great-circle distance between the actual sampling locations and the centroid of all sampling locations within a specific dataset and (ii) to compare the distance between the sampling location of a specific species record to the centroid of all the available sampling locations of that particular species within the OBIS database. The quality flag is assigned taking into account the 3xIQR or 6xMAD boundaries. The centroid of a set of sampling points is defined as the point that minimizes the sum of squared geodesic distances between itself and each point in the set and it is calculated from all the initial records except those that have zero coordinates or coordinates that fall out of the valid boundaries for the coordinate reference system WGS84.
The outlier analyses aim to identify species documented outside of their expected ranges and to reveal possible errors in the taxonomic identification or the assigned latitude and longitude which were not identified through the record-level geographic QC steps, e.g. a missing minus sign to indicate South or West or accidental switching of latitude and longitude values.

Results
All distribution records within EurOBIS and OBIS have gone through the earlier described quality control steps. Within the OBIS database, at least 60% of the distribution records pass each individual QC step. For some QC steps, >90% of the records pass the enforced criteria ( Figure 1). A detailed look shows that the scores of the different OBIS nodes vary greatly (Figure 2), indicating that the results of these QC procedures can provide valuable feedback to the data providers-to double check their data and possibly make corrections and additions-and users, to select the desired data from the system. For an overview of all datasets available within the OBIS database, we refer to http:// www.vliz.be/en/imis?module¼dataset&dasid¼68.
The results show that-on average-85% of distribution records in OBIS and its respective nodes can be used for species or genus specific analyses ( Figure 1). All nodes-and thus implicitly OBIS-seem to struggle with capturing the corresponding time zone of the given time at which the data were collected (QC13), which is valuable information when collating data from different time zones. Time and the corresponding time zone information is, e.g. highly relevant when comparing data from different regions and analysing the diurnal vertical migration patterns of, e.g. zooplankton species.
When evaluating the records that contain actual counts (the number of observed individuals within each species) within the (Eur)OBIS database, it becomes clear that the most valuable piece of information-an indication of the sample size-is missing for a large number of records (QC16). As most counts are in essence meaningless without a sample size, this QC result shows that still a lot of work needs to be done to be able to use the count information.
Although the results of the individual QC steps can already give a lot of information on the possible usefulness of a record, it becomes even more useful when several QC steps are combined (Table 2). A selection of relevant QC steps can be made on database level, giving an indication of the distribution records within OBIS that comply to these criteria. In biodiversity research, scientists are specifically interested in geo-referenced species and/or genus data. When combining these selection criteria, almost 85% of the records would be fit for this purpose. The more stringent the criteria become, the fewer records will suit the postulated conditions. The number of suitable records diminishes significantly if one wants to make use of counts or abundance information instead of just presence information (QC16), indicating that this information is rather hard to capture and document within large integrated databases, such as e.g. OBIS.
Two different approaches are used within the outlier analyses: the IQR and the MAD methodology. These two have been selected as they are widely used in outlier analyses. In general, the results of both QC procedures are similar. When they differ, the user can combine the results of these QC steps with other QC steps to come to a consensus approach on how to evaluate a specific record. Figures 3 and 4 illustrates that the MAD and IQR approaches can differ, but that these differences are generally relatively small. If a record gets flagged as a possible outlier, some caution is still needed. Figure 3 represents the sampling locations of the dataset 'International Council for the Exploration of the Sea (ICES) Biological Community' (42), where the core of the locations is in the Baltic Sea and the other locations are indicated as geographic outliers. After consultation with the data management team at ICES, it became clear that the records in the Antarctic region were the result of a reporting problem in an old format, where positive latitudes were reported as negative. These errors are currently being fixed, and the correct data should soon be available. Possible issues with the Mediterranean, African mainland and Greenland records are not obvious and are still under investigation by ICES. Figure 4 shows all the distribution records of the Cirriped species Verruca stroemia available within OBIS and how they respond to the different geographic and environmental outlier analysis. Appendix S1 gives an overview of the OBIS datasets containing Verruca stroemia distribution records. In the 'distance outlier analysis', all distribution records along the Norwegian coast, White Sea, Barents Sea and Mediterranean Sea are considered outliers, indicating the species would not occur there. Similar results come from the SSS outlier analysis. Accepting these distribution records as true outliers should be backed up with expert knowledge, as these outliers might not be actual outliers, but e.g. the result of a skewed availability of data within the OBIS database or misidentifications in the field (see discussion).

Discussion
The assigned quality flags to each record provide an indication of the 'fitness for purpose' of a particular distribution record, helping both the user and the data provider in more objectively assessing the quality and completeness of a record and to draw conclusions from this. The majority of the quality flags do not have the intention to label a record as 'good' or 'bad', they just give an indication of the completeness and quality, helping the user in his or her decision to make use of a specific record or to reject it.
Users need to be aware of the fact that the results of the outlier analyses only provide an indication of the possible outlier character of a distribution record. Records flagged as an outlier are not necessarily true outliers: the distribution of a species can e.g. be unrelated to bathymetry, but highly dependent on temperature or salinity. A single outlier check might thus not clearly identify an outlier ( Figure 4), but combining the results of the different outlier checks can indicate with more certainty that a species observation is outside its suspected range ( Figure 5). In addition, knowledge on the actual environmental boundaries of species can help in identifying true outliers and filtering of the data. False positives in the species-based outlier detection can be the result of extremely uneven sampling such as for example data from museum collections. Some true positives on the other hand might not be actual outliers, but could be the first observations for a specific species in a geographical area where it was unknown to appear before. The latter could be the case in first observations of alien species that moved to a new area, and these records should be approached with caution. As the dataset-based outlier detection aims to flag possible errors in the geographic coordinates, this will only work well when the dataset is spatially restricted, e.g. if all samples have been taken in the same region such as the North Sea.  Table 1.
When wider geographical areas are covered within a dataset, this outlier detection is prone to giving false positives, e.g. due to a biased sampling effort in the available data. This is clearly the case for Verruca stroemia (Figure 4): expert and literature consultation have confirmed that the Mediterranean outliers are true outliers, a consequence of misidentification (43). In this case, the providers of these records will be contacted with the expert and literature information. The northern distribution records (Norwegian coast, White Sea, Barents Sea) are, however, validated by literature. In addition, the available depth values also confirmed the species occurs at a depth range from 0 to 548 m (43). Because different outlier analyses are available, it is recommended that users combine the results of these outlier QC checks with each other and with the results of the more basic geography checks. All these combined will make the interpretation of the validity and fitness for use of the records.
Use-case 1: Quality controlled data available through EMODnet As mentioned earlier, the results of the assigned quality control flags can be combined according to the required 'fitness for use' for the users, thereby generating the possibility to create specific filters on the available data within EurOBIS and OBIS. EMODnet Biology Portal (http:// www.emodnet-biology.eu/) is already making use of such a filter, to offer a specific subset of EurOBIS data to its users. EurOBIS is the data engine behind the Biology Portal of EMODnet, meaning that the data part of the Biology Portal is driven by the EurOBIS data. It was, however, agreed that only those distribution data that comply with QC steps 2-3-4-5-related to taxonomy and basic geography-are offered to the users, thereby making a useful 'pre-selection' of the data. Through the portal, users can still see how many distribution records are available in the original dataset and how many have passed the postulated QC steps and are thus available. As of November 2014, 86% or 15.9 million of all the distribution records available in EurOBIS can be consulted through the EMODnet Biology Portal.
Use-case 2: Selection of QC steps available as web services through LifeWatch As of 2012, EurOBIS is part of the central taxonomic backbone of LifeWatch, an E-Science European Infrastructure for Biodiversity and Ecosystem Research which aims at standardizing species data and integrating the distributed biodiversity repositories and operating facilities. Given the importance of standardization, interoperability and being able to assess the quality and completeness of the available data within LifeWatch, a number of the QC steps related to data format, taxonomy and geography that are currently running on the (Eur)OBIS database have been 'translated' to interactive, user-friendly web services (http://www.lifewatch.be/dataservices). By making use of these freely available data services, data providers, data managers and users are able to make a general assessment of the quality, completeness and fitness for use of their own biogeographic data by simply uploading them to the LifeWatch portal and selecting the QC steps they want to run on their data.

Future plans and possibilities
Currently, the QC steps are running automatically on both the EurOBIS and OBIS database. A selection of these QC steps is already available online through LifeWatch as a web service. The creation of a customized filter-a combination of several QC steps-is not yet available for the users. Customized filters on EurOBIS will become available through the EMODnet Data Portal, allowing users to define the necessary 'fitness for use' of the required data and to refine their search results accordingly. In the future, similar filter options will be developed on the OBIS data.
The data download will then also include the corresponding QC flags. The results of the QC procedures currently stored in the database will be used to communicate with the data providers to improve both the quality and completeness of the available data. Specifically the outlier   analyses will provide valuable information to improve the correctness of the data. Currently, newly added datasets are thoroughly analysed before they go online, and possible issues are communicated with the data provider immediately. On the other hand, a lot of data have been uploaded to the database before these QC procedures came into place. For these datasets, a communication plan will need to be worked out to discuss the quality control results with the providers, aiming for the highest possible return and improvement of the data quality and completeness. It is important to realize that for some-mostly historical-datasets, the quality status will remain 'as is', e.g. when no additional information is available anymore and the original data provider is no longer around to deal with the identified issues.
Within WoRMS, the taxonomic information is currently being expanded with species attributes, such as whether a species belongs to the benthos or plankton, if a species is coastal or deep-sea, what the feeding method, average body size and life span is etc. Once these literature and expert-based traits have been sufficiently documented, they can be incorporated in the QC steps to offer an even higher quality standard to our users. For example, if WoRMS can distinguish between coastal and open ocean species, then this trait can be used as an additional check on the species distribution information: a coastal species (presumably) observed in the open ocean could then be flagged as a possible incorrect record, drawing the attention of the users to this and letting them decide for themselves whether they want to include this record in their download or analysis or not.

Conclusion
The development and implementation of the described QC steps meets a need to be able to add quality flags to records and to filter out data based on user needs, taking into account the fitness for purpose of the available records. As an array of QC steps is available, users will be able to create specific filters on the data, answering to their specific data needs and requirements.
Although a number of the discussed QC steps are specifically designed to check data meant for EurOBIS and OBIS, a number of other checks can be used widely by the scientific community to quality control their own data before analysis, publication and data sharing. Offering these QC tools as online, user-friendly data services through LifeWatch (www.lifewatch.be) greatly enhances their overall usability for scientists worldwide and meets the needs of the (marine) scientific community to be able to standardize and quality check their data themselves.
Depending on user needs, more QC steps can be added in the future, or existing QC steps could be fine-tuned to better meet their requirements. The mining of a quality controlled, integrated database of different data sources can give insights in previously unexplored matters and offers the possibility to develop new or improved technologies related both to the quality of the data and the outcomes. It is, however, important to realize that the outlier QC results should be approached with due caution. Because the QC steps are automated, a critical analysis of these QC results might be needed to draw the right conclusions on exclusion or inclusion of these records in certain analyses.

Supplementary data
Supplementary data are available at Database Online.