Imputation of missing data from time-lapse cameras used in recreational ﬁshing surveys

Imputation of missing data from time-lapse cameras used in recreational ﬁshing surveys. Science, While remote camera surveys have the potential to improve the accuracy of recreational ﬁshing estimates, missing data are common and require robust analytical techniques to impute. Time-lapse cameras are being used in Western Australia to monitor recreational boating activities, but outages have occurred. Generalized linear mixed effect models formulated in a fully conditional speciﬁcation multiple imputation framework were used to reconstruct missing data, with climatic and some temporal classiﬁcations as covariates. Using a complete 12-month camera record of hourly counts of recreational powerboat retrievals, data were simulated based on ten observed camera outage patterns, with a missing proportion of between 0.06 and 0.61. Nine models were evaluated, including Poisson and negative binomial models, and their associated zero-inﬂated variants. The imputed values were cross-validated against actual observations using percent bias, mean absolute error, root mean square error, and skill score as performance measures. In 90% of the cases, 95% conﬁdence intervals for the total imputed estimates from at least one of the models contained the total actual counts. With no systematic trends in performance among the models, zero-inﬂated Poisson and its bootstrapping variant models consistently ranked among the top 3 models and possessed the narrowest conﬁdence intervals. The robustness and generality of the imputation framework were demonstrated using other camera datasets with distinct characteristics. The results provide reliable estimates of the number of boat retrievals for subsequent estimates of ﬁshing effort and provide time series data on boat-based activity.


Introduction
As many recreational fisheries are of large spatial extent, diverse, and not well defined, it can be challenging and costly to obtain accurate recreational fishing information for sustainable management (Smallwood et al., 2012;Hyder et al., 2018). Remote camera surveys (also referred to as digital camera monitoring) are increasingly being used throughout Europe, North America, and Australasia to monitor recreational fishing effort in marine and freshwater fisheries (Smallwood et al., 2012;van Poorten et al., 2015;(Smallwood et al., 2012;Taylor et al., 2018), and activity occurring on the days when onsite survey staff are not present at the ramps. In addition, remote camera data can be used to test the accuracy of other recreational fishing survey methods that have more restricted sampling coverage (Lancaster et al., 2017). The integration of remote camera observations as a complementary technique in recreational fishing surveys can also be used to improve the accuracy and precision of harvest estimates (Steffe et al., 2017).
In theory, remote cameras can provide continuous recordings of boating and recreational fishing activities; however, interruptions of camera operations (herein referred to as "outages") can lead to significant gaps within the data. Camera network outages can occur as a result of technical faults, vandalism, theft, and/or weather conditions, such as temperature and humidity, lightning strikes, flooding, and other environmental factors (Blight and Smallwood, 2015;Hartill et al., 2020). Responding to missing values has been a subject of interest for researchers in many fields, where missing values require proper handling to prevent further loss of precision and reliability of estimates and indices (Kleinke and Reinecke, 2013a;van Poorten et al., 2015;Hartill et al., 2016). In remote camera surveys, missing data can potentially lead to biased estimates. Other problems that could occur include irreproducibility of estimates, and loss of statistical power, with the magnitude determined by the nature and duration of the outage (van Poorten et al., 2015;Hartill et al., 2016). Therefore, it is important to build imputation schemes that are tailored to the pattern and nature of "missingness" and the distributional characteristics of remote camera data. However, despite the rapid emergence of remote camera studies relevant to recreational fishing, relatively few studies have examined analytical approaches for dealing with data outages.
Of the recreational fishing studies that used remote cameras and reported missing data, outages have typically been assumed to be random (Smallwood et al. 2012;Taylor et al., 2018). Modelbased approaches for imputing remote camera missing data have also been explored (van Poorten et al., 2015;Hartill et al., 2016;Lancaster et al., 2017). A Bayesian hierarchical model was applied to predict total angling effort for 49 lakes in Canada based on remote camera data. Missing camera data were imputed from the average effort from proximate lakes (van Poorten et al., 2015). Hartill et al. (2016) used generalized linear models (GLMs) to impute missing values in recreational boats returning to a boat ramp based on recorded remote camera data from neighbouring ramps. In both van Poorten et al. (2015) and Hartill et al. (2016), neighbouring ramps and viewpoints were used as reference points. However, in instances where outages occur simultaneously across nearby ramps or when ramps with installed remote cameras are widely dispersed with significantly different trends in boating activities, these imputation approaches cannot be applied. The recommendation to incorporate covariates such as climatic and environmental factors was made in both studies. Soykan et al. (2014) identified temperature, rainfall, tides, winds (direction, speed, and gust), and sea surface variables as significant predictors of fishing effort, to which boating effort correlates and serves as a good proxy (Johnson et al., 2017). To the best of our knowledge, no study has evaluated the opportunities of using climatic variables to build imputation models to handle missing observations in remote camera data.
The current study sought to formulate and compare several imputation models with climatic and some temporal classifications, as explanatory variables to impute gaps of missing values in the counts of recreational powerboat retrievals from remote camera monitoring along the coastline of Western Australia (WA). Ten real camera outage patterns were applied to a "complete" remote camera monitoring dataset to artificially create missing gaps. Generalized linear mixed effect models built on the fully conditional specification multiple imputation framework were considered to reconstruct missing gaps (van Buuren, 2007;Kleinke and Reinecke, 2013b). Imputed estimates were compared with actual data recorded for the simulated periods of camera outage. The robustness and generality of the modelling scheme was illustrated on two other camera datasets (and covariates) from different locations, to establish the ability of the imputation scheme to handle both short and long outages.

Study area and camera data description
In WA, an estimated 26% of residents participate in recreational fishing at least once a year (Department of Primary Industries and Regional Development, 2019). Remote cameras have been used since 2006 to monitor trends in boating activities at 30 sites along the coast, including boat ramps, channel entrances and parts of foreshore (Hartill et al., 2020). The type of vessel launched and retrieved is recorded as either commercial, powerboat, jet-ski, kayak or others. Subsequent analysis for this paper was restricted to powerboat retrievals, as this is the common vessel type used for boat-based recreational fishing activities in WA. Counts of the number of powerboat retrievals for each ramp were recorded to the nearest minute. A technical overview of the camera monitoring scheme can be found in Blight and Smallwood (2015).
This study utilized complete data on powerboat retrievals collected between 1 March 2011 and 29 February 2012 at the Leeuwin ramp and ten outage patterns observed at eight boat ramps distributed across the coastline of WA (Figure 1, see also Supplementary Table S1). Outage patterns identified coincided with the state-wide surveys of boat-based recreational fishing in WA , see Supplementary Table S1). The choice of the ten outage patterns was based on the percentage of missingness, ranging from 0.06 to 0.61 (see Figure 2). The longest outage imputed was 80 days ($1920 h) and the shortest was 1 h. The ten outage patterns were of variable lengths and uncorrelated among the ramps. The complete record consisted of 8784 hourly entries of count of powerboats retrieved, and 54.4% of all records were zeros. In total, 12 293 powerboat retrievals were recorded.
A simulation scenario was chosen, where observed data of the complete records were turned into missing data based on the ten outage patterns (see Figure 2). This was done to enable crossvalidation of the models and to establish the consistency of the models in imputing the various durations of outages. If data were missing for any portion of the hour, the observation for this hour was classified as missing to control for all possible interpretation errors that may have occurred during outages.
In addition, camera records for Mindarie in  and Monkey Mia in Ryan et al. (2017) were used. Missing data in these records were imputed for short-term camera outages applying the extrapolation method in Wise and Fletcher (2013). However, missing data for periods of extended outages were not imputed. The traffic intensities at these ramps are very different; Mindarie is a moderately busy ramp with an annual total of $20 000 powerboat retrievals while Monkey Mia is a less busy ramp with $6000 powerboat retrievals . In terms of geographical location and climate type, Mindarie is in the West Coast bioregion with a hot-summer Mediterranean climate while Monkey Mia is in the Gascoyne Coast bioregion with a hot semi-arid climate (see Figure 1). Severe outages were observed at Mindarie as 60% of the fishing year data were missing, with no data for the months of September, April, and July. Similarly, 14% of data at Monkey Mia were missing. No data were available for the month of June.

Models and missing data assumptions
Let Y be the count of powerboat retrievals data observed from remotely operated camera, with some missing values, such that, Y ¼ ðY obs , Y mis Þ, where Y obs represents the observed data and Y mis missing data. Data were assumed to be missing at random. The imputation models were formulated to investigate the conditional distribution: where h represents the vector of unknown parameters of the The Leeuwin boat ramp is denoted with a larger white star because no outages occurred in the data from this camera in 2011/12. Real outages that occurred from the other remote cameras (denoted by smaller solid stars) were applied to the complete data set at Leeuwin to examine the various modelling approaches.
model. It was further assumed that the data generating process for Y can be derived from generalized linear mixed effect models (Afrifa-Yamoah et al., 2019). Let y be a n Â 1 vector of observed outcomes: where X is a n Â k matrix of fixed effects associated with the outcome y via b, which is a k Â 1 vector of coefficients, and Z is an n Â r matrix of random effects associated with y via t, which is an r Â 1 parameter vector. w t is the r Â r variance-covariance matrix of the random effects and e is the n Â 1 error vector with R ¼ r 2 I, where I is an n Â n identity matrix. Climatic variables were treated as fixed effects, whereas the temporal classifications such as season, type, and time of day were treated as random effects (see Table 1 for variable description). Missing data in the climate data were imputed using the methods in Afrifa-Yamoah et al. (2020). It is important to note that these covariates were not directly associated with the missing mechanism and did not explicitly give any information on why the camera records were missing. Although it is common in scientific studies to focus on the relative importance of predictors within statistical models, in this study the focus was to predict boating effort based on the collective contribution of all covariates, irrespective of the statistical significance of their coefficients. Based on the distributional characteristics of the data, quasi Poisson (denoted as QP), negative binomial (denoted by NB), zero-inflated Poisson (denoted by ZIP), and zero-inflated negative binomial (denoted by ZINB) models were considered. In the two-level models, it was assumed that the two-level processes (i.e. zero and non-zero parts) were influenced by the same set of covariates. In the modelling scheme, random intercept models were fitted, and common slopes were assumed without consideration for interaction effects. This was done to moderate the complexity of the model structure because of the number of predictors used. Predictive mean matching (PMM) as a general purpose method was also investigated. Little is known about the suitability of PMM for count data, because it was developed for imputing missing observations among continuous variables.

Fully conditional specification multiple imputation
The fully conditional specification multiple imputation framework (van Buuren and Groothuis-Oudshoorn, 2011) was used to Figure 2. Distribution of the ten outage patterns applied to the Leeuwin dataset and their missing proportion. The horizontal axis represents the length of the camera data partitioned into 100. The vertical axis represents the proportion of missing data in the partitioned block or otherwise. The black bands represent the periods of camera outages, and the grey bands represent the observed data. specify conditional models of the partially observed outcome variable given the covariates, to obtain a posterior predictive distribution defined as: where h ¼ ðb; w t ; rÞ is the vector of parameters in (1) and p hjy obs ; X À Á is the observed data posterior density of h. From (3), estimate of the model parameters Þ using a Gibbs sampler (see van Buuren and Groothuis-Oudshoorn, 2011; Kleinke and Reinecke, 2013a). From h Ã , we generated a chain of equations, y Ã ¼ xb Ã þ zt Ã þ e Ã ; for the observed data and missing observation. A random draw was made from k Y obs with y Ã in the closest neighbourhood to that of the missing observation being imputed. This was done to introduce between-imputation variability. The procedure was repeated, generating M plausible complete datasets accounting for the uncertainty in the missing data (Rubin, 1987;Sterne et al., 2009). More detail about the imputation scheme is presented in the Supplementary material.
Conversely, the normality assumption on b h may sometimes be implausible (Rubin, 1987). A variant implementation of the imputation scheme was carried out, where the draws of h Ã were estimated through bootstrapping (Efron, 1994). The parameter vector h Ã was estimated by fitting model (1) to a bootstrap sample consisting of b n À s observations (where s is the number of missing observations) drawn from ðY obs ; XÞ (van Buuren and Groothuis-Oudshoorn, 2011; Kleinke and Reinecke, 2013b). This scheme resulted in imputations generated as before, from the following models: quasi Poisson (denoted as QP.boot), negative binomial (denoted as NB.boot), zero-inflated Poisson (ZIP.boot), and zero-inflated negative binomial (ZINB.boot). PMM was also implemented in the multiple imputation framework. The scheme used the ordinary multiple linear regression model to formulate the posterior distribution of h: The imputation schemes replaced missing observations with observed values and thereby preserved the distribution of the observed data (Yu et al., 2007).

Pooling analysis
where M denotes the total number of estimates for Y mis in the imputation scheme and P M m¼1 b Y mis i;m À b Y mis i 2 reflects the missing values estimation uncertainties (Rubin, 1987).

Model performance evaluation
Missing values were imputed five times (M ¼ 5Þ as this number is considered to provide an appropriate balance of the biasvariance trade-off in the model evaluation (van Buuren and Groothuis-Oudshoorn, 2011; Allison, 2015). The estimation accuracy of the imputed values was assessed with the following performance indicators: percent bias, mean absolute error (MAE), root mean square error (RMSE), and skill score (SS) based on the mean square error: Percent bias measures the average tendency of imputed values to be larger or smaller than the associated observed values. A positive score indicates overestimation, whereas a negative score indicates underestimation. The optimal value is 0, with low values indicating plausible imputed values. The MAE and RMSE are widely reported imputation modelling performance indicators. MAE and RMSE have the same units as the variables measured. They are non-negative and unbounded above, with lower values indicating high levels of agreement between observed and estimated values. The SS measures the accuracy of a forecast relative to standard reference. The values of SS are bounded above by 1 and unbounded below. A perfect forecast is observed when a score of 1 is obtained.

Application
To further evaluate the ability of the method to impute plausible values with a distribution comparable to the observed periods, the model with the best overall performance in the crossvalidation study was determined. This model (with covariates unique to the locations) was then applied to impute missing data in the camera datasets at Mindarie  and Monkey Mia . For Mindarie, there were 3 months with complete camera outages, as well as shorter duration outages in the other months; for Monkey Mia, there was a single longer duration outage in addition to intermittent outages.

Results
Case study: ten outage patterns applied to a complete dataset The percentage of zero counts in the dataset with simulated missing data scenarios ranged from 21.2% to 51.8%. For these simulated periods, the average and standard deviation of observed hourly counts ranged from 1:0 to 1:5 ð2:1 SD 3:0Þ while the imputed data had values ranged from 1:0 to 1:8 ð1:8 SD 2:7Þ. Total imputed estimates obtained from the nine models agreed with the actual totals in most cases (Table 2). For outage patterns 5 and 9, the 95% confidence intervals of the imputed total number of powerboat retrievals from the nine models overlapped with the observed total number of powerboat retrievals. For the ZIP and ZIP.boot models, the 95% confidence intervals of the imputed totals for nine of the outage patterns contained the observed total number of powerboat retrievals. For outage pattern 1, the 95% confidence intervals of the imputed totals did not contain the observed total for any of the nine models (see Figure 3). With respect to temporal strata such as months, the 95% confidence intervals of the imputed counts of powerboat retrievals for most of the models often contained the total observed counts (see Supplementary Table  S2).
In terms of percent bias, models were ranked differently, but ZIP models were often among the top ranked models. The direction of the estimation of the bias also varied among the outage patterns. For example, the bias was positive for all the models for outage pattern 10, indicating overestimation of the total counts, but for outage pattern 1, all the models recorded negative bias, with underestimated total counts. In terms of MAE and RMSE, the indicators agreed on the top ranked models for all the outage patterns apart from 6. The ZIP models were top ranked most frequently (see Table 2). The relatively low values of MAE and RMSE suggest close agreement between imputed and observed data. The percentage differences in MAE and RMSE values between the two best models ranged from 0.1% to 4.7% and 0.04% to 7.3%, respectively. Nominal differences in MAE and RMSE among the three best models were relatively small and did not appear to be important. SS values, however, revealed some level of disparity in the performance of models and in most cases were distinctive in the choice of the best ranked model. Except for outage patterns 1 and 6, the SS consistently ranked the ZIP and its bootstrap variant as the best models, notably in missing patterns of very long duration (e.g. outage patterns 7 and 8). The percentage difference in the SS values between the two best models (models with larger SS scores) for the ten outage patterns ranged from 0.6% to 35.3%, with the magnitude of errors between À0.158 and 0.312. Although there was no clear systematic trend in the performance of the models with respect to the pattern, the proportion of missing data, and the proportion of zeros in the dataset, ZIP models were generally ranked best.

Application
For Mindarie, distributions across hours of the day adequately depicted the nature of boating activities, particularly for the 3 months where records were missing in their entirety (see Figure 4). This was inferred from the similarities that exist between the distributions for the imputed values and observed months. In addition, for the other months with some missing data, there were some differences in the shapes of the distributions of powerboat retrievals obtained with the imputations from the outlined method compared to the results in Ryan et al. (2017). For instance, the distributions for the months of January, May, and June were more regular in shape compared to results in Ryan et al. (2017). The variations in the imputations expressed in terms of the monthly total powerboat retrievals were also lower.
Similar patterns were observed in the analysis of Monkey Mia dataset. The distribution for the imputed values for May (with complete outage) adequately reflected the general patterns of the distributions across the observed months (see Supplementary Figure S2). The variation in the estimates of total monthly powerboat retrievals was lower than that for the method applied in .
To further understand the short-term behaviour and the consistency of the imputations, detailed daily distributions of the imputations, particularly for the months with no data have been provided (see Supplementary Figures S1, S3, and S4). The daily distributions of imputed values for April at Mindarie and May at The table displays the observed total counts and the imputed total powerboat retrievals (with 95% confidence intervals), the percentage bias, the skill score, the mean absolute error, and the root mean error from the fitted models in relation to the ten missing patterns. The best models have bold scores with respect to the performance indicators.
Monkey Mia adequately depicted the nature of traffic intensities at the two boat ramps.

Discussion
Generalized linear mixed models built on a fully conditional specification multiple imputation framework were found to reconstruct plausible values of counts of powerboat retrievals for the durations of outages studied. The modelling framework has demonstrated suitability for the imputation of missing data in count data sets. Generally, the choice and type of model will depend on the nature and characteristics of the data set and the missing patterns. However, the ZIP model in the multiple imputation scheme (with its "self-correcting properties") is likely to perform well for count data with many zeros and possible overdispersion. This is because, for such data, zero-inflated models provide a rigorous analytical approach. In addition, the overdispersed nature of such data will not impact on the results, since it uses the dependencies within the dataset to hierarchically model the variance structure. We recommend further simulation studies to assess varying modelling conditions and missing mechanisms of various types including missing completely at random (MCAR), where it is assumed that there is no relationship between missingness of the data and any values, observed or missing. Robust imputation models with the ability to uncover the relationship between variables to "fill-in" gaps of missing data with values that will fit the distribution of the powerboat retrievals are ideal. In the framework outlined, the predicted values from the chain of equations formulated with the covariates were used to guide the random draws from the observations to impute missing data. To obtain plausible estimates using regression modelling requires the use of as much information as available (Kaiser and Tracy, 1988;van Buuren and Groothuis-Oudshoorn, 2011). Conceptually, it is difficult to determine all variables related to boating activity, as many other factors not considered in this study may be important. The covariates in the imputation modelling phase do not completely capture all the variability in the powerboat retrievals data. However, inclusion of more variables more variables might lead to collinearity with the response and among control variables. Perturbation analysis (see Hendrickx, 2018) can be applied to mitigate the impact of collinearity on the response. If collinear control variables do not covary with the response variable(s) (which was the case in this study), there will be no effect on coefficient estimation or model performance (Allison, 2012).
The results varied for lower-level temporal stratification and there were instances of under-and overestimation, notably in the PMM and the negative binomial models for time of day. The shortfalls of the PMM in imputing non-continuous variables are apparent (Allison, 2015). The approach performed well in imputing estimates for the large scale (e.g. the 12-month total number of powerboat retrievals at the Leeuwin ramp) but struggled at a finer-scale (e.g. time of the day). This was because PMM used an ordinary linear regression model in the estimation process and did not capture the clustering effects especially for temporal variables with several levels. Conversely, the estimation processes for the negative binomial models were more cumbersome and Figure 3. Total estimates of powerboat retrievals (with 95% confidence intervals) obtained from the nine fitted models for the ten missing patterns studied. The horizontal dashed lines represent the true observed total counts of powerboat retrievals at the Leeuwin boat ramps from the missing periods.
sometimes models had to be run for long periods of time before convergence. This was mostly a consequence of the variance being a quadratic function of the mean, which affected the iteratively weighted least squares algorithm. Ver Hoef and Boveng (2007) found the quasi-Poisson regression to be superior to the negative binomial regression in estimating the overall abundance of harbour seals (Phoca vitulina) with overdispersed count data, as the negative binomial regression tends to assign more weight in the parameter estimation process.
The opportunities that remote camera surveys provide for complementary and corroborative purposes in recreational fishing research are evident (Smallwood et al., 2012;Hartill et al., 2016;Lancaster et al., 2017;Steffe et al., 2017;Askey et al, 2018). Although some of the challenges of missing data in remote camera studies can be mitigated with measures such as regular maintenance schedules, back-up power supplies for cameras, and installing cameras in proximate locations to assist data sharing (van Poorten et al., 2015;Hartill et al., 2016), missing data cannot be completely eliminated. Our study has demonstrated that there is a need to explore, using known response data for the variable of interest, the extent to which imputation models successfully describe the distribution of that variable and adequately impute plausible values for missing periods. The current method is suitable for imputing reasonably long outage periods, but in an instance where an entire season is missing, more assumptions would be required. For instance, major outages (e.g. 9-month in a year) will compromise the quality of imputed estimates and the level of acceptance of the results. Alternatively, the nearest neighbourhood concept used in Hartill et al. (2016) could be applied. In addition, we propose that for ramps where continuous camera data have been collected, inferences could be made from estimates from the preceding years.
The outcomes of the imputation modelling can assist in dealing with outages in remote camera studies elsewhere. Within WA, the detailed analysis undertaken at the boat ramp at Leeuwin will form the basis of imputing missing counts of powerboat retrievals for camera outages for the other locations where remote cameras have been installed. This will enable the monitoring of long-term trends in boating and recreational fishing activity. The current study has a wide area of application in fisheries, ecological, and related studies involving remotely operated cameras and automatic traffic counters. For instance, remote camera monitoring data has been used to estimate nocturnal shore-based recreational fishing effort (Taylor et al., 2018), to estimate angling effort (van Poorten et al., 2015;Askey et al, 2018;Stahr and Knudsen, 2018), and to monitor human use of artificial reefs and areas of the coast (Wood et al., 2016;Flynn et al., 2018). As the number of fisheries and ecological studies using remote cameras is likely to increase, the need to consider accurate approaches for imputing missing data resulting from outages will become increasingly important to guide key management decisions.

Supplementary data
Supplementary material is available at the ICESJMS online version of the manuscript. For the other months with missing data, differences can be observed in shapes compared to the results in Ryan et al. (2017). Right: monthly distribution of the total number of powerboat retrievals, with 95% confidence intervals where data imputations were required. The grey bars represent the months with complete camera outage.