Mapping wheel-ruts from timber harvesting operations using deep learning techniques in drone imagery

Wheel ruts, i.e. soil deformations caused by harvesting machines, are especially harmful because they can impact growth conditions of future forest generations and should therefore be avoided or ameliorated. However, the mapping of wheel ruts that would be required in monitoring harvesting operations and planning amelioration measures is a tedious and time-consuming task. We used a deep-learning image-segmentation method (ResNet50 + UNet architecture) that was trained on drone imagery acquired shortly after harvests in Norway, where more than 160 km of wheel ruts were manually digitized. The leave-one-out cross-validation of 20 harvested sites resulted in F1-scores of 0.45-0.83 with an average of 0.67. The highest accuracy was obtained for severe wheel ruts (average user’s accuracy (UA) = 0.74), and the lowest accuracy was obtained for light wheel ruts (average UA = 0.62). Besides rut severity, the accuracy was also affected by the spatial resolution and noise present at the site. In combination with the ubiquitous availability of drones, the results of our study have the potential to greatly reduce the environmental impact of final felling operations by enabling the automated mapping of wheel ruts.


Introduction
Mechanized harvesting of forests is an integral part of sustantiable forest management and required to supply society with the required timber through efficient and safe forest operations. However, mechanized harvesting operations can constitute considerable environmental impacts. Besides the inevitable but temporary loss of habitat for forest-dwelling animal and plant species, harvesting operations can result in soil damage (Ampoorter et al. 2010). Wheel ruts are a form of soil displacement caused by wheels or crawler tracks of forest machines that compress and shear upper soil layers. Despite considerable technological developments to reduce the impact on soils, wheel ruts can occur when forest operations cannot be conducted under suitably low soil moisture conditions. The porosity of soils affected by wheel ruts is may be reduced, resulting in anaerobic processes, and reducing the stability and growth of future forests. Therefore, wheel ruts should be avoided as far as possible, and forest certification schemes and regulations includes thresholds for acceptable levels of wheel rutting. Hence, it is important to have efficient ways to monitor the amounts of wheel rutting following harvests.
Because of the large areas involved, rugged terrain, and presence of harvest residues, it is challenging to map wheel ruts in the field, and the use of remotely sensed data may present a feasible alternative . Drones or unmanned aerial vehicles (UAVs) have become popular for capturing images in many forestrelated applications (Puliti et al. 2015, Ighlaut et al. 2019, Kentsch et al. 2020, Banu et al. 2016). In the field of forest operations, drone images and derived 3D products provide a useful source of information to assess the environmental performance of the harvesting operation . In particular, drones have been used to measure wheel rut depth (Pierzchala, Talbot and Astrup 2014, Haas et al. 2016, Talbot et al. 2018, Marra et al. 2021. These studies provided insights in the obtainable accuracy of the rut depth measurements from drone imagery. Nevertheless, all the above studies required manual intervention in either identifying the trail network or localizing specific measurement points or profiles for further analysis. In an effort to automate such measurements, Nevalainen et al. (2017) developed a method to delineate wheel ruts and measure their depth based on UAV imagery. While providing a certain degree of automation, such a method is limited since it heavily relies on the presence of trees around the wheel ruts, which are used to mask out the track area. Such conditions are rarely met in clear cut areas, characterized by a mix of open ground and harvest residuals.
A more widespread operational deployment of UAV based post-harvest assessment would require the partial or full automation of wheel rut detection and measurement . Popular techniques for the analysis of drone data comprise support vector machines (SVM) (Boser et al. 1992) and random forest (Breiman 2001), but these also require human intervention for feature extraction and therefore rely heavily on domain knowledge (Liu & Lang 2019). Deep learning (DL) algorithms automatically learn the features from the data (lazy learning), enabling automation for broad applications, and often outperform traditional algorithms (Zou et al. 2015;Wurm et al. 2019;Bhatnagar et al. 2020). This is also aided by more sophisticated DL libraries and complementary hardware to process the data. For remote sensing applications, convolution neural networks (CNNs) have become a popular choice (Ma et al. 2019). CNNs are utilized mainly for classification in two waysscene annotation and semantic segmentation. In the case of scene annotation, the output of the CNN is the 1-dimensional (1D) vector defining the probability of the image belonging to a particular label. Whereas in semantic segmentation, every pixel is labelled, i.e., the output is not a 1D probability but a 2-dimensional (2D) score map. Therefore, in semantic segmentation, some fully connected layers are replaced by fully convolutional layers. There are multiple new and off-the-shelf architectures present, which have been successfully applied in remote sensing for image classification already (Shin et al. 2016).
While we are not aware of studies on the automated detection of wheel ruts using DL, CNNs have been applied for urban mapping (Audebert et al. 2018), like the identification of roads (Bayoudh et al. 2021), cracks in surfaces (Kim et al. 2021;Ali et al. 2021), tracks (Giben et al. 2015), and pavements (Ma et al. 2021). In addition, studies like Kanakaraddi et al. (2021), Patil & Jadhav (2021), Sofia et al. (2021) have depicted the usage of CNNs to detect roads using satellite imagery. Zhang et al. (2018) describe the benefit of combining ResNet with UNet to extract roads from aerial images.
This study aims to automate the detection and segmentation of wheel ruts caused by cut-to-length harvesters and forwarders in drone images of previously tree-covered sites acquired shortly after final harvests. We use CNN models to segment wheel ruts and cross-validate our results using 20 independent harvested sites with areas between 0.5-21.5 ha in south-eastern Norway.

Material and methods
The processing workflow (Figure1) consisted of the five steps 1) Capturing the drone imagery. 2) Manual annotating the drone images (digitized as polyline vectors).
3) Pre-processing of the images, which includes rasterization of the wheel ruts vectors, and splitting the imagery for feeding into the DL system. 4) Semantic segmentation to detect wheel and non-wheel rut areas in all images per site. 5) Post-processing of the predicted results, including mosaicking the images and applying morphological operations on the mosaicked images. Steps 4 and 5 were repeated for all sites in the leave one out cross-validation format.

Study sites and drone data
A total of 20 study sites were available that were imaged from a drone after clear cutting ( Figure 2a). The flights were conducted over a span of four years (2016 -2019) as part of a long-term effort to monitor the environmental performance of modern harvesting practices. A full description of the manually annotated data is provided by Heppelmann et al. 2021. All sites were productive forest areas in south-eastern Norway.
The availability of the sites was determined in cooperation with forest owners commissioning harvests in several research projects for which the drone data were collected.
The drone data acquisitions varied with respect to several parameters, including the camera used for the image acquisition, flight altitude, season, date and time of the day ( Table 1).
The sites captured in the initial part of the acquisition period were done so using a DJI Phantom 2 UAV fitted with a GoPro TM Hero 4 12 mega-pixel camera (p2GoPro) (DJI 2013), which was later replaced with a DJI Phantom 4 Pro UAV, with DJI's factory fitted 20 mega-pixel camera (p4pro) (DJI 2020). Survey flights with given altitudes and overlap were conducted using DJI's Ground Station Pro software (DJI 2021) in most cases, although UgCS software (www.ugcs.com) was used on steeper sites to reduce variation in GSD within the same model. UgCS allows for flight paths to be set to follow the terrain form from a given digital terrain model resulting in a constant flight height above ground. For the lower resolution p2GoPro, a flight altitude of approximately 50 m above the ground was targeted, while this was increased to between 60 and 100 m on the p4pro, both depending on terrain and obstacles. Despite only low-to-ground objects being of interest, a high forward overlap of 80% and lateral overlap of 70% was targeted in the flight plan. On each site, 5-7 ground control points (GCPs) were installed before image capturing. The GCP position was recorded at centimetre accuracy using a TopCon GR-5 real-time kinematic (RTK) GNSS with live correction via the GSM network.
Agisoft Photoscan was used in processing all UAV based image data (Agisoft, 2019). This process includes feature detection, image alignment, depth mapping, the generation of sparse and dense point clouds, a textured mesh, and finally, digital elevation model (DEM) and ortho-mosaic creation. Depth mapping and the accuracy of that is carried out using structure-from-motion (SfM), as described by Iglhaut et al. (2019). The GCPs, which were clearly visible in the imagery, were used in optimizing camera pose estimates during the SfM procedure in Agisoft, resulting in DEMs with an RMSE typically < 5 cm on average. The finest common resolution on each site was used in generating the orthomosaics and DEMs, which was typically the same as the GSD of 1-3 cm.

Data annotation
The ortho-mosaics for all 20 sites were manually annotated in a GIS environment to register the location, frequency, and severity of wheel ruts (Figure1, step 2). Due to the removal of most trees (i.e., clear cut) during the harvesting operations, the tracks were clearly visible in the RGB ortho-mosaics. In addition to the RGB information, the UAV DEM were also used to aid the annotation of the UAV data (Heppelmann et al. 2021). The tracks were digitized as polylines, and each polyline was classified into the following three severity classes ( Figure 2c).
• Light: visible tracks with no identifiable soil displacement or rut-formation.
• Moderate: tracks that showed rutting with minor soil displacement and deeper indentations but no visible loss of water drainage functions.
• Severe: all tracks with either substantial soil displacement, deep indentations, loss of water drainage functions, or a combination of various of these factors.

Preparation of the annotated data for deep learning
The aim of the study was the segmentation of wheel ruts independent of their severity category (Figure1, step 3). The input reference observation for modelling was thus a binary annotated image with wheel ruts as one class and unaffected area as a second class. Such an image was generated by applying a three meters buffer around the annotated line shapefile, dissolving the results, and converting the resulting polygon into a binary raster with a value of one in correspondence to tracks and zero to nontracks (Figure 3 b,c). The three-meter buffer around the annotated line shapefile was selected after a visual assessment of the affected area. Additionally, any non-forest area was masked out from the further analysis.
The input to the deep learning model (next section) is RGB images in portable network graphics (png) format with a size of 210 × 210 pixels. Therefore, the RGB GeoTIFF raster and the annotated data were split into tiles of 210 × 210 pixels and converted to png format. The metadata containing the geotags of the tiles were stored and used later for mosaicking the predictions. All images were manually checked for quality before feeding the images into the model, and completely blurred or distorted images were removed. That is, from the total of 1909 images, 12 were removed such that a total of 1897 images were available for model training. However, for testing the model, all the images were used, resulting in a total of 1909 predicted images. This was done such that continuous mosaicked can be formed without any gaps. Depending on the size of the sites being used for training and testing, the amount of training data varied. The smallest site (site A) consisted of 96 images, and 1801 images were available for training. The biggest site (site T) consisted of 423 images, and 1474 images were available for training (Table 1, Section 3.2).

Semantic segmentation using CNN
For semantic segmentation (Figure1, step 4), the choice of architecture is generally application-specific, and each architecture has advantages and disadvantages in accuracy, memory consumption, operation counts, inference time and parameter utilization (Canziani et al. 2016). Preliminary analysis on a subset of this study's data revealed that among several combinations of network architectures, the combination of ResNet50 and UNet provided the best result, and we thus selected this combination for further analysis. Figure 4 shows the architecture used in this study. The ResNet50 architecture is well resilient to overfitting due to its residual learning concept (Yang et al. 2020), which states that each layer will feed to next layer and also to the activation layer directly. The layers are considered residual blocks to facilitate the network's training (Ardakani et al. 2020). For decoding the information from ResNet50, UNet architecture is used. The UNet retains information while upsampling to circulate context from lower to a higher resolution layer (Alam et al. 2021). Due to the availability of limited data, a transfer of pre-trained weights from ImageNet was applied on ResNet50, and UNet (decoder) was trained from scratch. Image augmentation (flipping, rotation) was also used to increase the number of input images.
The ResNet50 is a deep network having 50 layers, including batch normalization layers. Such layers normalize the nodes before inputting them into the following activation function. The architecture uses skip connections to impart information between the layers. The convolutional layer is matrix multiplication over the images using a filter of size 3 x 3 with stride = 2 ( Figure 4).
The activation function (A) is used to introduce nonlinearity in the input images, which is done to make the model more expressive and sensitive to distinguish minute features. The Rectified Linear unit (ReLu) activation function we used in this study is one of the most used activation functions due to its high computational effectiveness and computing speed (Lu et al. 2017, Bircanoğlu & Arıca 2018. ReLu removes all the negative parts from the input (image f), as described in 'Equation (1)'. days. An increase in the number of epochs did not markedly improve the precision (see Section 3).

Post-processing
The predictions were made for all the images from all the sites, including the images which were discarded for training the model, such that the ortho-mosaic can be calculated for the entire area. The predictions were further enhanced to locate the wheel ruts using multiple morphological operations. In the post-processing, the noise elements were removed without distorting the original results. First, a binary opening was performed. Opening in mathematical morphology is defined as an erosion followed by dilation using the same structuring element (SE) or kernel on the image (f), shown in 'Equation (2)'. The aim was to enhance detected wheel ruts. As a result, all the spurious regions smaller than ~5 m 2 were removed.
where F was the opened-binary image, ⊖ is erosion, and ꚛ dilation (Serra 1979). The SE1 was a circular disk of a radius of 2 m.
Second, grayscale erosion was performed after binary opening. This step replaces each pixel with the local minimum of the defined SE (SE2) around the pixel.
This was done to define the wheel ruts accurately and remove overestimation along the boundaries of the ruts. The circular disk also uses erosion in this study was set to 20 cm.
For the example of a spatial resolution of 1cm, SE2 was a matrix of size 10 x 10.

Validation
We used leave-one-out cross-validation, and in each iteration, the following steps were implemented: (1) Splitting the data into training (number of sites minus one = 19) and testing data (one site).
(2) Training the model on the training data.
(3) Applying the model to the testing site to classify wheel ruts.

Results and discussion
A comprehensive post-harvest assessment of soil disturbance for compliance with management objectives is a resource-demanding exercise, hardly justifiable under current economic or regulatory conditions. Talbot et al. (2018) show how manual infield measurement would require between 10 and 20 transects of 50 m each (depending on rut prevalence) per hectare to estimate rut lengths with a mean error below 10%.
Therefore, the use of UAV derived data is an attractive option for forest regions applying clear cutting or shelterwood regimes where the ground is partially or fully unobscured, including most of the managed boreal forest and all forms of plantation forestry. However, none of the studies carried out with UAVs to date has presented possibilities for fully automating the data analysis process. Hence, a study using deep learning to detect wheel ruts in a harvested forest automatically was performed. For this, ResNet50+UNet architecture was used, and the study was verified for 20 forest sites in Norway. For every site, training OA, testing OA, and other results achieved at the end of the 20th epoch, are described in Table 2.
On average, the cross-validation test OA was 73%, with an F1 score of 0.67 for the wheel ruts. Morphological operations such as area opening and erosion were performed in the predicted ortho-mosaics, leading to more refined wheel ruts' maps.
The post-processed imagery was compared against the original labels. Table 3 gives an overall confusion matrix (containing data from all pixels in all of the 20 sites), and the confusion matrix for all locations is shown in supplementary data (S.1). Compared to original CNN predictions, the testing OA increased by an average of 1.2% after postprocessing. Due to the area-opening technique, spurious regions were removed, increasing FN for wheel ruts. The next step, erosion of the wheel ruts, made the wheelruts more defined, increasing the TP (for the wheel ruts), leading to an increase in testing OA.
The severe wheel ruts were best detected with an average UA of ~74% ( Figure   5). Since the predictions were made only for wheel rut and non-wheel rut, and the severity was not predicted, the PA is not provided. The moderate and light wheel ruts were often confused with the background and were detected with ~67% and 62% average UA, respectively. The light wheel ruts, representing 63.6% of annotated tracks, were mostly shallow and spectrally similar to the background, resulting in poorer detection accuracy. This category poses little harm to the soil, and therefore, a lower detection probability is not alarming. The severe wheel ruts generally have reduced water drainage functions and considerable soil displacement, making them unique and identifiable. We found that despite the severe class was representing only 11.8% of the total length of the annotated tracks, it was the one detected with the largest accuracy. Even though severe ruts often represent a small portion of the area in a clear cut, they are key in determining the environmental performance of harvest operations for certification purposes. In this sense, our results are encouraging as they show that the most important class is the one that is best predicted.
The average spatial resolution of the images was ~2 cm, and for sites with finer or coarser resolution, the detection was poorer. For example, only 20% of the wheel rut was detected in site N, with a spatial resolution of 7.1 cm. For the sites with a spatial resolution of approx. 2cm (7 of 20 sites), the wheel ruts were detected with an average accuracy of 71%; whereas, for the sites with a spatial resolution of 1cm (5 of 20 sites), it was 66%. Apart from the actual wheel rut, the noise present in the form of residual logs, branches, harvest residues, and shadows can also increase the FP leading to overestimation of wheel ruts. For example, the site I and E has low testing OA (66% and 54% respectively) mostly due to the noise present on the site.
There are also two types of sensors that were used to capture the drone images.
Although the images from p4pro and p2GoPro had differences in the technical specifications (example: shutter mechanism, sensor size, focal length) and the size of the training data (75% p4pro, 15% p2GoPro), there was no notable difference in the accuracy of detecting wheel ruts for the sites from either sensor (p4pro testing OA ~64%, p2GoPro testing OA ~73%). The proposed model was also robust and applicable across the different sensors regardless of their different RGB colour profile (S.3) and sensor specifications.
To check if there is any bias due to imbalance of data between cameras, a separate model was run only on images from p4pro. Figure 7 depicts the accuracy metrics for the CNN model using images from only p4pro, tested on site P. The testing site was chosen randomly, and the total number of training images used was ~900. This test was carried out for 50 epochs to see the effect of increasing epochs on the testing OA (Figure 7).
At epoch 20, in comparison to site P (trained using all data), site P trained using only p4pro data has slightly better testing accuracy but a lower wheel rut F1 score (Table 2). At the end of 20 th and 50 th epochs, the training accuracy was 99.3% and 99.8%, respectively. This process took ten days to run. Compared to the 20th epoch, there is 2% rise in training accuracy and an increase of 0.04 in F1 score for predicting wheel rut. This means, by increasing the number of epochs, there is a slight improvement in the model's performance. However, it is important to consider the time constraint, and therefore running the model for 20 epochs was deemed acceptable.
Lastly, removing the data from p2GoPro made no notable difference in the model's performance. This also implies that, due to the usage of transfer learning, the proposed model works well even with a smaller amount of training data. This is particularly helpful when working in a new area with limited images, which is often the case in practical applications.

Conclusions
In this study, ResNet50+UNet architecture was deployed to identify wheel ruts in drone images of harvested forest sites. The research was conducted on 20 sites, and an average of 73% of testing OA was achieved, with the highest testing OA of 80%.
Based on the results obtained, the following conclusions were drawn: (1) Drone-based photogrammetry and deep learning are valuable techniques for detecting the presence of wheel ruts.
(2) Severe wheel ruts were most accurately identified due to their prominence in appearance compared to light wheel ruts that are homogenous to the surroundings.
(3) The proposed model is robust and can be used to identify wheel ruts from multiple sensors captured at different times.

S.1 Confusion Matricesall sites
The columns in the confusion matrices are the Reference (original) annotated labels, and the rows are the Predicted labels. The values in the confusion matrices are the number of pixels of the ortho-mosaic of that site. The thumbnails depict the actual reference image (training) and the prediction done without using the training image (cross-validated prediction).

S.2 S, M, L, Non-wheel rut -Confusion matrix
The columns in the confusion matrices are the Reference (original) annotated labels, and the rows are the Predicted labels. The values in the confusion matrices are the number of pixels of the ortho-mosaic of that site.