Evaluation of various deformable image registration algorithms for thoracic images

We evaluated the accuracy of one commercially available and three publicly available deformable image registration (DIR) algorithms for thoracic four-dimensional (4D) computed tomography (CT) images. Five patients with esophagus cancer were studied. Datasets of the five patients were provided by DIR-lab (dir-lab.com) and consisted of thoracic 4D CT images and a coordinate list of anatomical landmarks that had been manually identified. Expert landmark correspondence was used for evaluating DIR spatial accuracy. First, the manually measured displacement vector field (mDVF) was obtained from the coordinate list of anatomical landmarks. Then the automatically calculated displacement vector field (aDVF) was calculated by using the following four DIR algorithms: B-spine implemented in Velocity AI (Velocity Medical, Atlanta, GA, USA), free-form deformation (FFD), Horn–Schunk optical flow (OF) and Demons in DIRART of MATLAB software. Registration error is defined as the difference between mDVF and aDVF. The mean 3D registration errors were 2.7 ± 0.8 mm for B-spline, 3.6 ± 1.0 mm for FFD, 2.4 ± 0.9 mm for OF and 2.4 ± 1.2 mm for Demons. The results showed that reasonable accuracy was achieved in B-spline, OF and Demons, and that these algorithms have the potential to be used for 4D dose calculation, automatic image segmentation and 4D CT ventilation imaging in patients with thoracic cancer. However, for all algorithms, the accuracy might be improved by using the optimized parameter setting. Furthermore, for B-spline in Velocity AI, the 3D registration error was small with displacements of less than ∼10 mm, indicating that this software may be useful in this range of displacements.


INTRODUCTION
There are several commercially or publicly available deformable image registration (DIR) algorithms that have been applied to multimodality image fusion, automatic image segmentation, four-dimensional (4D) image-guided radiotherapy and lung functional (ventilation) imaging [1][2][3][4][5][6]. However, there have been limited studies on geometric accuracy, and there has been no study using thoracic images with commercially available DIR algorithms. Recently, fully automatic DIR software (Velocity AI, Velocity Medical Solutions, Atlanta, GA, USA) has become commercially available [7].
It is necessary to verify the accuracy of commercially available automatic DIR software for use in a clinical setting. A number of reference standards have been utilized for validation of DIR software, including synthetically deformed images, high-contrast phantoms and expert-delineated control points [8][9]. While synthetic images and phantoms might provide useful qualitative evaluation of DIR performance characteristics, they lack sufficient realism to provide credible validation of registration spatial accuracy for 4D dose calculation, automatic segmentation and 4D computed tomography (CT) ventilation imaging in patients with thoracic cancer [10]. Therefore, the best reference standard is one derived from actual patient image data. The 4D CT images provided by DIR-lab are images of actual patient data including a large range of reference sample sizes, with equally varying spatial distributions. DIR-lab images have been proposed for characterization of DIR spatial accuracy performance [11].
Thus, in this study, we evaluated the accuracy of one commercially available and three publicly available DIR algorithms using thoracic 4D CT images and anatomical landmark sets.

Commercially available DIR software (Velocity AI)
Velocity AI automatic DIR software was evaluated. B-spline and Demons algorithms were implemented in the software. Demons algorithms were first implemented in Velocity AI (ver. 2.7.0) but in the latest version (ver. 2.8.1) B-spline was only implemented. Therefore, the B-spline algorithm was used in this study. The Velocity's B-spline model was based on Mattes formulation of mutual information with some proprietary information that the vendor cannot divulge. This similarity metric was chosen for its simplicity and efficiency [12][13]. The B-spline model defines the deformation only on a sparse lattice of nodes overlaid on the image, and the displacement at any voxel is obtained by interpolation from the closet lattice nodes. This B-spline model is based on the open source ITK toolkit [13]. The prototype of Velocity's B-spline used the setting of 50 histogram bins and 10 percentage samples for Mattes mutual information metric. The uniform B-spline at one resolution was used and the L-BFGS (limited-memory Broyden-Fletcher-Goldfarb-Shanno) optimizer was used to find the optimal node value, with a maximum of 100 iterations and a maximum of 20 corrections used as termination conditions for the optimization algorithm [14][15]. Information on the parameter setting in the current version of this software, which was used in this study, was not made available by the vendor; the parameter setting cannot be changed by the user. Two DIR strategies (Deformable and Deformable multipass) were implemented in Velocity AI. Deformable is an approach to deform one image in a single stage, the resolution of which is determined by the user. Deformable multipass is an approach to perform DIR sequentially from low resolution to high resolution. After registration has been completed in one image resolution stage, the result is used as the initial condition for the next image registration stage. These resolutions for each stage are automatically determined. Since Deformable multipass is recommended by the vendor for use in a clinical setting and also for the purpose of reducing data complexity during DIR, this approach was used in this study.

Publicly available DIR algorithms (DIRART)
To compare the results of Velocity AI with results obtained by other DIR algorithms, free-form deformation (FFD), Horn-Schunk optical flow (OF) and Demons in DIRART of MATLAB software were used [16][17][18][19]. FFD in DIRART is fast free-form deformation, which minimalizes an energy functional that combines both similarity and smoothness measures. By using calculus of variations, the minimization problem was represented as a set of non-linear elliptic partial differential equations (PDEs) [16]. A Gauss-Seidel finite difference scheme was used to iteratively solve the PDE [14]. OF and Demons are similar algorithms for intensity-based registration with the sum of square of intensity differences metric, and thus would not normally be considered good choices for multimodality image registration [17][18]. In this study, modified Demons included in DIRART was used. This algorithm is modified in the way that the gradient of the moving image is used of the gradient of the fixed image. In DIRART, multigrid and multiple pass approaches are used. Multigrid is almost equal to Deformable multipass in Velocity AI. The default setting in this software is a fourstage setting and was used in this study. The iteration numbers for each stage were 30 (Stage 1), 30 (Stage 2), 30 (Stage 3) and 40 (Stage 4). Stage 4 was the highest resolution. A multiple pass approach made it possible to perform registration in multiple passes, and for each pass, the registration was computed with a small number of iterations. Multiple passes with a small number of iterations for each pass would generate better results than one pass with a larger number of iterations. The numbers of passes for Stage 1, Stage 2, Stage 3 and Stage 4 were 2, 3, 4 and 5, respectively. The stop condition for iteration was set to 0.002. Iteration will stop if adjustment of the motion field is less than this condition. The stop condition for the multiple passing was set to 0.001. Pass will stop if the adjustment of motion field is less than this condition. These values were in units of voxel in 3D. After each pass the deformation vector field computed by the pass could be smoothed by a Gaussian low-pass fitter. The sigma of the Gaussian low-pass filter was set to 0.5 in voxel size. If the value was 1, then no smoothing was performed. The α 2 parameter in the OF and the Gaussian low-pass filter window size in Demons were designed to control the smoothing operation in the iteration. The two values of α 2 and the Gaussian low-pass filter window size were set to 0.2 and 3 in voxel size, respectively. To ensure variation in DIR spatial accuracy performance, no attempts were made to optimize individual case registrations.

CT data and landmark point sets
Five patients with esophagus cancer were studied. Datasets of the five patients were provided by DIR-lab (www. DIR-lab.com) and consisted of thoracic 4D CT images acquired as part of the standard treatment planning process at 2.5-mm slice spacing with a General Electric Discovery ST PET/CT scanner (GE Medical Systems, Waukesha, WI) and a coordinate list of anatomical landmarks that had been manually identified and registered by an expert in thoracic imaging [11]. Extreme inhale and exhale phases of the 4D CT image sets were utilized in this study. Table 1 summarizes the CT image and reference landmark characteristics. Each image was cropped to include the entire rib cage and content subsampled 256 × 256 voxels. Final in-plane voxel dimensions ranged from 0.97 × 0.97 to 1.16 × 1.16 mm. For all cases, the final image slice thickness was 2.5 mm. Regarding the landmark point sets, the original landmark points used in the paper by Castillo contained more than 1000 landmarks. The data of DIR-lab were 300 landmarks from original landmarks [20]. Expert landmark correspondence was used for evaluating DIR spatial accuracy. Figure 1 shows inhale and exhale images with the 300 landmark pairs.

Evaluation of DIR
The goal of DIR is to find a point-to-point correspondence from one image set to another. First, the manually measured displacement vector field (mDVF) was calculated by using the coordinate list. Then the automatically calculated displacement vector field (aDVF) was calculated by using DIR software. Registration errors between mDVF and aDVF were calculated. Mean registration errors were also determined over the combined set of landmarks for all cases. Additionally, 3D registration error was quantified via the distance between the target and the source landmark.

Statistical methods
The registration errors were quantified as the differences between the calculated and manually measured landmark displacements.
Tukey's honestly significant difference test was used to compare the mean registration errors between the two DIR algorithms selected from four DIR algorithms and across all five cases. All tests were two-sided with P-values < 0.05 considered significant. Statistical analysis was performed with JMP version 9.0.2 (SAS Institute, Cary, NC).     15)), and the error for B-spline was greater than that for OF and Demons. Figure 2 shows the difference images for registration of inhalation to exhalation 4D CT volumes for Case 1 and Case 5. For both cases, there are a few large differences in images by B-spline and FFD but only a few differences in images by OF and Demons. Figure 3 shows the results of 3D registration error versus landmark displacement magnitude over the set of reference point pairs for the four algorithms. Displacement magnitudes were binned into 2.5-mm increments.  78) for Demons, respectively. The figure shows that the behavior of OF and Demons 3D registration errors over the range of displacement magnitudes was more consistent than that of B-spline and FFD.

DISCUSSION
Expert-determined sets of anatomical landmark feature pairs have become a common utility for evaluating DIR spatial accuracy, particularly in the context of clinically acquired thoracic images [21][22][23]. However, variability among reference  datasets, particularly with regard to both the quantity and spatial uniformity of selected landmark features, can potentially yield numerical results that are misrepresentations of the true spatial accuracy, from which erroneous conclusions can be drawn with regard to the relative performance characteristics of multiple algorithms or implementations. Thus, in this study, we evaluated DIR spatial accuracy using expertdetermined landmark features that are publically available on the Internet at the DIR-lab website (http://www.dir-lab.com). This dataset consists of 4D CT image sets and a large number of corresponding manually identified landmark points (300 pairs/case) between maximum phases.
Quantitative evaluations for the five cases are summarized in Table 2. Brock et al. assessed the accuracy of DIR for 12 groups for lung 4D CT and showed that all algorithms performed well for lung 4D CT, with mean absolute errors of ≤ 2.5 mm (slice thickness of the image set) in each direction [24].
The mean registration error in each direction in all cases, except for FFD (mean registration error in SI direction = 2.84), was <2.5 mm, which was the slice thickness of CT images. Gu et al. obtained the 3D registration error using the same five sets of thoracic images with various Demons algorithms and showed that the mean 3D registration error (standard deviation) for all cases was 1.57 mm (1.54 mm) [25]. Castillo et al. reported that the mean registration error for Cases 1, 2 and 5 by OF was 1.68 mm, and their data are comparable to our data calculated by 3D registration errors for Case 1, Case 2 and Case 5 in Table 2. (1.91 mm) [21]. Detailed analysis of Case 4 and Case 5 showed that the 3D registration errors for all algorithms were greater than those for other cases. The reason for this was that Case 4 and Case 5 had a higher percentage of large displacement magnitude of landmarks than did other cases, as shown in Fig. 4. That is, as shown in Fig. 3, a large displacement of landmarks gave rise to a large registration error. In terms of reasonable comparison of the algorithms, the present study has several limitations. The DIR parameters in Velocity AI cannot be changed by the user, whereas those in DIRART can be changed by the user. On the other hand, Velocity AI is commercial software for clinical use, and its setting of parameters can be determined to achieve both speed and accuracy. Thus, the different parameter settings for each software may have caused a difference in the accuracy of DIR. Moreover, Demons and OF are monomodel registrations, considering images from one modality (i.e. CT-CT). B-spline with a similarity metric of mutual information is a multimodel registration, considering images from multiple modalities (i.e. CT-positron emission tomography). It was reasonable to use the sum of squared differences as a similarity metric in B-spline for comparison of Demons or OF with B-spline. Furthermore, results in this study may or may not indicate the maximum performance of each DIR algorithm. We should optimize the parameters for each DIR algorithm, if we know the maximum performance of each DIR algorithm. However, it is difficult and it takes a long time to determine the best parameters for each algorithm. Further research is thus needed to clarify this issue.
The fundamental framework for image registration generally requires four steps: namely, it requires a similarity metric (such as Mutual information), an interpolator (which defines how voxels get sampled during the registration process), a transformation (which specifies how a volume can change during the various steps in the optimization process, such as rigidly, affinely, deformably), and lastly the optimizer. The optimizer strives to find the best possible solution that registers the two volumes by marching over a small subset of the solution space. It does this by comparing the answers given by the similarity metric for the evaluated transformed spaces. Kashani et al. showed that different implementations, different users, and different parameter settings for the same type of registration can result in different accuracies, suggesting the need for careful assessment of each implementation as well as standards for user-defined parameters or automation of the registration process [26].
Next, as for the results of 3D registration error versus landmark displacement magnitude, the behavior of OF and Demons 3D registration errors over the range of displacement magnitudes was relatively consistent. As mentioned before, these two algorithms are monomodel registration algorithms. OF and Demons may have high DIR accuracy, even if displacement magnitudes are large. In B-Spline in Velocity AI, we tested only Deformable multipass. We may be able to obtain a higher level of accuracy by using Deformable several times with optimized parameters or by using Deformable with manually set high resolution, because the resolution of Deformable can be set by the user and optimized for each patient. Whether performing Deformable several times with optimized parameters manually improves DIR accuracy remains unclear. Further research is thus needed to clarify this issue. In B-spline in Velocity AI, the registration error was small with displacements of less than~10 mm. This result indicated that the parameter setting in Velocity AI might be optimized to obtain a high level of accuracy in the range of displacements < 10 mm, which is the range of displacements in usual clinical situations. However, Seppenwoolde et al. reported that the average amplitude of the tumor was greatest (12 mm) in the cranial-caudal direction for tumors situated in lower lobes, and that maximum amplitude of the tumor was 24.6 mm in the cranial-caudal direction [27]. That is, the registration error may be larger in lower lobes than in middle or upper lobes when DIR is performed for the whole lung to calculate 4D radiation doses [28][29]. The grid resolution of B-spline has a significant effect on registration error. A denser grid would result in improved registration accuracy at the cost of increased computational expense. Thus, if higher grid density can be used, the registration accuracy of B-spline may be better.

CONCLUSION
We evaluated DIR algorithms including the DIR algorithm implemented in the commercial auto DIR system using patient CT images with a large number of anatomical landmark sets. The results showed that reasonable accuracy was achieved in B-spline, OF and Demons, and that these algorithms have the potential to be used in a clinical setting. However, for all algorithms, the accuracy might be improved by using the optimized parameter setting. Furthermore, for B-spline in Velocity AI, the registration error was small with displacements of less than~10 mm, indicating that this software may be useful in this range of displacements.