Effect of synthetic CT on dose-derived toxicity predictors for MR-only prostate radiotherapy

Abstract Objectives Toxicity-driven adaptive radiotherapy (RT) is enhanced by the superior soft tissue contrast of magnetic resonance (MR) imaging compared with conventional computed tomography (CT). However, in an MR-only RT pathway synthetic CTs (sCT) are required for dose calculation. This study evaluates 3 sCT approaches for accurate rectal toxicity prediction in prostate RT. Methods Thirty-six patients had MR (T2-weighted acquisition optimized for anatomical delineation, and T1-Dixon) with same day standard-of-care planning CT for prostate RT. Multiple sCT were created per patient using bulk density (BD), tissue stratification (TS, from T1-Dixon) and deep-learning (DL) artificial intelligence (AI) (from T2-weighted) approaches for dose distribution calculation and creation of rectal dose volume histograms (DVH) and dose surface maps (DSM) to assess grade-2 (G2) rectal bleeding risk. Results Maximum absolute errors using sCT for DVH-based G2 rectal bleeding risk (risk range 1.6% to 6.1%) were 0.6% (BD), 0.3% (TS) and 0.1% (DL). DSM-derived risk prediction errors followed a similar pattern. DL sCT has voxel-wise density generated from T2-weighted MR and improved accuracy for both risk-prediction methods. Conclusions DL improves dosimetric and predicted risk calculation accuracy. Both TS and DL methods are clinically suitable for sCT generation in toxicity-guided RT, however, DL offers increased accuracy and offers efficiencies by removing the need for T1-Dixon MR. Advances in knowledge This study demonstrates novel insights regarding the effect of sCT on predictive toxicity metrics, demonstrating clear accuracy improvement with increased sCT resolution. Accuracy of toxicity calculation in MR-only RT should be assessed for all treatment sites where dose to critical structures will guide adaptive-RT strategies. Clinical trial registration number Patient data were taken from an ethically approved (UK Health Research Authority) clinical trial run at Guy’s and St Thomas’ NHS Foundation Trust. Study Name: MR-simulation in Radiotherapy for Prostate Cancer. ClinicalTrials.gov Identifier: NCT03238170.


Introduction
Successful radiotherapy (RT) is a balance between irradiating tumour cells adequately for tumour control and limiting dose to healthy tissues.2][3] The advent of MRguided treatment machines 4 promises to further widen the therapeutic window, 5-7 by facilitating daily adaptive treatment regimens [8][9][10][11] with increased accuracy afforded by enhanced soft tissue contrast.
The concept of isotoxic adaptive RT has been proposed, 9,12,13 whereby dose prescription and distribution are optimized at each fraction for TCP by maintaining predicted toxicity at a predetermined level.The clinical benefit of adaptive MR-guided treatment delivery is being validated using NTCP models, 14 and the concept of toxicity-guided adaptive RT has been proposed.For prostate RT in particular, a recent study highlighted the use of toxicity risk models to estimate the benefit of MR-guided adaptive RT 15 and clinical trials including the MIRAGE trial (NCT04384770) will evaluate MR-guidedagainst conventional CT-guided RT in terms of gastrointestinal (GI) and genitourinary (GU) toxicity.
Crucial to the success of isotoxic and toxicity-driven RT protocols is accurate visualization of anatomy and accurate dose calculation.By integrating MR imaging into RT treatment machines, MR linear accelerators (MR-linacs) 4 allow MR imaging at the time of treatment delivery and offer improved anatomical visualization over CT for some tissues.
However, in a truly MR-only environment, accurate dose calculation is dependent on correct representation of electron density through the use of synthetic CTs (sCT). 16Methods of sCT generation must be accurate enough to maintain dose calculation accuracy to critical structures for decision-making using accurate toxicity estimation.
This first evaluation of the effect of sCT methodology on toxicity risk prediction concentrates on rectal toxicity from prostate RT.In the United Kingdom, 10% of patients receiving prostate RT experience gastrointestinal toxicity within the first 2 years post RT. 17 The most established method of dose-derived rectal toxicity risk estimation uses Lyman-Kutcher-Burman (LKB) models 18 where dose volume histogram (DVH) data from the rectum are used.These models have been used as the basis for prostate escalation trials. 19In recent years risk modelling has been performed by extracting dose surface maps (DSM) from the rectal wall.Buettner et al 20 created and parameterized binary DSMs for the rectum and reported associations with rectal toxicity in patients recruited to the RT01 trial [ISRCTN  47772397].The most statistically significant associations with rectal bleeding were the relative area of the 51 Gy DSM > 0.374 (P ¼ .03),and the relative lateral extent of the 61 Gy DSM > 0.591 (P < .001).
The aim of this work was to evaluate the effect of sCT methodologies on methods of rectal toxicity prediction using an LKB model based on rectal DVH and a model based on rectal DSM.

Patient datasets and scan acquisition
Thirty-six patients were recruited into a local Research and Development and ethical-review-board approved clinical study [NCT03238170] registered on clinicaltrials.gov,giving written informed consent.Patients underwent large field of view (LFOV) diagnostic quality T2-weighted and T1 Dixon MR scans within 3 h of their standard-of-care (SoC) RT planning CT scan for prostate RT.All CT scans were acquired with 2.5 mm slice thickness with in-plane pixel dimensions of 0.98 mm.MR scans were acquired on a 1.5 T Aera scanner (Siemens AG, Munich, Germany).The T2 turbo spin echo (T2tse) MR was optimized to provide image quality suitable for accurate delineation of all prostate target volumes and relevant organs at risk (OAR), with TE ¼ 104 ms, TR ¼ 14710 ms, and voxel size 0.4 mm × 0.4 mm × 3 mm.A T1 Dixon sequence was acquired using Siemens standard protocol.Bladder and rectal preparation were replicated for the MR and patients were scanned in the same treatment position using a combination of immobilization equipment supplied by Q-fix and equipment manufactured in-house.
For the DL approach, the 36 patient data sets were split into 24 patients for training and 12 for testing.Treatment plans were created on their planning CT, delivering 60 Gy in 20 fractions according to results from the CHHiP clinical trial (ISRCTN97182923). 21The volumetric modulated arc therapy (VMAT) treatment plans were calculated in Eclipse (Varian Oncology Ltd, CA, United States) using Acuros algorithm, calculating dose to medium.For each plan, planning target volume (PTV) doses were prescribed such that dose to 95% (D95%) of PTV60 was ≥57 Gy, with median dose ¼ 60 Gy, and PTV48 D95% ≥ 45.6 Gy.The most recent rectal dose constraints 22 were applied and met in all cases, and all OAR tolerances were met.

Synthetic CT datasets
For each test patient, 4 sCTs were generated from the MR data.These are described below.

Bulk density approach (sCT_BDw, sCT_BDp)
Bulk density (BD) sCTs were created by the authors by assigning all pixels within the T2tse body surface to Hounsfield Unit (HU) ¼ 0 to generate sCT_BDw, and assigning all body pixels to HU ¼ 11 (population average HU density over the 24 non-test patients) to generate sCT_BDp.

Tissue stratification [using siemens synthetic CT product] (sCT_TS)
The commercially available Siemens CE-marked syngo.viaVB30 (Siemens AG, Munich, Germany) application for tissue-stratified synthetic CT generation 23 was used to create sCT_TS from the T1 Dixon images, resulting in a dataset with 4 internal tissue classes (fat, muscle, spongy bone, compact bone) and inside air (e.g.rectal gas).Non-rigid multimodality mutual-information registration to the planning CT was performed in MIM (MIM Software Inc, OH, United States) to mitigate against the anatomical variation arising from the time delay between CT and MR and the process of repeat patient setup on the MR.
To assess the effect on calculated dose from residual anatomical differences after the non-rigid-registration, we created an extra dataset whereby the ground truth planning CT (rather than the T1 Dixon MRI) was stratified into tissue types to mimic the Siemens tissue stratification (TS) method.This dataset is referred to throughout as sCT_TS(CT).Here, the 4 internal tissue classes (fat, muscle, spongy bone, and compact bone) and air inside the rectum were auto-segmented using a threshold method 24 on the planning CT and assigned HU according to nominal Siemens sCT_TS values (fat ¼ −75HU, muscle/ bladder ¼ 0HU, spongy bone ¼ 204HU, cortical bone ¼ 1067HU, inside air ¼ −500HU).Calcifications and implanted fiducial markers within the clinical target volume (CTV) were assigned as muscle.Areas thresholded as fat or muscle within the spongy bone structures were assigned spongy bone.

Deep learning artificial intelligence approach (sCT_AI)
A deep learning (DL) network was trained, by the authors, to generate the fourth sCT: sCT_AI.The 24 non-test patients formed the training dataset and were used to train a convolutional neural network (CNN) to generate sCT from the T2tse MR data.Data preparation for the CNN training and test data was performed within MIM whereby a multi-modality mutual-information non-rigid image registration algorithm deformed the LFOV T2tse MR scans to the patient's corresponding planning CT scan.The deformed MR data was resampled to CT-resolution in all planes.MR pixel data was normalized from 0 to 1, relative to the near maximum (99.995 percentile) value in the MR.Normalization of the planning CT applied a cut-off HU of 2000, and subsequent normalization using eq (1).
The 2D MR slices now represent the "image" data, with the corresponding 2D CT slices representing the "labels."Network design and hyper-parameter tuning of number of network levels, kernel size, dropout rate, and number of

Assessment of sCT dose calculations on rectal toxicity risk prediction
The treatment plans originally optimized and calculated on the planning CT were transferred to each sCT dataset and recalculated using identical algorithm and monitor units (MU).Resulting dose distributions were compared against the ground truth CT dose distribution using gamma evaluation 29 in Verisoft v6.1 (PTW, Freiburg, Germany) software.LKB model parameters (TD 50 ¼ 97.7 Gy, m ¼ 0.27, n ¼ 0.085, a/b ¼ 3 Gy) were used to predict risk of grade 2 rectal bleeding and late faecal incontinence (LFI) (TD 50 ¼ 105.0 Gy, m ¼ 0.43, n ¼ 1.0, a/ b ¼ 3 Gy) from DVH data as adopted in the methodology of Onjukka et al. 19 The LKB methodology is described in the literature 30 and has been incorporated in commercially available software which allows use of DVHs from the calculated treatment plans for LKB prediction of toxicity risk. 31Risk predictions from each sCT were compared against those calculated from the ground truth planning CT.
DSMs were extracted for each of the dose distributions following the methodology outlined by Buettner et al. 20 On each axial slice, the posterior-most point of the rectal contour was located and, moving clockwise around the contour, the dose to each pixel on the rectal surface input into an array.This was repeated for each axial slice of the rectum.The Buettner paper proceeded to interpolate this array to a 21 × 21 array to compare DSMs between patients.Buettner et al worked with 5 mm CT slice separation, whereas data from this study has 2.5 mm separation, therefore, interpolation within this work is to a 42 × 42 array, with 42 elements in the cranio-caudal direction and 42 in the circumferential direction to maintain a square DSM.The DSMs were converted to equivalent dose in 2 Gray fractions (EQD2) using an alpha-beta ratio of 3. From each DSM, binary maps were created for 51 and 61 Gy.The area of the 51 Gy region relative to the full DSM area was found.An ellipse was fitted to the largest cluster of voxels within each 61 Gy binary DSM.Ellipse parameters were exported and length of principal lateral axis of the ellipse recorded.All DSM parameters from sCT dose calculations were evaluated against those from the ground truth planning CT.

Synthetic CT datasets
Four sCT datasets were generated for each patient using the 3 strategies of BD, TS, and DL.The increasing level of HU intensity resolution is illustrated in Figure 2. The sCT_AI datasets have voxel-wise HU resolution that delivers increased accuracy and detail in the bony anatomy, at the prostate/ bladder interface and within the rectum compared to sCT_TS.
Training of the CNN on 24 patients reached a plateau after 100 epochs and took 5 h.At inference, generation of sCT_AI for each test dataset took less than 3 s.The mean error (ME) and mean absolute error (MAE) in HU within the body volume over all patients was −2.8 ± 2.4HU (one SD) and 36.8 ± 2.4HU respectively (Table 1).Mean SD of HU error over the body volume was 87.7 ± 7.1HU.

LKB-derived late faecal incontinence risk
Ground truth predicted late faecal incontinence risk ranged between 2.1% and 4.0%, with median risk 3.2%.The ground truth range, alongside a box plot of absolute errors introduced for each sCT method, is shown in Figure S1.Despite each BD and TS sCT resulting in statistically significant errors in risk calculation compared to ground truth BJR|Open, 2024, Volume 6, Issue 1 (Table S1), the maximum error in LFI risk prediction for all sCT methodologies was a clinically insignificant 0.1%.sCT_AI resulted in accurate LFI risk calculation in all cases.

LKB-derived G2 rectal bleeding risk
Ground truth predicted G2 rectal bleeding risk ranged between 1.6% and 6.1%, with median risk 4.0%.Figure 3 shows this ground truth range alongside a box plot of absolute errors introduced for each sCT method.
Dose distributions calculated on sCT_BDw gave the largest deviations to DVH-based LKB-derived risk, with lower and upper 95% agreement limits of 0.04% and 0.71% respectively (Table 2).sCT_BDp addressed some of this bias by assigning a more appropriate population-based uniform density.sCT_TS reduced the mean error and 95% agreement limits further, albeit still with statistically significant errors compared with the gold standard.Simulating the TS approach on the planning CT (sCT_TS(CT)) highlighted the effect of unresolved registration errors with the T1-Dixon, bringing the mean error down by 0.13%.sCT_AI dose calculations delivered statistically similar LKB risk predictions to dose calculation on the gold standard CT, with maximum absolute errors of 0.1% and 95% agreement limits of ±0.12%.

DSM-derived G2 rectal bleeding risk
Figure 4 demonstrates the DSM methodology for a representative patient, highlighting how the rectal surface dose distribution (4a) maps to a rectal DSM (4b) and displays the resulting binary maps (4c and 4d).
Ground truth relative 51 Gy area of DSM ranged between 6.6% and 25.5% with median 17.0%.All ground truth plans resulted in values less than the 37.4% tolerance.Figure 5 shows this ground truth range alongside a box plot of absolute errors introduced for each sCT method.
Ground truth relative lateral extent of 61 Gy in the DSMs ranged between 9.1% and 52.5% with median 34.3%.All ground truth plans resulted in values less than the 59.1% tolerance.Figure 6 shows this ground truth range alongside a box plot of absolute errors introduced for each sCT method.

Discussion
sCT accuracy for prostate RT planning in an MR-only pathway has been reported, 16 with many authors establishing deep learning methods for voxel-wise sCT generation from MR. 32 Analysis of errors has focussed primarily on treatment planning parameters, reporting errors in target and OAR doses.Our work evaluates the effect of sCT inaccuracies on conventional treatment planning and importantly on the risk prediction of normal tissue toxicity, in this case G2 rectal bleeding.As RT strategies move to toxicity-driven adaptive regimes, it is important to assess the effect of sCT on the toxicity predictions that direct them.Our DL model generated sCTs with mean HU error (−3 ± 2HU) comparable to Maspero et al 33 who reported a mean error of 1 ± 6HU using a similar 2D technique.The mean absolute error for our 12 test patients of 36.8 ± 2.4HU also compares favourably with similar published machine-learning methods, with MAE of 43.3 ± 2.9 HU reported by Bragman et al, 34 40.5 ± 5.4HU by Jie Fu et al, 35 65 ± 10HU by Maspero et al, 33 and MAE of 42.4HU in a recent literature review of DL sCT approaches. 32Even greater accuracy and robustness may be achieved by increasing the number of training cases, including larger variation in patient size and internal anatomy characteristics, performing data augmentation during training and utilization of more recent CNN architectures.Model performance may also improve with the use of alternative deformable registration algorithms that resolve more of the anatomical differences between CT and MR in the training dataset.Model performance stated above is reliant on high image quality CT images acquired for RT planning, and only for patients scanned head-first supine and on a flat couch-top.Acceptable accuracy for radiotherapy treatment planning is often considered to be within 2%, 36 and thus the high gamma pass rates assessed at 2%/2 mm indicate that all sCT methods performed well globally.Global assessment of HU and dose distribution as described above do not, however, assess local errors and the effect on rectal dose calculations and rectal toxicity predictions in particular.In terms of rectal dose parameters used for treatment planning, sCT_BDw produced a noticeable bias in the dosimetric results and in accordance with the study by Kim et al 37 this bias was reduced when using a uniform density derived from the patient population (sCT_BDp).Errors in PTV and rectal dose constraints agreed closely with the 1% dose uncertainties quoted for pelvic regions in a recent literature review of recent DL sCT approaches 32 and 1% uncertainties cited in a literature review of atlas and non-DL voxel-based approaches. 16n relation to rectal toxicity risk prediction, the DVHbased LKB results demonstrated higher inaccuracies for LRB toxicities than LFI toxicities, indicating that sCT methodology primarily affects the higher doses in the rectal DVH rather than the whole DVH.The voxel-based sCT_AI had the lowest deviation and lowest bias of all methodologies for rectal toxicity risk prediction calculated from rectal DVH,  and errors were statistically insignificant when compared with ground truth calculations performed on gold standard planning CT.Bulk density approaches and the TS approach showed statistically significant differences in both DVH-and DSM-derived toxicity metrics.With DSM-based toxicity predictions sCT_AI had superior accuracy compared with the BD and TS techniques, but in this test cohort all sCT dose distributions gave DSM metrics that maintained their position below the cut-off threshold.Despite larger, and statistically significant, errors in rectal toxicity risk calculation for sCT_TS compared to sCT_AI, the level of error for both could be considered clinically appropriate to accurately resolve G2 LRB risk using these risk models.There are numerous other factors that contribute to dose-derived toxicity prediction (such as the accuracy of NTCP models, anatomical delineation and treatment delivery), but it is encouraging that TS-and DL-based approaches both maintain sufficient accuracy for sCT generation in the treatment planning component of the MR-only RT pathway.Dose calculations performed on sCT_TS and sCT_AI include inherent uncertainty arising from the time between CT and MR acquisition and repeat patient preparation and positioning.Despite best efforts to replicate patient setup and bowel/bladder preparation for the MR scan, changes in internal anatomy could be seen in a few patients which were not completely resolved using deformable image registration.The DL method was robust to these small unresolved registration uncertainties, giving maximum error of 0.1% in DVH-based toxicity.The creation of sCT_TS(CT) allowed inaccuracies originating from residual internal anatomy differences post deformable registration to be quantified for the TS method.The mean error was approximately halved for all toxicity

6
BJR|Open, 2024, Volume 6, Issue 1 metrics analysed and whilst the errors still remained statistically significant for this idealized TS sCT they would be clinically insignificant for RT purposes.Despite the limitation of a small test dataset, the study size was sufficient to show clear difference in accuracy between sCT methods.For investigating the use of sCT in non-pelvic sites, where tissue inhomogeneity plays a larger role, larger test datasets would be warranted.Another limitation is the use of DSM tolerances based on conformal treatment planning in the RT01 clinical trial rather than current IMRT planning.IMRT planning within our study meant that all rectal DSM parameters analysed were well within tolerance.However, we detected only  very minor DSM differences with the commercial TS and DL approaches, with errors unlikely to affect any future DSM tolerances published based on IMRT delivery.
As well as the increased accuracy for rectal toxicity risk calculation, the voxel-based DL technique has several additional advantages over a stratification method including increased HU resolution, which may be useful for image verification using digitally reconstructed radiographs or cone-beam CT.The DL process requires only the T2 image, meaning a single MR scan can be acquired and utilized for both delineation of target and healthy tissues as well as dose calculation via accurate sCT.There is no requirement for additional T1 Dixon sequence as for the TS process.This provides a crucial time saving in terms of patient throughput.

Conclusion
Within the RT community there is an increased interest in isotoxic or toxicity-guided regimes, where an assessment of the dose-derived toxicity risk for the patient is used to adapt RT prescription or delivery.It is important to maintain accuracy of toxicity risk estimation when moving to MR-only RT and this work evaluated this with a variety of sCT methodologies.Bulk density approaches introduced significant error in both DVHand DSM-derived toxicity risk estimation.The more detailed TS and deep learning sCT approaches reduced errors to a clinically appropriate level.It was noticeable that despite the DL voxel-based approach delivering increased HU resolution it did not, in the pelvic region at least, offer clinically significant improvements in DSM-or DVH-based toxicity estimation beyond a tissue-stratification approach.The DL approach was, however, beneficial in other ways.The DL approach required only the T2-weighted sequence for target and OAR delineation, meaning all the treatment planning processes could be carried out using a single MR acquisition, rather than requiring an additional T1 Dixon scan for TS sCT approaches.The speed of DL GPU processing was encouraging, with sCT generated from MR data in seconds.
The assessment of sCT-derived errors should be expanded to other treatment sites where dose to critical structures could guide dose escalation and adaptive radiotherapy strategies.Examples include advanced non-small-cell lung cancer isotoxic escalation, head and neck (H&N) isotoxic escalation and abdominal adaptive stereotactic treatments.sCT accuracy in H&N, thorax, and abdominal regions are less well developed and will be prone to more errors on account of increased tissue heterogeneity.An understanding of the errors introduced into toxicity prediction is necessary before proceeding with MR-only toxicity-guided RT.
epochs was performed using 12 of the 24 training patients in a leave one-out cross-validation approach, with 11/12 patients comprising the training set for each fold.Simple data augmentation (translation, flipping, and rotations) provided no improvement in accuracy at this stage.The final architecture consisted of a 2D U-Net 25 of 32 layers within 4 levels, utilizing dropout regularization rate of 0.2, rectified linear unit (ReLU) activation functions and linear final activation function.The whole batch of training data was input per epoch using the Adam optimizer,26 with learning rate 1e-4, mean squared error as loss function, and trained for 100 epochs.The U-Net architecture (Figure1) was written and refined in Python, within the Keras environment 27 using Tensorflow 28 as the backend.Processing was performed on a 12 Gb Titan Xp graphics processing unit (GPU).Training input data consisted of 2 (512 × 512 × 1671) arrays: one containing 1671 MR slices from the 24 patients and the other containing the corresponding CT slices as training labels.

Figure 3 .
Figure 3. Left: Range of ground truth G2 LRB risk predictions for the 12 test patients, Right: Error in risk prediction due to sCT method.Boxes represent the interquartile range (IQR).Whiskers represent the largest value within 1.5 times the IQR above and below the box.Mean value is shown as a cross.Statistical significance (paired t-test, P < .05)indicated by asterisk, after normal distribution was confirmed by Shapiro-Wilk test.

Figure 4 .
Figure 4. DSM methodology and results for representative patient.(A) Surface dose wash displayed on 3D render of rectum [grey (online version only)] adjacent to clinical target volumes [yellow (online version only)] and bladder [blue (online version only)], (B) EQD2-corrected rectal DSM, (C) binary map of EQD2 dose > 51 Gy, (D) binary map of EQD2 dose > 61 Gy with superimposed ellipse [green (online version only)].

Figure 5 .
Figure 5. Left: test cohort range of 51 Gy relative area on DSM with 37.4% tolerance indicated as thick horizontal line.Right: absolute errors introduced by sCT.Boxes represent the IQR.Whiskers represent the largest value within 1.5 times the IQR above and below the box.Mean (cross) and median (line) are also shown.Statistical significance (paired t-test, P < .05)indicated by asterisk, after normal distribution was confirmed by Shapiro-Wilk test.

Figure 6 .
Figure 6.Left: test cohort range of 61 Gy relative lateral extent on DSM with 59.1% tolerance indicated as thick horizontal line.Right: absolute errors introduced by sCT.Boxes represent the IQR.Whiskers represent the largest value within 1.5 times the IQR above and below the box.Mean (cross) and median (line) are also shown.Statistical significance (paired t-test, P < .05)indicated by asterisk, after normal distribution was confirmed by Shapiro-Wilk test.

Table 1 .
sCT_AI HU analysis accuracy against planning CT scan for 12 test patients.

Table 2 .
Error statistics for G2 LRB risk predictions derived from each sCT.