Abstract

Progress in neuro-oncology is increasingly recognized to be obstructed by the marked heterogeneity—genetic, pathological, and clinical—of brain tumours. If the treatment susceptibilities and outcomes of individual patients differ widely, determined by the interactions of many multimodal characteristics, then large-scale, fully-inclusive, richly phenotyped data—including imaging—will be needed to predict them at the individual level. Such data can realistically be acquired only in the routine clinical stream, where its quality is inevitably degraded by the constraints of real-world clinical care. Although contemporary machine learning could theoretically provide a solution to this task, especially in the domain of imaging, its ability to cope with realistic, incomplete, low-quality data is yet to be determined. In the largest and most comprehensive study of its kind, applying state-of-the-art brain tumour segmentation models to large scale, multi-site MRI data of 1251 individuals, here we quantify the comparative fidelity of automated segmentation models drawn from MR data replicating the various levels of completeness observed in real life. We demonstrate that models trained on incomplete data can segment lesions very well, often equivalently to those trained on the full completement of images, exhibiting Dice coefficients of 0.907 (single sequence) to 0.945 (complete set) for whole tumours and 0.701 (single sequence) to 0.891 (complete set) for component tissue types. This finding opens the door both to the application of segmentation models to large-scale historical data, for the purpose of building treatment and outcome predictive models, and their application to real-world clinical care. We further ascertain that segmentation models can accurately detect enhancing tumour in the absence of contrast-enhancing imaging, quantifying the burden of enhancing tumour with an R2 > 0.97, varying negligibly with lesion morphology. Such models can quantify enhancing tumour without the administration of intravenous contrast, inviting a revision of the notion of tumour enhancement if the same information can be extracted without contrast-enhanced imaging. Our analysis includes validation on a heterogeneous, real-world 50 patient sample of brain tumour imaging acquired over the last 15 years at our tertiary centre, demonstrating maintained accuracy even on non-isotropic MRI acquisitions, or even on complex post-operative imaging with tumour recurrence. This work substantially extends the translational opportunity for quantitative analysis to clinical situations where the full complement of sequences is not available and potentially enables the characterization of contrast-enhanced regions where contrast administration is infeasible or undesirable.

Introduction

Progress in neuro-oncology is increasingly recognized to be obstructed by the marked heterogeneity—genetic, pathological, and clinical—of brain tumours. If the treatment susceptibilities and outcomes of individual patients differ widely, determined by the interactions of many multimodal characteristics,1 then large-scale, fully-inclusive, richly phenotyped data—including imaging—will be needed to predict them at the individual level. Such data can realistically be acquired only in the routine clinical stream, where its quality is inevitably degraded by the constraints of real-world clinical care. Although contemporary machine learning could theoretically provide a solution to this task, especially in the domain of imaging, its ability to cope with realistic, incomplete, low-quality data is yet to be determined.

Over the last few decades, lesion segmentation has formed a cornerstone of innovation across the domains of neuro-oncology,2-4 medical imaging,5,6 biomedical engineering,7 machine, and deep learning.8 The ability to segment an anatomical or pathological lesion in 3D confers the ability to evaluate it quantitatively—moving beyond visual qualitative assessment—with greater richness and fidelity than conventional 2D measurements repeatedly shown to be often spurious and inconsistent between radiologists,9-11 and with greater sensitivity to the heterogeneity of the underlying pathological patterns.12 Enabling radiological image segmentation opens a wide array of possibilities for downstream innovation in neuro-oncological healthcare and research, ranging from clinical stratification, outcome prediction, response assessment, treatment allocation, and risk quantification, many of which have already shown great promise. The underlying goal is to enhance the individual fidelity of data-driven decision-making, facilitating better patient-centred care,13-15 a remit especially warranted in neuro-oncology.

The segmentation of brain tumours remains a particularly challenging task owing to the marked heterogeneity of their imaging appearances: spatial distribution, morphology, signal characteristics, and impact on adjacent healthy anatomical structures.16-18 Its difficulty has even inspired an international competition for cutting-edge deep learning groups to create the best segmentation model. Known as the brain tumour segmentation challenge (BraTS), it is attracting increasing attention as well as support from both the Radiological Society of North America and the American Society of Neuroradiology, providing large-scale data with multimodal MRI—fluid-attenuated inversion recovery (FLAIR), T1, T2, and contrast-enhanced T1 (T1CE) sequences—as well as the labelled ground-truths of oedema, non-enhancing, and enhancing tumour.8,19,20

But while benchmark tasks have unquestionably aided the advancement of lesion segmentation—indeed of computer vision generally—they have compelled a research focus on developing uniformly multimodal models trained on sequence-complete acquisition sets, often rare in real-world clinical practice. The causes of incomplete data are legion, but common examples include patient contraindications to contrast, corruption by image artefacts, and image acquisition constraints such as those imposed in pre-operative stealth studies. Taking just one of many possible causes for image degradation, the prevalence of motion artefact has been reported as 7.5% of outpatient and 29.4% of inpatient MRI studies, with an estimated economic impact of $115 000 per scanner, per year.21

The real-world utility of tumour segmentation must lie within the clinical domain, such as for treatment planning and monitoring across neuro-oncology. Yet, the ability to undertake segmentation in these real-world clinical situations, where complete—‘perfect’—data is scarce, remains completely unknown. How well do contemporary segmentation modelling architectures perform when trained on sequence-incomplete data, and what features of the lesion are correctly identifiable under such circumstances?

Here, we aimed to systematically quantify and answer these questions with the largest and most comprehensive study of its kind based on the application of state-of-the-art deep learning tumour segmentation models to large-scale MRI of brain tumours. We hypothesized that the decrement in segmentation performance with the loss of sequences would be modest, rendering good quality segmentations feasible with incomplete data.

Materials and methods

Data

The study was approved by the local ethics committee. We received ethical permission for the consentless analysis of irrevocably anonymized data collected during routine clinical care.

We used the BraTS 2021 challenge data for all model training. This dataset is described in detail by its curators elsewhere.20,22,23 In brief, it includes a large retrospective sample of multi-institutional brain tumour MRI scans, with heterogeneous equipment, protocols, and image quality. The following sequences are included: T1-weighted, T2-weighted, FLAIR, and T1CE, with a pre-processing pipeline consisting of image co-registration, sampling to a 1 mm3 isotropic space, and brain extraction. Lesions were segmented with an ensemble of previous top-ranking BraTS algorithms with subsequent manual refinement and checking by a panel of board-certified attending neuroradiologists with more than 15 years of clinical experience in neuro-oncology.24 We used the training set of 1251 individuals of the BraTS 2021 challenge data—comprising 5004 separate images—as this group included all ground-truth labels for model cross-validation.

Having trained and evaluated a set of models on the BraTS 2021 challenge data, we sought to separately evaluate their performance on an additional held-out population from our own centre. The aim of this was to provide an additional robust safeguard of model performance with international and external validation. Specifically, we acquired retrospective imaging for a random sample of 50 individuals who underwent gadolinium-enhanced MRI head studies between 2006 and 2021 for a known glioblastoma as part of their routine clinical care at our centre. The random allocation of year selected was to further instil heterogeneity to our sample, as data would be acquired over one of 11 possible MRI scanners of both 1.5 (n = 5) and 3 T (n = 6) field strengths, from multiple different manufacturers, and over a 15-year period. Moreover, of our 50 participants, we also chose to include 10 of those with post-operative imaging and evidential tumour recurrence. This choice increased the difficulty of the task, for a model would need to recognize post-operative resection/surgical bed as separate from the subsequent disease recurrence, as well as capturing the instrumental heterogeneity of different MRI machines distributed in time and place.

Most of our sample did not include volumetric imaging, a reflection of local clinical practice at the time of acquisition. To improve harmonization, we therefore employed super-resolution in the processing pipeline.25,26 The pipeline yielded data in a similar format to the BraTS challenge data20 with 1 mm3 isotropic and brain-extracted multi-sequence data. Lesions were hand-labelled with ITK-SNAP by a neuroradiology fellow with 3 years of experience working with brain tumour imaging, with additional aids of the ITK-SNAP semi-automated segmentation tools, namely random forest based classifiers with subsequent manual refinement.27

Tumour annotations conform to established tissue class labels comprising gadolinium-enhancing tumour, peritumoural oedema/invaded tissue, and non-enhancing tumour/necrotic tumour core.19 The detailed description of these components is beyond the scope of this article and is discussed elsewhere.19,20 In brief, enhancing tumour refers to regions with visible enhancement on a T1CE sequence after gadolinium administration. Non-enhancing tumour/necrotic tumour core refers to the part of the tumour that does not enhance after gadolinium, typically deep to the enhancement, while oedema/invaded tissue refers to the peritumoural oedematous and/or infiltrated brain parenchyma, typified by hyperintensity on T2 and FLAIR sequences.

Algorithm

Our task was not to propose a new architecture superior to those already evidenced by the BraTS 2021 challenge. Rather, we sought to characterize, evaluate, and quantify the variation in model performance with increasingly incomplete data, as a proxy index of translational potential across the variety of clinical situations where full complete datasets rarely occur. We chose the nnU-Net (‘no new net') self-configuring deep learning biomedical image segmentation modelling architecture,28 which notably won both the medical segmentation decathlon and the 2020 BraTS challenge.29,30 In brief, this segmentation method is able to automatically configure itself, including in pre-processing, architecture, training and post-processing across any task, and has been shown to be a superior methodology across a range of public datasets and tasks, including brain tumour segmentation.28 Our choice was guided by its excellent performance and the simple, largely automated processing and training cycle, which made development across many models at scale feasible.

Each nnU-Net28 is in particular a self-configuring U-Net,31 incorporating the standard encoder-decoder architecture and skip connections, instance normalization and leaky rectified linear units. We used the high-resolution 3D architectural formulation in all experiments. The nnU-Net approach employs a polynomially decaying learning rate, initially set to 0.01, with stochastic gradient descent optimization. The loss function is a weighted sum of the Sørenson-Dice coefficient and cross-entropy. Training data is augmented on the fly, including with rotations, scaling, Gaussian noise and blur, brightness and contrast shifting, and gamma correction. Patch and batch size are also self-configured. Model training utilizes 1000 epochs, with foreground oversampling to mitigate the impact of class imbalances. We used 5-fold cross-validation for each experiment and its evaluation with the BraTS 2021 challenge data, as well as additional external/international out-of-sample evaluation of models with the additional data from our own centre as detailed above. A schematic of the model architecture is shown in Supplementary Fig. 1.

Statistical analysis and performance evaluation

We trained all possible combinations of the MRI sequences T1, T2, FLAIR, and T1CE as separate models. This included all models using only a single sequence, two sequences, three sequences, and finally a complete four-sequence model. We also trained separate models for abnormality detection (i.e. a binary lesion mask to detect and segment the whole tumour) as well as tumour segmentation with the tissue classes of oedema, enhancing and non-enhancing tumour. This approach comprised 30 different models in total.

Performance was principally quantified by the out-of-sample Sørenson-Dice coefficient between ground truth and inferred labels,32,33 in accordance with typical research practices.3,8 This metric derives the area of overlap between the model prediction and the labelled ground-truth. The Sørenson-Dice coefficient, or Dice coefficient, is given as:

where TP is true positive, FP is false positive, and FN is false negative.

We also quantified overall model accuracy, false discovery rate, false negative rate, false omission rate, false positive rate, negative predictive value, precision, and recall, ensuring a broad range of possible performance metrics.34 All listed metrics were derived for whole tumour and the separate tissue constituents of oedema, enhancing and non-enhancing tumour, including with 95% confidence intervals (CIs), which are provided in detail throughout the Supplementary material (Supplementary Table 1, Supplementary Table 2, Supplementary Table 3, Supplementary Table 4).

We constructed regression models between ground truth tumour volumes and model predictions, reporting the R2. We acquired the acquisition times of contemporaneous imaging protocols at our centre for a given imaging sequence, to allow comparison between a gain in model performance aligned to the time it would take to be acquired. Lastly, we applied t-distributed stochastic neighbour embedding (tSNE)35—a nonlinear dimensionality reduction technique—to the contrast-enhancing components of all lesions in the BraTS dataset to create a 2D representation of the lesions, projecting their high-dimensional similarities and differences into a readily surveyable space. We overlaid lesion volume and the Sørenson-Dice coefficient of lesion segmentations to display any variation in these indices with the morphology of the lesion.

Data and code availability

All trained model weights, source code, and usage documentation are publicly availble at https://github.com/high-dimensional/tumour-seg. All BraTS 2021 challenge data are readily available from the challenge website at http://braintumorsegmentation.org8,20. Original nnU-Net source code is available at https://github.com/MIC-DKFZ/nnUNet28. Patient imaging data from our external validation site is not available for dissemination under the ethical framework that governs its use.

Compute

All models were trained on an NVIDIA DGX-1 with 8 16GB Tesla P100 GPUs. With approximately 3.5 days to train a single model, the task required just over 13 days utilization of all cards.

Results

Incremental performance with sequence addition

All models performed well on whole tumour segmentation qualitatively, despite varying degrees sequence-completeness, with quantitative performance ranging from a Dice coefficient of 0.907 (95% CI 0.904–0.910) (single sequence) to 0.945 (95% CI 0.943–0.947) (complete sequence set) (Fig. 1). Results for segmentation of the oedema, enhancing, and non-enhancing components were more variable, with Dice coefficients ranging from 0.701 (95% CI 0.689–0.713) [single sequence (FLAIR) segmenting non-enhancing tumour] to 0.891 (95% CI 0.886–0.896) [complete sequence set (T1 + T2 + FLAIR + T1CE segmenting oedema)]. Of note, the models that performed the poorest typically struggled in the segmentation of the non-enhancing tumour component, particularly affecting single sequence models of T1, T2, and FLAIR, two- and three-sequence models employing combinations of the former (i.e. with the omission of contrast). There was no evidence of model over-fitting when reviewing the training/validation curves. We provide the full breakdown of Dice coefficients for all models in Fig. 1. Example image segmentations across the range of all models are provided in Figs. 2 and 3, which visually illustrate excellent coverage of the lesion by the models, with relatively little error. We additionally detail model accuracy, false discovery rate, false negative rate, false omission rate, false positive rate, negative predictive value, precision, and recall (all with 95% CIs), for whole tumour and the separate tissue constituents of oedema, enhancing and non-enhancing tumour, all of which is provided within the Supplementary material (Supplementary Table 1, Supplementary Table 2, Supplementary Table 3, Supplementary Table 4).

Performance of all model combinations. (A) Heatmap illustrates the validation Dice coefficient across all models, for both whole tumour and the individual components. Models are partitioned into those which utilized just one sequence, two, three, and finally the complete four-sequence model. A brighter orange/white box depicts a better performing model as per the Dice coefficient. (B) Second heatmap depicts the relative acquisition time (TA) (in minutes) for the sequences used for a given model, with a more green/yellow box illustrating a longer acquisition time. (C) Third heatmap illustrates the performance gain in Dice coefficient per minute of acquisition time. The mathematical derivation of the Dice coefficient is given in the methods. Colour keys are given at the right of the plot.
Figure 1

Performance of all model combinations. (A) Heatmap illustrates the validation Dice coefficient across all models, for both whole tumour and the individual components. Models are partitioned into those which utilized just one sequence, two, three, and finally the complete four-sequence model. A brighter orange/white box depicts a better performing model as per the Dice coefficient. (B) Second heatmap depicts the relative acquisition time (TA) (in minutes) for the sequences used for a given model, with a more green/yellow box illustrating a longer acquisition time. (C) Third heatmap illustrates the performance gain in Dice coefficient per minute of acquisition time. The mathematical derivation of the Dice coefficient is given in the methods. Colour keys are given at the right of the plot.

Example segmentation results. (A) Left upper panel illustrates stacked axial slices of a given lesion for all imaging sequences, with (B) corresponding radiologist-labelled ground-truth in the left lower panel. (C) Right panel illustrates the tumour segmentation predictions across all model formulations, aligned to the number of sequences supplied.
Figure 2

Example segmentation results. (A) Left upper panel illustrates stacked axial slices of a given lesion for all imaging sequences, with (B) corresponding radiologist-labelled ground-truth in the left lower panel. (C) Right panel illustrates the tumour segmentation predictions across all model formulations, aligned to the number of sequences supplied.

Minimal error in example segmentation results. (A) Left upper panel illustrates stacked axial slices of a given lesion for all imaging sequences, with (B) corresponding radiologist-labelled ground-truth in the left lower panel. (C) Right panel illustrates the tissue-specific error in tumour segmentation predictions across all model formulations, aligned to the number of sequences supplied.
Figure 3

Minimal error in example segmentation results. (A) Left upper panel illustrates stacked axial slices of a given lesion for all imaging sequences, with (B) corresponding radiologist-labelled ground-truth in the left lower panel. (C) Right panel illustrates the tissue-specific error in tumour segmentation predictions across all model formulations, aligned to the number of sequences supplied.

Trade-off between acquisition time and segmentation fidelity

We aligned the acquisition times of all possible combinations of sequences using contemporaneous scanner protocol data at our centre and from which determined the gain in model fidelity in Dice per scanning minute (Fig. 1). This demonstrated that certain combinations of sequences appeared to offer greater gains in segmentation performance when compared with others, offering an insight into the efficiency of data acquisition in this clinical context. For instance, it was noted that whilst a single volumetric T1CE acquisition (proxy for a contrast-enhanced MRI stealth study for neurosurgical planning) took 3.1 minutes, achieving a whole tumour Dice coefficient of 0.908 and reasonable performance on individual components (Fig. 1), the addition of FLAIR raised total scanning time to only 4.9 minutes while improving whole tumour Dice to 0.943, just below the best performing model with all four sequences (Dice coefficient 0.945). Similarly, the three-sequence acquisition of FLAIR + T1CE + T2 (i.e. neglecting the pre-contrast T1) achieved Dice coefficients for whole tumour segmentation essentially equivalent to that of the complete four sequences, and reduced scanning time by 33%, from 9.48 to 6.38 minutes. We do, of course, note the omission of a pre-contrast T1 brings its own issues in delineating contrast from, for example, haemorrhage but is nonetheless a striking illustration of how models with incomplete data still achieved comparable performance.

Segmenting enhancing tumour without contrast-enhanced imaging

Interestingly, we discovered that models without contrast-enhanced imaging could still delineate tumours relatively well (Figs. 4 and 5). Models without contrast imaging segmented whole tumour lesions with Dice coefficients ranging from 0.907 (95% CI 0.904–0.910) (single sequence—T1) to 0.942 (95% CI 0.940–0.945) (three sequences—FLAIR + T1 + T2). Of note, this latter performance was only just shy of the best performing full four sequence model with Dice of 0.945 (95% CI 0.943–0.947). Furthermore, models without the contrast-enhanced T1 sequence could still identify the enhancing tumour component well, with Dice coefficients ranging from 0.756 (95% CI 0.748–0.765) (single sequence—T1) to 0.790 (95% CI 0.782–0.798) (three sequences—FLAIR + T1 + T2) (Figs. 4 and 5). This included the model’s ability to identify and segment lesions where the focus of enhancing tumour was less than 7 mm in diameter (Fig. 5). The volume of enhancing tumour was highly significantly correlated to that of all model predictions, even despite contrast-enhanced imaging not being provided. The relationship between actual enhancing tumour volume to that of the model predictions with the following inputs were as follows: FLAIR alone (R2 0.964); T1 alone (R2 0.953); T2 alone (R2 0.966); FLAIR + T1 (R2 0.973); FLAIR + T2 (R2 0.976); T1 + T2 (R2 0.962); FLAIR + T1 + T2 (R2 0.972). Furthermore, inspection of the t-SNE-derived low dimensional representation of the lesions did not reveal any clear relation between lesion anatomy and segmentation performance across models lacking contrast-enhanced sequences (Fig. 6), other than as expected with lesion size,36 suggesting broad invariance to spatially defined anatomical features.

Segmenting enhancing tumour without contrast. (A) Top panel illustrates axial slices of the lesion across the four sequences. (B) Second panel illustrates the radiologist hand-labelled ground truth for the three tissue classes—of note red depicts enhancing tumour. (C) Third panel illustrates predictions of enhancing tumour segmentation for four models with the following input data: (i) FLAIR alone; (ii) FLAIR and T1; (iii) FLAIR, T1, and T2; and (iv) FLAIR, T1, T2, and T1CE. Of note, only the final model is exposed to contrast-enhanced imaging, although the other three models still reasonably identify the location of the enhancing component. (D) Fourth panel illustrates the component of enhancing tumour that is missed by the model.
Figure 4

Segmenting enhancing tumour without contrast. (A) Top panel illustrates axial slices of the lesion across the four sequences. (B) Second panel illustrates the radiologist hand-labelled ground truth for the three tissue classes—of note red depicts enhancing tumour. (C) Third panel illustrates predictions of enhancing tumour segmentation for four models with the following input data: (i) FLAIR alone; (ii) FLAIR and T1; (iii) FLAIR, T1, and T2; and (iv) FLAIR, T1, T2, and T1CE. Of note, only the final model is exposed to contrast-enhanced imaging, although the other three models still reasonably identify the location of the enhancing component. (D) Fourth panel illustrates the component of enhancing tumour that is missed by the model.

Further examples of segmenting enhancing tumour without contrast. (A-C) Left two columns and rows of each panel illustrate the anatomical imaging for three randomly selected cases, whilst the third column of each panel illustrates the hand-labelled ground truth shown with the overlayed T1CE image, and finally the model prediction where contrast imaging was not provided. Of note, the case in (B) comprised a tumour with only a 7 mm diameter enhancing component. (D) The volume of enhancing tumour is highly significantly correlated to that of all model predictions, even when contrast-enhanced imaging is not provided (quantified by linear regression).
Figure 5

Further examples of segmenting enhancing tumour without contrast. (A-C) Left two columns and rows of each panel illustrate the anatomical imaging for three randomly selected cases, whilst the third column of each panel illustrates the hand-labelled ground truth shown with the overlayed T1CE image, and finally the model prediction where contrast imaging was not provided. Of note, the case in (B) comprised a tumour with only a 7 mm diameter enhancing component. (D) The volume of enhancing tumour is highly significantly correlated to that of all model predictions, even when contrast-enhanced imaging is not provided (quantified by linear regression).

Enhancing tumour segmentation is invariant to lesion morphology. 2D t-SNE embeddings of the enhancing component of all lesions. Each panel illustrates the different set of non-contrast sequences used. Point size is proportional to enhancing tumour volume, whereas colour is proportional to Dice score. Point and colour keys are given at the bottom right of the plot. Note the expected lower scores for smaller segments, but no other obvious systematic variation across the latent space. The mathematical derivation of Dice coefficient and overview of tSNE is given in the methods.
Figure 6

Enhancing tumour segmentation is invariant to lesion morphology. 2D t-SNE embeddings of the enhancing component of all lesions. Each panel illustrates the different set of non-contrast sequences used. Point size is proportional to enhancing tumour volume, whereas colour is proportional to Dice score. Point and colour keys are given at the bottom right of the plot. Note the expected lower scores for smaller segments, but no other obvious systematic variation across the latent space. The mathematical derivation of Dice coefficient and overview of tSNE is given in the methods.

International clinical validation

Whole tumour segmentation

We evaluated the performance of all trained models on an out-of-sample cohort of 50 patients from our own centre in which lesions were hand-labelled, with scans acquired on both 1.5 and 3 T scanners, and with a mixture of pre- and post-operative imaging. The cross-validation performances of all models from the BraTS data were well reproducible on our own data, with Dice coefficients for all models significantly correlated (r = 0.97, P < 0.0001) (Fig. 7). This was despite the multiple steps taken to deliberately make data more heterogenous and liable to error. As expected, models with single imaging modalities, such as T1 or T2 sequences alone, performed worst, with incremental gains in performance with alternative and supplementary modalities.

International validation. (A) Scatterplot illustrates the strong relationship between radiologist-labelled lesions from a disparate international centre. Only relationship between whole tumour hand segmentations and model predictions are shown here, as it transpired that the complete four-sequence model more accurately delineated tissue classes than when hand-labelled. (B) Scatterplot illustrates the strong relationship between model performances from the validation set and when re-evaluated on our own data. For this plot, the complete four sequence model was utilized as the ground truth for the tissue subclasses of the international validation data. The mathematical derivation of the Dice coefficient is given in the methods.
Figure 7

International validation. (A) Scatterplot illustrates the strong relationship between radiologist-labelled lesions from a disparate international centre. Only relationship between whole tumour hand segmentations and model predictions are shown here, as it transpired that the complete four-sequence model more accurately delineated tissue classes than when hand-labelled. (B) Scatterplot illustrates the strong relationship between model performances from the validation set and when re-evaluated on our own data. For this plot, the complete four sequence model was utilized as the ground truth for the tissue subclasses of the international validation data. The mathematical derivation of the Dice coefficient is given in the methods.

Tissue class segmentation

We manually reviewed the tissue segmentations of our own data predicted by the complete four-sequence model and determined the model’s performances classifying tumour by subclasses of non-enhancing tumour, enhancing tumour, and oedema were qualitatively more accurate than our semi-automated hand-segmentation. Akin to the method employed by the BraTS 2021 challenge,20 we therefore then utilized these model predictions using complete imaging sets as our new ground-truth with subsequent manual checking and refinement where required. We then compared the performance of all other models, i.e. those without four sequences, to this revised ground-truth. Model performances were again highly reproducible between the BraTS 2021 challenge data and that of our own external sample, with Dice coefficients significantly correlated (r = 0.95, P < 0.0001) (Fig. 7). As is usually the case in brain tumour segmentation models, segmentations for the non-enhancing tumour component fared worst—especially those with single imaging modalities, whilst prediction of enhancing tumour or oedema fared much better.

We applied our segmentation pipeline to a single patient from our own centre with variable quality (and availability) of imaging during their routine clinical care between 2010 and 2015. We also used this to quantitatively demonstrate lesion volumetry across this time, showing treatment response in early years, followed by stability, and later disease progression (Fig. 8).

Single case example with longitudinal imaging between 2010 and 2015. Line-plot shows time on the x-axis (as days since earliest available imaging and lesion compartmental volume determined from the segmentation model on the y-axis. Below in (A-I) are FLAIR (first row), T1CE (second row), and T1CE with predicted segmentation overlayed (third row) for each available scanning session. T1 and T2 images are not shown as were not available for all imaging sessions. Per the colour key, red depicts enhancing tumour, blue is non-enhancing tumour, and green is oedema. (A) Mid-2010 imaging shows the FLAIR (originally coronal acquisition, but here reconstructed into axial with our super-resolution pipeline for visualization) does not fully cover the posterior margin of the lesion (white arrow). (B) Early 2013 imaging shows the T1CE in some planes of acquisition did not fully cover the cortical surface (note the perfectly vertical line on either side of the brain cortical surface) and thus is super-resolved by using other sequences to recover this (white arrow). (E) Early 2014 FLAIR image demonstrates suboptimal image quality, and yet the segmentation model still delineates the tissue components subjectively well. (H-I) T1CE images undertaken during late 2014 and 2015, respectively, show radical difference in image contrast, but that the segmentation model still performs subjectively well. Moreover, in (I), the model still recognizes the surgical cavity not to be lesion, despite never being trained with post-operative imaging.
Figure 8

Single case example with longitudinal imaging between 2010 and 2015. Line-plot shows time on the x-axis (as days since earliest available imaging and lesion compartmental volume determined from the segmentation model on the y-axis. Below in (A-I) are FLAIR (first row), T1CE (second row), and T1CE with predicted segmentation overlayed (third row) for each available scanning session. T1 and T2 images are not shown as were not available for all imaging sessions. Per the colour key, red depicts enhancing tumour, blue is non-enhancing tumour, and green is oedema. (A) Mid-2010 imaging shows the FLAIR (originally coronal acquisition, but here reconstructed into axial with our super-resolution pipeline for visualization) does not fully cover the posterior margin of the lesion (white arrow). (B) Early 2013 imaging shows the T1CE in some planes of acquisition did not fully cover the cortical surface (note the perfectly vertical line on either side of the brain cortical surface) and thus is super-resolved by using other sequences to recover this (white arrow). (E) Early 2014 FLAIR image demonstrates suboptimal image quality, and yet the segmentation model still delineates the tissue components subjectively well. (H-I) T1CE images undertaken during late 2014 and 2015, respectively, show radical difference in image contrast, but that the segmentation model still performs subjectively well. Moreover, in (I), the model still recognizes the surgical cavity not to be lesion, despite never being trained with post-operative imaging.

Discussion

We have systematically surveyed the ability of state-of-the-art tumour segmentation models in delineating and quantifying brain tumour components in real-world clinical situations of incomplete and/or low-quality data. We reveal there is surprisingly little variation in the performance of segmenting a whole tumour with the number of modelled imaging modalities. Greater variation is observed when segmenting tumour components: a clear pattern of incremental improvement with the addition of further sequences emerges. These findings open the door both to the application of segmentation models to large-scale historical data, for the purpose of building treatment and outcome predictive models, and their deployment to real-world clinical care.

Strikingly, we find that segmentation models trained without contrast-enhancing imaging still characterize the anatomy of enhancing tumour components remarkably well. This includes quantification of the volumetric burden of enhancing tumour with high accuracy. Out-of-sample validation illustrates strong generalizability of these findings, including across super-resolved non-isotropic acquisitions, in varying MRI field strengths, and in tumour recurrences on complex postoperative imaging of limited quality. Our analyses show that current segmentation models generalize surprisingly well to real-world clinical imaging varying in quality and sequence completeness. We also use a case-based example (Fig. 8) to demonstrate how this might factor into the clinical workflow, which in this case was achieved using a Docker container with Python, the software requirements as detailed in the methods, and the trained tumour segmentation model weights (all of which is openly available at https://github.com/high-dimensional/tumour-seg).

Additive value of multiple sequences

Model fidelity unsurprisingly rose with the number of modelled sequences. What is, however, surprising is the ability of models based on limited data to delineate lesions very well. This is particularly striking in the segmentation of the whole tumour, where only marginal differences in Dice coefficient were seen across the range of sequence combinations. We can conclude that even single sequences may be sufficient for segmenting brain tumours with fidelity adequate for many downstream tasks.

The segmentation of tumour compartments—oedema, enhancing, and non-enhancing tumour—however presents a more complex picture. Single sequence models of oedema and enhancing tumour perform best with FLAIR and T1CE sequences, respectively. But models of two or three sequences exhibit less intuitive behaviour. Adding FLAIR to T1CE achieves whole tumour performance very close to that of the complete, four-sequence model, despite receiving only half the data. To that end, single T1CE MRI studies (such as in stealth imaging) may therefore benefit from the addition of a FLAIR sequence to enable more optimum visualization of the entire lesion to aid pre-operative planning. A two-sequence model of T1 and T1CE can delineate oedema well without the T2 or FLAIR typically used to identify it. Overall, these findings illustrate the ability of contemporary computer vision models to extract information from multiple sequences with greater efficiency than intuitive perception may suggest.37,38

Segmenting enhancing tumour without contrast-enhanced imaging

Strikingly, we found that models without the contrast enhancing sequence (T1CE) can still segment what has been hand-labelled by experienced neuroradiologists with full imaging datasets as the enhancing component of the tumour well, not least with performance largely invariant to the size, shape, and neuroanatomical location of the enhancing component. This introduces the possibility—across both research and clinical practice—to make approximate inferences about the anatomy of enhancing components without the use of contrast. Moreover, that a model can identify what has been termed the ‘enhancing’ tumour,19,38 without any information about its enhancing properties, reveals the presence of non-intuitive imaging features that could render the enhancing component quantifiable without the use of contrast. This challenges the current dogma of ‘enhancing tumour’, given a machine can identify it without the administration of intravenous agents ordinarily required to reveal it. Further investigation of this possibility is warranted, including the detectability of the presence of any degree of enhancement. These findings also illustrate a clinically important opportunity in oncological imaging when contrast enhanced imaging cannot be acquired, not least in situations of repeated follow-up where the over-use of contrast should ideally be limited, for example to minimize Gadolinium retention in paediatric patients. We note recent research on completing image sets synthetically may be fruitful in this domain,39-42 as well as a wider body of literature aiming to reduce the requirement for contrast.43

Limitations

In our systematic evaluation of the ability of deep learning models to identify brain tumours with varying degrees of sequence-incomplete data, we opted to use one single self-configuring architecture—nnU-Net.28,29 The use of this software is well justified given its validated performance across many domains of medical imaging.30 But segmentation models are a rapidly evolving field, and so it is possible that other architectures might perform differently, perhaps even superiorly, to that used here. It is however important to note that our aim was not to identify the definitive ‘best’ tumour segmentation model but to quantify the impact of sequence completeness. Our aim is to determine how such models could perform in real-world clinical situations where ‘perfect’ data rarely exists, quantifying their appropriateness for translation to the clinical frontline. Furthermore, BraTS training data includes only preoperative imaging, yet it is plausible that much of the value in segmentation models may lie in longitudinal follow-up including that of postoperative resection appearances. Whilst we included a selection of postoperative imaging in our additional external validation, a more dedicated evaluation in the postoperative setting should form an area for future investigation.

Conclusion

Automated segmentation models can characterize tumours in real-world clinical situations of incomplete imaging data remarkably well. Such models are also able to identify enhancing tumour without the use of contrast-enhanced imaging, potentially providing clinical guidance in circumstances where contrast administration is contraindicated or where its repeated use should be minimized. This opens the way to quantifying enhancing components without the administration of intravenous agents, not least invites a revision of the notion of tumour enhancement if the same information can be extracted without contrast. Its applicability includes not just prospective scenarios wherein a full scan may not be possible such as patients unable to receive intravenous contrast but also applies to historical datasets where certain sequences might not have been acquired. Out-of-sample validation illustrates strong generalizability, across non-isotropic acquisitions and even on complex postoperative imaging where tumours have recurred. Translation of such models to the clinical frontline for response assessment—where complete data is a rarity—may be easier than hitherto believed.

Supplementary material

Supplementary material is available at Brain Communications online.

Funding

J.K.R. is supported by the Guarantors of Brain, the National Health Service Topol Digital Fellowship, and the Medical Research Council (MR/X00046X/1). P.N. is supported by the Wellcome Trust (213038/Z/18/Z) and the University College London Hospital National Institute for Health Research Biomedical Research Centre. H.H. is supported by the University College London Hospital National Institute for Health Research Biomedical Research Centre.

Competing interests

The authors report no conflicts of interest.

References

1

Ruffle
JK
,
Mohinta
S
,
Pombo
G
, et al.
Brain tumour genetic network signatures of survival
.
arXiv
.
2023
.

2

Peng
J
,
Kim
DD
,
Patel
JB
, et al.
Corrigendum to: Deep learning-based automatic tumor burden assessment of pediatric high-grade gliomas, medulloblastomas, and other leptomeningeal seeding tumors
.
Neuro Oncol
.
2021
;
23
(
12
):
2124
2124
.

3

Xue
J
,
Wang
B
,
Ming
Y
, et al.
Deep learning–based detection and segmentation-assisted management of brain metastases
.
Neuro Oncol
.
2020
;
22
(
4
):
505
514
.

4

Lu
S-L
,
Xiao
F-R
,
Cheng
JC
, et al.
Randomized multi-reader evaluation of automated detection and segmentation of brain tumors in stereotactic radiosurgery with deep neural networks
.
Neuro Oncol
.
2021
;
23
(
9
):
1560
1568
.

5

Lenchik
L
,
Heacock
L
,
Weaver
AA
, et al.
Automated segmentation of tissues using CT and MRI: A systematic review
.
Acad Radiol
.
2019
;
26
(
12
):
1695
1706
.

6

Suetens
P
,
Bellon
E
,
Vandermeulen
D
, et al.
Image segmentation: Methods and applications in diagnostic radiology and nuclear medicine
.
Eur J Radiol
.
1993
;
17
(
1
):
14
21
.

7

Ashburner
J
,
Friston
KJ
.
Unified segmentation
.
Neuroimage
.
2005
;
26
(
3
):
839
851
.

8

Menze
BH
,
Jakab
A
,
Bauer
S
, et al.
The multimodal brain tumor image segmentation benchmark (BRATS)
.
IEEE Trans Med Imaging
.
2015
;
34
(
10
):
1993
2024
.

9

Zhao
B
,
James
LP
,
Moskowitz
CS
, et al.
Evaluating variability in tumor measurements from same-day repeat CT scans of patients with non-small cell lung cancer
.
Radiology
.
2009
;
252
(
1
):
263
272
.

10

McNitt-Gray
MF
,
Kim
GH
,
Zhao
B
, et al.
Determining the variability of lesion size measurements from CT patient data sets acquired under “no change” conditions
.
Transl Oncol
.
2015
;
8
(
1
):
55
64
.

11

Dempsey
MF
,
Condon
BR
,
Hadley
DM
.
Measurement of tumor “size” in recurrent malignant glioma: 1D, 2D, or 3D?
AJNR Am J Neuroradiol
.
2005
;
26
(
4
):
770
776
.

12

Mandal
AS
,
Romero-Garcia
R
,
Hart
MG
,
Suckling
J
.
Genetic, cellular, and connectomic characterization of the brain regions commonly plagued by glioma
.
Brain
.
2020
;
143
(
11
):
3294
3307
.

13

Topol
E
.
The Topol Review: Preparing the Healthcare Workforce to Deliver the Digital Future
.
2019
. https://topol.hee.nhs.uk/wp-content/uploads/HEE-Topol-Review-2019.pdf.
(Accessed date 2022-12-01)
.

14

Rajpurkar
P
,
Chen
E
,
Banerjee
O
,
Topol
EJ
.
AI in health and medicine
.
Nat Med
.
2022
;
28
(
1
):
31
38
.

15

Ruffle
JK
,
Farmer
AD
,
Aziz
Q
.
Artificial intelligence assisted gastroenterology—Promises and pitfalls
.
Am J Gastroenterol
.
2019
;
114
(
3
):
422
428
.

16

Chow
D
,
Chang
P
,
Weinberg
BD
,
Bota
DA
,
Grinband
J
,
Filippi
CG
.
Imaging genetic heterogeneity in glioblastoma and other glial tumors: Review of current methods and future directions
.
Am J Roentgenol
.
2018
;
210
(
1
):
30
38
.

17

Molina
D
,
Perez-Beteta
J
,
Luque
B
, et al.
Tumour heterogeneity in glioblastoma assessed by MRI texture analysis: A potential marker of survival
.
Br J Radiol
.
2016
;
89
(
1064
):
20160242
.

18

Louis
DN
,
Perry
A
,
Wesseling
P
, et al.
The 2021 WHO classification of tumors of the central nervous system: A summary
.
Neuro Oncol
.
2021
;
23
(
8
):
1231
1251
.

19

TCGA
.
VASARI Research Project
. Accessed 25 January 2022, 2022. https://wiki.cancerimagingarchive.net/display/Public/VASARI+Research+Project.

20

Baid
U
,
Ghodasara
S
,
Bilello
M
, et al.
The RSNA-ASNR-MICCAI BraTS 2021 benchmark on brain tumor segmentation and radiogenomic classification
.
2021
.
arXiv. 2021
.

21

Andre
JB
,
Bresnahan
BW
,
Mossa-Basha
M
, et al.
Toward quantifying the prevalence, severity, and cost associated with patient motion during clinical MR examinations
.
J Am Coll Radiol
.
2015
;
12
(
7
):
689
695
.

22

Bakas
S
,
Akbari
H
,
Sotiras
A
, et al.
Advancing the cancer genome atlas glioma MRI collections with expert segmentation labels and radiomic features
.
Sci Data
.
2017
;
4
:
170117
.

23

Bakas
S
,
Reyes
M
,
Jakab
A
, et al.
Identifying the best machine learning algorithms for brain tumor segmentation, progression assessment, and overall survival prediction in the BRATS challenge
.
ArXiv
.
2018
.

24

Baid
U
,
Ghodasara
S
,
Bilello
M
, et al.
The RSNA-ASNR-MICCAI BraTS 2021 benchmark on brain tumor segmentation and radiogenomic classification
.
arXiv
.
2021
.

25

Brudfors
M
,
Balbastre
Y
,
Nachev
P
,
Ashburner
J
.
MRI super-resolution using multi-channel total variation
.
Springer International Publishing
;
2018
. p
217
228
.

26

Nan
Y
,
Ser
JD
,
Walsh
S
, et al.
Data harmonisation for information fusion in digital healthcare: A state-of-the-art systematic review, meta-analysis and future research directions
.
Inf Fusion
.
2022
;
82
:
99
122
.

27

Yushkevich
PA
,
Gao
Y
,
Gerig
G
.
ITK-SNAP: An interactive tool for semi-automatic segmentation of multi-modality biomedical images
.
Annu Int Conf IEEE Eng Med Biol Soc
.
2016
:
3342
3345
.

28

Isensee
F
,
Jaeger
PF
,
Kohl
SAA
,
Petersen
J
,
Maier-Hein
KH
.
nnU-Net: A self-configuring method for deep learning-based biomedical image segmentation
.
Nat Methods
.
2021
;
18
(
2
):
203
211
.

29

Isensee
F
,
Jaeger
PF
,
Full
PM
,
Vollmuth
P
,
Maier-Hein
K
.
nnU-Net for brain tumor segmentation
.
2020
.
arXiv. 2020
.

30

Antonelli
M
,
Reinke
A
,
Bakas
S
, et al.
The medical segmentation decathlon
.
Nat Commun
.
2022
;
13
(
1
):
4128
.

31

Ronneberger
O
,
Fischer
P
,
Brox
T
.
U-Net: Convolutional networks for biomedical image segmentation
.
Springer International Publishing
;
2015
. p
234
241
.

32

Sorensen
J
.
A method of establishing group of equal amplitude in plant sociobiology based on similarity of species content and its application to analyses of the vegetation on Danish commons
.
1948
.

33

Dice
LR
.
Measures of the amount of ecologic association between species
.
Ecology
.
1945
;
26
(
3
):
297
302
.

34

Maier-Hein
L
,
Reinke
A
,
Christodoulou
E
, et al.
Metrics reloaded: Pitfalls and recommendations for image analysis validation
.
arXiv. 2022
.

35

van der Maaten
L
,
Hinton
G
.
Visualizing data using t-SNE
.
J Mach Learn Res
.
2008
;
9
(
86
):
2579
2605
.

36

Taha
AA
,
Hanbury
A
.
Metrics for evaluating 3D medical image segmentation: Analysis, selection, and tool
.
BMC Med Imaging
.
2015
;
15
:
29
.

37

Bink
A
,
Benner
J
,
Reinhardt
J
, et al.
Structured reporting in neuroradiology: Intracranial tumors
.
Front Neurol
.
2018
;
9
:
32
.

38

Wen
PY
,
Macdonald
DR
,
Reardon
DA
, et al.
Updated response assessment criteria for high-grade gliomas: Response assessment in neuro-oncology working group
.
J Clin Oncol
.
2010
;
28
(
11
):
1963
1972
.

39

Conte
GM
,
Weston
AD
,
Vogelsang
DC
, et al.
Generative adversarial networks to synthesize missing T1 and FLAIR MRI sequences for use in a multisequence brain tumor segmentation model
.
Radiology
.
2021
;
299
(
2
):
313
323
.

40

Calabrese
E
,
Rudie
JD
,
Rauschecker
AM
,
Villanueva-Meyer
JE
,
Cha
S
.
Feasibility of simulated postcontrast MRI of glioblastomas and lower-grade gliomas by using three-dimensional fully convolutional neural networks
.
Radiol Artif Intell
.
2021
;
3
(
5
):
e200276
.

41

Jayachandran Preetha
C
,
Meredig
H
,
Brugnara
G
, et al.
Deep-learning-based synthesis of post-contrast T1-weighted MRI for tumour response assessment in neuro-oncology: A multicentre, retrospective cohort study
.
Lancet Digit Health
.
2021
;
3
(
12
):
e784
e794
.

42

Wang
G
,
Gong
E
,
Banerjee
S
, et al.
Synthesize high-quality multi-contrast magnetic resonance imaging from multi-echo acquisition using multi-task deep generative model
.
IEEE Trans Med Imaging
.
2020
;
39
(
10
):
3089
3099
.

43

Zhang
Q
,
Burrage
MK
,
Shanmuganathan
M
, et al.
Artificial intelligence for contrast-free MRI: Scar assessment in myocardial infarction using deep learning-based virtual native enhancement
.
Circulation
.
2022
;
146
(
20
):
1492
1503
.

Abbreviations

     
  • MR =

    magnetic resonance

  •  
  • MRI =

    magnetic resonance imaging

  •  
  • T1CE =

    contrast-enhanced T1 image

  •  
  • TA =

    acquisition time

This is an Open Access article distributed under the terms of the Creative Commons Attribution License (https://creativecommons.org/licenses/by/4.0/), which permits unrestricted reuse, distribution, and reproduction in any medium, provided the original work is properly cited.

Supplementary data