Deep Learning for Multi-Tissue Segmentation and Fully Automatic Personalized Biomechanical Models from BACPAC Clinical Lumbar Spine MRI

Abstract Study Design In vivo retrospective study of fully automatic quantitative imaging feature extraction from clinically acquired lumbar spine magnetic resonance imaging (MRI). Objective To demonstrate the feasibility of substituting automatic for human-demarcated segmentation of major anatomic structures in clinical lumbar spine MRI to generate quantitative image-based features and biomechanical models. Setting Previous studies have demonstrated the viability of automatic segmentation applied to medical images; however, the feasibility of these networks to segment clinically acquired images has not yet been demonstrated, as they largely rely on specialized sequences or strict quality of imaging data to achieve good performance. Methods Convolutional neural networks were trained to demarcate vertebral bodies, intervertebral disc, and paraspinous muscles from sagittal and axial T1-weighted MRIs. Intervertebral disc height, muscle cross-sectional area, and subject-specific musculoskeletal models of tissue loading in the lumbar spine were then computed from these segmentations and compared against those computed from human-demarcated masks. Results Segmentation masks, as well as the morphological metrics and biomechanical models computed from those masks, were highly similar between human- and computer-generated methods. Segmentations were similar, with Dice similarity coefficients of 0.77 or greater across networks, and morphological metrics and biomechanical models were similar, with Pearson R correlation coefficients of 0.69 or greater when significant. Conclusions This study demonstrates the feasibility of substituting computer-generated for human-generated segmentations of major anatomic structures in lumbar spine MRI to compute quantitative image-based morphological metrics and subject-specific musculoskeletal models of tissue loading quickly, efficiently, and at scale without interrupting routine clinical care.


Introduction
Chronic lower back pain (cLBP) is a leading cause of disability in the United States [1] and is estimated to affect 540 million people worldwide [2]. Within the United States, cLBP is responsible for an estimated loss of 150 million workdays annually [3,4] and is associated with an estimated annual cost of $100 to $200 billion [5,6]. cLBP is one of the most common drivers of visits to a physician, but despite rapidly increasing treatment costs, patient outcomes have not improved substantially over time [7,8]. cLBP is nonspecific in 62.2% of cases [9], making patient-specific treatment interventions difficult to develop. The etiology of cLBP is multifactorial and includes physical, psychological, environmental, and socioeconomic factors [10]. Challenges in isolating the causes of pain can lead to overuse of imaging, opioids, and surgical treatment options.
Magnetic resonance imaging (MRI) and other medical imaging techniques are often used to design and monitor treatment strategies for patients with cLBP. Imaging alone is a weak predictor of pain presence and pain drivers because of a lack of consistent associations between imaging studies and clinical symptoms [11,12]. There is debate in the literature about the relationships between different image-based morphological metrics and spine-related health indicators [11][12][13][14][15][16][17][18][19][20][21][22], in part because of limited sample sizes and inconsistent measurement of imaging features across readers and institutions.
Segmentation of medical images is essential to unlocking scalable and reliable quantitative image-based markers of cLBP from spine morphology, but it can be prohibitively costly to perform. Manual segmentation is expensive and time intensive, requiring experienced readers between 15 minutes and several hours to annotate each exam, depending on the number of slices and structures of interest. Furthermore, despite the time and expertise investment required, annotations are subject to human error and bias across readers, which makes them difficult to generate and assess consistently in large-scale studies. Segmentation of the vertebral bodies, intervertebral discs, and paraspinous muscles in MRI can be valuable for diagnosing and characterizing spine degeneration and various pathologies related to cLBP, including stenosis, scoliosis, and osteoporosis [23]. Estimation of internal tissue loading demands is a promising potential indicator for evaluating subject-specific risks for back pain and injury prevention [24]. Tissue loading cannot be directly measured, but biomechanical musculoskeletal models based on medical images can be used to provide valid estimates of tissue loading [25]. Segmentation of images of the spine is an essential step to creating subject-specific musculoskeletal models, which improve on generic models by incorporating measurements like spine curvature and muscle morphology [26]. If segmentation can be performed in a fast, low-cost manner, personalized biomechanical models of patients could be automatically created as part of a clinical workflow. These models could then be used to evaluate measures of tissue loading as a component of back pain and to design patient-specific interventions.
Deep learning-based segmentation methods offer improvements in standardization and scalability when compared with human segmentation and can be leveraged to improve data collection in cLBP research. Combining computational methods with MRI presents opportunities for improved diagnosis, with potential for multifactorial analysis, quantitative analysis, and increased sample sizes.
We propose a framework for automatic segmentation of vertebral bodies, intervertebral discs, and paraspinous muscles in clinically acquired sagittal and axial T1-weighted MRIs of the lumbar spine and assess its efficacy as a substitute for manual segmentation in calculating intervertebral disc height, muscle cross-sectional area, and metrics of lumbar spine loading with subject-specific biomechanical models.

Methods
This study was approved by the University of California, San Francisco (UCSF), Institutional Review Board. A retrospective clinical dataset of 206 MRI exams with both axial and sagittal T1-weighted acquisitions was aggregated as part of UCSF's Back Pain Consortium. All exams were randomly selected from clinical scans at UCSF between 2008 and 2018. Of that set, 24/27/45 (vertebral body / disc / muscle) exams were annotated with the respective anatomic structure. Cases with bone fractures, extensive implants, primary tumors, and wide-spread metastatic disease to the spine were excluded. All volumes included were drawn from clinical exams obtained on GE scanners and followed the image acquisition parameters detailed in Table 1. Images were captured with subjects in head-first or feet-first supine position.
Manual vertebral body segmentations were performed on all slices in 24 exam volumes by one reader (GI). Intervertebral disc segmentations were similarly annotated on all slices in 27 exams by two readers (GI, MH). Paraspinal muscle segmentations were annotated on one to three slices at each lumbar disc level in 45 exams by one reader (GI). Both readers were trained by a board-certified radiologist (UB). All annotations were performed with MD.ai annotation software (MD.ai, New York, NY) (see Figure 1). When annotating the vertebral bodies and intervertebral discs, readers were instructed to segment all structures visible on every slice of a volume, excluding 1) any bodies and discs that were not completely pictured in the field of view and 2) structures inferior to the S1 vertebral body and L5-S1 intervertebral disc. All readers were trained to identify respective anatomy by a board-certified radiologist.
Separate 2D V-Nets [27] were trained to segment each anatomic structure on each sequence. The V-Net architecture and associated Dice-based loss function have previously been demonstrated to have improved performance and convergence time on medical image segmentation tasks over other popular network architectures. A 2D instead of 3D approach was selected to maximize use of available data, as the wide variation in number of annotated slices per exam would have required substantial slice-dimension coercion, which could impart bias through duplication or interpolation and loss of information through cropping. Because slice thickness in clinical images is greater than the in-plane resolution, segmentation of these sequences was a good candidate for a twodimensional method. Two versions of each convolutional neural network were trained to segment the vertebral bodies,
intervertebral discs, and paraspinous muscles, respectively, from T1-weighted MRI volumes. Two-version splitting was required to maximize both training performance and unbiased testing of biomechanical models; the standard split was constructed to demonstrate expected network performance under typical conditions, while the shared split was constructed to maximize samples in a hold-out test set to assess downstream performance without data leakage. The first version of each of the three models was computed with a standard random split of approximately 75% train, 15% validation, and 10% test to demonstrate optimal model performance on segmentation of each anatomic structure; we call this version the standard split. To accommodate limitations in data quantity, the networks were then retrained with new splits, which reserved a shared set of 10 patients across models for testing to assess the performance of this segmentation pipeline on automatic morphological metric and biomechanical model generation; we call this version the shared split. All networks were trained according to the hyperparameters listed in Table 2.
Morphological metrics and biomechanical models were evaluated only for the networks in the shared split. Intervertebral disc, vertebral body, and paraspinous muscle segmentations were inferred for each patient in the shared hold-out test set of 10 exams, through the use of the respective V-Net. Postprocessing was applied to all segmentations to smooth edges, fill holes, and isolate the largest connected components. Intervertebral disc height (IVDH), muscle crosssectional area (CSA), and 2D and 3D centroid positions for each anatomic structure in patient-based space were then extracted from the segmentations. These quantitative features were then leveraged as inputs to construct subject-specific biomechanical models of the lumbar spine to extract measures of compressive loading on the vertebral bodies.
To calculate IVDH, a 3D centroid was computed on each segmented disc to identify its most central slice; a minimum bounding rectangle was then constructed around the segmentation on the center slice to extract the shortest side length as a final height. Muscle CSA was constructed by calculating the sum of foreground pixels for each muscle and then multiplying by pixel spacing to yield area in square centimeters. Finally, a center of mass was computed in 3D to identify volume-wise centroids on each vertebral body and intervertebral disc and in 2D to identify slice-wise centroids on each muscle (see Figure 2). Each centroid point was then mapped to the patient-based coordinate system with an affine matrix transformation by using the source exam's metadata, yielding results measured in distance from the scanner reference point.
Subject-specific musculoskeletal models of the trunk were created with vertebral body centroids, muscle centroids, and muscle CSA in OpenSim (SimTK) [28]. Starting with an appropriate sex-specific base model, models were scaled to the subject's height and weight. Vertebral centroid locations and muscle morphology parameters were then incorporated into the model via custom Matlab 2019b (The MathWorks Inc., Natick, MA) [29] scripts to build the final model with subject-specific spine curvature and trunk muscle properties [26]. Lumbar spine compressive loading was evaluated with these models for a forward flexion activity (60 degrees of trunk flexion with 5-kg weights in each hand) at L1-L5 for each patient in the shared test set.
A volumetric Sørensen-Dice similarity coefficient [30] was calculated to assess the overall performance of each of the three networks. Each of the morphological metrics and biomechanical model outcomes was calculated on both the manual and inferred segmentations for each patient, creating results that could be directly compared to assess the effectiveness of substituting manual for automatic annotation. Bland-Altman plots and Pearson R correlation coefficients were computed to assess the relationships among IVDH, CSA, and loading metrics generated from manually vs automatically segmented data.
Visual inspection for quality control after model training revealed several errors. Errors on the vertebral body annotations included segmentation of the S2 vertebral body or lower, exclusion of lateral slices in which bone was visible, and segmentation of both the vertebral body and bony pedicles in some of exams. One exam was annotated with paraspinous muscles on all slices. Major annotation errors were identified on one exam after training; the exam had been incompletely annotated and included fewer than half of the intervertebral discs, and it was consequently excluded from the hold-out test set on the shared split model.

Results
Volumetric Dice coefficients for each segmentation method are summarized in Tables 3, 4, and 5. The three segmentation networks with standard splits performed with Dice coefficients of 0.87 for the intervertebral disc network, 0.88 for the vertebral body network, and 0.81/0.95/0.84/0.92 (mul/psoas/ QL/ES) for the muscle network on each respective test set. After re-splitting and retraining to construct a set of the shared split networks, Dice coefficients decreased to 0.86 on   Figure 3. The vertebral body network performed with level-wise volumetric Dice coefficients greater than or equal to its overall performance of 0.76 on all levels. Level-wise performance on the intervertebral disc network dropped below its overall performance of 0.86 on all disc levels. Results on the S2 vertebral body and S1S2 intervertebral disc are not reported, as annotators were instructed to exclude the S2 vertebral body and S1S2 intervertebral disc. Stratified volumetric Dice coefficients could not be computed because of the absence of objective markers with which to partition axial slices into level-based groups, given that the annotators were not required to segment each slice. Instead, two-dimensional slice-wise Dice coefficients were used to measure stratified network performance for each muscle group in the paraspinous muscle network (Figure 4).
Pearson R correlation coefficients and mean absolute error were used to measure the relationship between morphological metrics calculated from manually vs automatically generated segmentations (Tables 6, 7, and 8). IVDH calculated from network-generated segmentations vs manually generated segmentations showed a correlation coefficient of 0.26 and a mean absolute error of 1.98 mm between the two methods. Muscle CSA from automatic segmentation was correlated with that of manual segmentations with coefficients of 0.74/0.850.70/0.37 and mean absolute errors of 1.21/1.36/0.82/3.55 cm 2 (mul/ psoas/QL/ES). Centroid locations differed, with mean absolute errors of 3.03/3.64/7.23/3.58 mm in Euclidean distance between ground truth and inference. Vertebral body compressive loading computed between inferred and manually generated input data was correlated with coefficients 0.93/0.90/0.78/0.68/0.52 (L1/ L2/L3/L4/L5). Bland-Altman and scatter plots for manually vs automatically generated intervertebral disc height and muscle CSA show correlation and agreement between the two methods (see Figures 5 and 6).

Discussion
Previous studies have demonstrated the viability of automatic segmentation applied to medical images but have left notable Figure 2. Visualization of centroid construction. T1 axial and T1 sagittal MRI slices were input into each respective V-Net to generate inferred segmentation masks of the vertebral bodies, intervertebral discs, and paraspinous muscles. After postprocessing, centers of mass were computed on each segmentation mask to calculate the position of volume-wise centroids for each vertebral body and intervertebral disc and slice-wise centroids for each paraspinous muscle. These centroids were then converted to patient-based space, yielding a 3D atlas of the lumbar spine for further biomechanical modeling.  Automatic Segmentation MRI Biomechanics S143 gaps in the problem space in designing methods that integrate seamlessly into clinical workflow and create pathways to apply the technology at scale to biomechanical research. Demonstration of the feasibility of these networks to segment clinically acquired images is limited, instead requiring a nonroutine protocol or strict quality controls [23,[31][32][33].
Published networks do not apply this technology across planes and views to segment multiple anatomic structures or construct a fully subject-specific 3D atlas of the lumbar spine [34]. Additionally, published networks are specialized to downstream anomaly detection, not biomechanical modeling [35]. As a result, clinical translation of these methods is unfeasible on a large scale, as adding specialized imaging sequences to existing clinical protocols is costly and time intensive to institutions and patients. The generalizability of these methods to imaging studies routinely acquired in clinical settings is not yet proven. We demonstrate the feasibility of substituting manual with automatic segmentation of the vertebral bodies, intervertebral discs, and paraspinous muscles by using networks trained on a small amount of clinical data. Overall segmentation network performance indicates that manual and automatic segmentation methods perform similarly, and morphological metric calculation can largely be outsourced to neural networks. Although stratified performance results indicate value in human oversight of network performance and morphological metric generation, there are clear efficiency gains within an acceptable margin of error to be found by implementing fully automatic assistance when delineating vertebral bodies, intervertebral discs, and paraspinous muscle in T1-weighted MRI. The proposed segmentation pipeline and the associated quantitative feature generation methods have applications in both a clinical and a research context, as they will enable researchers to analyze larger datasets of potential biomarkers of cLBP and quickly provide those same features to clinicians to improve disease characterization and treatment in real time.
Performance for both the vertebral body and intervertebral disc segmentation networks is strongest on the central-most lumbar levels, which was consistent variation in training data and anatomic boundaries. Because of natural variation in patient anatomy, exams in model training exhibited a range of numbers of vertebral bodies and intervertebral discs in the lumbar spine. Annotators were instructed to exclude vertebral bodies that were not completely pictured in the field of view, as well as to exclude the S2 vertebral body or the S1S2 intervertebral disc, but features of vertebral body and intervertebral disc boundaries are similar across the lumbar spine. This led to a pattern in which trained networks inconsistently segmented the most superior and most inferior vertebral bodies and intervertebral discs in the field of view, as networks were arbitrarily penalized for correctly identifying these bodies in training. This phenomenon is reflected in the network's relatively poorer performance when segmenting the S2 vertebral body and T11T12 and S1S2 intervertebral discs ( Table 3). Segmentation of the multifidus, psoas, and erector spinae performance variance has no demonstrated correlation with slice level in most exams. Segmentation performance on the quadratus lumborum tends to drop on the most inferior and superior slices, consistent with anatomic expectation above L1 and below L5 (Figure 4).
Error trends in automatic calculation of intervertebral disc height reflected trends in model error on the disc segmentation network, with statistical significance occurring only at the T12L1/L1L2 level. Results of all levels except L2 and L3 were inconclusive in the lumbar loading comparison. The decline in correlation from L1 to L5 could suggest greater sensitivity of lower lumbar loads to the subject-specific model inputs, as models generate less reliable results on slices at the boundaries of anatomy. Automatic segmentation as an input might have tended to underestimate the height of each disc and load per vertebral body, but conclusions cannot be drawn given the small sample size. We present a highly scalable, fully automatic framework to generate quantitative measures of spine morphology and subject-specific biomechanical models from lumbar spine MRI. We demonstrate that results generated by this pipeline are highly correlated and agree with those generated by human readers, without human-in-the-loop correction. This work indicates that computer-generated segmentations could successfully substitute for human-demarcated masks, which are time intensive and costly to obtain, in the quantification of metrics of lumbar spine morphology and biomechanical models to quantify tissue loading. The networks used to create the predicted segmentations were trained on clinical  Figure 5. Correlation (left) and agreement (right) between manually and automatically generated segmentations for each biomarker. Correlation between disc height from manual vs inferred disc segmentations is displayed on a scatter plot, where the line x¼y is indicated in grey. Agreement is displayed with Bland-Altman plots for disc height on each disc.
exams with standard diagnostic sequences, which suggests strong generalizability with no extra costs associated with exam acquisition. A human-in-the-loop system to catch failures but improve time to acquire each segmentation could be of value to account for variation in performance with scan quality. A fully automatic, quantitative method for generating image-based features of spine morphology and validated estimates of tissue loading from clinically acquired MR exams combined with biomechanical modeling, like this one, would provide a scalable approach with which to evaluate drivers of cLBP across patients, institutions, and imaging archives without interrupting routine care.

Supplementary Data
Supplementary Data may be found online at http://painmedicine.oxfordjournals.org.

Funding
Funding was received from the U.S. National Institutes of Health (UH2AR076724, R01-AR073019).

Conflicts of interest:
All authors have no conflicts of interest to disclose.

Supplement sponsorship
This article appears as part of the supplement entitled "Back Pain Consortium (BACPAC) Research Program" supported by the National Institutes of Health through the NIH HEAL Initiative under award number AR076730-01. Figure 6. Stratified loading performance. Difference (left) and correlation (right) between loading metrics generated from human-vs computergenerated segmentation masks, stratified by vertebral body level.