Cirrus: An Automated Mammography-Based Measure of Breast Cancer Risk Based on Textural Features

Abstract Background We applied machine learning to find a novel breast cancer predictor based on information in a mammogram. Methods Using image-processing techniques, we automatically processed 46 158 analog mammograms for 1345 cases and 4235 controls from a cohort and case–control study of Australian women, and a cohort study of Japanese American women, extracting 20 textural features not based on pixel brightness threshold. We used Bayesian lasso regression to create individual- and mammogram-specific measures of breast cancer risk, Cirrus. We trained and tested measures across studies. We fitted Cirrus with conventional mammographic density measures using logistic regression, and computed odds ratios (OR) per standard deviation adjusted for age and body mass index. Results Combining studies, almost all textural features were associated with case–control status. The ORs for Cirrus measures trained on one study and tested on another study ranged from 1.56 to 1.78 (all P < 10−6). For the Cirrus measure derived from combining studies, the OR was 1.90 (95% confidence interval [CI] = 1.73 to 2.09), equivalent to a fourfold interquartile risk ratio, and was little attenuated after adjusting for conventional measures. In contrast, the OR for the conventional measure was 1.34 (95% CI = 1.25 to 1.43), and after adjusting for Cirrus it became 1.16 (95% CI = 1.08 to 1.24; P = 4 × 10−5). Conclusions A fully automated personal risk measure created from combining textural image features performs better at predicting breast cancer risk than conventional mammographic density risk measures, capturing half the risk-predicting ability of the latter measures. In terms of differentiating affected and unaffected women on a population basis, Cirrus could be one of the strongest known risk factors for breast cancer.

It is well established that there is information in a mammogram that predicts a woman's risk of a future breast cancer. Mammographic density has conventionally been defined as the white or bright regions on a mammographic image. Considerable research has shown that, after adjusting for age and body mass index (BMI), the residuals of the absolute and percentage values of conventional mammographic density are highly correlated with one another, and both sets of residuals have been found by many studies to be associated with breast cancer risk (1). Residuals are the appropriate way to consider mammographic density as a risk factor for breast cancer because, across the age range relevant to most mammographic density studies, age and BMI are negatively associated with conventional mammographic density measures (2) but positively associated with breast cancer risk. These residuals are also highly correlated over time (3,4).
For the conventional measures of mammographic density, once adjusted for age and BMI, the risk increases by about 1.4fold per adjusted standard deviation (3,5,6), equivalent to an approximately twofold interquartile risk ratio (IQRR) (7). In comparison, the risk gradient for the current best polygenic breast cancer risk score based on common variants (single-nucleotide polymorphisms; SNPs), is about 1.6-fold per standard deviation or a threefold IQRR (8,9).
The bench mark for measuring mammographic density has been a computer-assisted thresholding technique using Cumulus software (Sunnybrook Health Sciences Centre, Toronto, Canada) (10). A digital version of the mammographic image is divided into two segments based on a pixel brightness threshold chosen by the measurer; one segment represents what the measurer considers to be the white or bright regions (ie, the dense tissues) and the remainder of the breast is considered to be nondense.
Although highly repeatable across trained measurers, the semiautomated Cumulus measurements involve subjective judgment and are too labor intensive for clinical use. Automated measures of conventionally defined mammographic density, such as AutoDensity (11), LIBRA (12), and Volpara (13) (which aims to estimate the volume of dense tissue), have been developed. We have also found, using Cumulus software, but in effect defining mammographic density at higher pixel brightness thresholds, that stronger risk gradients can be obtained (5,6,14). This raises the possibility that better risk-predicting measures could be found by considering characteristics of a mammogram other than the conventional concept of mammographic density.
All the risk-predicting measures above summarize a mammographic image by a single quantity: an estimate of the area or volume of dense tissue. This is equivalent to counting the number of pixels above a brightness threshold, and uses firstorder statistical information only. Even the volumetric Volpara measure reduces to the same process, in that it defines mammographic density as the number of voxels (3 D pixels) deemed to be dense.
In this article, we describe an alternative and agnostic approach to discovering risk-predicting information in a mammographic image. We used machine learning to capture the combination of textural and spatial information, not necessarily observable by a human, that best predicts breast cancer risk. In contrast to conventional approaches, we used second-order statistical information (ie, measures of variability and interrelations between pixels).
We tested the risk predicting capability of each measure, trained on a specific study, on the other studies. We then combined these features and studies to produce a fully automated mammography-based risk measure that we named Cirrus. Although we are not the first to take such an approach [see Gastounioti et al. (15) for a summary of texture-based approaches], we consider that our large sample sizes, the validation across populations and designs, and the way that we handle the highly correlated textural features are strengths.

Subjects
We analyzed digitized film mammograms of women with breast cancer (cases) and women without a diagnosis of breast cancer (controls) from three studies:  (5), (16), and (17), respectively].
All film mammograms were digitizsed. Because our software automatically handled removal of pectoral muscles from craniocaudal (CC) views, but not from mediolateral oblique (MLO) mammograms, analysis was restricted to CC view mammograms. (Our pilot study of around 600 MLO views found that the risk prediction was no different between CC and MLO views.) We used both left and right mammograms at the same visits, and multiple vists when available. The median number of mammograms per woman was 10 (Australian cohort study), 2 (Australian case-control study), and 5 (Japanese American study).

Quality Control
All datasets underwent quality control to remove inappropriate mammograms (eg, negative images, MLOs misclassified as CCs, damaged film mammograms, and case patient mammograms used for their diagnosis) as well as mammograms that failed the automatic preprocessing stage by being incorrectly segmented (eg, when the range of contrast was abnormally low). After quality control, less than 3% of mammograms in any dataset were removed.

Image Analysis
We developed and applied an automatic preprocessing algorithm that uses image-processing techniques to segment the breast from the background noise, and to remove artefacts and labels before feature extraction (Supplementary Methods, available online).

Feature Extraction
We applied algorithms to extract features of potential interest. We required features to be invariant to rotation and translation, so that a mammogram will yield the same features irrespective of the positioning and orientation of the breast. In the imageprocessing literature, texture refers to the relationship between pixels in a neighborhood. Bringing second-order information, texture essentially provides information on the types of patterns present in an image; eg, whether areas of the image are smooth or rough, or whether the rough and smooth areas are scattered across an image or clustered together, etc. We chose the gray-level co-occurrence matrix (GLCM) class of features (18)(19)(20), based on the statistical properties of neighboring pixels. Many GLCM textures act as analogues of quantities found in the physical sciences. For example, homogeneity measures the degree of "scatteredness" of the texture within an image, so images with large areas of similar intensity pixels have a higher degree of homogeneity than those composed of a large number of small dissimilar regions.
A total of 20 GLCM features common in the literature were extracted from all mammograms (see Table 1). Importantly, because the characteristics and behavior of digitizers vary by manufacturer and study, we modified the standard GLCM feature algorithm so that our features would be resistant to digitizer settings. All image analysis was performed using the MATLAB computing platform. More details on the GLCM feature extraction can be found in the Supplementary Methods (available online).

Statistical Analysis
For each feature, the information for a given woman was taken to be the median of her features across all her mammograms because this increased the amount of information for risk prediction and produces a more stable predictor. Marginal (oneat-a-time) estimation of the association between the 20 GLCM features and breast cancer risk, unadjusted for covariates, was first performed using logistic regression and presented as the odds ratio (OR) per standard deviation of the unadjusted crosssectional measure.
A risk measure, Cirrus, was computed from features by applying the Bayesian lasso regression procedure (21), which is based on a Bayesian interpretation of the lasso (22) penalty that automatically estimates the regularization parameter to avoid statistical instability due to collinearity and makes best use of all the available features [see (23)].The logistic regression models were estimated by drawing 10 000 samples from the posterior distribution, with the first 1000 discarded as burn-in samples, and using the Bayesreg Bayesian regression software (24) in MATLAB. The measure of breast cancer risk was a linear combination of the estimated coefficients and the image features.
A separate Cirrus measure was constructed from each dataset. Each trained measure was tested on the given dataset, to assess its maximum risk prediction, and then tested on each of the other two independent datasets, to see the extent to which the discovery process could be externally validated.
We assessed risk-predicting performance using logistic regression and adjusting for age and BMI. Risk gradients were presented as the change in the age-and BMI-adjusted OR per unit change in the standard deviation of the residual of the measure after adjusting for age and BMI using the controls, following the OPERA concept (7). The nominal P values were determined using the Wald test. As shown in the Supplementary Methods (available online), for a continuous risk factor satisfying a normality assumption and a relatively rare disease, where U is the cumulative distribution function of the standard normal distribution and a % U À1 0:25 ð Þ¼À0:6745 and b % U À1 0:75 ð Þ¼0:6745. The relationship between log (OR) and the area under the receiver operator curve (AUC) are also shown in the Supplementary Methods (available online).
We also created a Cirrus measure trained on the combined data, and fitted it to the combined data using logistic regression adjusting for age and BMI, with and without the conventional risk measures of absolute and percentage mammographic density power transformed and adjusted for age and BMI. Based on the Box-Cox transformation, we used the fourth root of absolute density and the cube root of percentage density. We used the conventional mammographic density measures, created using the semiautomated computer software Cumulus (10), from the published cohort (16) and caseÀcontrol (5,17) studies, and the subjects for whom BMI data were available. Table 1 shows that, for 11 of the 20 GLCM image features (contrast, correlation, dissimilarity, homogeneity, difference variance, difference entropy, entropy, information correlation 1, information correlation 2, normalized inverse difference, and moment normalized inverse difference), the directions and magnitudes of their risk associations were similar across all studies. This indicates that there is a relationship of features to breast cancer risk that is robust to study variation. Table 2 shows that the OR per adjusted standard deviation of the Cirrus measures trained on each study were highest when tested on that study itself, and in the range of 1.72 to 1.92. Most importantly, the cross-study replication associations were also high, ranging from 1.56 to 1.78 (all P < 10 -6 ). All the replication log(OR)s were within 20% of their in-sample training log(OR)s. The strongest cross-validation was for the Cirrus measure trained on the Australian cohort study and replicated on the study of Japanese American women living in Hawaii. Table 3 shows that, when all three datasets were combined, nearly all the GLCM features were associated with case-control status. Figure 1 shows that the 11 GLCM features that were consistently associated across studies were highly correlated with each other (all r > 0.9). These features also had similar absolute log(OR)s in Table 2. Figure 1 shows that the other nine GLCM features were also strongly correlated with each other, and Table 3 suggests that they had similar but lower absolute log(OR)s. Most pairs of features from the two different sets of 11 and 9 GLCM features were weakly correlated (absolute r < 0.4). When we repeated the analyses using measures based on a single mammogram (the earliest) we found the general findings of Table 3 were repeatable across studies, although with greater variation across studies, justifying our use of the measures based on averaging over all mammograms. Table 4 shows the skewness and excess kurtosis of each of the 20 features in the combined dataset, along with the posterior standard deviation and standardized weight used to create the final Cirrus measure (see Supplementary Methods, available online, for more details). This Cirrus measure was independent of age and weakly negatively associated with BMI (r ¼ À0.1).

Results
For the Cirrus risk measure (Cirrus adjusted for age and BMI) constructed from combining all datasets and all features and adjusted for age and BMI (see Figure 2), the OR per adjusted standard deviation was 1.90 (95% CI ¼ 1.73 to 2.09) ( Table 5), close to the value of 1.86 based on the standardized difference in means for case and controls shown in Figure 2 being 0.622 and the theory explained in the Supplementary Methods (available online). In comparison, the ORs per adjusted standard deviation for the risk factors based on absolute and percentage density measures were 1.34 and 1.38, respectively. Table 5 also shows that the log(OR) for the Cirrus measure was reduced by less than 10% after adjusting for the conventional measures, and the predicted IQRR was about 4.2-fold (95% CI ¼ 3.3 to 5.4).
In contrast, the OR for the conventional measure was 1.34 (95% CI ¼ 1.25 to 1.43), and after adjusting for Cirrus it became 1.16 (95% CI ¼ 1.08 to 1.24) (P ¼ 4 Â 10 À5 ). This was nearly a halving in log(OR), and a similar result applied to percentage density. The correlations between the risk estimates for Cirrus and the risk estimates for the absolute and percentage measures of conventional mammographic density were À0.3 and À0.4, respectively. When the two conventional mammographic density measures were modeled together, the percentage measure was not significant, consistent with risk being best captured by the absolute measure. Figure 3 shows that, from considering the receiver operator curves, the Cirrus measure gives better risk discrimination (AUC ¼ 0.662; 95% CI ¼ 0.635 to 0.690) than does a Cirrus measure based on homogeneity alone (AUC ¼ 0.642; 95% CI ¼ 0.615 to 0.670), which gives better risk discrimination than does percentage mammographic density measure (AUC ¼ 0.620; 95% CI ¼ 0.593 to 0.648).

Discussion
From applying machine learning to textural features of film mammograms, we have found novel information that predicts advantageous. Analyses suggested that the textural features that differed in mean between the Caucasian and Japanese women were those that did not, of themselves, predict breast cancer risk (data not shown).
Our new mammography-based risk measure, Cirrus, is fully automated. It was better at predicting breast cancer risk than the conventional mammographic density risk measure and    captured half of that measure's risk-predicting ability. In terms of differentiating affected and unaffected women of the same age on a population basis, Cirrus is one of the strongest risk factors for breast cancer with an IQRR that could be as high as fourfold.
The risk discrimination of our Cirrus risk measures trained on a given dataset and tested on the other two were similar across all three datasets. The largest decrease in risk discrimination from training data to test data was only 20% on the log(OR) scale, and this was from training on the smallest sample (Japanese American women living in Hawaii) and testing on Australian women. The strong consistency across studies suggests that the identified risk-predicting features and their risk gradients are reliable predictors of breast cancer. It also suggests that our modification to the GLMC technique to make it resistant to regularity and collinearity effects across manufacturers and digitizers was successful.
An intriguing aspect of our findings is that the texture features used by the Cirrus measure are not based on absolute brightness thresholds, as are conventional mammographic density measures and the newer ones, Altocumulus and Cirrocumulus, defined at higher pixel brightness thresholds. Cirrus uses only relative brightness between pixels, yet achieves superior performance than the conventional measure across all studies, and has similar risk-predicting performance as do Altocumulus and Cirrocumulus for the Australian case-control study (5).
Our new study, and recent studies of Altocumulus and Cirrocumulus (5,6,14), raise the question as to whether the amount of what has conventionally been considered to be dense tissue should continue to be viewed as the gold standard of mammography-based risk estimation. The strong risk association we found with second-order textural information suggests that it may be a combination of the quantity and the spatial configuration of specific types of tissue that underlies the biological mechanisms determining breast cancer risk.
Almost all of the 20 GLCM features we studied were associated with breast cancer risk. Our statistical method helped us extract most of the information from the feature set even though they were correlated, and had some similarities to that used by Yaghian et al. (25) and Wang et al. (26) to find specific textural features predictive of masking and risk. There was a high correlation between some textural features, meaning that many capture the same textural information. This high level of collinearity likely explains the apparent lack of concordance in specific findings across studies using GLCM-type features [eg, Huo et al. (27),Wang et al. (26)]; when features are so highly correlated it is largely at random which features will be ranked as the "best" from analysis of any given dataset. Our aim was to build the best predictor, not to find the best individual predictor(s), so we used the Bayesian shrinkage procedure because it is known to be a better way to achieve our aim (23). This distinguishes our work from the previous studies (26,27).
For example, using a standard lasso procedure, Wang et al. (26) selected sum average, which tends to identify dispersed patterns of density, as the best predictor of risk. We, however,   found no evidence that this feature was a major predictor of disease. They also noted that "the features that were not selected by lasso are not necessarily non-predictive of risk." On the other hand, Wang et al. commented that "it was slightly surprising that . . . some previously reported texture features such as . . . contrast . . . were not selected, although contrast was significantly and negatively associated with risk in both the training and validation studies, in line with Huo et al." We too found that contrast was negatively associated with risk, and this was highly significant in our study. A study of our risk predictors and statistical approaches with those of Wang et al. on the same dataset would be instructive. Given our aim was to build the best predictor, use of the Bayesian lasso procedure mitigates the instability problem from procedures that try to first select features by estimating associations for all features. Notwithstanding, homogeneity was one of the strongest associated features, almost as good as the combination in predicting risk, and performed better than the conventional mammographic density measures (data not shown). However, considering that feature alone gave poorer internal and external risk prediction than using the combined measure, Cirrus, as is illustrated in Figure 3.
Supplementary Figure 1 (available online) shows mammograms from women with extreme (high and low) Cirrus risk measures, but whose conventional mammographic density risk measures were average (almost all between the 30th and 70th percentiles). The low-risk Cirrus images appear to be slightly darker overall.
Supplementary Figure 2 (available online) shows the same mammograms after processing and quantizing by the GLCM algorithm (see the Supplementary Methods, available online). The differences between the high-and low-risk Cirrus mammograms are clearer. The low-risk Cirrus mammograms have a scattered pattern with thin spiderwebs of brightness cutting through the darker regions. In contrast, the high-risk Cirrus mammograms appear to be composed of several large, welldefined, homogeneous connected regions. This would appear to reflect the homogeneity textural feature that we identified to be an important predictor of risk. Given its simple interpretation, homogeneity has the potential to be a useful new biomarker for breast cancer risk.
Some strengths of our study are our consistent findings related to textural features and strengths of association within and between study findings despite the variation in 1) ethnic origin of women, 2) the machines used to produce the mammograms, and 3) the digitization of mammograms. The strong cross-predictive performance suggests that we selected important aspects of a mammogram that are robust to differences in image acquisition. Cirrus appears to be essentially uncorrelated with age, and only weakly correlated with BMI and with both the absolute and percentage measures of conventional density. Further analyses (data not shown) suggest Cirrus is not highly correlated with family history and weakly corelated with number of live births and menopause status similar to Cumulus, and we are conducting detailed analyses of these issues as we did for Cumulus (2) for future publication. There appears to be room for improvement by, eg, extending the measure to incorporate more conventional mammographic risk features, such as dense area-like quantities defined by different pixel brightness thresholds (5,6).
The two main weaknesses of our study are 1) we combined interval and screen-detected cancers, which could weaken the ability of our measure to predict specifically risk or specifically masking and 2) our study was composed entirely of digitized film mammograms. Although the GLCM features account for some differences between images and machines, there is no guarantee that our current Cirrus measure will perform as well when ported to digital mammography; it could require further modification. It might also be that different features, or different weightings of features, provide better predictors of risk from digital mammograms. Not being able to readily identify and "see" the Cirrus measure and the individual features that underlie this phenomenon might be considered another weakness, and we have tried to address this. However, given that the consequence of applying machine learning is an automated mammography-based risk measure that does not require visual measurement, this might not be an impediment to future clinical use of measures like Cirrus when based on machine learning applied to digital mammograms.
In conclusion, we have used machine learning to create a new and fully automated mammography-based risk measure that has reliability within and across studies. Note that although we averaged mammograms to discover the predictor, we have created a risk predictor applicable to a single mammogram, thus allowing future studies of changes in time for a particular woman. One of the powerful features of using machine learning to create an automated measure of risk is that is it not necessary to have a human interpretation of the features-the human perception of the features will not be used on their own in practice. Our risk measure performs better at predicting breast cancer risk than the conventional mammographic density risk measures and captures half the risk-predicting ability of those measures. In terms of differentiating affected and unaffected women on a population basis, Cirrus could be one of the strongest known risk factors for breast cancer.