Risk Factors and Incidence of Colorectal Cancer According to Major Molecular Subtypes

Abstract Background Colorectal cancer (CRC) is a heterogeneous disease that can develop via 3 major pathways: conventional, serrated, and alternate. We aimed to examine whether the risk factor profiles differ according to pathway-related molecular subtypes. Methods We examined the association of 24 risk factors with 4 CRC molecular subtypes based on a combinatorial status of microsatellite instability (MSI), CpG island methylator phenotype (CIMP), and BRAF and KRAS mutations by collecting data from 2 large US cohorts. We used inverse probability weighted duplication-method Cox proportional hazards regression to evaluate differential associations across subtypes. Results We documented 1175 CRC patients with molecular subtype data: subtype 1 (n = 498; conventional pathway; non-MSI-high, CIMP-low or negative, BRAF-wild-type, KRAS-wild-type), subtype 2 (n = 138; serrated pathway; any MSI status, CIMP-high, BRAF-mutated, KRAS-wild-type), subtype 3 (n = 367; alternate pathway; non-MSI-high, CIMP-low or negative, BRAF-wild-type, KRAS-mutated), and subtype 4 (n = 172; other marker combinations). Statistically significant heterogeneity in associations with CRC subtypes was found for age, sex, and smoking, with a higher hazard ratio (HR) observed for the subtype 2 (HR per 10 years of age = 2.64, 95% CI = 2.13 to 3.26; HR for female = 2.65, 95% CI = 1.60 to 4.39; HR per 20-pack-year of smoking = 1.29, 95% CI = 1.14 to 1.45) than other CRC subtypes (all Pheterogeneity < .005). A stronger association was found for adiposity measures with subtype 1 CRC in men and subtype 3 CRC in women and for several dietary factors with subtype 1 CRC, although these differences did not achieve statistical significance at α  level of .005. Conclusions Risk factor profiles may differ for CRC arising from different molecular pathways.


Assessment of risk factors
For smoking, we assessed pack-years of smoking up to date and pack-years smoked before age 30 to account for the life-course effect. In addition to current body weight assessed on the biennial questionnaires, we asked participants to recall their body weight at age 18 in 1980 in the NHS, and at age 21 in 1986 in the HPFS. We used these data along with height reported at baseline to calculate BMI at young adulthood. In the NHS, women were asked to report their waist circumferences using a tape measure in 1986, 1996 and 2000. In the HPFS, we enclosed a tape measure in an optional questionnaire and directed participants to measure their waist at the umbilicus in 1987 and 1996. We assessed every 4 years the type and duration spent on specific physical activities and calculated total physical activity as the summed metabolic equivalent of task (MET) hours per week.

Assessment of Family history of CRC
We utilized family history data prospectively collected from questionnaires before a participant developed colorectal cancer, to avoid recall bias. If a participant did not develop colorectal cancer, we utilized questionnaire data throughout the follow-up period or until death. In our cohort studies, we classified participants as having a family history of CRC if they reported at least 1 first-degree relative (parent, sibling, or child) with CRC.

Assessment of the use of endoscopic exams
In both cohorts, beginning in 1988 and continuing through 2002, participants were asked biennially whether they had undergone lower gastrointestinal endoscopy and, if so, the reason for the endoscopy. In 2004, we additionally inquired whether the previously reported endoscopies were sigmoidoscopies or colonoscopies. Every cycle thereafter, responses for sigmoidoscopy and colonoscopy were recorded separately. For our study, we considered endoscopic use for the screening purpose only.

Molecular pathological epidemiology (MPE) database and inverse probability weighting (IPW)
Molecular pathological epidemiology (MPE) database of colorectal cancer are based on two U.S. nationwide prospective cohort studies (NHS, HPFS) and integrates various data about colorectal cancer patients, including the clinicopathological and molecular features, long-term survival data, and lifestyle information. This comprehensive database allowed us to examine an interactive relationship between a specific environmental exposure and CRC subtypes while at the same time adjusting for a variety of potential confounders. This MPE database can provide etiologic and pathogenic insights, potentially contributing to precision medicine for personalized prevention and treatment.
To account for the missing data on tumor molecular markers, we performed the IPW analysis in the Cox proportional hazards regression. The method details have been described in our previous publication. 3 Briefly, we first modeled the data availability of molecular subtypes (based on a combinatorial status of microsatellite instability (MSI), CpG island methylator phenotype (CIMP), and BRAF and KRAS mutations) using logistic regression with the subtype availability status (subtype data available vs. missing) as the outcome variable, and covariates that may influence the success of tissue collection as predictor variables, including disease stage (stage I, stage II, stage III, or stage IV), tumor location (proximal colon, distal colon, or rectum), tumor differentiation (well differentiation, moderated differentiation, or poor differentiation), age at diagnosis (continuous), age at diagnosis-square (continuous), diagnostic year (continuous), and family history of colorectal cancer (yes or no). We then calculated the weight as the inverse of the probability of molecular subtype data availability estimated from the logistic model.
Extremely small and large weights were truncated at the 5th/95th percentile to reduce outlier effects. We then incorporated this inverse probability weight into our Cox proportional hazards regression model. Weight was integrated in the variance estimation using the robust sandwich variance estimate. 4

Statistical analysis
We excluded participants who had missing data on any of the studied risk factors at baseline. For missing data that occurred in the follow-up questionnaires, we carried forward the most recent available information from prior questionnaires. To reduce intra-individual variation and capture long-term exposures, we used the cumulative average for relevant variables, which was the mean of all available data prior to each of the questionnaire cycles. To control for confounding by age, calendar time, and any possible two-way interactions between these two-time scales, we stratified the analysis jointly by age and calendar time of the current questionnaire cycle.
Proportional hazards assumption was tested by including the product term between exposure variable and time in the model and testing the significance of the product term using the Wald test. No deviation from the proportional hazards assumption was detected. All models were adjusted for race (white or non-white), height (continuous), family history of colorectal cancer (yes or no), history of endoscopic exams (yes or no), body mass index (continuous), pack-years of smoking (continuous), physical activity (continuous), alcohol intake (continuous), and regular aspirin use (yes or no).
The basic model for our analysis was the proportional hazards model, I(t |x,U,i)=I 0i (t) exp{β 1 x(t)+β 2 U(t)}, where β 1 is the log e incidence rate ratio describing the increase or decrease in the baseline incidence rate at time t due to a one unit increase in exposure x(t) measured at time t, U(t) is a vector of covariates at age t, β 2 is the vector of log e incidence rate ratio describing the increase or decrease in the baseline incidence rate due to a one unit increase in these a priori covariates, t is the age at which the cancer of interest is diagnosed, and I 0i (t) is the baseline cancer incidence rate at age t in stratum i. Person-years of follow-up was counted from the age in months at the date the baseline questionnaire was returned until the age in months at the date of diagnosis of CRC, age at date of death or age at end of follow-up, whichever came first. SAS PROC PHREG was used for all analysis and the Anderson-Gill data structure was used to handle time-varying covariates efficiently, where a new data record is created for every questionnaire cycle at which a participant was at risk, with covariates set to their values at the time that the questionnaire was returned. To control as finely as possible for confounding by age, calendar time and any possible two-way interactions between these two-time scales, we stratified the analysis jointly by age in months at start of follow-up and calendar year of the current questionnaire cycle. The time scale for the analysis was then measured as months since the start of the current questionnaire cycle, which is equivalent to age in months because of the way we structured the data and formulated the model for analysis. a Data are information for each CRC at diagnosis. All variables are adjusted for age and sex except for age and sex themselves. Mean (standard deviation) is presented for continuous. b A standard tablet contains 325 mg aspirin, and regular users were defined as those who used at least two standard tablets per week. c Physical activity is calculated by the product sum of the MET of each specific recreational activity and hours spent on that activity per week. For physical activity, the follow-up started in 1986 in NHS.