Identification of Novel Genetic Markers of Breast Cancer Survival

Background: Survival after a diagnosis of breast cancer varies considerably between patients, and some of this variation may be because of germline genetic variation. We aimed to identify genetic markers associated with breast cancer–specific survival. Methods: We conducted a large meta-analysis of studies in populations of European ancestry, including 37954 patients with 2900 deaths from breast cancer. Each study had been genotyped for between 200000 and 900000 single nucleotide polymorphisms (SNPs) across the genome; genotypes for nine million common variants were imputed using a common reference panel from the 1000 Genomes Project. We also carried out subtype-specific analyses based on 6881 estrogen receptor (ER)–negative patients (920 events) and 23059 ER-positive patients (1333 events). All statistical tests were two-sided. Results: We identified one new locus (rs2059614 at 11q24.2) associated with survival in ER-negative breast cancer cases (hazard ratio [HR] = 1.95, 95% confidence interval [CI] = 1.55 to 2.47, P = 1.91 x 10–8). Genotyping a subset of 2113 case patients, of which 300 were ER negative, provided supporting evidence for the quality of the imputation. The association in this set of case patients was stronger for the observed genotypes than for the imputed genotypes. A second locus (rs148760487 at 2q24.2) was associated at genome-wide statistical significance in initial analyses; the association was similar in ER-positive and ER-negative case patients. Here the results of genotyping suggested that the finding was less robust. Conclusions: This is currently the largest study investigating genetic variation associated with breast cancer survival. Our results have potential clinical implications, as they confirm that germline genotype can provide prognostic information in addition to standard tumor prognostic factors.


Methods:
We conducted a large meta-analysis of studies in populations of European ancestry, including 37 954 patients with 2900 deaths from breast cancer. Each study had been genotyped for between 200 000 and 900 000 single nucleotide polymorphisms (SNPs) across the genome; genotypes for nine million common variants were imputed using a common reference panel from the 1000 Genomes Project. We also carried out subtype-specific analyses based on 6881 estrogen receptor (ER)-negative patients (920 events) and 23 059 ER-positive patients (1333 events). All statistical tests were two-sided.

Results:
We identified one new locus (rs2059614 at 11q24.2) associated with survival in ER-negative breast cancer cases (hazard ratio [HR] = 1.95, 95% confidence interval [CI] = 1.55 to 2.47, P = 1.91 x 10 -8 ). Genotyping a subset of 2113 case patients, of which 300 were ER negative, provided supporting evidence for the quality of the imputation. The association in this set of case patients was stronger for the observed genotypes than for the imputed genotypes. A second locus (rs148760487 at 2q24.2) was associated at genome-wide statistical significance in initial analyses; the association was similar in ER-positive and ER-negative case patients. Here the results of genotyping suggested that the finding was less robust.
Conclusions: This is currently the largest study investigating genetic variation associated with breast cancer survival. Our results have potential clinical implications, as they confirm that germline genotype can provide prognostic information in addition to standard tumor prognostic factors.
Survival after a diagnosis of breast cancer varies considerably between patients. Many factors influence outcome in an individual patient, including inherited genetic variation. This hypothesis is supported by several lines of evidence. It has been shown that first-degree relatives with breast cancer have a correlated likelihood of dying from the disease (1)(2)(3). Additionally, mouse strain is a determinant of metastatic progression in in vivo models (4). There are many mechanisms through which germline genetic variation might affect prognosis. Some known disease susceptibility alleles confer differential risks of different tumor subtypes that are associated with different outcomes-for example, deleterious alleles of BRCA1 are associated with estrogen receptor (ER)-negative disease, and several common germline genetic variants that are associated with susceptibility to breast cancer have different risks of ER-positive and ER-negative disease (5,6). Germline genotype could also affect the efficacy of adjuvant drug therapies or might influence tumor-host interactions, such as those involving the stroma surrounding a tumor or the host's immune response (7). The host genotype might also influence the propensity of a tumor to seed and grow at metastatic sites.
The association between common germline genetic variation and breast cancer-specific survival has been examined in many candidate gene studies (8)(9)(10)(11)(12)(13)(14)(15)(16). These studies have identified numerous single nucleotide polymorphisms (SNPs) possibly associated with outcome, but none have been conclusively replicated in further studies. Genome-wide association studies (GWAS) have been very successful at identifying susceptibility alleles for a wide range of normal and disease phenotypes (17). However, GWAS of breast cancer survival published to date have had modest sample sizes and have not identified any confirmed associations (7,18). It is clear that the success of other GWAS has depended on large sample sizes. It is likely that large studies of survival time are required if alleles associated with prognosis in breast cancer are to be identified. We therefore pooled genotype data from multiple breast cancer GWAS discovery and replication efforts and linked these data to available survival time data article for the case patients in order to maximize statistical power to detect associations.

Breast Cancer Patient Samples
We pooled data from multiple breast cancer case cohorts in populations of European ancestry with existing high-density SNP genotyping (Supplementary Table 1 (26), and UK2 (27)). Each study had been genotyped for 200 000 to 900 000 SNPs across the genome using a variety of genotyping arrays. SASBAC and HEBCS are single-case cohorts, and all others have multiple constituent studies. A summary of the studies in COGS contributing data to our analysis is shown in Supplementary Table 3 (available online). ER status was obtained mostly from medical records followed by immunohistochemistry performed on tumor tissue microarrays or whole-section tumor slides. All studies were approved by the relevant institutional review boards, and all participants provided written informed consent.

Genotyping Quality Control
The genotype and the sample quality control (QC) have been previously described for COGS (5), CGEMS (20), HEBCS (16), SASBAC (26), UK2 (27), PG-SNPs (22), and BPC3 (19). QC procedures have not been described previously for the Molecular Taxonomy of Breast Cancer International Consortium (METABRIC) germline genotype data: SNPs were excluded 1) if the genotype frequencies deviated from those expected under Hardy-Weinberg equilibrium at P values of less than 1 × 10 −5 , 2) if they had a minor allele frequencies (MAFs) of less than 1%, 3) if the MAF was between 1% and 5% and call rate was under 99%, and 4) if the MAF was greater than 5% and the call rate was under 95%. A summary of the number of genotyped SNPs and the number of SNPs passing QC is shown in Supplementary Table 4 (available online). All individuals with low call rates (< 95%) or high or low heterozygosity (P < 1 x 10 −5 ) were excluded from subsequent analyses. All analyses were based on subjects of European ancestry based on genotype data. The methods and criteria for exclusion of non-European samples has been described previously for all studies apart from METABRIC, for which we used a set of unlinked SNPs, and the program Local Ancestry in adMixed Populations (28) to assign intercontinental ancestry based on the HapMap release no.22 genotype frequency data for European, African, and Asian populations. Subjects with less than 90% European ancestry were excluded.

Imputation
Genotypes for common variants across the genome were imputed using a reference panel from the 1000 Genomes Project in order to increase genome coverage. Genotype imputation for PG-SNPs, METABRIC, UK2, SASBAC, HEBCS, and COGS was performed using IMPUTE2 (29) after prephasing with SHAPEIT (30). This was done in chunks of 5 MB and default parameters for both programs. The imputation reference set consisted of 2184 phased haplotypes from the full 1000 Genomes Project data set (March 2012). All genomic locations are given in NCBI Build 37/ UCSC hg19 coordinates. Imputation for CGEMS and BPC3 was performed using the program MaCH (31). SNPs with imputation r 2 < 0.3 were excluded on a study-by-study basis. All SNPS with a MAF of less than 1% were excluded.

Statistical Analysis
The primary end point was breast cancer-specific survival. Time-to-event was calculated from the date of diagnosis. However, case patients were recruited at variable times before or after diagnosis; therefore, time under observation was calculated from date of recruitment (left censoring) in order to prevent the bias that could result from the inclusion of prevalent cases. Follow-up was right censored on the date of death if death was from something other than breast cancer, the date last known alive if death did not occur, or at 10 years after diagnosis, whichever came first. We fitted univariate Cox proportional hazard models to assess the association of genotype with breast cancer-specific mortality. We also ran analyses for ER-negative and ER-positive breast cancer. Each data set including the three component case cohorts in BPC3 was analyzed separately. The Cox models were stratified by study for the COGS dataset. We controlled for cryptic population substructure by including a variable number of principal components as covariates for each data set. The Cox proportional hazards assumption was tested for each significant SNP of interest analytically using Schoenfeld residuals. There was no evidence of nonproportional hazards. For the statistically significantly associated SNPs, we ran multivariable Cox models adjusting for age, nodal status, tumor size, tumor grade, and adjuvant treatment using the COGS data. We used an in-house program written in C++ for the analysis of COGS, HEBCS, METABRIC, PG-SNPs, SASBAC, and UK2. Analysis of CGEMS and BPC3 data was conducted using ProbABEL (32). We excluded SNPs with MAFs under 1% because of extreme value of the test statistics. Overall statistical significance tests for each SNP were performed by combining the results for each data set using a fixed-effects meta-analysis. All statistical tests were two-sided. Inflation of the test statistics (λ) was estimated by dividing the 45th percentile of the test statistic by 0.357 (the 45th percentile for a χ2 distribution on 1 degree of freedom). Heterogeneity between studies was measured using the I 2 statistic (33,34). Correlation between SNPs was calculated using Pearson correlation coefficient. Associations were regarded as statistically significant at a nominal P value of less than 5 x 10 -8 (genome-wide significance).

eQTL
Expression quantitative trait locus (eQTL) analyses were performed for all genes in the 1 MB region spanning the associated SNPs using probe-level gene expression data for breast epithelium samples taken from normal tissue adjacent to the tumor of 135 breast cancer patients of European ancestry from the METABRIC study (21). These were assayed using the Illumina HT12 platform. We also analyzed eQTL data of 387 breast tumors from the Cancer Genome Atlas (TCGA) (303 ER-positive, 81 ER-negative, three unknown) assayed using the Agilent G4502A-07-3 array (35). Germline SNP genotypes were available for normal and tumor samples from the Affymetrix SNP 6.0 platform imputed into 1000 Genomes Project data (March 2012) for the three SNPs of interest: rs2059614 at 11q24.2, and rs148760487 and rs114860916 at 2q24.2 (see Results section). Association between genotype and expression was tested by linear regression with false discovery rate control.

Results
The overall results were based on 37 954 case patients with 2900 deaths from breast cancer (Supplementary Table 2, available online). The results of the subtype-specific analyses were based on 23 059 ER-positive case patients (1333 deaths) from five studies and 6881 ER-negative case patients (920 deaths) from eight studies.
In the overall analysis, we identified 28 SNPs associated with breast cancer-specific survival at P values of less than 5x10 -8 (Table 1; Supplementary Figures 1 and 2, available online). All 28 SNPs were located in the same region on chromosome 2 and had been imputed in all eight datasets. The strongest association was for rs148760487 (hazard ratio [HR] = 1.88, 95% confidence interval [CI] = 1.51 to 2.34), P = 1.5x10 -8 ) (risk allele frequency = 0.01). This SNP was associated with breast cancerspecific survival in both ER-positive (HR = 2.07, 95% CI = 1.47 to 2.91, P = 3.1x10 -5 ) and ER-negative case patients (HR = 1.87, 95% CI = 1.27 to 2.75, P = .002). The imputation efficiency for these SNPs varied between an r 2 of 0.69 and 0.997 for the eight data sets. The inflation factor λ for the overall analysis was 1.01.
A single imputed SNP, rs2059614, located on chromosome 11, was associated with breast cancer-specific mortality at genomewide statistical significance in patients with ER-negative disease (HR = 1.90, 95% CI = 1.54 to 2.33, P = 1.3x10 -9 ) (risk allele frequency = 0.06) ( Table 1; Figures 1 and 2). The imputation r 2 ranged from 0.75 to 0.82 across eight studies with ER-negative cases. The inflation factor λ for analysis based on ER-negative cases was 1.03. No SNP reached nominal genome-wide statistical significance in the analysis of case patients with ER-positive disease ( Supplementary Figures 3 and 4, available online), for which the strongest association was for rs7149859 in chromosome 14 (HR = 1.22, 95% CI = 1.13 to 1.33, P = 7.0x10 -7 ). There was very little between study heterogeneity for the overall analysis for rs148760487 (I 2 = 0%, P = .59) (Supplementary Figure 5, available online) or the ER-negative analysis for rs2059614 (I 2 = 0%, P = .50) (Supplementary Figure 6, available online).
We conducted follow-up imputation on the two regions around rs148760487 and rs2059614 using the IMPUTE2 Markov chain Monte Carlo algorithm with 80 iterations without prephasing, as omitting the prephasing step should maximize imputation accuracy (29). We reimputed all SNPs in the genomic regions 500 KB pairs on either side of the two SNPs of interest. The association for rs148760487 was somewhat weaker (HR = 1.75, 95% CI = 1.39 to 2.20, P = 1.44 x 10 -6 ). A highly correlated SNP, rs114860916 (r 2 = 0.97) was now the most strongly associated SNP in the region (HR = 1.74, 95% CI = 1.39 to 2.18, P = 1.16 x 10 -6 ). In contrast, rs2059614 remained the most strongly associated SNP with survival of ER-negative disease (HR = 1.95, 95% CI = 1.55 to 2.47, P = 1.91 x 10 -8 ). Again there was no evidence of heterogeneity in the meta-analysis of these SNPs (data not shown).
We genotyped rs148760487 and rs2059614 in 2113 breast cancer case patients from the Studies of Epidemiology and Risk Factors in Cancer Heredity (SEARCH) in order to confirm the quality of the imputation. The correlation between the imputed and observed genotypes was 0.63 for rs148760487 and 0.68 for rs2059614. This compares with an estimated imputation r 2 of 0.76 and 0.79 for the genotypes imputed with prephasing using genotype data from the COGS custom array. We then compared the results of association analyses for the SEARCH data set using the imputed and observed genotypes. For rs148760487, there were 133 breast cancer deaths. In this subset the association based on genotyped data was weaker than the association based on the imputed data (HR = 1.66, 95% CI = 0.75 to 3.69, P = .21 and HR = 2.06, 95% CI = 0.84 to 5.04, P = .11, respectively), but this difference was not statistically significant (P = .72). For rs2059614, there were genotyped and imputed data for 300 ER-negative samples with 45 deaths. The association with genotyped data was stronger than that for imputed data (HR = 1.80, 95% CI = 0.99 to 3.25, P = .05 and HR = 1.44, 95% CI = 0.51 to 4.12, P = .49, respectively), as would be expected for a true positive association. Again, this difference was not statistically significant (P = .72). We also conducted multivariable analysis for these two SNPs using the pooled data within the COGS dataset, stratified by study and adjusting for principal components, age, lymph node status, tumor size, stage, grade, ER status (where applicable), and adjuvant treatment; the results were similar to the main findings (data not shown). Finally, we compared the hazard ratios for rs148760487 in all case patients and rs2059614 for ER-negative case patients in premenopausal (defined as age at diagnosis younger than 45 years) and postmenopausal (age at diagnosis of 55 years or older). There was no statistically significant difference (P = .96 and .24, respectively).
The risk allele of rs2059614 was associated with increased expression of EI24 and CHEK1 in normal breast epithelium adjacent to tumor from the METABRIC study (P = .002 and .007, respectively) (Supplementary Figure 7, available online). EI24 is a tumor suppressor gene involved in TP53 dependent apoptosis. CHEK1 is required for checkpoint-mediated cell cycle arrest in response to DNA damage. Other SNPs in the region were more strongly associated with both EI24 and CHEK1 expression, but were not associated with prognosis. There were no statistically significant eQTLs for rs148760487 and rs114860916 in normal breast epithelium. None of the three SNPs had statistically significant eQTLs in tumors from the TCGA study. We also explored the association between gene expression for all genes in the 1 MB region spanning the associated SNPs and breast cancerspecific mortality using KM plotter (36). Data were available for 575 ER-negative breast cancer patients. CHEK1 expression was  All statistical tests were two-sided.

Figure 2.
Quantile-Quantile (Q-Q) plot for the combined GWAS and COGS analyses for estrogen receptor (ER)-negative cases. The y-axis represents the observed -log 10 P value, and the x-axis represents the expected -log 10 P value. The red line represents the expected distribution under the null hypothesis of no association. All statistical tests were two-sided.
article not associated with relapse-free survival in ER-negative case patients (HR = 0.86, 95% CI = 0.65 to 1.12, P = .25) (Supplementary Figure 8, available online) but had statistically significant associations in ER-positive case patients (HR = 1.59, 95% CI = 1.31 to 1.91, P = 1.2x10 -6 ). EI24 expression was associated with relapsefree survival in both ER-positive case patients (HR = 0.75, 95% CI = 0.63 to 0.90, P = .002) and ER-negative case patients (HR = 1.38, 95% CI = 1.07 to 1.77, P = .01). It is interesting that the direction of association is consistent: The risk allele G of rs2059614 is associated both with poor breast cancer-specific survival in ER-negative case patients and with higher levels of EI24 expression in normal breast epithelium, which in turn is associated with poorer relapse-free survival in breast cancer. Expression of neither of the genes near rs148760487 was associated with relapse. Both the two top SNPs lie in putative enhancer sequences for which promoter interactions have been predicted (Supplementary Figure 9, available online) (37,38). IFIH or FAP might be the target of rs148760487, and EI24 might be the target of rs2059614 because the SNP is in an enhancer in endothelial cells that is predicted to regulate EI24.

Discussion
This is the largest genetic association study of breast cancer prognosis to date. We identified one new locus (rs148760487 at 2q24.2) associated with breast cancer-specific survival in all breast cancer and one new locus (rs2059614 at 11q24.2) associated with breast cancer survival in ER-negative case patients at genome-wide levels of statistical significance. However, both these associations were based on imputed genotype data. Genotyping a subset of the case patients confirmed that the quality of the imputation was reasonable, but for one SNP (rs148760487), the association in the subset of samples with both genotyped and imputed data was weaker for the genotyped data. Thus we are less confident that this represents a true positive. On the other hand, as would be expected for a true positive, the association of rs2059614 got stronger when comparing genotyped with imputed data, suggesting that this is a robust association.
Two genes lie within the 1 MB region on chromosome 2 spanning rs148760487 to KCNH7 and BC042876. KCNH7 encodes a voltage-gated potassium channel with diverse functions and has no obvious role in cancer. BC042876 is a noncoding RNA gene with no known function. Another SNP in the same region, rs1424760, has been reported to be associated with serum phospholipid levels, but this SNP is only weakly correlated with rs148760487 (r 2 = 0.11).
There are 18 genes in the genomic region 500 KB either side of rs2059614 in chromosome 11 (Supplementary Table 5, available online). Several of these are known to be involved in processes relevant to cancer, such as cell death and DNA damage responses. Of particular interest are EI24 and CHEK1, as expression of both of these in normal breast epithelium is associated with rs2059614 genotype. Additionally, expression of EI24 is associated with relapse-free survival in both ER-positive case patients and ER-negative case patients. Furthermore, this genomic region, 11q24, is frequently altered in cancers.
Genome-wide association studies with large-scale replication have been extremely successful in identifying multiple variants associated with many different phenotypes. For example, more than 70 common variants are known to be associated with an altered risk of breast cancer (5,6). In contrast, this study of breast cancer prognosis has identified just two variants associated at genome-wide statistical significance. There are several possible reasons for this difference. Despite the large sample size used in these analyses, the power to detect association with breast cancer-specific survival is only modest (see Supplementary Figure 10, available online). All of the common alleles associated with disease susceptibility confer relative risks of less than 1.2, and most are associated with relative risks of less than 1.1. Alleles such as these can be detected using case-control studies with a total sample size of approximately 100 000 (5). However, our analyses, based on 2900 breast cancer deaths, had limited power to detect alleles conferring hazard ratios of less than 1.2. Power to detect an allele with a hazard ratio greater than 1.5 was good (60% power if the MAF = 0.05, 100% power if the MAF > 0.1), suggesting that few such alleles are likely to exist.
Another issue affecting our ability to detect associations with prognosis is the heterogeneity of the phenotype. A wide variety of factors influence survival time after diagnosis, including tumor biology and treatment. Breast cancer is a heterogeneous disease, and different disease subtypes have different clinical outcomes (39)(40)(41). Restricting the analyses to specific subtypes in addition to ER status would reduce this heterogeneity, but the sample size would also be greatly reduced as subtype-specific information is not available for all case patients in these analyses, and some subtypes are relatively uncommon.
Our findings provide support for the hypothesis that germline genetic variation influences outcome after a diagnosis of breast cancer. Identification of novel germline genetic markers of breast cancer prognosis may help to elucidate molecular mechanisms of tumor progression and metastasis. Ultimately this may lead to the identification of new targets for therapeutic interventions. It may also lead to insights into mechanisms driving the differential response to adjuvant therapies and thereby enable improved targeting of therapy. In the clinical setting, germline markers of prognosis could be used to enhance risk stratification and provide patients with information about their prognosis in order to identify those patients most likely to benefit from adjuvant therapy. However, even studies larger than ours will be required in order to meet the challenge of identifying additional loci. Genotyping samples from clinical trials may prove to be particularly useful, but it is clear that data from multiple studies will need to be combined if there are to be further successes in this field. Funding of constituent studies (these are listed by funding agency, with each grant number in parentheses): Academy