Identification of a gene expression signature associated with breast cancer survival and risk that improves clinical genomic platforms

Abstract Motivation Modern genomic technologies allow us to perform genome-wide analysis to find gene markers associated with the risk and survival in cancer patients. Accurate risk prediction and patient stratification based on robust gene signatures is a key path forward in personalized treatment and precision medicine. Several authors have proposed the identification of gene signatures to assign risk in patients with breast cancer (BRCA), and some of these signatures have been implemented within commercial platforms in the clinic, such as Oncotype and Prosigna. However, these platforms are black boxes in which the influence of selected genes as survival markers is unclear and where the risk scores provided cannot be clearly related to the standard clinicopathological tumor markers obtained by immunohistochemistry (IHC), which guide clinical and therapeutic decisions in breast cancer. Results Here, we present a framework to discover a robust list of gene expression markers associated with survival that can be biologically interpreted in terms of the three main biomolecular factors (IHC clinical markers: ER, PR and HER2) that define clinical outcome in BRCA. To test and ensure the reproducibility of the results, we compiled and analyzed two independent datasets with a large number of tumor samples (1024 and 879) that include full genome-wide expression profiles and survival data. Using these two cohorts, we obtained a robust subset of gene survival markers that correlate well with the major IHC clinical markers used in breast cancer. The geneset of survival markers that we identify (which includes 34 genes) significantly improves the risk prediction provided by the genesets included in the commercial platforms: Oncotype (16 genes) and Prosigna (50 genes, i.e. PAM50). Furthermore, some of the genes identified have recently been proposed in the literature as new prognostic markers and may deserve more attention in current clinical trials to improve breast cancer risk prediction. Availability and implementation All data integrated and analyzed in this research will be available on GitHub (https://github.com/jdelasrivas-lab/breastcancersurvsign), including the R scripts and protocols used for the analyses. Supplementary information Supplementary data are available at Bioinformatics Advances online.


Introduction
Breast cancer (BRCA) is the most prevalent type of tumor in women in Europe with more than 500 000 cases diagnosed per year. The accurate diagnoses and treatment are critical to improve the disease survival. BRCA treatment is determined by a standard categorization in four groups of tumors considering three main clinical biomarkers (Saini et al., 2011): the estrogen receptor (ER; protein encoded by gene ESR1); the progesterone receptor (PR; protein encoded by gene PGR); and the human epidermal growth factor receptor-2 (HER2; protein encoded by gene ERBB2). These three biomarkers are regularly analyzed in hospital clinicopathology units using immunohistochemistry (IHC). These biomolecular factors are also commonly used to facilitate categorization into four standard groups of breast cancer: Luminal A, Luminal B, HER2-enriched and triple negative. Some proliferation markers, such as AURKA or MKI67, are also usually measured in the pathology units to improve risk estimation and treatment. However, the determination of these clinical markers by IHC suffers from frequent errors or inaccuracies (Li et al., 2010) and the groups provided are frequently too heterogeneous or intermediate in terms of categorical expression of ER, PR or HER2 (Bartlett et al., 2016;Mertins et al., 2016;Venet et al., 2011).
Modern genomic medicine based on omics technologies provides a new approach to the study of cancer by detecting the state of thousands of genes simultaneously. Moreover, it renders a way to interrogate the role of any gene or gene product as a potential biomarker associated with patient risk, prognosis or outcome. Several authors have proposed gene signatures related to the prognosis and survival of breast cancer (Cun and Frö hlich, 2012;Manjang et al., 2021). However, as cancer is a complex and heterogeneous disease, the consensus among them is quite small and they are frequently sample dependent (Mertins et al., 2016). In this regard, some commercial platforms have been developed based on gene expression detection for the prognostic analysis and prediction of breast cancer risk, such as Oncotype (https://www.oncotypeiq.com/) and Prosigna (https:// www.prosigna.com/), which are already available in the clinic. Despite this advance, the overlap between the gene signatures and the agreement between the risk predictions for each patient, provided by both platforms, is often very different or uneven (Bartlett et al., 2016;Mertins et al., 2016); and they function as a black box where decisions cannot be interpreted in terms of clinical phenotypes and do not show the correlation with defined biomolecular markers (Manjang et al., 2021). Therefore, the discovery and validation of robust and reproducible survival markers associated with the tumor risk is a critical step to achieve better disease prognosis and treatment.
Several authors have addressed the instability and irreproducibility of gene expression-derived prognostic signatures in breast cancer (Bartlett et al., 2016;Cun and Frö hlich, 2012;Ein-Dor et al., 2005). Within this problem, some integrative approaches attempted to combine protein interactome network information with gene expression data to more accurately predict prognosis (Das et al., 2015). Others are based on ensemble methods (Abeel et al., 2010;Gong et al., 2018), that try to reduce the variance of the estimator considering different feature selection algorithms or different preprocessing methods.
In this work, we follow the approach of ensemble methods, but focusing on the use of bootstrap techniques that help to reduce the dependency of the gene signature on a particular sample set. A relevant objective in this approach was that the survival markers obtained were biologically interpretable (i.e. were specific genes identified as biomarkers of prognosis), and were measurable in their association with the clinical factors and categories that characterize each tumor sample. Therefore, we propose a robust framework based on a bootstrap strategy to discover a set of reproducible survival markers that best correlate with the IHC clinical factors ER, PR and HER2 (that are determined in routine clinical practice in breast cancer).
The specific computational strategy and algorithms used are described in Section 2, and applied to two independent breast cancer datasets that integrate a large number of primary tumor samples (1024 samples and 879 samples). As a main result, we report that the proposed set of survival markers improves the risk prediction and the stratification of breast tumors provided by the genes included in the genomic platforms Oncotype and Prosigna.

Datasets description
Survival markers are often obtained from studies with small sample size. This favors the instability of the gene signatures derived. In this research, we have built two independent large datasets that integrate gene expression, survival and clinical data coming from several studies and public repositories. The first integrated dataset corresponds to a collection of data series of breast cancer tumor samples that include genome-wide expression data (obtained with Affymetrix highdensity oligo microarrays) and survival data with clinical information about the tumors. All the studies integrated in this dataset have been retrieved from Gene Expression Omnibus (GEO; https://www. ncbi.nlm.nih.gov/geo/) and they include the following information: (i) Gene expression platform used: GeneChips from Affymetrix Human Genome U133A and U133Plus2.0; (ii) Overall survival (OS) data available: event time t i and status, censored d ¼ 1 or not censored d ¼ 0; (iii) In at least a large part of the samples of the cohort, we had the values for the standard BRCA clinical markers determined by IHC: ER, PR and HER2. Table 1 shows the survival studies retrieved from GEO that have been integrated in our first dataset. This dataset, after normalization and data control, included 1024 BRCA tumor samples: 380 with information about the protein expression values determined by IHC of the 3 clinical breast cancer markers (ER, PR and HER2); and the rest, 644 samples, with only genome-wide expression data and survival data.
As we are integrating samples obtained under different protocols and conditions, the gene expression signal is not directly comparable due to the batch effect. The batch effect arises when different samples cluster together just because they belong to the same study. To avoid this problem, we applied the frozen RMA (fRMA) normalization algorithm proposed by Irizarry's group (McCall et al., 2010). We found that although this algorithm outperforms other approaches it should be adapted for our particular problem. The fRMA algorithm considers an empirical vector that gives different weight to each batch adjusting the expression to be comparable for different datasets. Weight vectors provided by Irizarry's group cannot be applied to our problem, because it does not meet the same requirements. To overcome this problem, we designed different approaches, changing the number of subsets for each series, and the number of samples for each subset. This problem is explained in more detail in a study previously published by our group (Martinez-Romero et al., 2018). After an extensive exploration of the data and quality control of the gene expression signal in the different batches, we selected a randomized sample size of 5 for each mini-subset. The number of mini-subsets for each series varied depending on the number of samples as explained in Supplementary Table S1.
As indicated above, we also used for this study a second independent dataset of BRCA tumor samples with full genome-wide expression data (obtained by RNA-seq technology) and survival data. This set was used in the validation of the gene signature (i.e. the list of selected genes) that we identified as good markers of survival with our first dataset. This dataset was made up selecting samples from The Cancer Genome Atlas (GDC Data Portal, https://portal. gdc.cancer.gov/). After cleaning and filtering the data, we selected and built a normalized set of 879 samples.

Identification of survival markers associated with IHC clinical markers of BRCA
In this section, we introduce a robust framework to discover a small but very significant subset of IHC-related genes that are also good markers of risk and survival in breast cancer. Figure 1 describes our approach. First, a robust differential expression algorithm (SAM) is applied to select the most variable genes and remove noisy features. Next, for each IHC marker (i.e. ER, PR and HER2, standard clinical markers in BRCA) a multivariate feature selection strategy is implemented based on an ensemble of linear classifiers (see Algorithm 1). The genes selected are those associated to the clinical variables with higher stability. Next, the gene markers for each clinical variable are ranked according to their ability to predict the risk and survival. To this aim, we have developed bootstrapped versions of univariate and multivariate feature selection algorithms based on Cox regression models. For each list, the genes more correlated to survival are selected giving rise to a subset of good survival markers associated to the standard clinical variables. Finally, the risk score is predicted based on the robust gene signature derived. An optimization algorithm is also introduced to stratify the patients in two risk groups.
The following paragraphs describe each step of the framework in detail.
The base of the multivariate feature selection strategy is the elastic-net algorithm proposed by (Friedman et al., 2010). Although this algorithm is quite robust, some researchers (Das et al., 2015) have pointed out that the genes selected in cancer transcriptomic studies are often quite unstable and sample dependent. To avoid this problem, we improve the robustness of our algorithm using a bagging strategy (Mbogning and Broët, 2016). An ensemble of linear classifiers is trained for feature selection and diversity among them is induced by bootstrap resampling. Several authors have applied ensemble methods in feature selection to improve the robustness of the gene markers found (Abeel et al., 2010;Gong et al., 2018;Zhao et al., 2011). The ensemble of lists is combined by selecting the genes associated to the IHC variables with highest stability.
The elastic-net method is applied to the feature selection. The elastic-net is a linear logistic regression model with a regularization term that is between Lasso and Ridge Regression (Friedman et al., 2010). This kind of predictor will shrink most of the coefficients to zero keeping only genes associated to the clinical variable. For simplicity, let G be the clinical variable that takes values in G ¼ f1; 2g.
The model can be extended easily for more than two categories. The logistic regression assumes: where PrðG ¼ ijxÞ is the a posteriori probability of class i and b is the vector of linear coefficients associated with the predictors. The model is adjusted by maximizing the regularized binomial likehood: (2) P a ðbÞ is a regularization term that plays a critical role to avoid overfitting. For the elastic-net algorithm this term is defined as: The a parameter determines the type of regularization considered and has a strong impact on the solution. Thus, for a ¼ 0, the model is equivalent to Ridge regression. The solution is dense and a large number of b i will become small but not equal to zero. For a ¼ 1, the model reduces to Lasso. The solution is sparse and most of the b i will become zero. The not null coefficients correspond to the predictor variables associated to the class label. When a subset of genes related to the clinical variable is slightly correlated, only one is chosen randomly. When a is between zero and one, most of the relevant genes related to the class variable are kept, although they may be slightly correlated among them. This improves the robustness of the subset of genes selected. Finally, k is a regularization parameter that should be determined by nested cross-validation to avoid overfitting. Nested cross-validation implements a double loop of cross-validation. In the inner loop, the optimal regularization parameter is chosen to maximize the partial likehood deviation in logscale. In the outer loop, the area under the receiver operating characteristic curve is estimated.
We have proposed a methodology to identify and select a set of genes related to the standard clinical variables in BRCA. Now, a relevant question for clinicians is which ones are significantly associated to the prognosis and survival of the patient. To address this, two complementary strategies are proposed. The first one is univariate and based on a robust version of the univariate Cox regression approach (Spirko-Burns and Devarajan, 2020). This strategy is quite robust to the particular training set of patients considered, but does not take into account the cooperative work among the genes, that is frequent to carry out a specific biological function. The second one is based on a multivariate Cox regression approach (https://rdrr.io/cran/uniCox/) proposed by (Tibshirani, 2009) and considers additive cooperation among the genes.
The univariate Cox proportional hazard regression algorithm is a linear technique that models the relative risk of patients. It does not assume a particular distribution for the hazard function and is less sensitive to the small sample size problem than other nonparametric approaches. Although the univariate Cox regression has been widely applied to gene selection and risk prediction in the literature, the stability and reproducibility of the list of genes obtained should be improved. To this aim, we have developed a bootstrapped version of the original Cox regression, following the same approach of Algorithm 1. In particular, an ensemble of univariate Cox regression models is trained for each gene. Diversity among the risk predictors is induced by bootstrapping. For each bootstrap sample B, a ranked list of genes is obtained considering a given metric f ðBÞ i for gene i. Different relevance metrics may be considered, such as the regression coefficient b i , the log-rank statistic or the Wald statistic for the b i . Finally, a summary statistic is computed along the ensemble of lists for each gene, based for instance on the median f i ¼ medff Abeel et al., 2010). Regarding the multivariate approach, we have considered the Cox proportional hazard regression model with L 1 norm penalty proposed by Tibshirani (2009). Let, X 1 ; X 2 ; . . . ; X p be the expression levels of the p genes. For each sample i, let ðt i ; d i Þ be the survival time and the censoring indicator for patient i respectively. The hazard of death at time t given the observed values of the gene expression ðkðtjxÞÞ can be modeled using a multivariate Cox regression model as: where k 0 ðtÞ is an unspecified baseline hazard function, ðb 1 ; b 2 ; . . . ; b p Þ are the regression coefficients and ðX 1 ; X 2 ; . . . ; X p Þ are the gene expression levels. f ðXÞ ¼ b T X is the linear risk score for the corresponding patient. The b j coefficients are estimated by maximizing the partial loglikelihood with L 1 norm penalty: where R k is the set of patients at risk for time t k and k is a regularization parameter that is estimated by 10-fold cross-validation. This parameter allows us to shrink most of the b j coefficients to zero selecting relevant risk markers.
To improve the robustness of the multivariate Cox regression and the stability of the risk markers selected, a double-nested cross-validation is applied. The inner loop estimates the optimal k regularization parameter for each predictor by 10-fold cross-validation and using a grid search strategy. In the outer loop, 10-fold cross-validation predicts the risk score considering nonoverlapping training and test sets. This strategy helps to avoid the overfitting, improving the risk prediction and the stability of the gene markers selected.

Accurate risk prediction and patient stratification
The risk prediction is carried out based on a small and robust subset of risk markers selected that correlate well with clinical variables.
The risk curve is estimated for the training set using double-nested cross-validation and the multivariate Cox regression introduced earlier. Next, the patients in the training set can be stratified in two risk groups (low and high). The usual method for the risk stratification is based in a heuristic threshold such as the median. This strategy does not perform well when the groups of risk are unbalanced. In this article, the optimal threshold is estimated by maximization of the logrank statistics. This method splits the patients in two risk groups that maximize the separability between the corresponding Kaplan-Meier curves. Below we provide a brief pseudocode outline that includes the steps of Algorithm 2 used for risk stratification.
Once the Multivariate Cox Regression is trained, it can be applied to predict the risk score for a new sample to test. To do this, we need to provide for this new query sample the gene expression profile for the subset of gene markers selected. Following this protocol, a new sample to test (from a new patient) will be assigned to the low-risk group if the risk score is smaller than the optimal threshold estimated based on the training set. Conversely, it will assigned to the high-risk group if the risk score is larger than the optimal threshold.

Results
Survival studies in cancer rely frequently on tumor samples of small size. Therefore, the biological findings are sample dependent and cannot be generalized well. To avoid this problem, in this work the list of survival markers is obtained using a first dataset of 1024 tumor samples (described in Section 2.1 and collected and compiled for this research). We also use a second independent dataset of 879 tumor samples to validate the results. Survival data are available for all the samples in both datasets, and the clinical markers (ER, PR and HER2) have been determined by inmunohistochemistry for a subset of 380 patients of the first set. This subset has been used as the initial training set to identify the list of survival genes related to the clinical markers and to train a multivariate risk predictor. The rest of the 644 samples of the first set are used to validate the list of genes and to evaluate the risk prediction. Therefore, after selecting the candidate survival marker genes, we use the second independent cohort of 879 tumor samples for validation of the prognostic power of the genes (selected as survival markers) and for the final prediction of the risk assigned to each tumor. We have also performed done a comparison of the gene signature discovered in this work with the gene sets included in Oncotype and Prosigna/PAM50 clinical genomic platforms. Notice that Prosigna and PAM50 include the same list of 50 genes in their respective platforms (Wallden et al., Algorithm 2 Optimal risk stratification. 1: Initialize: Let N be the number of patients and R j the risk score for patient j. 2: Rank patients according to the risk score. 3: for j ¼ 1 to N do 4: Define threshold: h j ¼ R j . Let G be the group variable defined as: 5: Compute the P-value using the log-rank statistic and the risk groups induced by G. 6: end for 7: Build the log-rank P-value curve versus the risk threshold value. The lowest P-value defines the optimal threshold c j . 8: Return risk threshold c j and classify the patients in two groups 9: End 2015). PAM50 is a well-known set of 50 genes that were found and identified by analyzing many genome-wide expression profiles of breast cancer patients (mainly by Charles M. Perou and collaborators), and included in a classifier of breast cancer tumors (i.e. PAM50) that defined five gene expression-based intrinsic subtypes: Luminal A, Luminal B, HER2-enriched, Basal-and Normal-like (Parker et al., 2009). PAM50 was successfully used to assess both prognostic and predictive value to adjuvant hormonal therapy in breast cancer patients (Chia et al., 2012). In Supplementary Table  S2, we present the list of genes included in the commercial platforms that are currently used in therapeutic decisions in breast cancer: Prosigna/PAM50 (50 genes) and Oncotype (16 genes). Thus, throughout this article, we will refer to Prosigna and PAM50 indifferently. We also include in Supplementary Table S2 the list of genes developed in this work and used in our analysis (i.e. the signature of 34 genes). In our comparative study, when indicating Prosigna or PAM50, we used the expression of 49 encoding genes present in these platforms, excluding one gene (KNTC2, that encodes a kinetochore protein) because we had not signal for this gene in the expression data matrix of the two sets used in our study (microarrays or RNA-seq). In any case, we tested that the experimental results did not change by discarding one gene. As a briefing or summary of the results of this study, we tried to answer three relevant questions: (i) Is there a small subset of genes that display their expression level associated with breast cancer prognosis in the same way as each of the three proteins regularly tested in the clinic by IHC?; (ii) Is there a robust and stable subset of gene survival markers that are able to improve the risk prediction and stratification provided by the gene sets considered in the commercial genomic platforms regularly used in the clinic to breast tumors?; (iii) Could some of these genes be considered novel prognostic markers because they better support the three IHC protein markers (ER, PR and HER2) that are regularly tested in the clinic to aid in therapeutic decisions?

Finding robust survival markers associated with the three standard IHC clinical markers used in BRCA
We identified a list of genes correlated with the three IHC standard tumor markers (ER, PR and HER2) most commonly used in the clinic to categorize breast cancer. To achieve this, we developed and applied Algorithm 1 based on an ensemble of elastic-net classifiers. In this algorithm, the number of bootstrap iterations N B were ¼ 100 and for the regularization term the a was ¼ 0:8. We selected a list of genes significantly associated with each of the three main clinical markers with stability index >0.10. In a further step, we carried out univariate and multivariate survival analysis of the genes using the methodology presented in Section 2.2. Finally, we selected for each clinical marker the genes more correlated with survival according to the log-rank statistics that were 16 genes for ER; 10 genes for PR (8 of them the same as ER); and 14 genes for HER2. The gene markers for HER2 did not show a highly significant association with survival (considering log-rank statistics), probably because most patients are treated with anti-HER2 drugs after diagnosis. Tables 3-5 show the genes selected, that in total are 32 distinct unique genes associated with ER, PR and HER2 with good stability. The genes, in alphabetical order by their gene symbols, are AGR3, CA12, CCDC170 (labeled C6orf97 in Tables 3 and 4), CDK12 (labeled CRKRS in Table 5), CNKSR1, CWC25, DNALI1, ERBB2, ESR1, GATA3, GFRA1, GRB7, KLC4, KMO, MED1, MIEN1 (labeled C17orf37 in Table 5), NANOS1, NAT1, NME3, PGAP3, PGR, PNMT, PSMD3, SIRT3, SLC15A2, SLC39A6, SOX11, STARD3, SUSD3, TBC1D9, TFF1 and ZNF552. Two extra genes that are standard indicators of cell proliferation were added to this list: AURKA and MKI67. In this way, we set up a final gene pool that includes 34 selected genes. Next, we checked whether the list of 34 selected genes was able to improve the prediction of IHC marker status made with individual genes (i.e. if the 34 genes could be better predicting whether the IHC markers are positive or negative in a given queried tumor). Table 2 shows the accuracy for the prediction of the ER, PR and HER2 status based on the signal of each single gene or based on the signal of the 34 genes. The first row corresponds to the 34-gene signature proposed in this study. The next rows show the accuracy based only on the gene that best represents the clinical marker: ESR1 gene for the ER, PGR gene for the PR and ERBB2 gene for the HER2. The  95% confidence intervals are provided to assess the significance of the differences among the prediction accuracy. These results show that the approach proposed in this article outperforms the predictors based on a single gene for all the three clinical markers considered. This suggests that the effect of these variables is better modeled by the cooperative work of several genes than by each single gene. Once we have a list of 34 gene markers associated with the IHC clinical status, we performed a full analysis of these genes with respect to tumor survival and risk prediction, using the univariate or multivariate methods presented in Section 2.2. Tables 3-5 provide the genes selected associated with the status of each one of the three IHC markers: ER, PR and HER2. The tables summarize the results using univariate and multivariate analysis. The Log-rank P-value is the P-value for the nonparametric log-rank test. The optimization procedure employed is similar to the Algorithm 2. This is an univariate test that evaluates the ability of a single gene to stratify a set of patients by the expression level in two groups of significantly different survival. The b UniCox is the estimation for the expected value of the regression coefficient in the univariate Cox regression model introduced in Section 2.2. The P-value UniCox is the Wald statistic P-value that evaluates if the regression coefficient is significantly different from zero. Significant P-values mean that there is an association between the gene expression and the risk. Therefore, the genes can be ranked according to their individual effect on the patient risk and survival considering the univariate metrics P-value UniCox or Log-rank P-value. Notice that as we have mentioned in Section 2.2 all the metrics have been computed using a bootstrap strategy to increase the robustness of the results.
Regarding the multivariate approach, the b MultiCox is the estimation for the expected regression coefficient in the multivariate Cox regression model with L 1 norm penalization (introduced in Section 2.2). Larger absolute values correspond to genes that have strong impact on the patient risk. This metric takes into account the cooperative work between genes. The r b MultiCox is the standard deviation of the regression coefficient. The standard deviation will help to determine if the jb i j are significantly greater than zero.
From the results presented in Table 3, we observed that the P-value UniCox for the ESR1 is significant and the b UniCox < 0 (negative), indicating that the over-expression of this gene reduces the risk in breast cancer. This is expected since the oncologists know that breast tumors that are ER positive have a better prognosis than tumors that are ER negative. Furthermore, the results in this table show that the expression of genes TBC1D9, SUSD3, CCDC170(C6orf97) and NME3 is also associated with low risk, and they show a stronger association with good survival than gene ESR1. Therefore, they can be proposed as new complementary markers to the ER clinical status. Finally, the genes found for HER2 did not show a highly significant association with survival, which may be due to the fact that these patients are usually treated with anti-HER2 drugs. However, the b i are often positive which is consistent with the empirical evidence that overexpression of HER2 increases patient risk.
3.2 Risk prediction of BRCA tumors based on gene expression signatures: 34 genes here proposed, Oncotype genes and Prosigna genes A relevant application of the gene expression signatures measured in the breast tumors is to achieve an accurate prediction of the risk expected for each patient, and a consequent stratification of the tumors. To evaluate this, we have applied the multivariate Cox regression (as described in Section 2.2). We have compared the risk prediction based on the 34 genes signature proposed and the risk prediction and survival curves obtained using the same algorithm with the lists of genes that are included in two commercial genomic platforms that are already used in the clinic: Oncotype (including 16 genes) and Prosigna/PAM50 (including 49 genes). The lists indicating the names of all these genes are included in Supplementary Table  S2. To check the generalization ability of these three gene signatures we tested them using an independent breast cancer dataset of 879 samples. Figure 2 shows the Kaplan-Meier survival curves obtained after the splitting of the 879 tumors in two groups: one of low risk (that included 519 samples) and another of high risk (that included 360 samples). These two risk groups were obtained considering the expression values in the tumors of the 34 genes signature. The same analysis was done using the 16 genes signature of Oncotype and the 49 genes signature of Prosigna/PAM50. The corresponding Kaplan-Meier survival curves are included in Supplementary Figure S2a and b. We remark that the 34-gene signature proposed in this work generalizes very well over the validation dataset of 879 RNA-seq  samples, that is completely independent to the first data set. Moreover, as indicated before, these three gene signatures were also compared with the 644 test samples of the first dataset and the corresponding Kaplan-Meier survival curves are presented in Supplementary Figure S1a-c. Table 6 provides a summary of the results indicating the logrank statistic P-value and the hazard ratio (HR) for the risk groups obtained based on each of the three signatures tested: the 34 genes signature and the ones including the genes used by Oncotype or by Prosigna/PAM50 platforms. The table shows that the signature of 34 genes performs significantly better than Prosigna/PAM50 and Oncotype, with the log-rank statistic and the HR significantly improved. This means that the risk prediction and the stratification of patients in two prognostic groups are significantly better considering the 34 genes signature. Since the results are obtained using an independent dataset, different from the one used to produce the signature, we can suggest that the gene expression signature found is quite stable and reproducible.
Finally, several authors have studied the relationship among some survival markers discovered in this research and breast cancer.  Althobiti et al., 2021) have been linked to breast cancer in different studies and several of these genes have also been reported to interact or be associated with ER expression and function. Furthermore, as can be seen from the included references, several of these genes have been correlated with patient survival and prognosis in breast cancer studies. In some cases, upregulated expression of these genes was associated with prolonged survival and thus good prognosis, mainly in luminal breast cancer subtypes. This is the case of genes AGR3, CCDC170 and NDPKC. CA12 has also been correlated with a chemoresistance phenotype in cancer; in particular, silencing this gene has been shown to reverse paclitaxel sensitivity in drug-resistant breast cancer cells (Huang et al., 2021). In this respect, the expression of multidrug resistance gene 1 (MDR1, also called ABCB1) in breast tumors has been associated with treatment and with a poor response to chemotherapy (Trock et al., 1997).
With respect to genes CDK12 (CRKRS), CWC25 (CCDC49), GRB7, MIEN1 (C17orf37), PNMT and SIRT3, they have been shown to be linked to HER2 receptor and cancer (Oh et al., 1999). Several of these genes have also been included in gene prognostic signatures published for HER2-positive breast cancer (Knauer et al., 2010;Staaf et al., 2010). In addition, the relationship of these genes with HER2 in breast cancer has already been demonstrated for several of them. For example, CDK12 positivity was correlated with HER2 positivity (Naidoo et al., 2018); and MIEN1 overexpression facilitates migration, invasion and metastasis in breast cancer (Zhao et al., 2017) and it is negatively correlated with disease free survival (Kushwaha et al., 2019) mainly in HER2-positive subtypes.

Discussion
Accurate prediction of tumor risk based on easy-to-determine gene expression signatures associated with survival may be a critical step for the advance in breast cancer prognosis and treatment. Many gene survival signatures reported in the literature are unstable (Gong et al., 2018;Haury et al., 2011) and it is frequent that risk predictors provided by companies to physicians and oncologists lack transparent biological and biomolecular explanation or interpretation.
To improve the robustness of the survival gene signatures some authors have integrated gene expression data with gene interaction networks, protein-protein interaction networks or pathways information (Cun and Frö hlich, 2012;Das et al., 2015). However, the integration of uncertain biological knowledge a priori can introduce a negative bias in the final model. Our approach is more related with ensemble methods (Gong et al., 2018;Zhao et al., 2011) that combine several predictors to improve the robustness of the selected signature by aggregating multiple gene lists. However, in some cases, the signatures proposed by multiple approaches do not allow biological interpretation in terms of the molecular markers that determine clinical outcomes (Manjang et al., 2021). In this regard, genomic platforms such as Oncotype, Prosigna/PAM50 incorporate heuristic clinical knowledge to derive a more effective list of markers for therapeutic decisions. However, they work as black boxes because they do not reveal the clinical parameters considered, and therefore the correlation between the included individual gene markers and risk is unclear. Despite this, they have been successfully applied to predict risk and survival, providing a stratification of patients that helps the medical decision-making.
Our framework in this study identifies a set of survival markers that have a clear biological interpretation in terms of the three main biomolecular factors that determine the clinical prognosis in breast cancer: ER, PR and HER2. To ensure the stability and reproducibility of our results, we analyzed in this work two independent datasets with large number of samples. Furthermore, a robust feature selection strategy is developed based on bootstrap and ensemble methods. In this way, a 34 genes signature is proposed, improving significantly the risk prediction and patient stratification of the list of genes present in the commercial genomic platforms: Oncotype (16 genes) and Prosigna/PAM50 (49 genes).

Conclusion
The discovery and validation of survival biomarkers associated with a given phenotype or to a specific clinical variable is a critical step to achieve better disease prognosis. However, gene signatures proposed in the literature suffer frequently from the instability and irreproducibility problem and cannot be correlated well with clinical phenotypes.
In this article, we have developed a bioinformatic framework for the identification and validation of a robust and reproducible subset of gene survival markers that are related to clinical phenotypes in breast cancer. Algorithms are also introduced for the accurate risk prediction and optimal stratification of patients into risk groups. Combining several resources, we have integrated and analyzed two large breast cancer datasets that include genome-wide expression and survival data, as well as clinical pathological and phenotipic information. Furthermore, we have applied the framework in this dataset to discover a robust list of survival markers that correlate well with clinical IHC markers used in breast cancer. The results suggest that our signature of 34 genes improves the risk prediction and the patient stratification obtained using the genes included in two widely used genomic platforms.
Several of the gene markers found have been studied in the literature and are related to the prognosis of breast cancer or to some of the characteristics used in the categorization of breast tumors. Therefore, we report that some of the discovered survival-associated genes exhibit properties worthy of attention as novel alternative risk markers to standard IHC variables used in the clinic.