-
PDF
- Split View
-
Views
-
Cite
Cite
Baolin Wu, Cancer outlier differential gene expression detection, Biostatistics, Volume 8, Issue 3, July 2007, Pages 566–575, https://doi.org/10.1093/biostatistics/kxl029
Close -
Share
Abstract
We study statistical methods to detect cancer genes that are over- or down-expressed in some but not all samples in a disease group. This has proven useful in cancer studies where oncogenes are activated only in a small subset of samples. We propose the outlier robust t-statistic (ORT), which is intuitively motivated from the t-statistic, the most commonly used differential gene expression detection method. Using real and simulation studies, we compare the ORT to the recently proposed cancer outlier profile analysis (Tomlins and others, 2005) and the outlier sum statistic of Tibshirani and Hastie (2006). The proposed method often has more detection power and smaller false discovery rates. Supplementary information can be found at http://www.biostat.umn.edu/∼baolin/research/ort.html.
1. INTRODUCTION
Recently, Tomlins and others (2005) have proposed the “cancer outlier profile analysis" (COPA) method for detecting cancer genes which show increased expressions in a subset of disease samples. They argue that in the majority of cancer types, oncogene has heterogeneous activation patterns; traditional analytical methods, for example, t-statistic, which search for common activation of genes across a class of cancer samples, will fail to find such oncogene expression profiles. Instead, we should search for overexpression only in a subset of cases. Through applications to public cancer microarray data sets, they have shown that the proposed COPA can perform better than the commonly used t-statistic.
More recently, Tibshirani and Hastie (2006) proposed the outlier sum (OS) statistic to detect cancer gene outlier expressions. The OS and COPA are similarly defined using robust location and scale estimates of the gene expression values (more details in Section 2). Through simulation studies and applications, they have shown that the OS can perform better than the COPA, for example, having smaller false discovery rates (Benjamini and Hochberg, 1995).
In this paper, we consider the statistical methods to detect cancer genes with a subset of over- or down-expressed outlier disease samples. Many methods have been proposed to detect differentially expressed genes (see, e.g. Dudoit and others, 2002, Troyanskaya and others, 2002). Among them, the t-statistic is the most commonly used method. We will discuss several problems associated with the t-statistic for cancer gene outlier expression detection, which will motivate the development of the outlier robust t-statistic (ORT). We will further establish the connection of the OS, COPA, and ORT statistics to the t-statistic from a robustness consideration. Through simulation studies and applications to a public breast cancer microarray data, we empirically evaluate and compare the different outlier detection statistics.
2. STATISTICAL METHODS
Consider a 2-class, for example, cancer/normal tissues, microarray data. Let be the observed expression values for samples and genes . Without loss of generality, assume that the first samples are from the normal group and the last samples are from the cancer group, where . In the following discussion, we assume that the outlier disease samples are overexpressed. Similar arguments will carry through to detect genes with down-expressed outlier disease samples.


The t-statistic is based on the assumption that all disease samples are overexpressed. While in cancer gene outlier analysis, only a subset of the disease samples are assumed to be overexpressed. Intuitively, we want to make inference only using those overexpressed samples (outliers).
In the following, we first study the recently proposed COPA method (Tomlins and others, 2005) and the OS statistic (Tibshirani and Hastie, 2006) for detecting cancer gene outliers. We will make some intuitive connections between these 2 outlier detection statistics and the t-statistic. The t-statistic (2.1) will be studied from a robustness (against outlier) perspective, which shows its dependence on all disease samples and the inappropriate variance estimate. We then propose an ORT to remove the “all disease samples” dependence and appropriately reduce the outlier effects on the variance.
2.1 t-statistic, COPA and OS








2.2 Outlier robust t-statistic
Besides the inefficiency of the COPA statistic owing to its use of a fixed sample percentile, a second problem is that the median over all samples, , is not quite the right statistic to replace the normal sample mean, . It might overestimate the normal group mean owing to the contamination by disease samples if a majority of them have outlier expressions. A more intuitive and appropriate quantity might be, for example, the normal sample median.






In the following, we use simulation studies and applications to a public breast cancer microarray data to empirically evaluate and compare the detection power of previously discussed 4 methods: the t-statistic, COPA, OS, and the proposed ORT.
3. SIMULATION STUDIES
Simulation studies are conducted to evaluate the power of various outlier detection statistics. We also compare their false discovery rates (Benjamini and Hochberg, 1995).
Suppose we have normal and disease samples. There are in total genes with their expression values simulated from the standard normal distribution. The first gene contains outlier disease samples with their expression values being added constant . For each simulated data, we can calculate the P-value for the first gene, which is the proportion of the other (null) genes with the absolute test statistics bigger than the first gene. The P-values from the simulations can be used to estimate the true/false-positive rates, that is, the sensitivity and 1 − specificity, which are then used to construct the receiver operating characteristic curve for power comparison.
Figure 1 shows the estimated true/false-positive rates based on 1000 simulations. In the extreme situation with only one outlier disease sample (), the OS statistic performs the best, the ORT has comparable performance as the OS, and the t-statistic and COPA have almost no detection power. When increasing to outlier disease samples, the ORT, OS, and COPA have similar power, all better than the t-statistic. For outlier disease samples, the ORT performs the best. The detection power of both the ORT and t-statistic increases with more outlier disease samples. While the performance of the COPA and OS decreases a little bit when the outlier disease samples approach the full set (). Overall, the ORT performs the best. It seems to be able to automatically adapt to the unknown number of outlier samples, and combine the strength of both the OS and t-statistic.
Detection power estimation based on 1000 simulations. There are disease and normal samples, and 999 null genes with their expression values simulated from standard normal distribution. The first gene contains a subset of k outlier disease samples with their expression values added constant .
Next we evaluate and compare the false discovery rates of the 4 methods based on the simulation. We set of the genes as differentially expressed with outlier disease samples with their expression values being added constant . Figure 2 shows the estimated false discovery rates based on 1000 simulations for differentially expressed genes. Similar patterns as the true/false-positive rates estimation (see Figure 1) are observed. The ORT has the overall best performance with the smallest false discovery rates.
False discovery rate estimation based on 1000 simulations. There are disease and normal samples, and genes with their expression values simulated from standard normal distribution. The first genes contain a subset of k outlier disease samples with their expression values being added constant . The x-axis is the positive rates: the proportion of genes called significant.
Very similar patterns have been observed for . We also did the simulation studies for ; or ; and . We consistently observe that the ORT has the overall best performance. Complete simulation results are available at the supplementary web site (http://www.biostat.umn.edu/∼baolin/research/ort.html).
In Section 4, we apply the 4 cancer gene outlier detection statistics to a public breast cancer microarray data and empirically compare their performance.
4. APPLICATION TO THE BREAST CANCER MICROARRAY DATA
The breast cancer microarray data reported by West and others (2001) contained the expression levels of 7129 genes from 49 breast tumor samples. Each sample had a binary outcome describing the status of lymph node involvement in breast cancer. Among them, 25 tumor samples had no positive lymph nodes discovered and 24 tumor samples had identifiably positive nodes. The gene expressions, obtained from the Affymetrix human HuGeneFL GeneChip, can be downloaded from http://data.cgt.duke.edu/west.php. We normalize the data using quantile normalization (Bolstad and others, 2003), and then log transform the intensities for follow-up statistical analysis. In the cancer gene outlier detection, we treat the negative group as the normal class. We applied the t-statistic, COPA, OS, and the proposed ORT to detect genes with overexpressed disease samples. We rank the genes based on each test statistic. For those top 25 genes identified by each method, we mapped their Affymetrix identifiers to the UniGene cluster identifiers using the Bioconductor (Gentleman and others, 2004) annotation package hu6800, which were then used to search for relevant literature in the PubMed. There are in total 13 genes identified that have been studied previously and shown related to breast cancer.
Table 1 lists the confirmed breast cancer–related genes ranked in top 25 for each outlier detection statistic. ORT identified 8 genes, 5 of them were not selected by other statistics. There were 5 genes that were missed by the ORT but identified by the others. Also listed in the table is the ranking of each gene by the 4 test statistics. The genes identified by the OS were ranked generally high by the ORT. Among those genes identified by the ORT, some were ranked low by the OS but relatively higher by the t-statistic, for example, ATM and ERBB4; while several others were ranked low by the t-statistic but relatively higher by the OS, for example, AGTR1 and CASC3. It seems likely that the proposed ORT could combine the strength of both the OS and t-statistic (see also Figures 1 and 2 in Section 3). Overall, the ORT had the best detection power.
Genes ranked in top 25 by the outlier detection statistics and confirmed to be associated with breast cancer in previous studies. The last 4 columns also list the ranking of each gene by the 4 methods
| Methods | Rank | UniGene ID | Gene name | t | COPA | OS | ORT |
| t | 18 | Hs.435561 | ATM | 819 | 4296 | 7 | |
| 23 | Hs.338207 | FRAP1 | 4507 | 4296 | 4376 | ||
| 24 | Hs.487046 | SOD2 | 3670 | 4296 | 401 | ||
| COPA | 17 | Hs.512234 | IL6 | 3447 | 5 | 126 | |
| 21 | Hs.204238 | LCN2 | 4744 | 4296 | 4375 | ||
| OS | 5 | Hs.512234 | IL6 | 3447 | 17 | 126 | |
| 14 | Hs.477887 | AGTR1 | 2191 | 98 | 21 | ||
| 15 | Hs.435714 | PAK1 | 4744 | 125 | 32 | ||
| 16 | Hs.350229 | CASC3 | 731 | 105 | 22 | ||
| ORT | 7 | Hs.435561 | ATM | 18 | 819 | 4296 | |
| 9 | Hs.390729 | ERBB4 | 82 | 1842 | 1203 | ||
| 17 | Hs.724 | THRA | 817 | 121 | 69 | ||
| 18 | Hs.327527 | SMARCA4 | 84 | 196 | 55 | ||
| 19 | Hs.460996 | TRADD | 380 | 483 | 415 | ||
| 20 | Hs.534310 | CTAG1B | 1883 | 292 | 176 | ||
| 21 | Hs.477887 | AGTR1 | 3291 | 98 | 14 | ||
| 22 | Hs.350229 | CASC3 | 731 | 105 | 16 |
| Methods | Rank | UniGene ID | Gene name | t | COPA | OS | ORT |
| t | 18 | Hs.435561 | ATM | 819 | 4296 | 7 | |
| 23 | Hs.338207 | FRAP1 | 4507 | 4296 | 4376 | ||
| 24 | Hs.487046 | SOD2 | 3670 | 4296 | 401 | ||
| COPA | 17 | Hs.512234 | IL6 | 3447 | 5 | 126 | |
| 21 | Hs.204238 | LCN2 | 4744 | 4296 | 4375 | ||
| OS | 5 | Hs.512234 | IL6 | 3447 | 17 | 126 | |
| 14 | Hs.477887 | AGTR1 | 2191 | 98 | 21 | ||
| 15 | Hs.435714 | PAK1 | 4744 | 125 | 32 | ||
| 16 | Hs.350229 | CASC3 | 731 | 105 | 22 | ||
| ORT | 7 | Hs.435561 | ATM | 18 | 819 | 4296 | |
| 9 | Hs.390729 | ERBB4 | 82 | 1842 | 1203 | ||
| 17 | Hs.724 | THRA | 817 | 121 | 69 | ||
| 18 | Hs.327527 | SMARCA4 | 84 | 196 | 55 | ||
| 19 | Hs.460996 | TRADD | 380 | 483 | 415 | ||
| 20 | Hs.534310 | CTAG1B | 1883 | 292 | 176 | ||
| 21 | Hs.477887 | AGTR1 | 3291 | 98 | 14 | ||
| 22 | Hs.350229 | CASC3 | 731 | 105 | 16 |
Genes ranked in top 25 by the outlier detection statistics and confirmed to be associated with breast cancer in previous studies. The last 4 columns also list the ranking of each gene by the 4 methods
| Methods | Rank | UniGene ID | Gene name | t | COPA | OS | ORT |
| t | 18 | Hs.435561 | ATM | 819 | 4296 | 7 | |
| 23 | Hs.338207 | FRAP1 | 4507 | 4296 | 4376 | ||
| 24 | Hs.487046 | SOD2 | 3670 | 4296 | 401 | ||
| COPA | 17 | Hs.512234 | IL6 | 3447 | 5 | 126 | |
| 21 | Hs.204238 | LCN2 | 4744 | 4296 | 4375 | ||
| OS | 5 | Hs.512234 | IL6 | 3447 | 17 | 126 | |
| 14 | Hs.477887 | AGTR1 | 2191 | 98 | 21 | ||
| 15 | Hs.435714 | PAK1 | 4744 | 125 | 32 | ||
| 16 | Hs.350229 | CASC3 | 731 | 105 | 22 | ||
| ORT | 7 | Hs.435561 | ATM | 18 | 819 | 4296 | |
| 9 | Hs.390729 | ERBB4 | 82 | 1842 | 1203 | ||
| 17 | Hs.724 | THRA | 817 | 121 | 69 | ||
| 18 | Hs.327527 | SMARCA4 | 84 | 196 | 55 | ||
| 19 | Hs.460996 | TRADD | 380 | 483 | 415 | ||
| 20 | Hs.534310 | CTAG1B | 1883 | 292 | 176 | ||
| 21 | Hs.477887 | AGTR1 | 3291 | 98 | 14 | ||
| 22 | Hs.350229 | CASC3 | 731 | 105 | 16 |
| Methods | Rank | UniGene ID | Gene name | t | COPA | OS | ORT |
| t | 18 | Hs.435561 | ATM | 819 | 4296 | 7 | |
| 23 | Hs.338207 | FRAP1 | 4507 | 4296 | 4376 | ||
| 24 | Hs.487046 | SOD2 | 3670 | 4296 | 401 | ||
| COPA | 17 | Hs.512234 | IL6 | 3447 | 5 | 126 | |
| 21 | Hs.204238 | LCN2 | 4744 | 4296 | 4375 | ||
| OS | 5 | Hs.512234 | IL6 | 3447 | 17 | 126 | |
| 14 | Hs.477887 | AGTR1 | 2191 | 98 | 21 | ||
| 15 | Hs.435714 | PAK1 | 4744 | 125 | 32 | ||
| 16 | Hs.350229 | CASC3 | 731 | 105 | 22 | ||
| ORT | 7 | Hs.435561 | ATM | 18 | 819 | 4296 | |
| 9 | Hs.390729 | ERBB4 | 82 | 1842 | 1203 | ||
| 17 | Hs.724 | THRA | 817 | 121 | 69 | ||
| 18 | Hs.327527 | SMARCA4 | 84 | 196 | 55 | ||
| 19 | Hs.460996 | TRADD | 380 | 483 | 415 | ||
| 20 | Hs.534310 | CTAG1B | 1883 | 292 | 176 | ||
| 21 | Hs.477887 | AGTR1 | 3291 | 98 | 14 | ||
| 22 | Hs.350229 | CASC3 | 731 | 105 | 16 |
Figure 3 shows the expression profiles of the 8 genes that were identified by the ORT and confirmed associated with the breast cancer in previous studies. Figure 4 shows the expression profiles of the other 5 confirmed breast cancer–related genes that were missed by the ORT but identified by the other 3 methods. We have added some jittering to the horizontal positions to distinguish among close points. The title lists the gene names. Within the parentheses are those outlier statistics that have ranked the gene in top 25.
Cancer gene outlier detection for breast cancer microarray data: plotted are 8 top-ranking genes that were identified by the ORT and confirmed associated with the breast cancer in the literature. The lymph node–negative samples () serve as the normal group, and we look for outlier samples in the lymph node–positive (LN+) group. We have added some jittering to the horizontal positions to distinguish among close points. The title lists the gene names. Within the parentheses are those outlier statistics that have ranked the gene in top 25.
Cancer gene outlier detection for breast cancer microarray data: plotted are 5 top-ranking genes missed by the ORT but identified by the other 3 methods that were confirmed related to the breast cancer in the literature. The lymph node–negative samples () serve as the normal group, and we look for outlier samples in the lymph node–positive (LN+) group. We have added some jittering to the horizontal positions to distinguish among close points. The title lists the gene names. Within the parentheses are those outlier statistics that have ranked the gene in top 25.
5. DISCUSSION


The proposed ORT is intuitively motivated from the widely used t-statistic with the robustness consideration. Compared to the COPA and OS, ORT more appropriately takes into account the difference between the normal and disease groups, for example, the proper estimation of median absolute deviation (2.7) and the use of normal group median instead of the overall median (2.8). Through simulation studies and application to public cancer microarray data, we have illustrated the competitive performance of the proposed ORT. In this paper, we have focused on comparing 2 groups. The study of multigroup comparisons will be reported in the future.
This research was partially supported by a University of Minnesota artistry and research grant and a research grant from the Minnesota Medical Foundation. Conflict of Interest: None declared.



