Abstract

We apply the “fused lasso” regression method of (TSRZ2004) to the problem of “hot- spot detection”, in particular, detection of regions of gain or loss in comparative genomic hybridization (CGH) data. The fused lasso criterion leads to a convex optimization problem, and we provide a fast algorithm for its solution. Estimates of false-discovery rate are also provided. Our studies show that the new method generally outperforms competing methods for calling gains and losses in CGH data.

INTRODUCTION

In this paper, we apply the fused lasso method (Tibshirani and others, 2004) to the “hot-spot” detection problem in comparative genomic hybridization (CGH) data. The CGH signal is approximated by a piecewise function that has relatively sparse areas with nonzero values. Hence, the method is useful for determining which areas of the signal are likely to be nonzero.

CGH is a technique for measuring DNA copy numbers of selected genes on the genome. In cells with cancer, mutations can cause a gene to be either deleted from the chromosome or amplified, that is, there are extra DNA copies of the gene. The CGH array experiments return the log2 ratio between the number of DNA copies of the gene in the tumor cells and the number of DNA copies in the reference cells. A value greater than zero indicates a possible gain in DNA copies of that gene, while a value less than zero suggests possible losses.

The results of a CGH experiment are often interpreted by a biologist, but this is time consuming and not necessarily very accurate. In recent years, a number of algorithms have been developed for automatic interpretation, including Fridlyand and others (2004), Olshen & Venkatraman (2004), Myers and others (2004), Wang and others (2005), Lipson and others (2005), and Picard and others (2005); a comprehensive comparison is given by Lai and others (2005). Approaches include successive top–down splitting, bottom–up agglomeration along the chromosome, and hidden Markov models.

The left panel of Figure 1 shows an example.

Fig. 1.

Fused lasso applied to some GBM data. The data are shown in the left panel, and the solid line in the right panel represents the inferred copy number β^ from the fused lasso. The grey line is for y = 0. 1

Fig. 1.

Fused lasso applied to some GBM data. The data are shown in the left panel, and the solid line in the right panel represents the inferred copy number β^ from the fused lasso. The grey line is for y = 0. 1

The data represent CGH measurements from 2 glioblastoma multiforme (GBM) tumors (see Section 4.2 for details).

In this paper, we apply the fused lasso method of Tibshirani and others (2004) to spatial smoothing and the CGH detection problem. The solid line in the right panel of Figure 1 shows the result of the fused lasso method applied to these data. The method has successfully detected the narrow regions of gain and the wide regions of loss.

THE FUSED LASSO AND THE PROPOSED METHOD FOR HOT-SPOT DETECTION

The “fused lasso” (Tibshirani and others, 2004) is a generalization of the lasso (Tibshirani, 1996) and is defined by 

(2.1)
graphic
The fused lasso is developed for the situations when xi1,xi2…,xip have some kind of natural ordering. The additional fused constraint encourages the flatness of the coefficient profiles βj as a function of j.

Here, we apply the fused lasso in the special case in which {xij} is the identity matrix. Hence, we are smoothing the sequence y1,y2,…,yn along the 1-dimensional index i.

Denote the log2 ratio measurement of a chromosome (or chromosome arm) as Y = {yi}i = 1n; we are interested in finding coefficients β^=(β^1,β^2,,β^n) satisfying 

(2.2)
graphic
and β^j is the inferred DNA copy number for gene j. Here, s1 controls the overall DNA copy number alteration amount of the target chromosome (or chromosome arm), while s2 controls the frequency of the alterations in the target region.

We investigated 2 different methods for estimating s1,s2 and calling gains and losses.

Method 1

The first method estimates s1 and s2 from pre-smoothed version of the data, applies the fused lasso with these values, and then thresholds the estimate to determine regions of gains or losses.

Since the overall DNA copy number alteration amount is contributed mainly by the large gain/loss regions, we derive the value of s1 by using a heavily smoothed version of Y:

  • 1) Apply “lowess” to Y with fraction parameter f = max(1/50,50/p), where p is the length of Y. (The lowess window will be at least 50.)

  • 2) Denote the lowess result by Y˜. Define s^1=i|y˜i|.

On the other hand, s2 controls the frequency of the alterations in the target region. Thus, we use moderately smoothed Y to infer s2:

  • 1) Apply lowess to Y with fraction parameter f = min(1/20,10/p). (The lowess window will be at most 10.)

  • 2) Denote the result as forumla and calculate the median absolute deviation of {di}: 

    graphic
    where μd = median({di}i).

  • 3) It is reasonable to assume that di with absolute values greater than 4δ corresponds to copy number alteration break points on the genome. Thus, we define s^2=2δ+i|di|·I(|di|>4δ).

After we obtain s^1 and s^2, we apply the fused lasso to estimate the underlying signals β^i,i=1,2,,n. Finally, we threshold |β^i| by a value θ to obtain the final regions of gain or loss. The choice of θ is discussed in Section 2.3, where it is used to control the false-discovery rate (FDR).

Method 2

Here, we vary both s1 and s2 in a systematic way. Define S1(X) = ∑i|xi|. We calculate s1 = S1(lowess(Y,f = w/p)) and s2 = 2s1/w, where w represents the average length of alteration regions. The value of w is varied and then the fused lasso is computed for the resulting values of s1 and s2. Figure 2 shows an example with some simulated data from a normal and tumor array. As w increases, the estimate does a better job of isolating the gain regions in the tumor sample. However, there is still a large region of low but nonzero values that would require further thresholding before gains and losses can be called. Hence, this approach seems too complicated, and we therefore settle on method 1.

Fig. 2.

Simulated data: the effect of parameter w (the average length of alterations) on the fused lasso solution. The panels of the left column represent a chromosome with no copy number alteration; the panels of the right column represent a chromosome with an amplification region of genes 40–60. The solid lines illustrate the estimated copy numbers of fused lasso; a horizontal line is drawn at y = 0.

Fig. 2.

Simulated data: the effect of parameter w (the average length of alterations) on the fused lasso solution. The panels of the left column represent a chromosome with no copy number alteration; the panels of the right column represent a chromosome with an amplification region of genes 40–60. The solid lines illustrate the estimated copy numbers of fused lasso; a horizontal line is drawn at y = 0.

Estimation of FDR

For 1 array, when we are trying to decide whether the ith gene/clone has significant DNA copy number alteration, we are actually doing a hypothesis testing with

 
graphic

Here, rises the issue of multiple hypothesis testing, for, in each array, tens of thousands of genes/clones need to be considered simultaneously.

Although we do not have independent Hi for each gene in our problem, we can still use 

(2.3)
graphic
as a rough estimator for (FDR) (Benjamini & Hochberg, 1995), (Chu and others, 2002), (Storey, 2002), (Efron & Tibshirani, 2002). We assume that the denominator is greater than 0 with probability 1 (i.e. the event that no significant genes are selected is rare).

We estimate the FDR in 2 different ways according to the availability of normal reference arrays.

Estimation of FDR when normal reference arrays are available.

Suppose there are M normal reference arrays {Rm}m = 1M, where Rm = {rim}i = 1n. We first scale the normal reference arrays according to the target tumor array and then use the copy number inferred from the reference arrays to approximate the null distribution of {β^i}i.

  • 1) The contiguous genes/clones with the same β^i are defined as 1 segment. Suppose there are K segments resulting from {β^i} and denote them as Sk = {ik − 1 + 1,…,ik}. We then calculate the within-segment standard deviation of the target array 

    graphic
    where μk = mean(yi:iSk).

  • 2) Normalize each reference array according to σ^: 

    graphic

  • 3) Estimate s1 and s2 for each forumla and then apply fused lasso on it. Denote the resulting coefficients by forumla.

  • 4) For a given threshold θ, FDR of the procedure is estimated as a function of θ: 

    (2.4)
    graphic
    Usually, the data analyst pre-chooses a target FDR and vary θ over a range to seek the solution with FDR^(θ) closest to the target.

Estimation of FDR when normal reference arrays are not available.

In real experiments, there are often situations where appropriate normal reference arrays are not available due to sample/lab limitations. Thus, there is a need to control the FDR in the absence of reference arrays.

We approach this by considering segments as the hypothesis unit first. For 1 segment Sk, the hypothesis of interest can be stated as 

graphic
This makes it natural to examine statistics forumla where nk is the size of Sk. To take advantage of the inferred copy number forumla, we further define forumla.

We approximate the null distribution of forumla with standard normal and define pk = P(Z > |forumla|), where ZN(0,1).

If we assume forumla are independent from each other, then, for a given q∈(0,1), a conservative estimator of the genome-wide FDR in (2.3) is 

(2.5)
graphic
Here, genes/clones in the segments with pkq are considered to have copy number alterations.

Again, data analyst can vary q over a range and choose the solution with FDR^(q) closest to the preselected value.

Computational considerations

The optimization problem for the fused lasso is a quadratic programming problem. Criterion (2.1) leads to a quadratic programming problem. For large n, the problem is difficult to solve and special care must be taken to avoid the use of p2 storage elements. We use the approach for the general fused lasso problem outlined in Tibshirani and others (2004): the 2-phase active set algorithm “sqopt” of Gill and others (1999), which is designed for quadratic programming problems with sparse linear constraints.

Figure 3 shows a log-log plot of the number of seconds required for the computation as a function of the number of clones (genes), on a Linux computer. We simulated 10 data sets of each size, and the figure shows the mean plus or minus standard error; the computation took on average about 30 s for 4000 clones. Beyond n = 1000 clones, the plot is roughly linear with a slope of about 2. This suggests that the computation is of order n2. The estimated computation time for a data set with 100000 clones is about 135 min.

Fig. 3.

Number of CPU seconds required for the fused lasso algorithm as a function of the number of clones.

Fig. 3.

Number of CPU seconds required for the fused lasso algorithm as a function of the number of clones.

While this may be sufficiently fast for practical use, clearly it would be preferable to reduce the computation time for large data sets. We are currently working with Jerome Friedman and Trevor Hastie on specialized algorithms for the fused lasso, and the initial results show considerable promise.

COMPARISON TO OTHER SMOOTHING METHODS

General smoothing methods are not typically used for analyzing CGH data because their results can be difficult to interpret (Lai and others, 2005). This is illustrated in Figure 5, where 2 popular smoothing methods—lowess (Becker and others, 1988) and penalized smoothing splines (Ruppert and others, 2003)—are applied to the example data (see Section 4.2 for details). We used R function lowess (smooth window = 10) and “spm” (default parameters) to compute the results. As we can see from the figure, the 2 smoothing methods do not provide direct calls for copy number gains/losses, and thus require additional thresholding for identifying regions with significant alterations. In addition, the smoothing curves do not catch the piecewise constant shape of copy number changes, which raises additional challenges for controlling the FDR. Moreover, copy number alterations can be both large chromosome segmentation gain/loss and also abrupt local amplification/deletion. Therefore, different degrees of smoothness are needed for different chromosome regions, which adds another layer of complexity to the kernel- and spline-based approaches.

In the fused lasso, the fused term in the loss function can be viewed as the first derivative of the coefficient profiles, and thus the method can also be deemed as a “smoothing” approach. However, the use of L1-norm on the fused penalty term enables the method to capture both the piecewise flatness patterns and the abrupt local jumps at the same time (see Figure 5). In addition, the control on the overall sparsity of the coefficient solution helps to screen away the “cold”-spot regions. These make the fused lasso a more attractive approach for analyzing CGH data.

SOME RESULTS

4.1. Simulated data

We apply the proposed method on artificial chromosomes simulated by Lai and others (2005) (downloaded from http://www.chip.org/∼ppark/Supplements/Bioinformatics05b.html). We consider the most challenging situation where the signal-to-noise ratio is equal to 1. For each of the 4 different aberration widths (5, 10, 20, and 40 probes), there are 100 independently simulated chromosomes with 100 probes in total.

We first compare the inferred copy number {β^i}i by fused lasso with the estimated copy numbers from 3 other popular programs “CGHseg” (Picard and others, 2005), “CBS” (Olshen & Venkatraman, 2004), and “CLAC” (Wang and others, 2005). The receiver operating curves of different methods shown in Figure 4 suggest that fused lasso better captures the true DNA copy number alterations than the other 3 methods, especially when the aberration width is small.

Fig. 4.

TPR = (the number of probes within the aberration width that is above a threshold)/(the total number of probes within the aberration width); FPR = (the number of probes outside the aberration width that is above a threshold)/(the total number of probes outside the aberration width).

Fig. 4.

TPR = (the number of probes within the aberration width that is above a threshold)/(the total number of probes within the aberration width); FPR = (the number of probes outside the aberration width that is above a threshold)/(the total number of probes outside the aberration width).

We then investigate the 2 FDR estimators proposed in Section 2.3. FDR^(θ) is computed based on 10 simulated reference arrays with no copy number alterations (i.i.d. from N(0,1)). We then control FDR^(θ) and FDR^(q) at the level of 0.01 and calculate the true FDR for the selected solutions. The result is summarized in Table 1. Overall, when we control FDR^(θ)=0.01, 80.75% simulated cases have true FDR less than 0.01; when we control FDR^(q)=0.01, 84.5% simulated cases have true FDR less than 0.01. Generally, FDR^(q) is more conservative than FDR^(θ), for the former is derived on the assumption that segments are independent from each other.

Table 1

The median, 75% quantile, and 90% quantile of the true FDR values

Window size FDR^(θ)=0.01 FDR^(q)=0.01 
 Median 75% 90% Median 75% 90% 
05 00.37 0.499 0.37 0.37 
10 00.37 00.37 0.37 0.37 
20 00.37 0.108 0.37 0.09 
40 0.037 0.133 0.03 0.29 
Window size FDR^(θ)=0.01 FDR^(q)=0.01 
 Median 75% 90% Median 75% 90% 
05 00.37 0.499 0.37 0.37 
10 00.37 00.37 0.37 0.37 
20 00.37 0.108 0.37 0.09 
40 0.037 0.133 0.03 0.29 

GBM data

The glioma data from Bredel and others (2005) contain samples representing primary GBMs, a particular malignant type of brain tumor. We investigate the performance of various methods on the array CGH profiles of the GBM samples examined in Lai and others (2005). To generate a more challenging situation where both local amplification and large region loss exit in the same chromosome, we paste together the following 2 array regions: (1) chromosome 7 in GBM29 from 40 to 65 Mb and (2) chromosome 13 in GBM31. The performance of different methods on this pseudo chromosome is illustrated in Figure 5. We can see that the proposed method using fused lasso successfully identified both the local amplification and the big chunk of copy number loss.

Fig. 5.

Chromosome 7 and chromosome 13 from 2 GBM tumors.

Fig. 5.

Chromosome 7 and chromosome 13 from 2 GBM tumors.

Breast tumor data

In the study conducted by Pollack and others (2002), cDNA microarray CGH was profiled across 6691 mapped human genes in 44 breast tumor samples and 10 breast cancer cell lines. The scanned raw data were downloaded from the Stanford Microarray Database (http://smd.stanford.edu). We pick the breast cancer cell line (MDA157) as an example, which has a large degree of copy number alterations, and apply our proposed method as well as the other 3 methods on it to estimate the underlying copy number changes. From the results shown in Figure 6, we can see that fused lasso successfully recognized varies copy number alterations.

Fig. 6.

Array CGH profile of 10 chromosomes of breast cancer cell line MDA157. Panels in the same row are for the same chromosome. The integers at the beginning of each row are chromosome indexes. The dark black line in each panel represents the estimated copy number of a particular method, whose name is shown on the top of each column. The black horizontal line in each panel represents y = 0.

Fig. 6.

Array CGH profile of 10 chromosomes of breast cancer cell line MDA157. Panels in the same row are for the same chromosome. The integers at the beginning of each row are chromosome indexes. The dark black line in each panel represents the estimated copy number of a particular method, whose name is shown on the top of each column. The black horizontal line in each panel represents y = 0.

CGHseg appears to be very sensitive to outlier measurements and thus will be more suitable for detecting single-gene copy number mutations of high-quality arrays. CLAC is conservative in handling outliers with opposite signs in 1 alteration region and therefore tends to break large alteration segments into small blocks. CBS provides clean solutions for segmentations but has the limitation to detect break points whose alteration signals are weak (e.g. chromosome 7 and 15 of the selected cell line).

DISCUSSION AND FUTURE WORK

The fused lasso can be generalized to other analysis besides the calling of gains and losses in CGH data. For example, biologists have great interest in understanding the interactions between copy number alterations and mRNA expression levels. To study this, a commonly used method is to identify gene pairs, of which one gene's copy number and the other gene's expression level are significantly correlated. However, genome-wide screening for such pairs is quite challenging: (1) the large dimensionality of the problem raises the requirement of controlling the sparsity in the solutions; (2) the strong spacial correlations of gene copy number changes need to be taken into account. The fused lasso can be used again to tackle these challenges.

Denote the expression measurements of a target gene as Y = (y1,y2,…,yJ), where J represents the total number of samples. Denote the CGH measurement of the ith gene across the J samples as Xi = (x1i,…,xJi). Suppose Y and {Xi} have all been normalized to mean 0 and variance 1. cor(Y,Xi) is then exactly the least-squares coefficient of the linear model YβXi. Thus, we can infer the correlation coefficients by solving 

(5.1)
graphic
where a nonzero βj suggests that Y and Xi are significantly correlated.

In addition, for CGH data, our paper focuses on the calling of gains and losses from a single array. Typically, the researcher collects data on a few dozen arrays, often divided into normal and diseased groups. Ideally, one would like to carry out a joint analysis of the data set, finding regions of gains and losses that occur commonly among the patients and also ones that occur differentially in the normal and diseased groups. One interesting proposal along these lines is the STAC procedure of Diskin and others (2006). This method projects the calls of gains and losses for individual arrays along chromosome, measuring the frequency and “footprint” of calls in each local region. The STAC procedure could be applied to the gains and losses called by the fused lasso. However, a more integrated approach might be possible. For example, given data on arrays j = 1,2,…,J, one could first apply the fused lasso to the data from each array. Suppose that we parse the resulting coefficient vectors into a collection of nonzero basis functions β^ik, k = 1,2,…,K. Each function is nonzero over a single interval of the real line. Then, we can do some kind of joint regression of the data from all arrays on this collection, allowing the arrays to share the basis functions as efficiently as possible. This would reveal basis functions (areas of the chromosome) that are common among many or all arrays and also those that are unique to only a few. This is a topic for further study.

An R language package for the fused lasso will be freely available at http://www.stat.stanford.edu/∼tibs/cghFLasso.

We would like to thank Michael Saunders for his expert advice on computational issues and would like to thank the referees and editors for helpful comments that led to improvements in this manuscript. Tibshirani was partially supported by National Science Foundation Grant DMS-9971405 and National Institutes of Health Contract N01-HV-28183. Conflict of Interest: None declared.

References

Becker
RA
Chambers
JM
Wilks
AR
The New S Language
 , 
1988
Pacific Grove, CA
Wadsworth Brooks Cole
Benjamini
Y
Hochberg
Y
Controlling the false discovery rate: a practical and powerful approach to multiple testing
Journal of the Royal Statistical Society, Series B
 , 
1995
, vol. 
85
 (pg. 
289
-
300
)
Bredel
M
Bredel
C
Juric
D
Harsh
GR
Vogel
H
Recht
LD
Sikic
BI
High-resolution genome-wide mapping of genetic alterations in human glial brain tumors
Cancer Research
 , 
2005
, vol. 
65
 (pg. 
4088
-
4096
)
Chu
G
Narasimhan
B
Tibshirani
R
Tusher
V
Significance Analysis of Microarrays (SAM) Software
2002
 
Diskin
SJ
Eck
T
Greshock
J
Mosse
YP
Naylor
T
Stoeckert
CJ
Jr
Weber
BL
Maris
JM
Grant
GR
Stac: a method for testing the significance of DNA copy number aberrations across multiple array-CGH experiments
Genome Research
 , 
2006
, vol. 
16
 (pg. 
1149
-
1158
)
Efron
B
Tibshirani
R
Microarrays, empirical Bayes methods, and false discovery rates
Genetic Epidemiology
 , 
2002
, vol. 
1
 (pg. 
70
-
86
)
Fridlyand
J
Snijders
AM
Pinkel
D
Albertson
DG
Jain
AN
Hidden Markov models approach to the analysis of array CGH data
Journal of Multivariate Analysis
 , 
2004
, vol. 
90
 (pg. 
132
-
153
)
Gill
P
Murray
W
Saunders
M
Users guide for SQOPT 5.3: a Fortran package for large-scale linear and quadratic programming
Technical Report
 , 
1999
Palo Alto, CA
Stanford University
Lai
WR
Johnson
MD
Kucherlapati
R
Park
PJ
Comparative analysis of algorithms for identifying amplifications and deletions in array CGH data
Biostatistics
 , 
2005
, vol. 
21
 (pg. 
3763
-
3770
)
Lipson
D
Aumann
Y
Ben-Dor
A
Linial
N
Yakhini
Z
Efficient calculation of interval scores for DNA copy number data analysis
Journal of Computational Biology
 , 
2005
, vol. 
13
 (pg. 
215
-
228
)
Myers
CL
Dunham
MJ
Kung
SY
Troyanskaya
OG
Accurate detection of aneuploidies in array CGH and gene expression microarray data
Bioinformatics
 , 
2004
, vol. 
20
 (pg. 
3533
-
3543
)
Olshen
A
Venkatraman
E
A faster circular binary segmentation algorithm for the analysis of array CGH data
Biostatistics
 , 
2004
, vol. 
5
 (pg. 
557
-
572
)
Picard
F
Robin
S
Lavielle
M
Vaisse
C
Daudin
J
A statistical approach for array CGH data analysis
BMC Bioinformatics
 , 
2005
, vol. 
11
 (pg. 
6
-
27
)
Pollack
JR
Sórlie
T
Perou
CM
Rees
CA
Jeffrey
SS
Lonning
PE
Tibshirani
R
Botstein
D
Bórresen-Dale
A-L
Brown
PO
Microarray analysis reveals a major direct role of DNA copy number alteration in the transcriptional program of human breast tumors
Proceedings of the National Academy of Sciences USA
 , 
2002
, vol. 
99
 (pg. 
12963
-
12968
)
Ruppert
D
Wand
MP
Carroll
R
Semiparametric Regression
 , 
2003
New York
Cambridge University Press
Storey
JD
A direct approach to false discovery rates
Journal of the Royal Statistical Society, Series B
 , 
2002
, vol. 
64
 (pg. 
479
-
498
)
Tibshirani
R
Regression shrinkage and selection via the lasso
Journal of the Royal Statistical Society, Series B
 , 
1996
, vol. 
58
 (pg. 
267
-
288
)
Tibshirani
R
Saunders
M
Rosset
S
Zhu
J
Knight
K
Sparsity and smoothness via the fused lasso
Journal of the Royal Statistical Society, Series B
 , 
2005
, vol. 
67
 (pg. 
91
-
108
)
Wang
P
Kim
Y
Pollack
J
Narasimhan
B
Tibshirani
R
A method for calling gains and losses in array CGH data
Biostatistics
 , 
2005
, vol. 
6
 (pg. 
45
-
58
)