## Abstract

We apply the “fused lasso” regression method of (TSRZ2004) to the problem of “hot- spot detection”, in particular, detection of regions of gain or loss in comparative genomic hybridization (CGH) data. The fused lasso criterion leads to a convex optimization problem, and we provide a fast algorithm for its solution. Estimates of false-discovery rate are also provided. Our studies show that the new method generally outperforms competing methods for calling gains and losses in CGH data.

## INTRODUCTION

In this paper, we apply the fused lasso method (Tibshirani *and others*, 2004) to the “hot-spot” detection problem in comparative genomic hybridization (CGH) data. The CGH signal is approximated by a piecewise function that has relatively sparse areas with nonzero values. Hence, the method is useful for determining which areas of the signal are likely to be nonzero.

CGH is a technique for measuring DNA copy numbers of selected genes on the genome. In cells with cancer, mutations can cause a gene to be either deleted from the chromosome or amplified, that is, there are extra DNA copies of the gene. The CGH array experiments return the log_{2} ratio between the number of DNA copies of the gene in the tumor cells and the number of DNA copies in the reference cells. A value greater than zero indicates a possible gain in DNA copies of that gene, while a value less than zero suggests possible losses.

The results of a CGH experiment are often interpreted by a biologist, but this is time consuming and not necessarily very accurate. In recent years, a number of algorithms have been developed for automatic interpretation, including Fridlyand *and others* (2004), Olshen & Venkatraman (2004), Myers *and others* (2004), Wang *and others* (2005), Lipson *and others* (2005), and Picard *and others* (2005); a comprehensive comparison is given by Lai *and others* (2005). Approaches include successive top–down splitting, bottom–up agglomeration along the chromosome, and hidden Markov models.

The left panel of Figure 1 shows an example.

The data represent CGH measurements from 2 glioblastoma multiforme (GBM) tumors (see Section 4.2 for details).

In this paper, we apply the fused lasso method of Tibshirani *and others* (2004) to spatial smoothing and the CGH detection problem. The solid line in the right panel of Figure 1 shows the result of the fused lasso method applied to these data. The method has successfully detected the narrow regions of gain and the wide regions of loss.

## THE FUSED LASSO AND THE PROPOSED METHOD FOR HOT-SPOT DETECTION

The “fused lasso” (Tibshirani *and others*, 2004) is a generalization of the lasso (Tibshirani, 1996) and is defined by

*x*

_{i1},

*x*

_{i2}…,

*x*

_{ip}have some kind of natural ordering. The additional fused constraint encourages the flatness of the coefficient profiles

*β*

_{j}as a function of

*j*.

Here, we apply the fused lasso in the special case in which {*x*_{ij}} is the identity matrix. Hence, we are smoothing the sequence *y*_{1},*y*_{2},…,*y*_{n} along the 1-dimensional index *i*.

Denote the log_{2} ratio measurement of a chromosome (or chromosome arm) as *Y* = {*y*_{i}}_{i = 1}^{n}; we are interested in finding coefficients $\beta ^=(\beta ^1,\beta ^2,\u2026,\beta ^n)$ satisfying

*j*. Here,

*s*

_{1}controls the overall DNA copy number alteration amount of the target chromosome (or chromosome arm), while

*s*

_{2}controls the frequency of the alterations in the target region.

We investigated 2 different methods for estimating *s*_{1},*s*_{2} and calling gains and losses.

### Method 1

The first method estimates *s*_{1} and *s*_{2} from pre-smoothed version of the data, applies the fused lasso with these values, and then thresholds the estimate to determine regions of gains or losses.

Since the overall DNA copy number alteration amount is contributed mainly by the large gain/loss regions, we derive the value of *s*_{1} by using a heavily smoothed version of *Y*:

1) Apply “lowess” to

*Y*with fraction parameter*f*= max(1/50,50/*p*), where*p*is the length of*Y*. (The lowess window will be at least 50.)2) Denote the lowess result by $Y\u02dc$. Define $s^1=\u2211i|y\u02dci|$.

On the other hand, *s*_{2} controls the frequency of the alterations in the target region. Thus, we use moderately smoothed *Y* to infer *s*_{2}:

1) Apply lowess to

*Y*with fraction parameter*f*= min(1/20,10/*p*). (The lowess window will be at most 10.)2) Denote the result as and calculate the median absolute deviation of {

*d*_{i}}:where*μ*_{d}= median({*d*_{i}}_{i}).3) It is reasonable to assume that

*d*_{i}with absolute values greater than 4*δ*corresponds to copy number alteration break points on the genome. Thus, we define $s^2=2\delta +\u2211i|di|\xb7I(|di|>4\delta )$.

After we obtain $s^1$ and $s^2$, we apply the fused lasso to estimate the underlying signals $\beta ^i,i=1,2,\u2026,n.$ Finally, we threshold $|\beta ^i|$ by a value *θ* to obtain the final regions of gain or loss. The choice of *θ* is discussed in Section 2.3, where it is used to control the false-discovery rate (FDR).

### Method 2

Here, we vary both *s*_{1} and *s*_{2} in a systematic way. Define *S*_{1}(*X*) = ∑_{i}|*x*_{i}|. We calculate *s*_{1} = *S*_{1}(lowess(*Y*,*f* = *w*/*p*)) and *s*_{2} = 2*s*_{1}/*w*, where *w* represents the average length of alteration regions. The value of *w* is varied and then the fused lasso is computed for the resulting values of *s*_{1} and *s*_{2}. Figure 2 shows an example with some simulated data from a normal and tumor array. As *w* increases, the estimate does a better job of isolating the gain regions in the tumor sample. However, there is still a large region of low but nonzero values that would require further thresholding before gains and losses can be called. Hence, this approach seems too complicated, and we therefore settle on method 1.

### Estimation of FDR

For 1 array, when we are trying to decide whether the *i*th gene/clone has significant DNA copy number alteration, we are actually doing a hypothesis testing with

Here, rises the issue of multiple hypothesis testing, for, in each array, tens of thousands of genes/clones need to be considered simultaneously.

Although we do not have independent *H*_{i} for each gene in our problem, we can still use

*and others*, 2002), (Storey, 2002), (Efron & Tibshirani, 2002). We assume that the denominator is greater than 0 with probability 1 (i.e. the event that no significant genes are selected is rare).

We estimate the FDR in 2 different ways according to the availability of normal reference arrays.

##### Estimation of FDR when normal reference arrays are available.

Suppose there are *M* normal reference arrays {*R*^{m}}_{m = 1}^{M}, where *R*^{m} = {*r*_{i}^{m}}_{i = 1}^{n}. We first scale the normal reference arrays according to the target tumor array and then use the copy number inferred from the reference arrays to approximate the null distribution of ${\beta ^i}i$.

1) The contiguous genes/clones with the same $\beta ^i$ are defined as 1 segment. Suppose there are

*K*segments resulting from ${\beta ^i}$ and denote them as*S*_{k}= {*i*_{k − 1}+ 1,…,*i*_{k}}. We then calculate the within-segment standard deviation of the target arraywhere*μ*_{k}= mean(*y*_{i}:*i*∈*S*_{k}).2) Normalize each reference array according to $\sigma ^$:

3) Estimate

*s*_{1}and*s*_{2}for each and then apply fused lasso on it. Denote the resulting coefficients by .4) For a given threshold

*θ*, FDR of the procedure is estimated as a function of*θ*:Usually, the data analyst pre-chooses a target FDR and vary(2.4)*θ*over a range to seek the solution with $FDR^(\theta )$ closest to the target.

##### Estimation of FDR when normal reference arrays are not available.

In real experiments, there are often situations where appropriate normal reference arrays are not available due to sample/lab limitations. Thus, there is a need to control the FDR in the absence of reference arrays.

We approach this by considering segments as the hypothesis unit first. For 1 segment *S*_{k}, the hypothesis of interest can be stated as

*n*

_{k}is the size of

*S*

_{k}. To take advantage of the inferred copy number , we further define .

We approximate the null distribution of with standard normal and define *pk* = *P*(*Z* > ||), where *Z*∼*N*(0,1).

If we assume are independent from each other, then, for a given *q*∈(0,1), a conservative estimator of the genome-wide FDR in (2.3) is

*p*

_{k}≤

*q*are considered to have copy number alterations.

Again, data analyst can vary *q* over a range and choose the solution with $FDR^(q)$ closest to the preselected value.

## Computational considerations

The optimization problem for the fused lasso is a quadratic programming problem. Criterion (2.1) leads to a quadratic programming problem. For large *n*, the problem is difficult to solve and special care must be taken to avoid the use of *p*^{2} storage elements. We use the approach for the general fused lasso problem outlined in Tibshirani *and others* (2004): the 2-phase active set algorithm “sqopt” of Gill *and others* (1999), which is designed for quadratic programming problems with sparse linear constraints.

Figure 3 shows a log-log plot of the number of seconds required for the computation as a function of the number of clones (genes), on a Linux computer. We simulated 10 data sets of each size, and the figure shows the mean plus or minus standard error; the computation took on average about 30 s for 4000 clones. Beyond *n* = 1000 clones, the plot is roughly linear with a slope of about 2. This suggests that the computation is of order *n*^{2}. The estimated computation time for a data set with 100000 clones is about 135 min.

While this may be sufficiently fast for practical use, clearly it would be preferable to reduce the computation time for large data sets. We are currently working with Jerome Friedman and Trevor Hastie on specialized algorithms for the fused lasso, and the initial results show considerable promise.

## COMPARISON TO OTHER SMOOTHING METHODS

General smoothing methods are not typically used for analyzing CGH data because their results can be difficult to interpret (Lai *and others*, 2005). This is illustrated in Figure 5, where 2 popular smoothing methods—lowess (Becker *and others*, 1988) and penalized smoothing splines (Ruppert *and others*, 2003)—are applied to the example data (see Section 4.2 for details). We used R function lowess (smooth window = 10) and “spm” (default parameters) to compute the results. As we can see from the figure, the 2 smoothing methods do not provide direct calls for copy number gains/losses, and thus require additional thresholding for identifying regions with significant alterations. In addition, the smoothing curves do not catch the piecewise constant shape of copy number changes, which raises additional challenges for controlling the FDR. Moreover, copy number alterations can be both large chromosome segmentation gain/loss and also abrupt local amplification/deletion. Therefore, different degrees of smoothness are needed for different chromosome regions, which adds another layer of complexity to the kernel- and spline-based approaches.

In the fused lasso, the fused term in the loss function can be viewed as the first derivative of the coefficient profiles, and thus the method can also be deemed as a “smoothing” approach. However, the use of *L*_{1}-norm on the fused penalty term enables the method to capture both the piecewise flatness patterns and the abrupt local jumps at the same time (see Figure 5). In addition, the control on the overall sparsity of the coefficient solution helps to screen away the “cold”-spot regions. These make the fused lasso a more attractive approach for analyzing CGH data.

## SOME RESULTS

### 4.1. Simulated data

We apply the proposed method on artificial chromosomes simulated by Lai *and others* (2005) (downloaded from http://www.chip.org/∼ppark/Supplements/Bioinformatics05b.html). We consider the most challenging situation where the signal-to-noise ratio is equal to 1. For each of the 4 different aberration widths (5, 10, 20, and 40 probes), there are 100 independently simulated chromosomes with 100 probes in total.

We first compare the inferred copy number ${\beta ^i}i$ by fused lasso with the estimated copy numbers from 3 other popular programs “CGHseg” (Picard *and others*, 2005), “CBS” (Olshen & Venkatraman, 2004), and “CLAC” (Wang *and others*, 2005). The receiver operating curves of different methods shown in Figure 4 suggest that fused lasso better captures the true DNA copy number alterations than the other 3 methods, especially when the aberration width is small.

We then investigate the 2 FDR estimators proposed in Section 2.3. $FDR^(\theta )$ is computed based on 10 simulated reference arrays with no copy number alterations (i.i.d. from *N*(0,1)). We then control $FDR^(\theta )$ and $FDR^(q)$ at the level of 0.01 and calculate the true FDR for the selected solutions. The result is summarized in Table 1. Overall, when we control $FDR^(\theta )=0.01$, 80.75% simulated cases have true FDR less than 0.01; when we control $FDR^(q)=0.01$, 84.5% simulated cases have true FDR less than 0.01. Generally, $FDR^(q)$ is more conservative than $FDR^(\theta )$, for the former is derived on the assumption that segments are independent from each other.

Window size | $FDR^(\theta )=0.01$ | $FDR^(q)=0.01$ | ||||

Median | 75% | 90% | Median | 75% | 90% | |

05 | 0 | 00.37 | 0.499 | 0 | 0.37 | 0.37 |

10 | 0 | 00.37 | 00.37 | 0 | 0.37 | 0.37 |

20 | 0 | 00.37 | 0.108 | 0 | 0.37 | 0.09 |

40 | 0 | 0.037 | 0.133 | 0 | 0.03 | 0.29 |

Window size | $FDR^(\theta )=0.01$ | $FDR^(q)=0.01$ | ||||

Median | 75% | 90% | Median | 75% | 90% | |

05 | 0 | 00.37 | 0.499 | 0 | 0.37 | 0.37 |

10 | 0 | 00.37 | 00.37 | 0 | 0.37 | 0.37 |

20 | 0 | 00.37 | 0.108 | 0 | 0.37 | 0.09 |

40 | 0 | 0.037 | 0.133 | 0 | 0.03 | 0.29 |

### GBM data

The glioma data from Bredel *and others* (2005) contain samples representing primary GBMs, a particular malignant type of brain tumor. We investigate the performance of various methods on the array CGH profiles of the GBM samples examined in Lai *and others* (2005). To generate a more challenging situation where both local amplification and large region loss exit in the same chromosome, we paste together the following 2 array regions: (1) chromosome 7 in GBM29 from 40 to 65 Mb and (2) chromosome 13 in GBM31. The performance of different methods on this pseudo chromosome is illustrated in Figure 5. We can see that the proposed method using fused lasso successfully identified both the local amplification and the big chunk of copy number loss.

### Breast tumor data

In the study conducted by Pollack *and others* (2002), cDNA microarray CGH was profiled across 6691 mapped human genes in 44 breast tumor samples and 10 breast cancer cell lines. The scanned raw data were downloaded from the Stanford Microarray Database (http://smd.stanford.edu). We pick the breast cancer cell line (MDA157) as an example, which has a large degree of copy number alterations, and apply our proposed method as well as the other 3 methods on it to estimate the underlying copy number changes. From the results shown in Figure 6, we can see that fused lasso successfully recognized varies copy number alterations.

CGHseg appears to be very sensitive to outlier measurements and thus will be more suitable for detecting single-gene copy number mutations of high-quality arrays. CLAC is conservative in handling outliers with opposite signs in 1 alteration region and therefore tends to break large alteration segments into small blocks. CBS provides clean solutions for segmentations but has the limitation to detect break points whose alteration signals are weak (e.g. chromosome 7 and 15 of the selected cell line).

## DISCUSSION AND FUTURE WORK

The fused lasso can be generalized to other analysis besides the calling of gains and losses in CGH data. For example, biologists have great interest in understanding the interactions between copy number alterations and mRNA expression levels. To study this, a commonly used method is to identify gene pairs, of which one gene's copy number and the other gene's expression level are significantly correlated. However, genome-wide screening for such pairs is quite challenging: (1) the large dimensionality of the problem raises the requirement of controlling the sparsity in the solutions; (2) the strong spacial correlations of gene copy number changes need to be taken into account. The fused lasso can be used again to tackle these challenges.

Denote the expression measurements of a target gene as *Y* = (*y*_{1},*y*_{2},…,*y*_{J}), where *J* represents the total number of samples. Denote the CGH measurement of the *i*th gene across the *J* samples as *X*^{i} = (*x*_{1}^{i},…,*x*_{J}^{i}). Suppose *Y* and {*X*^{i}} have all been normalized to mean 0 and variance 1. cor(*Y*,*X*^{i}) is then exactly the least-squares coefficient of the linear model *Y*∼*β**X*^{i}. Thus, we can infer the correlation coefficients by solving

*β*

_{j}suggests that

*Y*and

*X*

^{i}are significantly correlated.

In addition, for CGH data, our paper focuses on the calling of gains and losses from a single array. Typically, the researcher collects data on a few dozen arrays, often divided into normal and diseased groups. Ideally, one would like to carry out a joint analysis of the data set, finding regions of gains and losses that occur commonly among the patients and also ones that occur differentially in the normal and diseased groups. One interesting proposal along these lines is the STAC procedure of Diskin *and others* (2006). This method projects the calls of gains and losses for individual arrays along chromosome, measuring the frequency and “footprint” of calls in each local region. The STAC procedure could be applied to the gains and losses called by the fused lasso. However, a more integrated approach might be possible. For example, given data on arrays *j* = 1,2,…,*J*, one could first apply the fused lasso to the data from each array. Suppose that we parse the resulting coefficient vectors into a collection of nonzero basis functions $\beta ^ik$, *k* = 1,2,…,*K*. Each function is nonzero over a single interval of the real line. Then, we can do some kind of joint regression of the data from all arrays on this collection, allowing the arrays to share the basis functions as efficiently as possible. This would reveal basis functions (areas of the chromosome) that are common among many or all arrays and also those that are unique to only a few. This is a topic for further study.

An R language package for the fused lasso will be freely available at http://www.stat.stanford.edu/∼tibs/cghFLasso.

We would like to thank Michael Saunders for his expert advice on computational issues and would like to thank the referees and editors for helpful comments that led to improvements in this manuscript. Tibshirani was partially supported by National Science Foundation Grant DMS-9971405 and National Institutes of Health Contract N01-HV-28183. *Conflict of Interest:* None declared.