Abstract
Summary: Non-negative matrix factorization (NMF) is an increasingly used algorithm for the analysis of complex high-dimensional data. BRB-ArrayTools is a widely used software system for the analysis of gene expression data with almost 9000 registered users in over 65 countries. We have developed a NMF analysis plug-in in BRB-ArrayTools for unsupervised sample clustering of microarray gene expression data. Our analysis tool also incorporates an algorithm for Semi-NMF which can handle both positive and negative elements for log-ratio data. Output includes a heat map of sample clusters and differentially expressed genes with extensive biological annotation. For comparison, output also includes the results of K-means clustering.
Availability: The NMF analysis plug-in is freely available in BRB-ArrayTools for non-commercial users. BRB-ArrayTools can be downloaded at http://linus.nci.nih.gov/BRB-ArrayTools.html. The algorithms used for NMF and Semi-NMF are available at ftp://linus.nci.nih.gov/pub/NMF.
Contact: rsimon@mail.nih.gov
1 INTRODUCTION
Non-negative matrix factorization (NMF) was introduced as a tool to reduce the dimensionality of large image bit maps and identify key meta-features (Lee and Seung, 1999). It has been widely applied in text mining, image processing and document clustering (Devarajan, 2008). More recently, NMF has been reported to be a powerful tool for gene expression data (Brunet et al., 2004; Kim and Tidor, 2003; Li and Ding, 2006). Brunet applied it to discover distinct molecular patterns and indicated that NMF clustered samples of leukemia, medulloblastoma and central nervous system tumors datasets in a manner that preserved separation of known phenotypes better than hierarchical clustering and self-organizing map algorithms (Brunet et al., 2004). BioNMF was developed as a web-based tool for NMF analysis (Pascual-Montano et al., 2006).
NMF algorithms require both the input matrix and the two resultant factor matrices to be non-negative. This is appropriate for image bitmaps and for single channel expression data but not for log-ratio data. Therefore, we have also implemented a Semi-NMF algorithm (Li and Ding, 2006) for unrestricted input data. The input is factored into an unrestricted matrix of cluster centroids times a non-negative matrix of weights used for sample clustering.
The NMF and Semi-NMF analysis plug-in is developed in R and integrated into BRB-ArrayTools, which is a comprehensive package for the visualization and statistical analysis of DNA microarray gene expression data (Simon et al., 2006). BRB-ArrayTools contains a plug-in facility that enables users and developers to easily add modules for processing gene expression data for visualization, multidimensional scaling, clustering of genes and samples and predictive classification of samples. BRB-ArrayTools is highly computationally efficient, using C and FORTRAN components for computationally intensive analyses. It uses Excel as user interface, but expression data are stored in binary files so that analysis projects of over 1000 arrays and 50 000 genes can be easily handled. By incorporating a wide variety of analysis tools in BRB-ArrayTools, users without programming experience are empowered to try new methods on their own data and evaluate whether claims of the developers are justified.
2 METHODOLOGY
The standard NMF algorithm can be written as
where X+ is a p (genes) by n (samples) matrix, F+ is p by k and G+ is n by k. k is the number of clusters and is smaller than p and n. Matrices X+, F+ and G+ must be non-negative. The columns of X+ represent the expression profiles of the samples. The columns of F+ represent the expression centroids of the sample clusters. The rows of G+ provide weights for representing columns of X+ as non-negative linear combinations of the columns of F+.Li and Ding (2006) introduced Semi-NMF to handle clustering for gene expression data containing negative values. It can be represented as
where X is p by n, F is p by k and G+ is n by k. With Semi-NMF, X and F are not restricted to be non-negative. G+ is still restricted to be non-negative so that the weights are interpretable for sample clustering.
The algorithms used for NMF and Semi-NMF are iterative optimization algorithms for minimizing a cost function. The algorithms are summarized in the Table 1 as Supplement.
3 IMPLEMENTATION AND APPLICATION FEATURES
The NMF analysis plug-in was developed in R and integrated into BRB-ArrayTools. The BRB-ArrayTools plug-in facility contains a dialog box wizard that enables the plug-in developer to provide users a dialog box for entering control parameters (Fig. 1a). The user's expression data are passed to the plug-in from the BRB-ArrayTools software transparently to the user. The output of the analysis is presented to the user in a HTML file.
(a) NMF analysis plug-in input interface. The output consists of three parts: (b) a table that contains sample identifiers and clustering results for both NMF and K-means clustering. It shows clustering results of leukemia dataset of Brunet et al.; (c) a table indicating the genes differentially expressed among clusters with gene identifiers hyperlinked to biological annotations. Genes in the table are those differentially expressed among clusters with biological annotations of leukemia dataset; (d) heat maps for NMF (left) and K-means (right) clustering of leukemia dataset.
(a) NMF analysis plug-in input interface. The output consists of three parts: (b) a table that contains sample identifiers and clustering results for both NMF and K-means clustering. It shows clustering results of leukemia dataset of Brunet et al.; (c) a table indicating the genes differentially expressed among clusters with gene identifiers hyperlinked to biological annotations. Genes in the table are those differentially expressed among clusters with biological annotations of leukemia dataset; (d) heat maps for NMF (left) and K-means (right) clustering of leukemia dataset.
3.1 Data Preprocessing
The NMF analysis plug-in automatically detects whether input data contains negative values. For datasets that contain negative values, the Semi-NMF algorithm is used, while the standard NMF algorithm is used for non-negative data. Missing values are imputed to 0 for log-ratio input data or to row minimums for non-negative input data.
3.2 Analytical methods
The iterative algorithm is initiated with random initial values for the F and G matrices. The initial seed for the random number generator can be specified in the user dialog. Multiple iterative searches, as specified in the initial dialog, are run and the final factorization used is that with the smallest error. The error of a run is defined as the maximum element of matrix |X−FGT|.
Each sample i is placed into cluster j where Gij is the largest element in the i-th row of G as suggested by Brunet et al. (2004). A heat map showing the sample clustering and the genes differentially expressed among the clusters is given as output. The genes are identified by performing F-tests across clusters for all genes in the dataset. This is for descriptive purpose only as the P-values are not interpretable as tests of pre-stated null hypotheses. The differentially expressed genes are identified with gene symbols and hyperlinked to numerous biological annotations. K-means clustering, using the R function K-means, is included in the output for comparison purposes.
We have not incorporated any method for optimally selecting the number k of clusters because there has been no consensus on the criteria that should be used. Many methods for selecting the number of clusters have been proposed and could be applied to clusters generated by non-negative or semi-negative matrix factorization (Dudoit and Fridlyand, 2002; Gordon, 1999; Milligan and Cooper, 1985; Tibshirani et al., 2001). Brunet et al. suggest selecting a value of k based on a ‘connectivity matrix’ which indicates whether each pair of samples are contained in the same or different clusters. They suggest selecting k to make the connectivity matrix stable over multiple runs with different initial randomization seeds. This is similar to earlier proposals (Kerr and Churchill, 2001; McShane et al., 2002) to use stability of clustering under perturbations. The output of the plug-in provides some information to guide selection of the number of clusters. The element in the j-th column of row i of the sample assignment matrix G indicates the weight of the centroid for the j-th cluster in approximating the expression profile for sample i. If each of those row vectors is dominated by a single entry, each cluster is tight. This can be used as a partial guide for deciding whether to consider a larger number of clusters, but if used uncritically it can lead to selecting too large a value for k.
4 CASE STUDY
Application of the NMF analysis plug-in to the leukemia dataset of Brunet et al. (2004) is shown in Figure 1b. The seed for the random number generator for initial values of F and G matrices was specified in the input dialog and the maximum number of iterations was set to 2000. Standard NMF provided two-group clustering that accurately separated the acute myelogenous leukemia specimens from the acute lymphoblastic leukemia specimens. Only one sample was mis-assigned in this unsupervised clustering (Fig. 1b). One hundred and fifty significantly expressed genes were identified by the two sample t-test, with nominal threshold of significance set to 0.0001. The results were robust to the number of runs specified, but could vary with different randomization seeds.
Conflict of Interest: none declared.




Comments