ctcRbase: the gene expression database of circulating tumor cells and microemboli.

Circulating tumor cells/microemboli (CTCs/CTMs) are malignant cells that depart from cancerous lesions and shed into the bloodstream. Analysis of CTCs can allow the investigation of tumor cell biomarker expression from a non-invasive liquid biopsy. To date, high-throughput technologies have become a powerful tool to provide a genome-wide view of transcriptomic changes associated with CTCs/CTMs. These data provided us much information to understand the tumor heterogeneity, and the underlying molecular mechanism of tumor metastases. Unfortunately, these data have been deposited into various repositories, and a uniform resource for the cancer metastasis is still unavailable. To this end, we integrated previously published transcriptome datasets of CTCs/CTMs and constructed a web-accessible database. The first release of ctcRbase contains 526 CTCs/CTM samples across seven cancer types. The expression of 14 631 mRNAs and 3642 long non-coding RNAs of CTCs/CTMs were included. Experimental validations from the published literature are also included. Since CTCs/CTMs are considered to be precursors of metastases, ctcRbase also collected the expression data of primary tumors and metastases, which allows user to discover a unique 'circulating tumor cell gene signature' that is distinct from primary tumor and metastases. An easy-to-use database was constructed to query and browse CTCs/CTMs genes. ctcRbase can be freely accessible at http://www.origin-gene.cn/database/ctcRbase/.


Introduction
Circulating tumor cells (CTCs) are tumor cells that originate from either primary or metastatic tumors and travel through systemic circulation to distant organs, where they can initiate metastatic lesions (1)(2)(3)(4). Circulating tumor microemboli (CTMs) are clusters of CTCs that can play an important role in metastatic cascade (5). Due to the intratumor heterogeneity, traditional surgical biopsies that are taken from one part of a tumor always miss information contained in other active regions. In contrast, CTCs/CTMs can consist of a mixture of cells shed from multiple active tumor regions, potentially providing a better representation of the invasive clones (1,6). Because of the minimally invasive procedure of isolating CTCs/CTMs, it is a more practical approach for repeatedly monitoring disease progression (7). Analysis of CTCs allows investigation of cancer cell biomarker expression from a non-invasive liquid biopsy, and their analysis has emerged as one of the hottest fields in cancer research (8)(9)(10)(11). CTCs have exhibited potential for tumor diagnosis, treatment and monitoring.
Over the past years, great efforts have been made to detect and separate CTCs. The classical method of positive selection is utilizing epithelial cell adhesion molecule (EpCAM), which is consistently expressed by epithelialderived tumor cells and the absence from the normal leukocytes (12). The microfluidic CTC isolation technologies can effectively deplete leukocytes without manipulating CTCs (13). It preserves cell viability and ensures the high-quality RNA content to the greatest extent. High-throughput technology has become a powerful tool to provide a genomewide view of transcriptomic changes associated with CTCs/CTMs (14)(15)(16)(17)(18). These data provided us much information to understand the tumor heterogeneity, and the underlying mechanism of tumor metastases at the single-cell perspective. Single-cell sequencing provides a new method to identify etiology at the whole-genome wide level (19,20). After CTC isolation, single-cell sequencing can be applied to identify genomic and transcriptomic characteristics of CTCs.
Unfortunately, these data have been deposited in various repositories, and a uniform resource for the cancer metastasis is still unavailable. To this end, we integrated previously published transcriptome datasets of CTCs/CTMs and constructed a web-accessible database. An easy-to-use interface was constructed to query and browse CTC/CTM genes. Current version of ctcRbase is not only a comprehensive, update-to-date resource of CTC/CTM gene expression data sets, but also can compare CTC/CTM expression to tumor/WBC expression for the seven cancers included. In addition, all expression data of CTCs/CTM contained in ctcRbase can be downloaded and reprocessed by users, which enhances the utility of ctcRbase. This knowledge would be helpful for researchers to better understand the molecular mechanisms underlying tumor metastasis, relapse and chemoresistance and might eventually aid in the development of new targeted cancer therapies.
Here, ctcRbase also provides the expression profiles of primary tumors and metastasis sites. The expression data (FPKM values) of primary tumors were download from The Cancer Genome Atlas (TCGA). We manually curated the TCGA clinical data and removed all the metastasis samples. Tumor metastasis expression data were collected from the Human Cancer Metastasis Database (HCMDB) (21). For tumor metastasis, a total of 556 samples from 18 metastasis sites were downloaded.

Analysis of RNA-seq data
The sequencing raw data were first trimmed by removing adapters using TrimGalore. Then all the CTC/CTM data were processed through a consistent pipeline ( Figure 1B). Sequencing reads were mapped to the human reference genome (GRCh38) by bowtie (22), and then RSEM software (23) was used to calculate the read counts and FPKM (Fragments Per Kilobase Per Million). In total, 14 631 mRNAs and 3642 lncRNAs were collected in ctcRbase. The functions of some genes in CTCs/CTMs have been studied in previous works. To better annotate those genes expressed in CTCs/CTMs, we searched Pubmed database using the keywords of 'CTC' or 'CTCs' or 'Circulating Tumor Cells' or 'CTM' or 'Circulating Tumor Microemboli' or 'CTC cluster'. After all the papers were downloaded, we filtered those papers which have no detailed functions of specific genes. Then, we screened out those papers and extracted the descriptions of conclusions about the genes expressed in CTCs/CTMs. Finally, we To understand this complex RNA crosstalk in CTCs/CTMs, competing endogenous RNA networks were predicted. The mRNA-lncRNA cis-regulatory relationships were defined as pairs consisting of genes located within a genomic window of 100 kb. miRNA targets were predicted by RNAhybrid (26), miRanda (27) and PITA software (28). Those genes that were identified by at least two software were regarded as miRNA targets. Cytoscape software was used to visualize the network.

Database query and search platform
A user-friendly web interface was built to present the ctcRbase. Several ways were provided to allow database query. First, a search engine was developed in ctcRbase using gene names (Gene Symbol, EnsembleID or EntrezID) from the 'Search' page. Users can input their interested genes in the textbox (mRNA and lncRNA), and all the items that contain the query genes in the database can be derived. The result page of search function lists the database ID, gene symbol, category, cancer type and CTC/CTM isolation method. Additionally, users can select mRNA or lncRNA category from the 'Browse' page ( Figure 1C).

Database implementation
A user-friendly web interface was developed to present ctcRbase. All data was deposited into MySQL database.
The web interface for searching and browsing was implemented by JavaScript and PHP.

The result page for gene search
The result page for single gene search has five major sections: (i) general information; (ii) gene information; (iii) literature description; (iv) gene expressions; and (v) competing endogenous RNA network ( Figure 1C). General information provides the Database ID, cancer type, dataset and CTC/CTM isolation methods. Gene information shows the gene annotation information, including gene symbol, full name, category, synonyms, location and gene summary. In the literature description section, ctcRbase provides the literature descriptions curated from Pubmed about the functional implications of specific gene in CTCs/CTMs. In the gene expression section, ctcRbase shows the expression levels of primary cancer, CTCs/CTM and whole blood cells (WBC). Moreover, ctcRbase allows users to compare the gene expression among primary tumors, CTCs/CTMs and metastasis. The last section provides the competing endogenous RNA network. Users can online visualize the regulatory network of miRNA-lncRNA-mRNA.

Perspectives and concluding remarks
To date, our attention is turning to the non-invasive liquid biopsies, which enables analysis of CTCs/CTMs in bodily fluids. CTCs/CTMs have been proved to be a promising approach in cancer diagnostics, with the applications ranging from tumor early detection to treatment selection (1). Comparing to other tumor components, such as circulating tumor DNA (ctDNA), CTCs/CTMs provide a source of intact genomic and transcriptomic information for lineagebased analysis. Recently, single-or bulk-cell sequencing data of CTCs/CTMs have been available, which provides us great opportunities to detect the transcriptomic landscape of CTCs/CTMs and can provide RNA-based signatures enabling high specificity detection of CTCs/CTMs (29,30). Furthermore, CTCs has been regarded as the seeds for the subsequent metastases in distant organs (31). These data provide us a comprehensive understanding of the metastatic cascade.
Although many single-cell RNA sequencing data have been published, it is hard for researchers to use these data and there has no accessible database. ctcRbase is the first database for searching and visualizing the expression pattern of CTCs/CTMs and their primary tumor, WBC and metastasis sites. It provides a better way for researchers to understand the gene expression pattern in tumor metastasis. The data can reveal the gene regulation process in metastasis and can facilitate our understanding of the metastasis hallmark. ctcRbase will be continuously updated and provide more information including: (i) more upcoming CTCs/CTM transcriptome data; (ii) DNA methylation data of CTCs/CTM and their paired primary tumor; and (iii) more comprehensive CTCs/CTM genetic and genomic data, including copy number variation data and mutation variation data. We expected that ctcRbase can contribute to researchers' understanding about tumor metastasis mechanism and even facilitate to find the therapeutic method of tumor in the future.