-
PDF
- Split View
-
Views
-
Cite
Cite
Mingyang Li, Tianxiu Zhou, Mingfei Han, Hongke Wang, Pengfei Bao, Yuhuan Tao, Xiaoqing Chen, Guansheng Wu, Tianyou Liu, Xiaojuan Wang, Qian Lu, Yunping Zhu, Zhi John Lu, cfOmics: a cell-free multi-Omics database for diseases, Nucleic Acids Research, Volume 52, Issue D1, 5 January 2024, Pages D607–D621, https://doi.org/10.1093/nar/gkad777
- Share Icon Share
Abstract
Liquid biopsy has emerged as a promising non-invasive approach for detecting, monitoring diseases, and predicting their recurrence. However, the effective utilization of liquid biopsy data to identify reliable biomarkers for various cancers and other diseases requires further exploration. Here, we present cfOmics, a web-accessible database (https://cfomics.ncRNAlab.org/) that integrates comprehensive multi-omics liquid biopsy data, including cfDNA, cfRNA based on next-generation sequencing, and proteome, metabolome based on mass-spectrometry data. As the first multi-omics database in the field, cfOmics encompasses a total of 17 distinct data types and 13 specimen variations across 69 disease conditions, with a collection of 11345 samples. Moreover, cfOmics includes reported potential biomarkers for reference. To facilitate effective analysis and visualization of multi-omics data, cfOmics offers powerful functionalities to its users. These functionalities include browsing, profile visualization, the Integrative Genomic Viewer, and correlation analysis, all centered around genes, microbes, or end-motifs. The primary objective of cfOmics is to assist researchers in the field of liquid biopsy by providing comprehensive multi-omics data. This enables them to explore cell-free data and extract profound insights that can significantly impact disease diagnosis, treatment monitoring, and management.

Introduction
In the last decades, molecular profiling has played a pivotal role in the diagnosis and monitoring of cancers and other diseases, providing clinicians with indispensable information to guide personalized treatment strategies (1). Traditionally, these profiling methods relied on invasive techniques involving the acquisition of resected tumor samples through surgical procedures. However, this approach is time-consuming, associated with increased risks, and poses challenges for continuous sampling for prognosis (2). In contrast, liquid biopsy presents an alternative paradigm by leveraging the molecular profiles present in body fluid samples from patients, obviating the limitations of conventional biopsy methods (3). Various fluid samples, including peripheral blood, plasma, urine, saliva, serum, and cerebrospinal fluid, offer greater accessibility with reduced temporal and financial costs, facilitating continuous and extensive sampling. As a result, liquid biopsy has emerged as an increasingly promising avenue in precision and clinical oncology, enabling the non-invasive diagnosis (4), monitoring (5) and prediction of recurrence rates (6) for cancers. Consequently, this non-invasive approach has gradually become an important supplement to traditional invasive methods, and its important position is increasingly reflected in precision oncology.
However, the selection of optimal molecular biomarkers for liquid biopsy remains a formidable challenge due to the heterogeneity and limited quantity of molecules present in body fluids. Plenty of studies have extensively investigated the detection of biomarkers for various cancers (7–17), encompassing the examination of cell-free DNA (cfDNA), cell-free RNA (cfRNA), protein, and metabolite biomarkers in bodily fluids. Furthermore, the advancement of next-generation sequencing technologies and bioinformatics tools has unlocked the potential for research on diverse molecular data types. For instance, in the case of cfDNA, the utilization of data types such as methylation (9), hydroxymethylation (7), fragmentomics (15), end-motif (13), microbiome (18), etc. has been explored. Similarly, numerous data types of cfRNA are being explored, including expression (abundance) (19), microbiome (20), chimeric RNA (21), etc. Recently, the amalgamation of cell-free multi-omics data, encompassing diverse omics and data types, has emerged as a promising avenue for liquid biopsy, surpassing the efficacy of single-omic methods (22,23). The integration of data from distinct omics can more comprehensively elucidate the variations in fluid molecules induced by cancers, and can effectively address the challenges arising from cancer heterogeneity (24). In the realm of multi-omics diagnostic models, noteworthy examples include CancerSEEK (10), which simultaneously combines protein and cfDNA markers to enable early detection of various common cancers; and HIFI, a multi-omics model involving methylation, end-motif, fragmentomics, and bincounts, which has successfully achieved early-stage lung cancer diagnosis (25,26).
The investigation of biomarkers within the realm of liquid biopsy necessitates a substantial amount of data support and resources, thereby emphasizing the pressing need for the establishment of pertinent databases. At present, two primary types of databases have emerged in this field: data-driven databases and knowledge-based databases, also known as knowledgebases. Data-driven databases primarily focus on housing original or processed research data. Such databases include notable examples like exoRBase 2.0 (27), CancerMIRNome (28), LiqDB (29), BBCancer (30), which serve as repositories for cfRNA data, and CFEA (31), FinaleDB (32), which are dedicated to the exploration of cfDNA data. On the other hand, knowledge-based databases are characterized by their collection of literature-reported information, encompassing various aspects such as biomarkers, experiments, and more. Among these databases are miRandola (33), Vesiclepedia (34), ExoBCD (35), ExoCarta (36), EV-ADD (37), Plasma Proteome Database (38) and the Urinary Exosome Protein Database (https://esbl.nhlbi.nih.gov/UrinaryExosomes/), etc. However, despite the availability of these databases, none of the existing data-driven ones comprehensively address all four omics types (cfDNA, cfRNA, proteome, metabolome). Furthermore, crucial data types like microbe abundance, chimeric RNA, and end-motifs remain lacking in the current liquid biopsy databases. Thus, there is a clear research gap that needs to be addressed in this domain.
In this study, we developed a novel mainly data-driven database called the cell-free multi-Omics (cfOmics) database. We aimed to compile a comprehensive collection of molecular data from various body fluids, encompassing all four omics types, including previously unrecorded data types that are absent in existing databases. The cfOmics also provides functions including integration, browsing, analysis and visualization of multi-omics data. Furthermore, cfOmics distinguishes itself by its unparalleled level of comprehensiveness and inclusiveness, incorporating a total of 17 distinct data types, 11 345 samples and 69 disease conditions across 13 specimen variations. Importantly, cfOmics provides unrestricted access to all its data and information, allowing users to freely download them. Therefore, as far as we know, cfOmics stands as the first multi-omics database in this field, characterized by its comprehensive integration of multi-omics data, diverse processed feature types, extensive spectrum of body fluid specimens, user-friendly interface, and incorporation of literature-based biomarkers. These distinctive attributes empower researchers in the field of liquid biopsy, enabling them to explore cell-free multi-omics data and extract profound insights that can significantly impact disease diagnosis, treatment monitoring, and management.
Materials and methods
Data collection and curation
The cfOmics database encompasses 11 345 samples from various public databases, namely GEO (39) (https://www.ncbi.nlm.nih.gov/geo/), iProX (40) (https://www.iprox.cn/), PRIDE (41) (https://www.ebi.ac.uk/pride/), and GNPS (42)(https://gnps.ucsd.edu/ProteoSAFe/static/gnps-splash.jsp), involving 13 specimen types and 69 disease conditions (Figure 1). These samples contain diverse omics types, including cfDNA, cfRNA, proteome and metabolites.

Framework of cfOmics. Datasets in cfOmics were curated from 4 public databases, GEO, iProX, GNPS and PRIDE (top left). It encompasses 69 disease conditions (including 28 cancers and 41 non-cancer disease conditions), 13 specimens, a total of 17 distinct data types and a collection of 11 345 samples (top right), providing intuitive and clear functions for browsing (middle left) and analyzing data inside (bottom).
To consolidate data features and gene information, we integrated the data using Ensembl's (43) annotation information about the gene and promoter of the human genome hg38. This integration encompassed data types that were not documented within a single gene unit, such as SNP ratio, editing ratio and alternative promoter. After this processing approach, we were able to present and analyze the data landscape encompassing the aforementioned feature data based on a gene-centric perspective, as well as the interrelationships between these features stored in the database. Furthermore, data features that lacked associations with genes, such as microbial abundance and end-motifs of fragments, were subjected to separate processing and subsequently displayed within the database. The microbial data was derived from kraken2 calculations, and the specific taxonomy information can be retrieved from the NCBI Taxonomy database (44) (https://www.ncbi.nlm.nih.gov/taxonomy).
Data Processing
The cfOmics encompasses diverse biological data, including cfDNA, cfRNA, proteins and metabolites. Each type of data possesses distinct significance and is presented in a unique format. Hence, we have implemented varied processing techniques to accommodate these dissimilarities. We have also summarized a comprehensive elucidation of our data processing pipelines (Figure 2).

Pipelines of processing multi-omics data. Data of cfDNA and cfRNA are based on next-generation sequencing, whereas data of proteome and metabolome are based on mass-spectrometry.
For the cfDNA datasets, we calculated data features of cfDNA methylation (including non-enrichment-based methylation data and enrichment-based methylation data), nucleosome occupation, end-motif abundance, fragment size, and microorganisms.
To analyze non-enriched methylation data, our initial step involved aligning the clean data to the hg38 human genome employing Bismark (45) (v0.15.0) after the cleansing of the original dataset. Subsequently, the methylation details of each cytosine site within the genome were extracted utilizing the bismark_methylation_extractor and examined for the presence of CpG sites. Moreover, for each sample, we meticulously documented the methylation levels (beta values) of gene bodies, promoters, and CGIs in the annotation file of the human genome based on the f ollowing process: For a single cytosine site, we have
Here, M signifies the quantity of methylated cytosine, while U represents the count of unmethylated cytosine. As a result, the computation of the mean beta values for all cytosines encompassed within a specific region affords us the beta value indicative of said region.
These beta value records serve as quantitative indicators for this feature.
To analyze methylation data through enrichment methods, we employed DIP-seq data as a representative example. Initially, we aligned the processed raw data to the genome using Bowtie2 (46) (v2.5.1). Subsequently, we computed the read counts for gene bodies, promoters, and CGIs. Ultimately, we standardized these counts into transcripts per million (TPM) values, serving as quantitative indicators for this feature.
To determine the fragment size, commence by extracting alignment files for long segments (151–220nt) and short segments (100–150nt) from the alignment outcomes using samtools (47) (v1.6) and awk (v4.0.2). Next, employ bedtools (48) (v2.31.0) coverage to calculate the read counts of long and short segments based on the provided gene body and genomic 100 kb bin (referring to segments of genome DNA that is 100 kb in length) location data. Proceed by utilizing bedtools map to calculate the average coverage level for each region. Normalize the data within each region by dividing it by the total data from all regions, and subsequently apply the logarithm to the base 2 (performing separate processing for long and short segment data). Lastly, employ the standardized and logarithm-transformed data to calculate the fragment size ratio for each region, using the specified ratio calculation formula presented below, as a quantitative indicator of this particular characteristic (15).
To ascertain the end-motifs, which denote short nucleotide sequences located at the termini of cfDNA (49), the initial step entails employing the pysam package (v0.21.0) to assess both the count and prevalence of each 4-mer end-motif within the alignment output (13). Subsequently, the data was standardized by transforming the frequency into proportionate values. Furthermore, within the database, an option is provided for researchers to transform these motifs into 2-mers and 3-mers within the plot of the browse page. This feature facilitates customized visualization.
The calculation of nucleosome occupancy was conducted by focusing on two distinctive regions associated with each gene in the database: a 150 bp span upstream and a 50 bp span downstream of the gene's transcription start site (TSS), denoted as ‘15t5’, as well as a 300 bp span upstream and a 100 bp span upstream of the first exon, denoted as ‘31e1’. In the computation of these characteristics, corresponding control regions for the feature areas were also employed. Specifically, for the 15t5 region, the control area encompassed a range of 2000 bp upstream to 1000 bp upstream of the TSS, as well as a range of 1000 bp downstream to 2000 bp downstream of the TSS. As for the 31e1 region, the control area covered a range of 2000–1000 bp upstream of Exon1, and a range of 1000–2000 bp downstream. Initially, the bedtools coverage tool was employed to determine the coverage of the alignment output within these regions. Subsequently, the nucleosome occupancy ratio of both 15t5 and 31e1 was calculated using the following formula (14), providing researchers with a quantitative indicator for this feature.
The microbial taxonomy information within the cfDNA and cfRNA datasets was acquired by aligning non-human genome-matching reads with the kraken2 (50) (v2.1.2) database. Subsequently, potential contamination from microorganisms in the standardized matrix was eliminated (20). The residual microbial taxonomic abundance contained in these reads was subsequently computed. Following this, abundance matrices were constructed for all microbial classifications and their respective abundances. To obtain standardized data (relative abundance), the matrix was then normalized by dividing the abundance of each microbial taxon in every sample by the sum of the abundances of all microbial taxa within the sample.
For the cfRNA dataset, we calculated multiple data features, including alternative promoters, gene expression, RNA editing, RNA SNP, alternative splicing, chimeric RNA, alternative polyadenylation, and microbial abundance, in which the calculation method for microbial data has been introduced earlier.
Upon acquisition of raw high-throughput sequencing data, initial preprocessing involved the utilization of cutadapt (v3.4) (51) for cleansing. After this, mitigation of GGG/CCC induced by template switching was executed. Subsequently, the processed data underwent alignment against spike-in sequences using STAR (52), with STAR being consistently employed for all subsequent alignments. Unmapped reads resulting from this alignment were then subjected to alignment against the UniVec Database (https://www.ncbi.nlm.nih.gov/tools/vecscreen/univec/) to effectively mitigate potential vector DNA contamination. Following this, the pool of unmapped reads underwent alignment against the rRNAs of the human genome. Reads that did not align with rRNAs were further aligned against the hg38 human genome assembly and the non-coding RNAs sourced from MiTranscriptome (53). Subsequent to these steps, unmapped reads were mapped to the comprehensive set of circRNAs sourced from circBase (54). The outcomes of all alignment processes were leveraged for the computation of distinct RNA data features.
To determine gene expression (or RNA abundance), we initially employed the alignment data to construct an expression matrix using featureCounts (55) (v2.0.1). Subsequently, the expression matrix undergoes standardization utilizing the per million read (TPM) method. The standardized TPM data serves as a quantitative measure for assessing gene expression.
To determine the alternative promoter, we employed the salmon (56) (v0.8.1) tool to quantify the abundance of transcript isoforms, which was subsequently normalized to transcripts per million (TPM). By consolidating the TPM values of isoforms whose transcript start sites were within a 10-bp range (thus sharing the same promoter), we derived a measure of promoter activity.
To determine chimeric RNA, we employed STAR-fusion (52) (v1.10.0) to realign unaligned reads to chimeric junctions, detect chimeric RNA, and quantify its expression level.
During the computation of RNA editing, we utilized GATK (57) (v4.1.9.0) ASEReadCounter to identify editing sites derived from REDIportal (58). Subsequently, the read counts for allele reads and reference reads were calculated. The editing ratio was defined as the quotient of allele count divided by the total count.
During the computation of RNA single nucleotide polymorphisms (SNPs), the GATK SplitNCigarReads tool was employed to partition intron-spanning reads. This division ensured reliable SNP identification at the RNA level. Subsequently, the GATK HaplotypeCaller was utilized to detect genetic alterations, which were subsequently subjected to filtration using GATK VariantFilteration. This filtration process incorporated four specific criteria: (i) strand bias, determined through a Phred-scaled P-value obtained from Fisher's exact test (FS), with a threshold of <20; (ii) variant confidence (QUAL) divided by the unfiltered depth (QD), with a requirement of >2; (iii) a minimum read count at the variant site (DP) of >10 and (iv) an SNP quality (QUAL) exceeding 20. To quantify the SNP prevalence, the SNP ratio was calculated by dividing the allele count by the total count, which encompassed both the reference count and the allele count.
During the process of RNA splicing analysis, we employed rMATs (59) (v4.1.2) software to compute the IncLevel of each gene based on the obtained reads. IncLevel, also known as exon inclusion level, signifies the proportion of gene transcripts encompassing a specific exon, which is the quantitative indicator for this feature.
|${\rm{I}}$|: reads number mapped to exon inclusion isoform; |${\rm{lI}}$|: effective length of exon inclusion isoform. |${\rm{S}}$|: reads number mapped to exon skipping isoform; |${\rm{lS}}$|: effective length of exon skipping isoform, where the effective length in rMATs refers specifically to the length of the coding region plus any included UTRs.
When assessing alternative polyadenylation, ΔPDUI (distal polyA site usage index) values were computed based on cfRNA alignment data. ΔPDUI serves as a metric for evaluating the relative utilization of a specific polyadenylation (polyA) site, which is determined by comparing the usage of a given polyA site with that of the most prevalent polyA site (or the distal polyA site, referring to the termination point of the longest 3′ UTR among all samples) within the same gene. A ΔPDUI value of 100% indicates the preponderance of the polyA site under examination (or the distal site), whereas a value of 0% signifies its complete non-utilization.
The following formula is used to calculate it:
where |${\rm{w}}_{\rm{L}}^{{\rm{i*}}}$| and |${\rm{w}}_{\rm{S}}^{{\rm{i*}}}$| are the estimated expression levels of transcripts with distal and proximal (or a particular polyA site) polyA sites for sample |${\rm{i}}$|.
For the calculation of the intensity of proteins, Mascot version 2.8 (60) was applied to process the raw mass spectrometry (MS) data with the following parameter settings: false discovery rate (FDR) of 0.05, precursor mass tolerance of 20 ppm, fragment tolerance of 0.05 Da, number of tryptic termini (NTT) of 2, maximum missed cleavage of 2, and fixed modification of carbamidomethyl on Cysteine. The MS/MS spectra were searched against the UniProt human protein database (version of 3 November 2022), which contained 20401 protein entries. PANDA (61) was applied to calculate protein intensities based on label-free or labeled quantification.
Label-free quantification
The label-free quantification method includes four steps: (i) three types of information, the mass-to-charge ratio (m/z), retention time (RT), and isotope intensity are obtained from extracted ion chromatograms (XICs); (ii) retention time (RT) alignment, cross-search, normalization and peptide identification; (iii) the XIC peak area is used for peptide quantification; (iv) we assume that peptides from the same protein have different weights. Therefore, protein abundance is calculated as the weighted average intensity of all peptides using the one-step Tukey's biweight algorithm. The weight of each peptide is defined as the distance between the intensity of that peptide and the median intensity of all peptides.
Labeled quantification
Tandem mass tag (TMT) labeling is applied for labeled quantification. Different from label-free quantification, TMT-based quantification uses reporter ion intensities to estimate peptide quantification values, which has been shown to afford higher precision than XIC-based quantitation.
The TMT-based quantitative proteomic analysis includes four steps: (i) secondary spectrum preprocessing, (ii) reporter ion extraction and correction, (iii) normalization and (iv) peptide quantification. Finally, the one-step Tukey's biweight algorithm is used to calculate protein abundance, similar to label-free quantification.
Regarding metabolome, the acquired raw data were converted to mzML format using MSConvert (62) and then processed using MS-DIAL version 5.10 (63), which includes data collection, peak detection, compound identification, and peak alignment. Data collection was performed with the following parameter settings: MS1 tolerance = 0.01 Da, MS2 tolerance = 0.025 Da, Retention time = 0–100 min, MS1 mass range = 0–2000 Da.
In MS-DIAL, the base peak chromatogram is extracted for each mass slice of 0.1 m/z with a step size of 0.05 m/z. MS-DIAL uses smoothing methods (the linearly weighted smoothing average as default), differential calculus, and noise estimations to detect peak tops and two edges from the base peak chromatograms. The peak intensity is then measured by peak height or area. The detected peak tops are shown as ‘spots’ in a spot plot with retention time (min) and MS1 data (m/z) axes. The retention time and base peak m/z of each peak spot are used for metabolite identification, while the peak intensity is used to represent the intensity of metabolites in the database.
Metabolite identification is carried out by searching the MS-DIAL metabolomics MSP spectral kits (All public MS/MS libraries) to match the obtained mass spectra with reference spectra of compounds. Four scores, namely RT similarity, MS1 similarity, isotope ratio similarity, and MS/MS similarity, were calculated based on retention time, accurate mass, isotope ratio, and MS/MS spectrum information. Each score was standardized to a range from 0 to 1, meaning no similarity and a perfect match, respectively. A weighted average of the four scores is used for compound identification.
The peak alignment algorithm in MS-DIAL is derived from the Joint Aligner implemented in MZmine (64). It consists of four major steps: (i) making a reference table, (ii) fitting each sample peak table to the reference peak table, (iii) filtering aligned peaks and (iv) interpolating missing values.
Correlation analysis
The main page of genes offers a distinct function for conducting multi-omics correlation analyses. Users are afforded the opportunity to opt for two distinct types of data, occasionally accompanied by their corresponding entities, along with a designated specimen milieu. Following this selection process, the database server undertakes a comprehensive search for available disease conditions, leveraging the specified data types and specimen milieu as foundational criteria.
For each discerned disease condition within the database, two vectors are extracted. These vectors encapsulate data pertaining to the first and second data types respectively, both of which are intrinsically linked to the chosen specimen, entity, and gene. Each element within these vectors embodies the feature value characteristic of a given sample. Importantly, in instances of mismatched vector lengths, the longer vector is judiciously truncated to align with the length of its shorter counterpart. Consequently, this process culminates in the formulation of bi-omics sample-sample pairs, which subsequently serve as the foundation for scatter plot representations and correlation coefficient computations.
Website settings
All processed data are stored in a MySQL 8.0 database (Inno DB engine). The cfOmics website is a single-page application (SPA) based on React.js (v18.0), with styles of the Bootstrap 5 framework. All tables, graphs, and analysis utilities are generated by our Django (v4.1.7) backend, while the genome browser functions are powered by the igv.js (65) (v2.15.5) project. Specifically, the graphs are generated using a combination of the matplotlib (v3.7.1) and plotly (v5.9.0) Python libraries. For further technical details, please refer to our GitHub repository (https://github.com/choutianxius/cfomics.git) where the source code of both the backend and the frontend is available.
Results
Data summary
The cfOmics database presently encompasses a total of 11345 samples derived from four distinct omics categories, namely cfDNA, cfRNA, proteome, and metabolome. These samples encompass 69 disease conditions, including 28 distinct cancer types, such as liver cancer (LIHC), colorectal cancer (CRC), lung cancer (LUCA), breast cancer (BRCA), gastric cancer (GC), acute myeloid leukemia (LAML), melanoma (MEL), cholangiocarcinoma (CCC), chronic lymphocytic leukemia (CLL), diffuse large B-cell lymphoma (DLBC), head and neck cancer (HNC), and others. Furthermore, 41 non-cancerous diseases or disease conditions are included, such as atherosclerosis, Crohn's disease, HBV cirrhosis, hydrocephalus, non-ST-elevation myocardial infarction, epilepsy, stable angina pectoris, liver cirrhosis, and others (Figure 3A). Moreover, the specimens in cfOmics are sourced from 13 different specimen types, encompassing plasma, serum, whole blood, platelet, extracellular vesicles (EVs), urine, cerebrospinal fluid, circulating epithelial cells, circulating tumor cells, peripheral blood mononuclear cells, and red blood cells (Figure 3B). Across all the data, a comprehensive analysis has been conducted on a total of 17 distinct data types or features. Specifically, the analysis includes methylation and hydroxymethylation profiles, nucleosome occupancy, end-motifs, fragment sizes, and microbe abundance concerning cfDNA. Additionally, expression levels (RNA abundance), alternative promoters, alternative poly-adenylation, chimeric RNA, RNA SNPs, RNA editing, RNA splicing, and microbe abundance were examined for cfRNA. The intensity of proteins and metabolites in the corresponding specimens was also calculated (Figure 3C). Furthermore, a compilation of 878 biomarkers reported in the literature has been collated, encompassing 587 RNA biomarkers, 149 DNA biomarkers, 104 protein biomarkers, and 38 metabolite biomarkers. These biomarkers can be found on the corresponding gene sites.

Summary of data in cfOmics. Number of samples: (A) per disease condition, (B) per type of specimen and (C) per data type.
Browse cfOmics
We have developed four browsing modules corresponding to four omics. Within each module, users can actively select items or options, prompting the website to display comprehensive data tables containing detailed information. The tables are equipped with search and download functions. At the top of the website, users can switch between different omics (Figure 4B, top). The "Feature Type" option allows users to narrow their search to a specific type of data, with options provided in the format of ‘data type – value type’. Moreover, users can further refine their search by selecting a specific entity through the ‘Genetic Element’ option, such as genes, bins, or promoters, depending on the available data types. The term ‘entity’ here refers to the organization of the table. Additionally, users can select a particular specimen of interest using the ‘Specimen’ option (Figure 4b). By selecting various options, the database will display relevant datasets, showcasing the associated disease conditions. Most diseases or cancers are represented in the form of abbreviations. To obtain the complete names, users can click on the ‘Disease Details’ button to access the nomenclature. Similarly, a button labeled ‘Dataset Details’ is provided to obtain additional information about the datasets. The data table is displayed below, with data points highlighted in green and fundamental entity information indicated in cyan. Users have the option to download the table in .csv or .json format, allowing for customized analysis. Additionally, users can search for a gene using its HGNC symbol (66). Moreover, gene names displayed in blue and underlined format can be clicked to access the gene main site for further analysis. Clicking on a column name will trigger the website to sort the table accordingly (Figure 4C). In the case where users select the data type of microbe or end-motifs, a stacked bar plot illustrating the proportion of each microbe taxonomy or end-motif across all disease conditions will be displayed below the data table. Based on this plot, users can also combine 4-mer end-motifs into 2-mer or 3-mer motifs. The relevant buttons for performing this action are provided (Figure 4D).

cfOmics browse module. (A) Enter the browse module through the top navigation bar or buttons on the main page. (B) Select specific omics, feature types, genetic elements and specimens. And then select a dataset to browse the data table. (C) The table result, with searching, sorting, downloading functions and hyperlink to genes’ main sites. (D) For features like microbe and end-motifs, the browse module provides visualization for the proportion of taxonomies and end-motifs.
Search cfOmics and analyze the data
The cfOmics database facilitates the analysis of all genes, encompassing both coding and non-coding sequences, as well as end-motifs of 4-mer length and microbial taxonomies spanning from domains to species. Each of these components possesses its main site, wherein users can perform diverse and comprehensive analyses centered around the focal point. Accessing the main site of a gene can be achieved through three methods: firstly, by searching the main site (Figure 5B), and secondly, by utilizing the search page and clicking the "‘Search’ button located in the top navigation bar, whereby the user can input either the HGNC symbol (e.g. TP53) or the Ensembl ID (e.g. ENSG00000141510) associated with the gene (Figure 5A and C). The latter approach is also applicable for exploring microbial taxonomies and end-motifs. Furthermore, the browse module's data tables contain hyperlinks to the main pages of genes, microbial taxonomies and end-motifs (Figure 5D).

cfOmics search module, examples of searching for genes, microbe taxonomies and end-motifs. (A) Top navigation bar of cfOmics website. (B) Search for genes on the main page by HGNC symbol or Ensembl gene ID. (C) Search for genes, microbe taxonomies and end-motifs on the search page. (D) Search on the browse module
Many valuable analyses and visualizations can be accessed on the main site. Let's consider the main sites dedicated to genes as an illustration. At the top of the page, essential information about the gene is presented, including its HGNC symbol, Ensembl gene ID, genomic location, and biotype. Hyperlinks to the NCBI (67,68) and Ensembl (43) databases for the gene are also provided (Figure 6A). To visualize the profile, users can choose an omics category and then select a feature type. Subsequently, users can select relevant datasets (collections) and specify the appropriate specimen. In cases where certain feature types are associated with different genetic elements relative to the gene, such as promoters for methylation, additional options will be provided to select the corresponding elements for such features (Figure 6B). Once all the parameters have been specified, users can view the data profile on the website. For features that have a single record per gene (e.g. methylation, expression), the database will generate a box plot that encompasses all disease conditions (Figure 6C). Conversely, for features with multiple records per gene (e.g. alternative promoter, RNA SNP, chimeric RNA), a table, bar chart, and stacked box plot will be provided to offer a comprehensive overview of the profile (Figure 6C, F). Furthermore, a hierarchically clustered heatmap is available to visualize the profile (Figure 6D). Users can also perform a comparative analysis using the Mann-Whitney test between two disease conditions to determine if a significant difference exists (Figure 6E). Additionally, this page displays biomarkers associated with the gene, as reported in the literature, providing information such as literature references, journal details, PMID on PubMed (https://pubmed.ncbi.nlm.nih.gov/), and the molecular type of the biomarker (Figure 6G). For multi-omics visualization and analysis, users can explore the landscape of read counts, methylation levels, fragment size ratios, nucleosome occupancy ratios, chimeric RNAs, SNPs, editing sites and splicing events using the Integrative Genomics Viewer (IGV) (69) (Figure 6H). Moreover, it is possible to conduct correlation analysis between two types of features in a specific specimen type. This can be done by selecting the two features, their corresponding entities, and the specimen. Based on the selected options, linear regression using ordinary least squares (OLS) will be performed for each disease condition. Users can hover over the plot to view the linear equation and the R-square value (Figure 6i). On the main pages dedicated to microbes, users can access detailed information about taxonomy from the NCBI taxonomy database through the provided hyperlink. Statistical analysis of differences is also available on the main page of end-motifs.

Examples of analysis functions of cfOmics. (A) Basic information provided on the main site. (B) Options in the analysis module. (C) Profile visualization of data as a box plot. (D) Profile of visualization of data as clustered heatmap. (E) Comparison analysis function. (F) Profile visualization of data as table and bar chart. (G) Relative literature recording the gene as a biomarker. (H) Integrative genomic viewer (IGV) function for multi-omics visualization. (I) Correlation analysis function.
Download data from cfOmics
We have developed a website designed to present comprehensive information on the cfOmics datasets, encompassing details such as sample and specimen quantities, library types, disease classifications, as well as pertinent publication and journal data. Moreover, our database provides access to all processed data via a structured questionnaire. Following the submission of the questionnaire, users will receive an email containing the website link for downloading the requested data.
An example application of cfOmics
The cfOmics database constitutes a valuable repository for delving into biomarker profiles, ascertaining their potential, and probing the most efficacious data features in distinguishing cancer from related diseases or healthy individuals. To commence the browsing process, users can access the Browse Page and make selections in the ‘Feature Type’, ‘Genetic Element’ and ‘Specimen’ categories to investigate the methylation profiles of genes in plasma samples. If the user's focus lies particularly on the performance of the gene VIM, encoding vimentin, in colorectal cancer (CRC) in plasma samples, they may opt for the pertinent dataset containing CRC plasma samples, such as GSE124600, where the gene vim was identified as a CRC biomarker (70). Subsequently, a data table encompassing both coding and non-coding genes is presented. Upon scrutinizing the provided data, the user can promptly discern that the mean methylation level (beta value) of gene VIM in CRC samples (57.24) surpasses that in control samples (45.17) (Figure 7A). Clicking on the gene name redirects the user to its main page, which includes hyperlinks to databases like NCBI and Ensembl, furnishing comprehensive information on VIM. Furthermore, this section offers access to methylation profiles and comparison functions. By visualizing the profiles, the overall beta value of the VIM gene in CRC samples is higher than that in control samples (Figure 7B). This difference is also statistically significant, as indicated by the result of the comparison function (Figure 7C), corroborating previous literature findings (70).

Example applications of studying a gene using cfOmics. (A) Data table showing the beta value of a gene, VIM, in colorectal cancer (CRC). (B) Profile visualization showing the beta value of VIM in CRC and control samples. (C) Comparison between CRC and control in terms of the beta value of VIM. (D) Comparison between CRC and control in terms of fragment size of VIM. (E) Comparison between CRC and control in terms of nucleosome occupancy of VIM. (F) Heatmap showing expression of VIM using different promoters among seven disease conditions. (G) Comparison between CRC and control in terms of alternative promoters of VIM. (H) Correlation analysis of VIM in terms of expression and alternative poly-adenylation in three disease conditions.
In addition to methylation, users can also explore the potential of other data types in discriminating CRC from controls. For instance, by adjusting appropriate options in the analysis module, users can ascertain that cfDNA fragment size of VIM does not exhibit a significant difference between CRC and healthy samples (Figure 7D), nor does nucleosome occupancy (Figure 7E). For data types of cfRNA, such as alternative promoters, users can identify the promoter that contributes the highest level of abundance in the fluid. In this case, the promoter ‘ensg000026025.15|vim|protein_coding|1868|17229278.+’ exhibits the greatest abundance in both control samples and cancer samples (Figure 7F), and its abundance in CRC samples is significantly lower than that in healthy samples in tumor-educated platelets (TEP) (Figure 7G). Additionally, the gene's prospective application in various specimens can be explored, extending beyond plasma. Within the context of multi-omics analysis, users can quantitatively investigate the relationships between multiple data types using the ‘Correlation Analysis’ module, which presents regression results. For example, the RNA abundance of the VIM gene is inversely correlated with the total polyA site usage of the VIM RNA in healthy extracellular vesicle samples, whereas it is positively correlated in diseased samples. The positive correlation is more pronounced in PDAC samples than in chronic pancreatitis samples (Figure 7H).
Discussion
With the development of NGS library construction techniques and bioinformatics tools, liquid biopsy is becoming a more and more popular and powerful method for cancer diagnosis, monitoring, and prediction of recurrence, compared with tissue biopsy. Researches that focus on different omics data types and multi-omics methods are both promising avenues for biomarker discovery. cfOmics is the first and only liquid biopsy database that comprehensively covers all four omics types, as well as some data types not available in other databases. Users can browse the cfOmics database and conduct relative analysis and visualization of 17 omics data types on 13 specimens. The database also involves 69 disease conditions for users to explore. Multi-omics data can be integrated with tools provided in cfOmics. Thus, researchers can attain a more holistic understanding of the molecular alterations implicated in a given disease with the help of cfOmics.
Unlike databases such as Vesiclepedia and ExoCarta, which primarily focus on recorded biomarkers, cfOmics focuses on multi-omics and comprehensive high-throughput data, as well as the resulting features processed by our standardized bioinformatics pipelines. Meanwhile, recorded biomarkers serve as an auxiliary function that can be used to enhance users' understanding of genes based on the data we provide.
Since not all pairs of data types share the same specimen and disease condition, it is not surprising to see the ‘No data suitable’ warning when conducting correlation analysis. We are working to increase the volume of data to solve the problem in the next version.
We will continuously update cfOmics, including uploading more liquid biopsy data, incorporating more disease conditions and specimens, and adding more useful and powerful analysis and visualization functions. We are confident that cfOmics will provide a more comprehensive liquid biopsy data profile and continue to excel in this field.
Data availability
All the data and visualizations described are freely available at https://cfomics.ncRNAlab.org/.
Acknowledgements
The authors thank the contributors for providing cfDNA, cfRNA, proteome and metabolome datasets generously.
Author contributions: Mingyang Li: Conceptualization, Data curation, Formal analysis, Investigation, Methodology, Software, Visualization. Tianxiu Zhou: Conceptualization, Methodology, Software, Visualization. Hongke Wang: Conceptualization, Project administration, Formal analysis, Resources. Yuhuan Tao: Conceptualization, Methodology, Formal analysis, Resources. Pengfei Bao: Conceptualization, Methodology, Formal analysis, Resources. Mingfei Han: Methodology, Formal analysis. Xiaoqing Chen: Methodology, Formal analysis. Guansheng Wu, Tianyou Liu: Software. Qian Lu, Yunping Zhu, Zhi John Lu: Resource, Supervision, Project administration, Conceptualization. All authors contributed to the paper writing.
Funding
Tsinghua University Spring Breeze Fund [2021Z99CFY022]; National Natural Science Foundation of China [81972798, 32170671]; Tsinghua University Initiative Scientific Research Program of Precision Medicine [2022ZLA003]; Tsinghua University Guoqiang Institute Grant [2021GQG1020]; Bioinformatics Platform of National Center for Protein Sciences (Beijing) [2021- NCPSB-005]; National Key Research Program of China [2021YFA1301603]; Bayer Micro-funding; Beijing Advanced Innovation Center for Structural Biology; Bio-Computing Platform of Tsinghua University Branch of China National Center for Protein Sciences.
Conflict of interest statement. None declared.
References
Author notes
The authors wish it to be known that, in their opinion, the first three authors should be regarded as Joint First Authors.
Comments