A probabilistic multi-omics data matching method for detecting sample errors in integrative analysis

Abstract Background Data errors, including sample swapping and mis-labeling, are inevitable in the process of large-scale omics data generation. Data errors need to be identified and corrected before integrative data analyses where different types of data are merged on the basis of the annotated labels. Data with labeling errors dampen true biological signals. More importantly, data analysis with sample errors could lead to wrong scientific conclusions. We developed a robust probabilistic multi-omics data matching procedure, proMODMatcher, to curate data and identify and correct data annotation and errors in large databases. Results Application to simulated datasets suggests that proMODMatcher achieved robust statistical power even when the number of cis-associations was small and/or the number of samples was large. Application of our proMODMatcher to multi-omics datasets in The Cancer Genome Atlas and International Cancer Genome Consortium identified sample errors in multiple cancer datasets. Our procedure was not only able to identify sample-labeling errors but also to unambiguously identify the source of the errors. Our results demonstrate that these errors should be identified and corrected before integrative analysis. Conclusions Our results indicate that sample-labeling errors were common in large multi-omics datasets. These errors should be corrected before integrative analysis.


Background
With advances in high-throughput technologies in the past 2 decades, diverse types of omics data at multiple layers of regulation have been generated to survey complex human diseases [1][2][3], which arise from dysregulations of interplays among these multiple layers of regulations including genetics, epigenetics, transcriptomics, metabolomics, glycomics, and proteomics. Therefore, integration of multi-omics data at multiple layers of regulation is essential to derive a holistic view of molec-ular mechanisms underlying complex human disease. Previous studies have shown that simultaneously considering diverse types of biological data results in more complete understandings of biological systems [4][5][6].
Recently, many large projects, such as The Cancer Genome Atlas (TCGA) and International Cancer Genome Consortium (ICGC), have generated diverse types of omics data for public use. However, data errors, including sample swapping, mis-labeling, and improper data entry, are almost inevitable in the process of large-scale data generation and management. Westra et al. [7] found ∼20% of mis-matched samples between genotype and gene expression data. Yoo et al. [8] demonstrated that samplelabeling errors occurred in almost every database examined. Also, there are studies to identify cross-individual contamination in next-generation sequencing data from TCGA samples [9,10].
Identifying and ultimately correcting these sample errors are critical for statistical data analysis, especially for integrative analysis. Data errors need to be identified and corrected before extensive efforts are devoted to data analysis. Analyzing data with sample errors is a waste of limited public resources. More importantly, data analysis with sample errors could lead to wrong scientific conclusions. Furthermore, sample errors have more significant effects on integrative data analysis where different types of data are merged on the basis of the annotated labels. Some types of sample errors can be detected during data quality control on each individual type of data, whereas such sample errors as sample swapping or mis-labeling are difficult to detect by means of data quality control on that individual type of data alone.
Previously, we developed a sample-mapping procedure called MODMatcher (Multi-Omics Data matcher) [8], which is not only able to identify mis-matched omics profile pairs but also to properly map them to correct samples based on other omics data. The main idea is first to identify "biological cisassociations" between 2 types of omics data and then to use these biological cis-associations as intrinsic barcodes to match different types of omics data. The types of biological cisassociations are different when different pairs of omics data are mapped, but they all reflect general biological regulations. For example, when mapping genotype and gene expression data, the method is based on cis-genetic regulation of expression traits (or expression quantitative trait loci-cis-eQTLs), where a genetic polymorphism at a gene's promoter or regulatory region affects the binding of transcription factors or co-factors, which in turn affects the abundance of the gene's transcript [11]. Similarly, when mapping methylation and gene expression data, the method leverages on cis-methylation regulation of expression traits (or cis-methyls), where a high DNA methylation level of CpGs at a gene's promoter or regulatory region hinders the binding of transcription factors or co-factors, which in turn represses the gene's transcription [12]. More details on biological cis-associations are provided in the Methods section.
We demonstrated that the statistical power to identify biological signals increases after database cleaning by applying the MODMatcher procedure to multiple large-scale public multiomics datasets from the Lung Genomic Research Consortium and TCGA. The power of MODMatcher depends on the number of intrinsic biological cis-associations that can be identified. The power of MODMatcher decreases when the number of cisassociations between 2 omics profiles is small. However, in some cases (a few examples are detailed in the Results), the number of possible intrinsic biological cis-associations is small, and new methods are needed for these types of applications.
In this study, we extended MODMatcher and developed a robust probabilistic multi-omics data-matching procedure, pro-MODMatcher, to curate data and identify and unambiguously correct data annotation and metadata attribute errors in large databases. First, we applied the proMODMatcher to simulated datasets to assess the statistical power of our procedure. The results suggest that proMODMatcher achieved robust statistical power even when the number of cis-associations was small and/or the number of samples was large. Next, we applied the proMODMatcher procedure to multiple large-scale publicly available multi-omics datasets from TCGA and, in particular, focused on the omics profiles that have small numbers of intrinsic cis-associations including microRNA (miRNA) expression and reverse phase protein array (RPPA). Additionally, we applied proMODMatcher to large-scale publicly available multi-omics datasets in ICGC. Our results indicate that sample-labeling errors were common in large multi-omics datasets. These errors should be corrected before integrative analysis is performed.

TCGA datasets
For the TCGA breast invasive carcinoma (BRCA) dataset, level 3 data of gene expression, DNA methylation, miRNA expression, and copy number variation (CNV) was downloaded from the Genomic Data Commons data portal [13]. For gene expression profiles, the IlluminaHiSeq RNASeqV2 and AgilentG4502A platform were used. Illumina HumanMethylation27 (HM27) and Human-Methylation450 (HM450) Beadchip were used for DNA methylation bisulfide sequencing. The IlluminaHiSeq miRNASeq and Il-luminaGA miRNASeq platforms were used to profile miRNA expression. Affymetrix Genome-Wide Human SNP Array 6.0 was used for CNV. The protein expression levels were measured in RPPA and downloaded. Each of the level 3 profiles was reformatted into a matrix with genes (or probes) as rows and barcodes of samples as columns. For methylation profiles and CNV, the probes or segments were mapped to hg19 gene symbols. Different profiles were initially matched according to their barcodes. The mapping files of HM450, RPPA, and miRNA are available in the source code.
For other types of cancers in TCGA, we downloaded gene expression, miRNA expression, CNV, DNA methylation, and RPPA data from the Firehose database [14]. For RPPA data, we filtered genes with >25% of samples with not-assigned measurements.

ICGC datasets
For the ICGC datasets, the pre-processed data were downloaded from the ICGC data portal [15]. We selected datasets with >1 available types of omics data including mesenger RNA (mRNA) expression profiles (i.e., RNA sequencing [RNAseq] and Array), DNA methylation profiles based on Illumina HM450, miRNA expression profiles, and copy number somatic mutation profiles. Each of the profiles was reformatted into a matrix with genes (or probes) as rows and barcodes of samples as columns. The gene and miRNA expression profiles were log 2 transformed and normalized by quantile normalization [16]. For copy number somatic mutation profiles, the segments were mapped to hg19 gene symbols. Some datasets contain very sparse segment information for copy number somatic mutation profiles such as CLLE-ES. We excluded these copy number profiles from further analysis. For methylation profiles, the probes were mapped to hg19 gene symbols.

Simulation study
Simulated datasets for testing alignment between a pair of omics profiles were generated. Given a set of N cis-associations, each of correlation coefficient r n , we can simulate omics profiles ϒ based on omics profiles X for M samples as follows: where is a standard normal distribution, N(0, 1). For each N and M combination, we simulated N significant sets with r n drawn from a truncated normal distribution with a cut-off value corresponding to correlation coefficients q-value < 0.05, as well as 2,000 sets of random r n drawn from a normal distribution. We considered N significant cis-associations from 75 through 1000, and M samples from 100 through 1,000. The simulated data with label error were generated by permuting the labels of 1 type of data. We considered 0, 2, ..., 10% label error rates. We measured sensitivity (i.e., recall) = number of truly aligned pairs divided by number of simulated pairs, specificity (i.e., precision) = number of truly aligned pairs divided by number of aligned pairs, falsepositive rate (FPR) = 1 − specificity, and F measures (= 2 × [(precision × recall)/(precision + recall)]) for assessment. Additionally, because a pair of omics profiles mostly has unbalanced samples, we mimic this by adding 10% of M samples for Type A and Type B omics profiles.

Analyses
Overview of proMODMatcher procedure proMODMatcher followed the general framework of multi-omics data matching of the previous study [8]. Two types of data (or profiles) (i.e., Type A and Type B in Fig. 1) were matched based on their cis-associations. Samples were initially matched based on annotated sample ID and potential cis-associations (Fig. 1A). The significant cis-associations from 2 different data types were identified by the Spearman correlations (Fig. 1B). The data for each cis-association were normal rank-transformed (Fig. 1B). The profile similarity between the 2 types of data S(A i , B j ) is defined as the correlation between profile i of Type A and profile j of Type B (Fig. 1C). The probability of a match between profile i of Type A and profile j of Type B is estimated by evaluating a similarity score in a bivariate normal distribution (Fig. 1D). Based on probability of a match, proMODMatcher determines self-or cross-alignments for each match. First, profile pairs matched by annotated sample IDs were checked to determine whether their similarity scores were high (Fig. 1D), in which case they would be annotated as "self-aligned." If not, additional steps were applied to find any potential matches among other unmatched profiles (Fig. 1E). The matched profile pairs were then used to update significant cis-associations. We iteratively refined profile alignment, and rounds of alignments were repeated until there were no further updates (Fig. 1F).

Simulation studies
Numbers of significant cis-associations and samples are 2 important deterministic factors of similarity scores, as well as the accuracy of omics profile alignment results. To investigate the effect of numbers of samples and cis-associations, we simulated datasets with different numbers of samples and significant cis-associations and applied MODMatcher and proMODMatcher to the simulated datasets. For MODMatcher, when the number of cis-associations was >200, almost all profile pairs could be aligned at high accuracy (FPR vs sensitivity) (Fig. 2). The similarity scores of matched pairs based on a low number of cisassociations were more variable, resulting in lower accuracies ( Supplementary Fig. S1). This result indicates that the MOD-Matcher can be applied to align omics profile pairs with >200 cisassociations, such as methylation-mRNA profiles with >7,000 intrinsic cis-associations and mRNA-CNV profiles with >10,000 intrinsic cis-associations [8]. On the other hand, when the number of cis-associations was ∼200 or less, the accuracy of sample alignments decreased as the number of samples increased ( Fig. 2). When aligning gene expression profiles with miRNA or RPPA profiles, the number of candidate intrinsic cis-associations was small (detailed below). Thus, MODMatcher was not powered to accurately align these types of profile pairs. The proMODMatcher was applied to the same simulated datasets and was able to achieve high sensitivities and low FPRs across a wide range of numbers of cis-associations and samples (Fig. 3A). When compared with MODMatcher's results, proMOD-Matcher resulted in better accuracies (F measure in Fig. 3B), similar sensitivities (Fig. 3C), and better specificities (Fig. 3D).
We further investigated their performances when there were labeling errors. Datasets with sample-labeling errors (i.e., 4% and 6%) were simulated by randomly assigning some samples' labels; then proMODMatcher and MODMatcher were applied to identify aligned profile pairs. As expected, when a larger number of cis-associations was available, proMODMatcher achieved a high sensitivity and low FPR (Fig. 3A). Across all tested combinations of numbers of cis-associations and samples, proMOD-Matcher resulted in >99% accuracy with 4-6% input labeling error rates, consistently outperforming MODMatcher (Fig. 3B). The top goal of MODMatcher and proMODMatcher is to identify sample-labeling errors without introducing any errors. Thus, we optimized the specificity of proMODMatcher over its sensitivity. In terms of the contribution of sensitivity and specificity to F scores, proMODMatcher achieved a similar sensitivity as MOD-Matcher ( Fig. 3C) but better specificities in all cases (Fig. 3D). These simulation results suggest that proMODMatcher is applicable for identifying and correcting labeling errors even when the number of cis-associations is small such as pairing mRNA-miRNA or mRNA-RPPA profiles.

Application to TCGA breast cancer dataset: mRNA and miRNA profiles
Multiple omics data, including profiles of mRNA, miRNA, protein, DNA methylation, and CNV, were available in TCGA. The proMODMatcher was applied to align methylation and/or CNV profiles to mRNA profiles, similar to what we did previously [8]. Here we focused on alignment of miRNA expression profiles to mRNA expression data because the number of candidate intrinsic cis-associations between miRNA and mRNA profiles was small. We used the TCGA breast cancer (BRCA) dataset as an example to illustrate the profile alignment results in detail. There were mRNA expression profiles based on 2 different platforms, Agilent microarray and RNAseq technology. There were 519 tumor samples with both mRNA expression measured in Agilent microarray and miRNA expression measured by small-RNA sequencing method, and 1,041 tumor samples with both mRNA expression measured in RNAseq and miRNA measured by small-RNA sequencing method. A small portion of miRNAs are embedded in gene regions (i.e., host genes) and frequently cotranscribed with host genes [17,18] (Fig. 4A); embedded miRNAhost gene pairs were candidate intrinsic cis-associations. A total of 1,222 miRNAs were profiled, and 227 and 271 of them were mapped to host genes, for Agilent microarray and RNAseq data, respectively. Among them, 138 of 227 and 175 of 271 miRNA-host gene pairs were significantly associated with each other at qvalue < 0.05, for Agilent microarray and RNAseq data, respectively. For example, in the case of miR-452 located in the gene body of GABRE, its expression was highly associated with mRNA expression of GABRE (Fig. 4B). On the basis of these intrinsic cis-associations between expression levels of miRNAs and host genes, we aligned the 2 types of omics data.  .

S(A i ,B j )= corr(RT(Exp i A ), RT(Exp
. The significant cis-associations from 2 different data types were identified by the Spearman correlation. The data for each cis relationship were normal rank-transformed.

RT(Exp T(Exp
(C) The sample similarity score between the 2 types of data S(Ai , B j ) is defined as the Spearman correlation between normal rank-transformed profiles. (D) The proMODMatcher evaluated the similarity score of a match, S(Ai , B j ), by calculating the probability of a match estimated on the basis of a score distribution of (S(Ai , Bn), S(An, B j )), where An and Bn represent Type A and Type B profile of the n th matched profile pairs. (E) In the determine self-aligned vs cross-aligned step, profile pairs matched by sample IDs were checked to determine whether their similarity scores were high, in which case they would be annotated as "self-aligned." If not, additional steps were applied to find any potential matches among other unmatched profiles. The matched profile pairs were used to update significant cisassociations.

Aligning gene expression profiles by RNAseq and miRNAseq data
The similarity scores of self-aligned gene expression-miRNA expression profiles were much higher than other possible pairings in general ( Fig. 4C): 898 of 1,041 (86.3%) similarity scores for self-self RNAseq-miRNAseq profiles were ranked in the top 2%. For example, the similarity score for the self-aligned profiles of TCGA−D8−A1JH-01 was top ranked among other possible pairings (Fig. 4D). A total of 143 miRNA profiles that were not matched to the corresponding mRNA profiles of the same sample names based on MODMatcher (e.g., TCGA−B6−A0X7-01 shown in Fig. 4E). Among profile pairs that were not selfaligned, 5 RNAseq profiles were cross-aligned to other samples' miRNA profiles (Supplementary Table S1). The rate of alignment was low compared to alignments of other types of profile pairs. For example, >99% of the profile pairs of DNA methylation and mRNA expression profiles were aligned for the TCGA BRCA dataset.
When proMODMatcher was applied to TCGA BRCA RNAseq-miRNAseq datasets, the probabilities of similarity scores (before multiplying prior probability) for self-aligned RNAseq-miRNA profiles were much higher than for other possible pairs in general (Fig. 4F). An example of similarity scores of a self-aligned RNAseq-miRNA profile pair and other possible pairs is shown in Fig. 4G. There were multiple self-self pairs with low probabilities for self-alignment ( Fig. 4F and H), suggesting potential labeling errors in RNAseq and/or miRNA profiles. Overall, 989 of 1,041 candidate matching pairs (95.0%) ( Table 1) were selfaligned compared to 86.3% for MODMatcher. Among profiles that were not self-aligned, 1 profile pair (i.e., TCGA-BH-A0BZ-01 and TCGA-E2-A15K-01) was cross-aligned to each other (Table 1).
Comparing MODMatcher and proMODMatcher, the proMOD-Matcher identified an additional 91 self-aligned profile pairs that were missed by MODMatcher. For example, the similarity score of self-alignment for TCGA-AO-A0JF-01 was among the highest when the miRNA profile was compared to RNAseq profiles of  other samples (y-axis in Fig. 5A). However, the RNAseq profile of TCGA-AO-A0JF-01 was highly similar to multiple miRNA profiles of other samples (x-axis in Fig. 5A). As a result, the rank-based MODMatcher rejected the self-alignment, but proMODMatcher identified self-alignment for TCGA-AO-A0JF-01 with P-value of 7.3 × 10 −6 . One cross-aligned pair, RNAseq of TCGA-BH-A0BZ-01 and miRNA of TCGA-E2-A15K-01, was identified by both proMOD-Matcher and MODMatcher. The similarity score of the crossaligned pair is shown in Fig. 5B. The similarity scores of selfself alignments were low (red dots in Fig. 5B); on the other hand, the similarity score of the cross-aligned pair was significantly higher compared to other similarity scores (Fig. 5B), indicating high confidence of cross-alignment. On the other hand, the cross-aligned pairs detected only by MODMatcher showed relatively marginal similarity scores even though the similarity scores of cross-aligned pairs were the highest (Supplementary Fig. S2). Furthermore, we compared significance levels of cis-associations based on profile pairs aligned by MODMatcher and proMODMatcher. They were comparable in general, with a few highly significant cis-associations that were more significant based on proMODMatcher compared to MODMatcher (Fig. 5C).

Aligning gene expression profiles by Agilent microarray and miR-NAseq data
MODMatcher and proMODMatcher were also applied to align mRNA expression profiles based on Agilent microarray and miRNA profiles. There were 138 cis-associations identified on the    basis of Agilent microarray data and miRNAseq data. On the basis of these cis-associations, 87.1% of candidate profile pairs were identified as self-aligned by MODMatcher (Supplementary Table  S1) while 89.8% of candidate profile pairs were self-aligned by proMODMatcher (Table 1). Among profiles that were not self-aligned, 9 cross-aligned profile pairs were identified by proMODMatcher (Table 1, Sup-plementary Fig. S3B), and 8 of the 9 pairs were also detected by MODMatcher (Table 1). MODMatcher detected additional crossaligned pairs including several questionable cross-aligned pairs (i.e., TCGA−E2−A153−01 and TCGA−E9−A1NG−01, TCGA-AR-A1AL−01 and TCGA−AR−A1AN−01 in Supplementary Fig. S4). The cross-aligned pairs identified by proMODMatcher included a possible swap between TCGA-BH-A18K-01 and TCGA-BH-A18T-  -A8-A07U-01  Y  TCGA-A2-A3XY-01  Y  TCGA-BH-A0H9-01  Y  TCGA-EW-A423-01  No  TCGA-AO-A128-01  Y  TCGA-BH-A18V-06  Y  TCGA-A1-A0SD-01  No: TCGA-BH-A0EI-01  TCGA-BH-A0EI-01  Y  TCGA-BH-A18K-01  No: TCGA-BH-A18T-01  TCGA-BH-A18T-01  Y  TCGA-BH-A18T-01 No Boldface indicates cross-alignments supported by other data, and underscore indicates sample swaps. 1 The number of common samples with both Type 1 and Type 2 profiles. 2 The number of significant cis-pairs at q-value < 0.05 at final iteration and the number of cis-pairs investigated. 3 Indicates whether the RNA samples of cross-aligned pairs were self-aligned in alignment between RNA profile (Agilent array or RNAseq) and CNV profile. The aligned pairs were also shown if there was a cross-aligned sample. 4 Indicates whether the cross-aligned pairs were cross-aligned by MODMatcher.  Table 1). To determine the source of labeling errors (due to mRNA Agilent profiles or miRNA profiles) other omics profiles were compared with each other and the results were summarized into a patient-centric view (Fig. 6B). For patient/sample TCGA-BH-A18K, the RNAseq and miRNAseq profiles were self-aligned and the RNAseq and CNV profiles were self-aligned as well (Fig. 6B). Similarly, for patient/sample TCGA-BH-A18T, the RNAseq profile was self-aligned to the miRNA, CNV, and DNA methylation profiles as well as the RPPA profile (detailed below) (Fig. 6B). The cross-alignments of TCGA-BH-A18K-01 and TCGA-BH-A18T-01 mRNA Agilent profiles with their miRNA profiles (Fig. 6B) indicate that sample swapping occurred in the mRNA Agilent array profiles. After swapping the corresponding mRNA Agilent array profiles, multiple-omics profiles of TCGA-BH-A18K and TCGA-BH-A18T were aligned to each other consistently (Fig. 6C). Our previous study based on pairwise profile alignments of gene expression, DNA methylation, and CNV also identified the sample swaps in the mRNA Agilent array profiles of TCGA-BH-A18K-01 and TCGA-BH-A18T-01 [8] (Fig. 6B and C). In addition, proMODMatcher identified a crossalignment of the mRNA Agilent array profile of TCGA-A1-A0SD-01 and the miRNA profile of TCGA-BH-A0EI-01 (Table 1, Fig. 6D), consistent with potential sample swaps of mRNA Agilent array profiles of TCGA-A1-A0SD-01 and TCGA-BH-A0EI-01 when alignments of other omics profiles were included. Similarly, the cross-alignment between the Agilent array profile of TCGA-BH-A0BS-01 and the miRNA profile of TCGA-BH-A0BT-01 was likely a result of a swap between the Agilent array profiles of the 2 samples when all available omics data were added into the comparison (Fig. 6E). The proMODMatcher identified a cross-aligned pair between the mRNA Agilent array profile of TCGA-BH-A0BZ-01 and the miRNA profile of TCGA-E2-A15K-01 (See Table 1, Fig. 6F). The miRNA profile of TCGA-E2-A15K-01 was also cross-aligned to the mRNAseq profile of TCGA-BH-A0BZ-01 (Table 1, Fig. 5B). When alignments of other omics profiles were included in a patientcentric view (Fig. 6F), the result suggests that there was a labeling error of the miRNA profile of TCGA-E2-A15K-01.
These results together suggest that proMODMatcher with 138 cis-associations can accurately identify sample-labeling errors and unambiguously correct labeling errors.

Application to TCGA breast cancer dataset: mRNA and RPPA profiles
There were 424 tumor samples with both mRNA expression measured in Agilent microarray and RPPA data, and 856 tumor samples with both mRNA expression measured in RNAseq and RPPA data. A total of 145 proteins were mapped to unique mRNA transcripts, and 97 and 104 of the protein-mRNA pairs whose protein abundance was significantly correlated (q < 0.05) with the corresponding mRNA's expression level were defined as significant cis-associations based on Agilent microarray and RNAseq data, respectively ( Fig. 7A and Table 2). And 84.9% and 80.3% of candidate profile pairs were identified as self-aligned by proMODMatcher ( Table 2). Examples of similarity scores of a self-aligned RNAseq-miRNA profile pair (Fig. 7B) and a crossalignment (Fig. 7C, Supplementary Fig. S5) comparing with other possible pairs are shown. The cross-aligned pair of the mRNA Agilent microarray profile TCGA-AR-A1AV-01 and the RPPA profile of TCGA-AR-A1AW-01 data was identified (Fig. 7D), consistent with labeling errors in the mRNA Agilent array data (Fig. 7D). However, this pair was not identified by MODMatcher ( Table 2). The potential cross-alignment between the mRNA Agilent microarray profile TCGA-AR-A1AW-01 and the RPPA profile of TCGA-AR-A1AV-01 data was not identified (Fig. 7D), suggesting that proMODMatcher's sensitivity is limited when the number of cis-associations is ∼100. A large number of non-random missing data in RPPA data ( Supplementary Fig. S6) may also contribute to low sensitivity of the method.

Application to TCGA pan-cancer datasets
The proMODMatcher was also applied to pan-cancer datasets (a total of 22 different types of cancers) in TCGA to align miRNA (Table 3) and RPPA profiles (Table 4) with mRNA profiles. When aligning RNAseq and miRNAseq profiles, >95% of candidate profile pairs were identified as self-aligned for most cancer datasets (Fig. 8A). The self-alignment rates for sarcoma (SARC), lymphoid neoplasm diffuse large B-cell lymphoma (DLBC), and cervical and endocervical cancers (CESC) were 100%, suggesting high data quality for the datasets (Fig. 8A, Table 3). On the other hand, miRNA expression profiles were aligned to mRNA expression profiles (i.e., Agilent, HG-U133, or RNAseq) at a low selfalignment rate for the glioblastoma multiforme (GBM) dataset (Fig. 8A), suggesting low quality of the TCGA GBM miRNA profiles.
For alignments between mRNA and RPPA profiles, the selfalignment rates were lower than alignments between mRNA and miRNA (Fig. 8B) for most datasets due to lower numbers of cis-associations between mRNA and RPPA profiles. The selfalignment rates for DLBC (97.0%) and SARC (97.8%) were higher compared to other datasets (Fig. 8AB), again suggesting high data qualities of the datasets. This observation indicates that some datasets in TCGA showed consistently high confidence for sample quality and low data labeling errors.
Even in datasets of high quality, sample-labeling errors were detected. For example, the self-alignment rate for mRNA-miRNA profiles of the TCGA UCEC dataset was 98.1%. Four crossalignments were identified ( Table 3). Two of them were likely due to a swap of miRNA profiles of TCGA-AX-A1C4-01 and TCGA-AX-A1CI-01 after considering other types of omics data (Fig. 8C). Similarly, the self-alignment rate for mRNA-miRNA profiles of the TCGA OV dataset was 96.9%. Five cross-alignments were identified ( Table 3). Two of them were likely due to a swap of miRNA profiles of TCGA-24-2261-01 and TCGA-31-1953-01 (Fig. 8D).

Application to ICGC datasets
We applied proMODMatcher to 8 cancer datasets that were generated by institutes in the USA, Spain, UK, Germany, Australia, Canada, and France. Each dataset contains >1 types of omics data including mRNA expression profiles (i.e., RNAseq and Ar-ray), DNA methylation profiles based on Illumina HM450, miRNA expression profiles, and copy number somatic mutation profiles. The ICGC datasets used and the associated alignment results are summarized in Table 5. In some of the datasets such as PAEN-AU and PRAD-FR, all profiles were matched to other corresponding profiles of the same sample names (Table 5). On the other hand, several sample errors were identified in some datasets. For example, mapping between gene expression Array and CNV profiles in the NBL-US dataset resulted in 170 selfself aligned sample pairs, 10 non-self-self aligned samples, and 12 cross-mapped pairs of profiles (examples shown in Fig. 9A). Mapping gene expression profiles by RNAseq and Array in the CLLE-ES dataset yielded 5 non-self-self aligned samples and 2 cross-mapped pairs of samples. The 2 cross-mapped pairs of samples were likely due to a swap of either RNAseq profile or Array profile (Fig. 9B). Similarly, proMODMatcher identified 3 cross-alignments between RNAseq and DNA methylation profiles in the PRAD-CA dataset, which were also involved in crossmappings when mapping Array and DNA methylation profiles: 2 of them were likely due to a swap of DNA methylation (HM450) profiles of DO229525 and DO51109 (Fig. 9CD), and 1 of them was likely due to sample-labeling errors in DNA methylation array (HM450) (Fig. 9CD).

Discussion
We developed a sample alignment method, proMODMatcher, for detecting and correcting sample-labeling errors by aligning omics profiles. The proMODMatcher extended our previous method MODMatcher by estimating probabilities of potential matches rather than using ranks of similarity scores. Applied to simulated datasets, proMODMatcher outperformed MODMatcher when aligning the omics data profiles with a relatively small number of cis-associations. We showed that the number of candidate intrinsic cis-associations between mRNA-miRNA profiles or mRNA-RPPA profiles was low. Application of our proMODMatcher to alignment between mRNA-miRNA profile pairings and mRNA-RPPA profile pairings from 22 different cancer datasets in TCGA demonstrated that sample-labeling errors occurred even in datasets of high quality and our procedure was not only able to identify sample-labeling errors but also to unambiguously identify the source of the errors.
Integrating multi-omics data into comprehensive network models is essential to elucidate complex molecular mechanisms of cancers. After correcting sample-labeling errors, associations between different profiles were stronger. For example, mislabeled samples were outliers when comparing significant pairs between mRNA and miRNA expression levels in the TCGA BRCA dataset (Fig. 10A, red dots were mis-labeled samples). Spearman correlations between expression levels of miRNAs and their host genes were improved for most pairs of miRNA-host genes after curating sample-labeling errors (Fig. 10B).
We showed that some potential cross-aligned profile pairs in the TCGA BRCA dataset were missed by proMODMatcher. The sensitivity and accuracy of multi-omics profile-matching methods need further improvement. Integrating >2 types of profiles in probability estimation may yield more robust sensitivity and specificity when the number of cis-associations is small.

Non-self: DO46197
Similarity Score: S(i,n),DO46197 Similarity Score: S(n,i),DO46197 D. PRAD-CA Figure 9. Application to ICGC datasets (A) An example of self-self aligned, non-self-self aligned, and cross-aligned pairs of samples based on alignment between Array and CNV profiles in the NBL-US dataset. (B) An example of sample-labeling errors. In alignment of Array and DNA methylation profiles, DO7484 and DO7472 were cross-aligned to each other. The similarity scores of each cross-alignment are shown. The similarity score of the cross-aligned pair is shown in blue, and the similarity scores of self-self alignments are shown in red. Omics profiles of DO7484 and DO7472 were compared with each other and results were summarized into a patient-centric view. The red line indicates self-aligned, the and blue line indicates cross-aligned. (C) An example of possible sample swaps and sample-labeling errors. DO229525 and DO51109 were cross-aligned to each other in alignment of RNAseq and DNA methylation profiles as well as Array and DNA methylation profiles. Additionally, the RNAseq and Array profiles of DO51105 were cross-aligned to the DNA methylation profile of DO51091. (D) Other omics profiles of these pairs were compared with each other and results were summarized into a patient-centric view. After swapping the corresponding DNA methylation profiles, multiple-omics profiles of DO229525 and DO51109 were aligned to each other consistently.  2)  TCGA-CM-4744-01  Y  TCGA-AA-3558-01  TCGA-QL-A97D-01  Y  TCGA-AA-A00W-01  TCGA-A6-A567-01  Y  TCGA-AA-3693-01  TCGA-5M-AATA-01  Y  TCGA-AA-3529-

Potential implications
Our results demonstrated that sample-labeling errors were common in large multi-omics datasets. Our method has improved statistical accuracy to identify and curate these errors over the previous method and is generally applicable to other datasets. Application of our general framework for automated curation of public databases and properly merging omics data would be the fundamental basis for the development of effective integrative approaches.

A general framework of multi-omics data matching: Pairwise alignments based on cis-associations
We followed the general framework of multi-omics data matching of the previous study [8]. Two types of data (or profiles) (i.e., Type A and Type B in Fig. 1) were matched on the basis of their cis-associations. Probes in different types of data were matched by intrinsic biological relationships. For example, probes in methylation, miRNA, and CNV profiles were mapped to a close transcript based on hg19 reference genome. Samples were initially matched on the basis of annotated sample ID and potential cisassociations (Fig. 1A). The significant cis-associations from 2 different data types were identified by the Spearman correlations at Benjamini-Hochberg (BH) adjusted q-value < 0.05 (Fig. 1B). The data for each cis-association was normal rank-transformed as RT(A n,i ) and RT(B n,i ), where A n,i and B n,i represent the measurements of sample i and nth cis-related probes for Type A and B profiles, respectively (Fig. 1B). For simplicity, we omitted all normal rank transformation in the rest of the notations. The profile similarity between the 2 types of data S(A i , B j ) is defined as (Fig. 1C):

Correlation of cis-associated mRNA and miRNA before and after correction of labeling errors
To assess improvement of signals after labeling error correction, we calculated the Spearman correlation between miRNA expression and its host genes with initially matched pairs based on sample ID and with aligned sample pairs. To avoid bias due to different number of samples, we matched the number of samples of initially matched pairs to the number of aligned pairs. We randomly selected the samples with the same number of aligned pairs, and calculated the Spearman correlation. We performed random selection 100 times and calculated the mean of correlation.

Availability of supporting data and materials
Data supporting the results of this article are publicly available at the Firehose database, TCGA data portal, and ICGC data portal (see Data Description). Data further supporting this work and snapshots of our code are available in the GigaScience repository, GigaDB [21].