MEpurity: estimating tumor purity using DNA methylation data

Liu, Bowen; Yang, Xiaofei; Wang, Tingjie; Lin, Jiadong; Kang, Yongyong; Jia, Peng; Ye, Kai

doi:10.1093/bioinformatics/btz555

Abstract

Motivation

Tumor purity is a fundamental property of each cancer sample and affects downstream investigations. Current tumor purity estimation methods either require matched normal sample or report moderately high tumor purity even on normal samples. It is critical to develop a novel computational approach to estimate tumor purity with sufficient precision based on tumor-only sample.

Results

In this study, we developed MEpurity, a beta mixture model-based algorithm, to estimate the tumor purity based on tumor-only Illumina Infinium 450k methylation microarray data. We applied MEpurity to both The Cancer Genome Atlas (TCGA) cancer data and cancer cell line data, demonstrating that MEpurity reports low tumor purity on normal samples and comparable results on tumor samples with other state-of-art methods.

Availability and implementation

MEpurity is a C++ program which is available at https://github.com/xjtu-omics/MEpurity.

Supplementary information

Supplementary data are available at Bioinformatics online.

1 Introduction

Tumor purity is a critical feature for tumor data analysis. Imprecise estimation of tumor purity complicates downstream analysis and often leads to the incorrect interpretation of oncogenesis. For instance, a homozygous deletion combined with 50% tumor purity might be considered as a heterozygous deletion combined with 100% tumor purity.

Currently, methods like PurBayes (Larson and Fridley, 2013) and ABSOLUTE (Carter et al., 2012) require matched normal and tumor samples to calculate tumor purity. However, it is inconvenient and costly to include normal tissues in clinical practice. LUMP (Aran et al., 2015) and ESTIMATE (Yoshihara et al., 2013) examine immune cells and stromal cells for tumor purity estimation but their calculations often yield imprecise result due to ignoring other cell types in tumor samples. Recently Infiniumpurify (Zheng et al., 2017) and PAMES (Benelli et al., 2018) take tumor only Illumina Infinium Human Methylation 450K (450k) data as input. However, they rely on a set of tumor samples and report moderately high tumor purity on normal samples (Supplementary Fig. S1), which does limit their clinical application.

For clinical tumor purity estimation, it is desired to develop an algorithm to accurate estimate tumor purity for each single tumor sample. Here, we propose MEpurity, a beta mixture model (BMM) based algorithm (Ma and Leijon, 2011), to estimate tumor purity using 450k data of single tumor sample. It has been shown that alternation of DNA methylation during tumorgenesis reflects the clone architecture of tumors (Brocks et al., 2014). Based on this and analogous to the hieratical accumulation of the somatic SNV during tumorgenesis (Ding et al., 2012), we hypothesize that cells in the tumor founding clone acquire methylation changes compared to normal cells as well as additional methylation alterations emerge when each subclone diverges from the parent clone (Supplementary Fig. S2). The methylation changes acquired in the founding clone indicate the tumor purity.

2 Materials and methods

2.1 Methods

In MEpurity, we first use a set of normal samples, which are independent of tumor samples, to select stable CpG sites. Then, we detect differentially methylated CpG sites (DMCs) for each tumor sample and calculate the alpha value (details in following) of each DMC. We use BMM to cluster these alpha values and the cluster with the largest mean alpha values represents the founding clone. The tumor purity is estimated by the mean alpha value of the founding clone cluster. The workflow of MEpurity is depicted in Figure 1A.

Fig. 1.

Open in new tab Download slide

MEpurity workflow and performance comparison. (A) The workflow of the MEpurity; (B) Comparisons of MEpurity with ABSOLUTE, Infiniumpurify, PAMES and LUMP on tumor samples; (C) Comparisons of MEpurity with ABSOLUTE, Infiniumpurify, PAMES and LUMP on cancer cell line samples; (D) Comparison of tumor purity estimation by different methods on normal samples, *** means P-value < 0.0001

2.1.1 Selection of the most stable CpG sites

DNA methylation heterogeneity does present in normal samples due to different cell types (Houseman et al., 2016). In order to reduce the DNA methylation heterogeneity in normal samples, we select CpG sites with the most stable methylation level across normal samples in our study. For 450k data, the methylation status of each CpG site forms a beta distribution, in which beta value represents the fraction of methylated alleles. We calculate the mean $μ_{i}$ and standard deviation $σ_{i}$ of beta values at each CpG site i in a pool of normal samples and select top n CpG sites with the smallest $σ$ as the most stable CpG sites (n is a user-defined parameter and the default value is 70000).

2.1.2 Tumor sample specific DMC detection

We define sample specific DMCs as the selected most stable CpG sites with significant DNA methylation changes in each tumor sample. We argue that for most stable CpG sites, the beta values largely follow normal distribution on normal samples. Although the beta value distribution for certain sites may deviate from the normal distribution, the assumption is close enough to be true in practice to be used heuristically. We compare the beta value $β_{i}$ of ith stable CpG site in the tumor sample with its beta value distribution in the pool of normal samples by calculating z-score (⁠ $z = | β_{i} - μ_{i} | / σ_{i}$ ⁠). We detect DMCs as stable CpG sites with z > k, where k is a user-defined parameter (default is 20).

2.1.3 Calculation of alpha value

Let $β_{i 0}$ represent the beta value of cells with the same methylation level as in normal cells at ith DMC, $β_{i 1}$ represent the beta value of tumor cells with altered methylation level at ith DMC, and $α_{i}$ represent the percentage of tumor cells with their methylation status altered. Thus the beta value in the mixed sample (tumor sample) at ith DMC is $β_{i}^{'} = β_{i 0} (1 - α_{i}) + β_{i 1} α_{i}$ ⁠. For the ith DMC, we calculate its alpha value $α_{i}$ based on the estimated $β_{i 0}$ and $β_{i 1}$ and the observation of $β_{i}^{'}$ (Supplementary Materials). Here we emphasize that $α_{i}$ is a measurement of DNA methylation change at ith DMC with a bounded value between 0 and 1. We demonstrate that alpha value is a reliable indicator of tumor purity (Supplementary Fig. S3).

2.1.4 Clustering of alpha values and calculating tumor purity

We next cluster alpha values to detect the largest mean alpha value for tumor purity estimation. First we adopt multivariate beta distribution to fit the alpha values and then apply BMM model (Ma and Leijon, 2011) to detect clusters. We calculate the mean alpha value for each cluster, and use the largest one to represent the tumor purity (Details described in Supplementary Materials).

2.2 Datasets

We downloaded the 450k data of 722 normal samples and 3693 tumor samples (Supplementary Tables S1 and S2) with tumor purity estimated by ABSOLUTE from TCGA (Aran et al., 2015). In addition, we downloaded 450k data of 374 human cancer cell lines (Iorio et al., 2016) used in Benelli et al’s study (Benelli et al., 2018) from Gene Expression Omnibus (GEO) portal (GSE68379) to validate MEpurity and to make comparison with PAMES (Benelli et al., 2018), Infiniumpurify (Zheng et al., 2017) and LUMP (Aran et al., 2015).

3 Results

3.1 Runtime and memory

MEpurity is implemented in C++. It takes about 9 s and 150 MB memory with single core to process a sample.

3.2 Comparison with other tools

3.2.1 Tumor purity estimation on tumor samples and cancer cell lines

We applied MEpurity on 3693 TCGA tumor samples and compared the results with Infiniumpurify, ABSOLUTE, PAMES and LUMP (Fig. 1B). The tumor purity values of ABSOLUTE and LUMP were obtained from previous study (Aran et al., 2015). A correlation analysis indicates high consistency of tumor purity estimation between MEpurity and the state-of-the-art methods. In addition, we benchmarked the above tools with 374 human cancer cell line data with known tumor purity equal to 1 (Fig. 1C). We found that MEpurity (0.953 ± 0.035) reports highest mean and smallest standard deviation compared to other methods (Infiniumpurify: 0.902 ± 0.150; LUMP: 0.931 ± 0.084; PAMES: 0.921 ± 0.052), indicating MEpurity’s performance in high tumor purity samples. More detailed comparison results are in Supplementary Tables S1 and S2.

3.2.2 Tumor purity estimation on normal samples

We compared the results of MEpurity, Infiniumpurify, PAMES and LUMP in TCGA normal samples (Fig. 1D). We found that purity estimated by Infiniumpurify, PAMES and LUMP in normal samples are significantly higher than purity estimated by MEpurity (P-value <0.0001), indicating MEpurity’s performance in low tumor purity samples. More detailed comparison results are in Supplementary Table S3.

Funding

K.Y. and X.Y. are supported by the National Science Foundation of China (31671372 and 61702406), the National Key R&D Program of China (2018YFC0910400 and 2017YFC0907500) and the National Science and Technology Major Project of China (grand no. 2018ZX10302205), X.Y. is supported by the General Financial Grant from the China Postdoctoral Science Foundation (2017M623178).

Conflict of Interest: none declared.

References

Aran

D.

et al. (

2015

)

Systematic pan-cancer analysis of tumour purity

.

Nat. Commun

.,

6

,

8971.

Benelli

M.

et al. (

2018

)

Tumor purity quantification by clonal DNA methylation signatures

.

Bioinformatics

,

34

,

1642

–

1649

.

Brocks

D.

et al. (

2014

)

Intratumor DNA methylation heterogeneity reflects clonal evolution in aggressive prostate cancer

.

Cell Rep

.,

8

,

798

–

806

.

Carter

S.L.

et al. (

2012

)

Absolute quantification of somatic DNA alterations in human cancer

.

Nat. Biotechnol

.,

30

,

413

–

421

.

Ding

L.

et al. (

2012

)

Clonal evolution in relapsed acute myeloid leukaemia revealed by whole-genome sequencing

.

Nature

,

481

,

506

–

510

.

Houseman

E.A.

et al. (

2016

)

Reference-free deconvolution of DNA methylation data and mediation by cell composition effects

.

BMC Bioinformatics

,

17

,

1.

Iorio

F.

et al. (

2016

)

A landscape of pharmacogenomics interactions in cancer

.

Cell

,

166

,

740

–

754

.

Larson

N.B.

,

Fridley

B.L.

(

2013

)

PurBayes: estimating tumor cellularity and subclonality in next-generation sequencing data

.

Bioinformatics

,

29

,

1888

–

1889

.

Ma

Z.

,

Leijon

A.

(

2011

)

Bayesian estimation of beta mixture models with variational inference

.

IEEE Trans. Pattern Anal. Mach. Intell

.,

33

,

2160

.

Google Scholar

PubMed

OpenURL Placeholder Text

WorldCat

Yoshihara

K.

et al. (

2013

)

Inferring tumour purity and stromal and immune cell admixture from expression data

.

Nat. Communic

.,

4

,

2612.

Google Scholar

Crossref

WorldCat

Zheng

X.

et al. (

2017

)

Estimating and accounting for tumor purity in the analysis of DNA methylation data from cancer studies

.

Genome Biol

.,

18

,

17.

This article is published and distributed under the terms of the Oxford University Press, Standard Journals Publication Model (https://academic.oup.com/journals/pages/open_access/funder_policies/chorus/standard_publication_model)

Associate Editor:

Download all slides

Month:	Total Views:
July 2019	66
August 2019	21
September 2019	48
October 2019	47
November 2019	28
December 2019	24
January 2020	64
February 2020	18
March 2020	19
April 2020	19
May 2020	10
June 2020	14
July 2020	5
August 2020	15
September 2020	17
October 2020	34
November 2020	29
December 2020	16
January 2021	9
February 2021	18
March 2021	13
April 2021	16
May 2021	31
June 2021	38
July 2021	38
August 2021	32
September 2021	33
October 2021	24
November 2021	28
December 2021	21
January 2022	47
February 2022	31
March 2022	30
April 2022	25
May 2022	47
June 2022	52
July 2022	70
August 2022	49
September 2022	101
October 2022	99
November 2022	48
December 2022	60
January 2023	36
February 2023	20
March 2023	44
April 2023	59
May 2023	15
June 2023	17
July 2023	16
August 2023	30
September 2023	17
October 2023	23
November 2023	20
December 2023	44
January 2024	31
February 2024	40
March 2024	31
April 2024	36

Article Contents

MEpurity: estimating tumor purity using DNA methylation data

Abstract

1 Introduction

2 Materials and methods

2.1 Methods

2.1.1 Selection of the most stable CpG sites

2.1.2 Tumor sample specific DMC detection

2.1.3 Calculation of alpha value

2.1.4 Clustering of alpha values and calculating tumor purity

2.2 Datasets

3 Results

3.1 Runtime and memory

3.2 Comparison with other tools

3.2.1 Tumor purity estimation on tumor samples and cancer cell lines

3.2.2 Tumor purity estimation on normal samples

Funding

References

Supplementary data

Citations

Views

Altmetric

Email alerts

Citing articles via

Latest

Most Read

Most Cited

Looking for your next opportunity?

Article Contents

MEpurity: estimating tumor purity using DNA methylation data

Abstract

1 Introduction

2 Materials and methods

2.1 Methods

2.1.1 Selection of the most stable CpG sites

2.1.2 Tumor sample specific DMC detection

2.1.3 Calculation of alpha value

2.1.4 Clustering of alpha values and calculating tumor purity

2.2 Datasets

3 Results

3.1 Runtime and memory

3.2 Comparison with other tools

3.2.1 Tumor purity estimation on tumor samples and cancer cell lines

3.2.2 Tumor purity estimation on normal samples

Funding

References

Supplementary data

Citations

Views

Altmetric

Email alerts

Citing articles via

Latest

Most Read

Most Cited

Looking for your next opportunity?

This Feature Is Available To Subscribers Only