Abstract

Summary

Methods for quantifying the imbalance in CpG methylation between alleles genome-wide have been described but their algorithmic time complexity is quadratic and their practical use requires painstaking attention to infrastructure choice, implementation and execution. To solve this problem, we developed CloudASM, a scalable, ultra-efficient, turn-key, portable pipeline on Google Cloud Platform (GCP) that uses a novel pipeline manager and GCP’s serverless enterprise data warehouse.

Availability and implementation

CloudASM is freely available in the GitHub repository https://github.com/TyckoLab/CloudASM and a sample dataset and its results are also freely available at https://console.cloud.google.com/storage/browser/cloudasm.

1 Introduction

The importance of allele-specific methylation (ASM) at CpG sites was recently illustrated by linking haplotype-dependent ASM to genetic variants (single-nucleotide polymorphisms, SNPs) that disrupt transcription factor binding sites (Do et al., 2016, 2017, 2019; Onuchic et al., 2018). Correctly mapping ASM in whole methylomes (e.g. data from whole-genome bisulfite sequencing) requires the joint analysis of three gigabyte-large datasets, namely, aligned reads, variant calling on these reads and CpG methylation calling on these reads. Therefore, implementing such a pipeline at scale becomes a demanding bioinformatics endeavor involving irreversible decisions on infrastructure, implementation and execution. Current approaches to call ASM (Do et al., 2016, 2019; Onuchic et al., 2018; Orjuela et al., 2019) are similar in that they create intertwined loops on the three datasets of aligned reads, variant calling and CpG methylation calling, which lead to several operations with a quadratic algorithmic time complexity. For instance, finding the genotyping of each variant in each read has a quadratic time complexity of O(NxM) where N is the number of reads and M the number of variants. Therefore, such operations require an access to a large number of CPUs if these loops are parallelized over smaller datasets, preventing researchers from systematically mapping ASM in bisulfite-converted genomes.

2 Presentation of CloudASM

2.1 Overview

To address the issues described in Section 1, we designed CloudASM, a cloud-based, scalable, fully integrated and highly efficient pipeline to call ASM from fastq files resulting from bisulfite sequencing. CloudASM enables anyone to map ASM on any number of samples in the same amount of time and from any machine, provided they have an account with Google Cloud Computing with the appropriate authorizations and quotas (see our Github repository for more details). CloudASM uses Verily's pipeline manager ‘dsub’, which handles workloads and job submission similarly to slurm or Grid Engine (Kohlhoff, 2019). Also, like recent pipeline managers (Lee et al., 2019), dsub accepts reproducible and portable pipeline standards (e.g. Docker) and cloud configuration is automatically handled with a serverless scheduling approach.

2.2 Optimization of ASM-specific steps

As described in Section 1, steps specific to mapping ASM exhibit a quadratic complexity, an expensive endeavor in terms of CPU hours. To break with this approach, we used GCP’s ‘BigQuery’ module for ASM-specific steps. Created internally by Google to manipulate its own large datasets through a severless and scalable data warehouse, and made commercially available in 2010, BigQuery is ideally suited to elastically manipulate and analyze large genomic datasets. Specifically, we designed ASM-specific steps using BigQuery’s join function. Although BigQuery’s join algorithm is not publicly available, peer-reviewed algorithms for join functions were shown to have a linear or quasilinear time complexity (Chen et al., 2014), representing a significant improvement over the quadratic time complexity found in current ASM scripts. We further minimized the computation time by selectively using ‘Broadcast’ and ‘Hash’ join queries (Lakshmanan and Tigani, 2019). After implementing ASM specific on BigQuery, we measured a 25x improvement in CPU hours compared to the internal pipeline that we have used for previous publications (Do et al., 2017, 2019, 2016) and that uses the algorithmic approach of intertwined loops on the aligned reads, variant calling, and CpG methylation calling. See Section 3.4 for details regarding the comparison.

2.3 Adjustable parameters

CloudASM can be customized using 10 different adjustable parameters, covering alignment, variant call and the stringency of ASM regions. The user can choose which reference genome will be used for aligning the fastq files (hg19 or GrCh38) and the script will automatically download the selected reference genome, its most recent associated variant database and convert the reference genome into a bisulfite genome. The user can also choose which database of variants should be used to remove CpG sites that may overlap with variants (unfiltered or filtered variants derived from the variant call on the aligned reads, or the database of common variants derived from the most updated variant database associated with the selected reference genome, with the option of selecting the frequency of common variants—0.05 by default). When performing the task of identifying the reads that include a biallelic variant, the user can select the minimum quality score recalibrated by the variant call required to identify the variant (zero by default). To map ASM regions, CloudASM computes ASM at the CpG level and across reads overlapping the variant, ensuring that ASM regions identified by CloudASM are not artifacts of misalignment and methylation/variant call errors.

First, the user can define the effect size of the difference between fractional methylation levels across reads between the two alleles (20% by default), the minimum coverage per CpG across reads per allele (5x by default), the minimum number of CpG sites ‘near’ a variant to test if there is ASM (3 per default) and the minimum number of CpG sites with a significant methylation difference and in the same direction between the two alleles (3 per default). The user can also select the minimum number of consecutive CpG sites with a significant methylation difference (2 by default), which we found to be the most stringent parameter when calling ASM. Also, when calculating the effect size of an ASM region and its P-value using a Wilcoxon test, the user can select the P-value threshold for significance (0.05 by default) or the Benjamini/Hochberg threshold (0.05 by default) used when correcting the P-value for multiple testing. Finally, when calculating ASM at the CpG level and its significance using a Fisher test, the user can select the P-value for its significance (0.05 by default) (Fig. 1).

Definition of ASM in CloudASM. All parameters are customizable by the user. Cloud ASM uses a two-dimension (by CpGs and across CpGs) and a two-step statistical approach to define ASM
Fig. 1.

Definition of ASM in CloudASM. All parameters are customizable by the user. Cloud ASM uses a two-dimension (by CpGs and across CpGs) and a two-step statistical approach to define ASM

3 Results

3.1 How to use CloudASM

Our pipeline can be run from any machine that is equipped with a command line terminal and Python (nearly all PC, Mac and Linux machines). To employ our pipeline, a user must install the pipeline manager ‘dsub’ and have access to BigQuery, Compute Engine and Cloud Storage in a GCP project. The user does not need to install any bioinformatics package because CloudASM uses our publicly available Docker images which come with all the packages required by CloudASM. If the user is new to GCP, we provide additional information on our Github repository.

3.2 Execution example

We prepared a small dataset (∼24 MB of zipped fastq files) for anyone to rapidly test CloudASM without incurring large costs, prior to applying it to whole methylomes. The dataset was prepared by recombining bisulfite-converted reads overlapping the positions from 9 000 000 to 10 000 000 on chromosome 1, using the module bamtofastq within the package BEDTools (Quinlan et al., 2010) on the lymphoblastoid cell line GM12878 (identifier: ENCSR890UQO) made available by the ENCODE consortium (Davis et al., 2018). The zipped fastq files are freely accessible on our GCP’s storage space at https://console.cloud.google.com/storage/browser/cloudasm. Using this dataset, CloudASM assessed 456 SNPs and found 13 ASM regions. All the data generated by CloudASM is stored in the same location in our GCP storage space.

Finally, we ran both CloudASM and a traditional cloud-powered cluster using a cluster manager (Elasticluster) and job scheduler (Sun Grid Engine, SGE) on all fastq files for GM12878 and measured the number of CPU hours in both cases. Both algorithms evaluated 1.5 billion reads, 1.1 billion CpG methylation status, 1.48 million variants and found 8045 ASM regions. We measured a 25x improvement in the number of CPU hours of CloudASM over the cloud-powered cluster we used in Do et al. (2019) (Table 1).

Table 1.

Comparison between CloudASM and the pipeline used in Do et al. (2019) to call ASM on ENCODE’s lymphoblastoid cell line GM12878 (identifier: ENCSR890UQO on the ENCODE website)

ASM pipelineCloudASMSGE-powered pipeline
Number of CPU hours771944
ASM pipelineCloudASMSGE-powered pipeline
Number of CPU hours771944
Table 1.

Comparison between CloudASM and the pipeline used in Do et al. (2019) to call ASM on ENCODE’s lymphoblastoid cell line GM12878 (identifier: ENCSR890UQO on the ENCODE website)

ASM pipelineCloudASMSGE-powered pipeline
Number of CPU hours771944
ASM pipelineCloudASMSGE-powered pipeline
Number of CPU hours771944

3.3 Scalability of CloudASM

Because CloudASM uses the pipeline manager ‘dsub’, CloudASM does not rely on a cluster (i.e. with a manager node and a set of compute nodes communicating through a NFS server) and, therefore, its scalability is not limited by common issues found in clusters, such as NFS I/O issues and cluster management. In clusters (on-site or cloud-powered), compute nodes share a common drive where they perform operations and eventually, they collectively reach the disk’s maximum writing speed. We noticed this issue both using Elasticluster/SGE and HTCondor on GCP. Finally, clusters need to be created, maintained and their nodes need to be disposed of when they are no longer used. This servicing to cluster is both time and resource consuming. CloudASM, through dsub, automatically creates, maintains and kills virtual machines with their own disks, eliminating scalability issues encountered with cluster management. To be scalable over any number of samples, the user will have to make sure that her project has been authorized by GCP to request the resources required by CloudASM (see our Github repository for more info). We have used CloudASM to process 10 whole methylomes in parallel and the duration of their analysis was not noticeably different from the time required to analyze a single whole methylome. However, the duration to process 10 whole methylomes on a cloud-powered SGE cluster increases approximately by 50% because of the issues described above, therefore increasing the number of CPU hours by 50% as well.

3.4 The use of pre-emptible machines

GCP offers pre-emptible virtual instances, which are about 80% cheaper than regular virtual instances. However, pre-emptible instances live for at most 24 h and are subject to random interruptions. On average, we have observed a 15% failure rate in a given batch of jobs relying on these instances. CloudASM provides the user with the ability to use regular or pre-emptible virtual instances, and we also provide the code to re-run the failed jobs in our Github repository.

4 Conclusion

CloudASM is a cloud-based, highly efficient pipeline to call ASM. We believe that the unprecedented efficiency of our pipeline will give researchers easy access to a useful regulatory epigenome mark and that its flexibility will be useful in exploring other allele-specific epigenomes or transcriptomes.

Acknowledgements

We are grateful to Peter Kaplan for helpful discussions. We also thank Adil Alaoui and Subha Madhavan at Georgetown University for setting up a Google Cloud Platform account on our behalf. Finally, we thank Jesus Gomez and Keith Binder at Google for their advice.

Funding

This work was supported by the National Institutes of Health [R21 AI133140].

Conflict of Interest: none declared.

References

Chen
 
M.
 et al. (
2014
) Block nested join and sort merge join algorithms: an empirical evaluation. In:
Luo, X. et al. (eds.) Advanced Data Mining and Applications, vol. 8933. Springer International Publishing, New York, NY, pp. 705–715
.

Davis
 
C.A.
 et al. (
2018
)
The Encyclopedia of DNA elements (ENCODE): data portal update
.
Nucleic Acids Res
.,
46
,
D794
D801
.

Do
 
C.
 et al. (
2016
)
Mechanisms and disease associations of haplotype-dependent allele-specific DNA methylation
.
Am. J. Hum. Genet
.,
98
,
934
955
.

Do
 
C.
 et al. (
2017
)
Genetic–epigenetic interactions in cis: a major focus in the post-GWAS era
.
Genome Biol
.,
18
,
120
.

Do
 
C.
 et al. (
2019
) Allele-specific DNA methylation is increased in cancers and its dense mapping in normal plus neoplastic cells increases the yield of disease-associated regulatory SNPs. BioRxiv.

Kohlhoff
 
K.J.
(
2019
)
Google-accelerated biomolecular simulations
.
Methods Mol. Biol
.,
2022
,
291
309
.

Lakshmanan
 
V.
,
Tigani
J.
(
2019
) Google BigQuery: The Definitive Guide: Data Warehousing, Analytics, and Machine Learning at Scale. ‘O’Reilly Media, Inc., Sebastopol, CA.

Lee
 
S.
 et al. (
2019
)
Tibanna: software for scalable execution of portable pipelines on the cloud
.
Bioinformatics
,
35
,
4424
4426
.

Onuchic
 
V.
 et al. (
2018
)
Allele-specific epigenome maps reveal sequence-dependent stochastic switching at regulatory loci
.
Science
,
361
,
eaar3146
.

Orjuela
 
S.
 et al. (
2019
) DAMEfinder: A method to detect differential allele-specific methylation. BioRxiv.

Quinlan
 
A.R.
 et al. (
2010
)
BEDTools: a flexible suite of utilities for comparing genomic features
.
Bioinformatics
,
26
,
841
842
.

This article is published and distributed under the terms of the Oxford University Press, Standard Journals Publication Model (https://academic.oup.com/journals/pages/open_access/funder_policies/chorus/standard_publication_model)
Associate Editor: Peter Robinson
Peter Robinson
Associate Editor
Search for other works by this author on: