esATAC: an easy-to-use systematic pipeline for ATAC-seq data analysis

Wei, Zheng; Zhang, Wei; Fang, Huan; Li, Yanda; Wang, Xiaowo

doi:10.1093/bioinformatics/bty141

Abstract

Summary

ATAC-seq is rapidly emerging as one of the major experimental approaches to probe chromatin accessibility genome-wide. Here, we present ‘esATAC’, a highly integrated easy-to-use R/Bioconductor package, for systematic ATAC-seq data analysis. It covers essential steps for full analyzing procedure, including raw data processing, quality control and downstream statistical analysis such as peak calling, enrichment analysis and transcription factor footprinting. esATAC supports one command line execution for preset pipelines and provides flexible interfaces for building customized pipelines.

Availability and implementation

esATAC package is open source under the GPL-3.0 license. It is implemented in R and C++. Source code and binaries for Linux, MAC OS X and Windows are available through Bioconductor (https://www.bioconductor.org/packages/release/bioc/html/esATAC.html).

Supplementary information

Supplementary data are available at Bioinformatics online.

1 Introduction

Assay for transposase accessible chromatin with high-throughput sequencing (ATAC-seq) is a sensitive method to probe chromatin accessibility genome-wide (Buenrostro et al., 2013). The library preparation is fast, easy-to-perform and requires low amount of biological sample. These advantages make ATAC-seq become a popular way to study open chromatin, nucleosome positioning and transcription factor (TF) footprinting in cell lines or primary tissues by a booming number of laboratories.

Compared with its easy-to-perform experiment, ATAC-seq data analysis may take much more time and effort. Highly integrated cross-platform software to process ATAC-seq data is still lacking. Researchers need to set up their own local pipeline and use multiple tools, each of them provides partial functions of the entire data analysis workflow. Installing those tools from diverse sources, learning their manuals, testing their functions and integrating them together are tedious and time-consuming.

To fill this gap, we developed an easy-to-use R/Bioconductor package named ‘esATAC’. esATAC systematically integrates the state-of-the-art software for full procedure ATAC-seq data analysis, covering raw data processing, downstream statistical analysis and multiple quality control (QC) functions. For the ease of user, esATAC provides preset pipelines that can be executed by one command line under R/Bioconductor environment on different platforms. Advanced users can easily create customized pipelines through flexible interfaces in esATAC. Multi-core and memory control mechanisms have been implemented to optimize hardware utilization.

2 Design and implementation

The flowchart of esATAC is shown in Figure 1a.

Fig. 1.

Open in new tab Download slide

(a) esATAC workflow. esATAC pipeline is mainly divided into two parts, raw data processing and statistical analysis. QC functions at multiple levels are provided, including sequencing QC, library QC and functional annotation QC. (b) and (c) Examples of analyzing ATAC-seq data (GEO accession number GSE47753, see Supplementary Material). (b) CTCF footprinting. (c) Fragment length distribution. Periodicity of approximately 200 base pairs (bp) for nucleosome protection and 10.4 bp for the pitch of the DNA helix is shown by fast Fourier transformation in the upper right corner

2.1 Data analysis workflow

The workflow can be mainly divided into two parts, raw data processing and statistical analysis.

In the raw data processing part, esATAC can directly handle ATAC-seq raw data in FASTQ format. It wraps AdapterRemoval (Schubert et al., 2016) for adapter trimming and Bowtie2 (Langmead et al., 2012) for reads alignment. esATAC will sort the mapped reads, remove duplicates, shift reads for Tn5 insertion (Buenrostro et al., 2013) and generate intensity profile in BigWig format for genome browser visualization.

In the statistical analysis part, esATAC provides a comprehensive analyzing procedure for mapped ATAC-seq reads. It identifies open chromatin peak regions using F-seq (Boyle et al., 2008), which specializes in seeking genome-wide profiling of open chromatin regions with high sensitivity (Koohy et al., 2014). The peaks are annotated and related gene ontology terms are reported (see Supplementary Material). esATAC has integrated known TF motifs in JASPAR database (Mathelier et al., 2016) to find potential TF binding sites in the peak regions, and generate TF footprinting plots (Fig. 1b).

2.2 Quality control

esATAC provides multiple level QC functions. Raw sequencing reads quality report will be generated (Gaidatzis et al., 2015). esATAC performs fragment length QC analysis, providing that typical ATAC-seq fragment length distribution has a clear periodicity caused by nucleosome protection and the pitch of the DNA helix (Fig. 1c). Other QC methods adopted by ENCODE consortium have been integrated (see Supplementary Material), and concordance between replicates can be reported.

2.3 Implementation

For user convenience, we preset pipelines to analyze single sample and case-control paired samples for human and mouse. Users only need to provide the raw sequencing files and can execute the entire pipeline with one command in R. Dependent data like annotation files and bowtie2 index can be downloaded and built automatically. An HTML summary report for comprehensive QC and statistical analysis will be generated.

The package is managed by dataflow graph, therefore users can easily understand and trace the pipeline processing modules (see Supplementary Material). Mechanisms in esATAC such as inputs legality checking ensure that sophisticated users are able to customize the pipeline or integrate other tools from any intermediate stages easily.esATAC provides memory control and parallel computing options to maximize the computing efficiency. Breakpoint detection has been established to ensure that users do not have to redo the finished processes in case the program was interrupted.

3 Conclusion

We proposed esATAC aiming to make ATAC-seq data analysis easy for a wide range of users. esATAC covers whole procedure for ATAC-seq data processing. It can be installed on different platforms and perform ‘one command line for result’ analysis. Users without sophisticated programming skills can get started easily. At the same time, all the sub-functions are componentized, making it a flexible platform for advanced users to build pipelines for specialized applications.

Funding

This work was supported by the National Science Foundation of China [grant nos. 31371341, 61773230 and 61721003], Tsinghua University Initiative Scientific Research Program [no. 20141081175] and the Open Research Fund of State Key Laboratory of Bioelectronics, Southeast University.

Conflict of Interest: none declared.

References

Boyle

A.P.

et al. (

2008

)

F-Seq: a feature density estimator for high-throughput sequence tags

.

Bioinformatics

,

24

,

2537

–

2538

.

Buenrostro

J.D.

et al. (

2013

)

Transposition of native chromatin for fast and sensitive epigenomic profiling of open chromatin, DNA-binding proteins and nucleosome position

.

Nat. Methods

,

10

,

1213

–

1218

.

Gaidatzis

D.

et al. (

2015

)

QuasR: quantification and annotation of short reads in R

.

Bioinformatics

,

31

,

1130

–

1132

.

Koohy

H.

et al. (

2014

)

A comparison of peak callers used for DNase-Seq data

.

PLos One

,

9

,

e96303

.

Langmead

B.

et al. (

2012

)

Fast gapped-read alignment with Bowtie 2

.

Nat. Methods

,

9

,

357

–

U354

.

Mathelier

A.

et al. (

2016

)

JASPAR 2016: a major expansion and update of the open-access database of transcription factor binding profiles

.

Nucleic Acids Res

.,

44

,

D110

–

D115

.

Schubert

M.

et al. (

2016

)

AdapterRemoval v2: rapid adapter trimming, identification, and read merging

.

BMC Res. Notes

,

9

,

88

.

Author notes

The authors wish it to be known that, in their opinion, Zheng Wei and Wei Zhang authors should be regarded as Joint First Authors.

This is an Open Access article distributed under the terms of the Creative Commons Attribution Non-Commercial License (http://creativecommons.org/licenses/by-nc/4.0/), which permits non-commercial re-use, distribution, and reproduction in any medium, provided the original work is properly cited. For commercial re-use, please contact journals.permissions@oup.com

Associate Editor:

Download all slides

Month:	Total Views:
March 2018	990
April 2018	840
May 2018	1,251
June 2018	1,093
July 2018	1,013
August 2018	910
September 2018	740
October 2018	699
November 2018	734
December 2018	520
January 2019	524
February 2019	554
March 2019	632
April 2019	532
May 2019	534
June 2019	533
July 2019	564
August 2019	614
September 2019	338
October 2019	275
November 2019	278
December 2019	286
January 2020	271
February 2020	230
March 2020	164
April 2020	164
May 2020	129
June 2020	205
July 2020	193
August 2020	210
September 2020	212
October 2020	199
November 2020	211
December 2020	203
January 2021	138
February 2021	122
March 2021	162
April 2021	174
May 2021	197
June 2021	162
July 2021	124
August 2021	167
September 2021	119
October 2021	135
November 2021	126
December 2021	113
January 2022	129
February 2022	101
March 2022	148
April 2022	135
May 2022	101
June 2022	132
July 2022	139
August 2022	134
September 2022	143
October 2022	149
November 2022	139
December 2022	83
January 2023	83
February 2023	88
March 2023	111
April 2023	120
May 2023	90
June 2023	78
July 2023	95
August 2023	100
September 2023	80
October 2023	86
November 2023	74
December 2023	85
January 2024	106
February 2024	90
March 2024	110
April 2024	91

Article Contents

esATAC: an easy-to-use systematic pipeline for ATAC-seq data analysis

Abstract

1 Introduction

2 Design and implementation

2.1 Data analysis workflow

2.2 Quality control

2.3 Implementation

3 Conclusion

Funding

References

Author notes

Supplementary data

Citations

Views

Altmetric

Email alerts

Citing articles via

Latest

Most Read

Most Cited

Looking for your next opportunity?

Article Contents

esATAC: an easy-to-use systematic pipeline for ATAC-seq data analysis

Abstract

1 Introduction

2 Design and implementation

2.1 Data analysis workflow

2.2 Quality control

2.3 Implementation

3 Conclusion

Funding

References

Author notes

Supplementary data

Citations

Views

Altmetric

Email alerts

Citing articles via

Latest

Most Read

Most Cited

Looking for your next opportunity?

This Feature Is Available To Subscribers Only