Classifying cells with Scasat - a tool to analyse single-cell ATAC-seq

Motivation The assay for transposase-accessible chromatin using sequencing (ATAC-seq) reveals the landscape and principles of DNA regulatory mechanisms by identifying the accessible genome of mammalian cells. When done at single-cell resolution, it provides an insight into the cell-to-cell variability that emerges from identical DNA sequences by identifying the variability in the genomic location of open chromatin sites in each of the cells. Processing of single-cell ATAC-seq requires a number of steps and a simple pipeline to processes and analyse single-cell ATAC-seq is not yet available. Results This paper presents ScAsAT (single-cell ATAC-seq analysis tool), a complete pipeline to process scATAC-seq data with simple steps. The pipeline is developed in a Jupyter notebook environment that holds the executable code along with the necessary description and results. For the initial sequence processing steps, the pipeline uses a number of well-known tools which it executes from a python environment for each of the fastq files. While functions for the data analysis part are mostly written in R, it is robust, flexible, interactive and easy to extend. The pipeline was applied to a single-cell ATAC-seq dataset in order to identify different cell-types from a complex cell mixture. The results from Scasat showed that open chromatin location corresponding to potential regulatory elements can account for cellular heterogeneity and can identify regulatory regions that separates cells from a complex population. Availability The jupyter notebook with the complete pipeline applied to the dataset published with this paper are publicly available on the Github (https://github.com/ManchesterBioinference/Scasat). An additional notebook is also provided for analysis of a publicly available dataset. The fastq files are submitted at ArrayExpress database at EMBL-EBI (www.ebi.ac.uk/arrayexpress) under accession number E-MTAB-6116. Contact syed.murtuzabaker@manchester.ac.uk and magnus.rattray@manchester.ac.uk Supplementary information Supplementary data are available at bioRxiv online.

In this paper we introduce a bioinformatics pipeline to conduct the following tasks: • Processing: Adapter trimming, quality control, mapping, removing blacklisted reads, removing PCR dupicates and calling peaks.
• Peak accessibility for each cell: We first merge all the single-cell BAM files to create a reference set of peaks. An accessibility matrix is then generated using this reference set with the accessibility information for these peaks in each of the cells.
• Quality Control: Lower quality cells and peaks are removed. Also, peaks that would cause stronger batch effects are removed.
• Downstream analysis: Clustering the cells, deconvoluting cell types from a mixed cell population or identifying differentially accessible peaks between two groups of cells. Figure (1) describes the complete work-flow of Scasat. It starts with the processing steps followed by the downstream analysis of the single-cell ATAC-seq. In the following we provide details of the workflow and demonstrate its utility by applying this to deconvoluting a mixture of three oesophageal tissue (ie one is "normal" and two are "tumours") derived cell lines.

Method
Here we describe, Scasat, Single-cell ATAC-seq Analysis Tools, processing and analysing singlecell ATAC-seq data. The Scasat workflow typically consists of four steps, 1. Data processing; 2. Feature extraction; 3. Heterogeneity analysis of the cells; 4. Differential accessibility analysis of the peaks between two cluster of cells. To make it convenient for the user, we have introduced two notebooks for this analysis. The first notebook does the processing of the data and the second notebook does the downstream statistical analysis. Below we discuss the workflow of both these notebooks.

Notebook environment
The complete pipeline was developed using the jupyter notebook environment (Pérez and Granger, 2007). The jupyter notebook is a web-based notebook that can execute code, produce figures and put all the necessary explanations in the same place. As all the function definitions of the tool are open to the user, it can be easily extended to integrate new tools or approaches. Although the current version of the pipeline enables only the serial processing of the cells at the processing step, it can be easily extended for parallel processing. The notebook makes it easy to document, understand and share the code with non-technical users (Shen, 2014). Scasat uses many of the most widely used software tools at the processing step. The parameters and paths of these tools are set in the python environment. For the downstream analysis Scasat uses the R programming languages for statistical computing and graphical visualization of the results. The use of R magic cells in the notebook variables makes the pipeline more robust and allows both programming languages to be used in the same workflow.

Sequence data processing
The processing step starts with first configuring the folders and setting the paths of the software.
The user configures the inputFolder to the foldername where all the fastq files are. The outputFolder is configured to store all the processed files. Experiments using sequencing applications (ATACseq, Chip-seq) generate artificial high signals in some genomic regions due to inherent properties of some elements. In this pipeline we removed these regions from our alignment files using a list of comprehensive empirical blacklisted regions identified by the ENCODE and modENCODE consortia (Encode, 2012). The location of the reference genome is set through the parameter ref genome. This folder contains the index file for the bowtie2 aligner. A brief description of the tools that we have used in this processing notebook are given below • Trimmomatic v0.36 (Bolger et al., 2014) is used to trim the illumina adapters as well as to remove the lower quality reads.
• Bowtie v2.2.3 (Langmead et al., 2009) is used to map paired end reads. We used the parameter -X 2000 to allow fragments of up to 2kb to align. We set the parameter -dovetail to consider dovetail fragments as concordant. The user can modify these parameters depending on experimental design.
• Samtools (Li et al., 2009) is used to filter out the bad quality mapping. Reads with a mapping quality ¿ q30 are only retained. Samtools is also used to sort, index and to generate the log of mapping quality.
• Bedtools intersect (Quinlan and Hall, 2010) is used to find the overlapping reads with the blacklisted regions and then remove these regions from BAM file.
• Picards (Picard, 2017) MarkDuplicate is used to mark and remove the duplicates from the alignment.
• MACS2 (Zhang et al., 2008) is used with the parameters -nomodel, -nolambda, -keep-dup all -call-summits to call the peaks associated with ATAC-seq. During the callpeak we set the p-value to 0.0001. This is due to the fact that otherwise MACS2 will not call the peaks having a single read mapped to it as it would consider those reads to be background noise.
• Bdg2bw is used to generate the Bigwig files for the UCSC genome browser visualization.
• QC: A final quality control is performed based on the library size of the BAM file. We filter out the cells for which the library size estimated by Picard tool is less than a user-defined threshold. The default value of the LIBRARY SIZE THRESHOLD is set to 10000. We consider any cell having a library size lower than this threshold to not be a valid cell as those reads may come from debris free material or from dead cells.

Bulk vs aggregate
If a Bulk measurement is available for the same cell-type or sample, then the pipeline can calculate Number of peaks vs. precision for the aggregated single-cell data against its population-based Bulk data. This demonstrates how closely the single-cell data recapitulates its Bulk counterpart. We define peakA's list all the peaks in the population based on Bulk data and peakB's list the peaks in aggregated single-cells sorted on q-values. peakA is considered to be the gold standard for this calculation. We start with the top 100 peaks in the sorted peak list of peakB. We assume that all these 100 peaks are positive peaks overlapping with the peaks in peakA. Now, all the peaks within this 100 list that actually overlap with the peaks in peakA are considered the True Positive and the ones that do not overlap are the False Positive ones. Now, we take all the remaining peaks in peakB (peaks that we get after removing the 100 previously selected ones) as the negative ones. If any of the peaks in this negative set overlaps with peaks in peakA then we denote them as False Negative. Otherwise, they are called as True negative. We then calculate the precision as

Calculate Jaccard distance
Once the accessibility matrix is generated, we are interested in a dissimilarity measure that quantifies the degree to which two cells vary in their peak accessibility. In our pipeline we use the Jaccard distance (Jaccard, 1901) as a dissimilarity measure. The Jaccard distance is the ratio between the number of peaks that are unique to a cell against all the peaks that are open in two cells.

Dimensionality reduction
Our peak accessibility matrix represents a very high-dimensional dataset of open chromatin regions for each single-cell. Dimensionality reduction for this high-dimensional dataset is essential for easy visualization and other downstream analysis. In our pipeline we applied the multidimensional scaling (MDS) and the t-distributed stochastic neighbour embedding (t-SNE). MDS provides a visual representation of similarity (or dissimilarity) between two objects. It takes as input the distances between any pair of objects and then minimizes a loss function called strain (Borg and Groenen, 2003) so that the between object distances are preserved as much as possible. t-SNE is a non-linear dimensionality reduction technique that maps multidimensional data to two or more dimensions for easy visualization. t-SNE converts the similarity between the data points to joint probabilities and tries to minimize the Kullback-Leibler divergence between the joint probabilities of high dimensional data and low-dimensional embedding of this data (van der Maaten and Hinton, 2008). To reduce the noise and computational speed it is recommended to use a lower dimensional representation of the data as input to t-SNE.

Clustering
In this pipeline we used the k-medoids algorithm to cluster the cells into different groups. The k-medoids algorithm breaks the dataset into different partitions and attempts to minimize the distance between points assigned in a cluster and a point designated as the center of that cluster.
In our pipeline we used Jaccard distance as the dissimilarity matrix for this algorithm.

Differential accessibility (DA) analysis
One of the key features of interest in single-cell ATAC-seq analysis is to perform a statistical analysis to discover whether quantitative changes in accessibility of chromatin locations between two groups of cells are statistically significant. The pipeline implements two methods, Fisher exact test and Information gain to conduct the differential accessibility analysis between any two groups of cells.

Information gain
Based on the expected reduction in entropy (homogeneity measure), information gain measures the attribute that provides the best prediction of the target attribute where entropy is reduced.
In differential accessibility analysis, information gain lists the peaks that best splits the cells into different groups. Defining Entropy as c i=1 −p i log 2 p i , the Information gain is calculated as Here, P is the collection of all data, P G1 represents the cells in G 1 and P G2 represents cells in G 2 .
We calculate the information gain of each of the peaks given the cells are divided into two groups.
The peaks are then sorted based on the informatin gain and the user can choose the cutoff value for selecting the DA peaks.

Fisher exact test
The Fisher exact test looks at a 2 × 2 contingency table that shows how different groups/conditions have produced different outcomes. Its null hypothesis states that the outcome is not affected by groups or conditions. We run this test on a peak-by-peak basis by organizing the open and closed (1's and 0's) for each peak in a 2 × 2 contingency table. The p-values are then corrected using Bonferroni correction for multiple comparisons (Bonferroni, 1936). Differentially accessible peaks with statistical significance are then selected based on a user-defined cutoff value (default: q-value ¡ 0.01).

Results
To demonstrate Scasat we generated scATAC-seq data from a mixture of three different cell types and the objective was to identify these three cell types from this mixture. We applied Scasat to characterize biologically relevant chromatin variability associated with each cell-type.

Experimental design
To create the mixture we took two classic oesophageal adenocarcinoma (OAC) cell lines, OE19, OE33 and one non-neoplastic HET1A cell line. We mixed the three cell types in equal proportions to create a heterogeneous population. Two samples from this mixture were taken to make two technical replicates. ATAC-seq was then performed on those two replicates by loading on two separate C1 fluidigm chips using a 96 well plate integrated fluidic circuit (IFC) and sequenced on an Illumina NextSeq (Figure (2)). As ATAC-seq reports on the accessible regions of the chromatin which are considered to be active (Buenrostro et al., 2015a), these three cell lines are expected to have different accessibility at regulatory regions. The analysis attempted to disentangle these cells based on the presence of these active sites.

Sample preparation
We plated 1 × 10 6 of each cell type onto 10 cm plates and incubated for 24 hours at 37 • C. Cells were detached using 0.1% trypsin. After detaching cells, each individual cell type was placed into a separate tube, centrifuged and re suspended in 1 ml of 1X phosphate-buffered saline (PBS).
Each cell type suspension was quantified using a haemocytometer and this gives us a concentration in cells per ml. We then put the same number of each cell type into the same tube (to get the mixed population), which was then centrifuged again and re-suspended in 100 µl of 1X P BS.

Data processing and analysis
Two batches were processed separately making it easier to keep track of the processing steps and to troubleshoot any problem that might arise due to batch effects. The parameters for the tools are explained in the tool description section. We trim the adapter sequences using Trimmomatic and used Bowtie2 to map reads to the genome. After removing the Blacklisted regions, we use Picard's Markduplicate, to remove the duplicates. We then remove the chrY (as the three cells lines are a mixture of male and female) and chrM. The peak calling is done in two stages. In the first stage macs2 is used to call peaks in each individual cell by setting the p-value parameter to 0.0001 to ensure that peaks associated with low mapping reads are also called. We then filter the cells that fail to cross the LIBRARY SIZE THRESHOLD set to 10000. After this QC, Batch-1 has 84 and Batch-2 had 89 cells for downstream analysis. In the second stage of peak calling, we aggregated all the BAM (both batches) files by calling getAggregatedPeak() module and use macs2 to call peaks on this aggregated BAM file with qvalue of 0.2. This gave us a reference set of peaks. We then use the mergePeaks() module to merge the overlapping peaks in this reference set and sort them based on the q-value giving us a total of 236, 580 peaks as reference set. Finally we call peakAccessibility() module to calculate the accessibility of these reference peaks for each single-cell and generated the peak accessibility matrix.

Peak selection
We used the clean.count.peaks() with the default parameters to remove the lower quality cells and peaks. When setting min.cell.peaks.obs=10 (a valid peak has to be observed in at least 5% cells across both batches), a significant number of peaks were removed. These peaks are filtered out because we believe they do not contain enough information for reliable statistical inference. One caveat of this approach is that peaks associated with more abundant cell types are potentially kept and we might fail to detect rare sub-populations. In such cases the threshold for number of cells with a peak can be reduced.

Dimensionality reduction
We now employed the plotMDS() module that subsequently calls the getJaccardDist() module to calculate Jaccard distances between the cells, applies multidimensional scaling on this Jaccard

Additional filtering
Applying conventional batch correction methods like linear regression as in Limma or other approaches outlined in SVA or Combat are not applicable here as all of them would convert the binary data into a real number. So we had to apply a different approach. Careful analysis of the data found that the peaks in Batch-1 had higher number of zeros associated with aggregated peaks, indicating that less information is attained from Batch-1 cells compared to Batch-2 (FIG S2). So we applied an additional filtering. In that filtering, if a peak is observed in less than three cells in any of the batches we discarded that peak. This further reduced the number of peaks to 17, 255 peaks. Applying MDS to these peaks the cells no longer show a strong batch effect (Figure 3(B)).

Clustering
We used the Partitioning Around Medoids (PAM) algorithm which is the most common realisation

Identifying cell types
To relate the different groups to the input cell types, we compared it with the Bulk data of HET1A, OE19 and OE33 and also with single-cell datasets for OE19 (done in two batches, B1 and B2) and HET1A (Batch B1) by aggregating the single cells into its corresponding cell type. For this comparison, we merged the mapped reads for each cell in each of the clusters to create three aggregated BAM files, one for each of Cluster1, Cluster2 and Cluster3 identified in Figure 3(C).
For the other two single-cell dataset we did the same. We now extend the column of peak accessibility matrix by calculating the peak accessibility for the three aggregated single cell data (OE19 B1, B2 and HET1A B1) and the Bulk datasets (HET1A, OE33, OE19) using the same 17, 255 peaks.
Finally, we calculate the Pearson correlation coefficient for each of these dataset against each other which is shown in a correlation plot ( Figure 4) and cluster them using hierarchical clustering. The correlation plot assigns each of the single cell clusters to their respective cell type based on the high correlation coefficient with the known cell types. This identifies Cluster1 as OE33 cell, Cluster2 as OE19 cell and Cluster3 as HET1A cell (Figure 4). If the Bulk data is not available or any other input cell types are not known, GO based analysis eg. GREAT (McLean et al., 2010) can be used to assign cells to a particular cell type .

Evaluating clustering solutions with golden truth
We used the Bulk data to identify the Ground truth on this data by calculating the correlation for each single-cell against the three Bulk data and assigning each cell to its corresponding type based on the largest correlation coefficient. These assignments are then projected onto MDS plot in FIG. S4. This Ground Truth shows very similar cell assignment as with our k-medoid algorithm.
The adjusted rand index (ARI) that measures the similarity between two clusters gives a value of 0.75 for the cell assignment by k-medoid clusters and the ground truth where 1 signifies complete overlap between the two assignments.

UCSC Genome tracks
We created a genome browser view of the aggregated single cells in the clusters and compared this to the data from Bulk ATAC-seq experiments performed on known cell populations. We ran our Differential Accessibility analysis with the module getDiffAccessInformationGain() which uses the entropy and information gain to identify the differentially accessible peaks between Cluster1 vs Cluster2 and Cluster2 vs Cluster3. We annotated these peaks by finding the overlapping genes with a maximum distance of 1000 bases. Figure

GO based functionality
To confirm these cluster assignments further, we looked at the disease ontology associated with the cells in the clusters. We used the peaks that we identified during differential accessibility analysis of Cluster2 and Cluster3 and took the peaks that are differentially accessible with with more than two log2 fold change in Cluster2 compared to Clsuter3 and identified the Disease ontologies associated with these peaks through GREAT. GREAT first associate the peaks with potential target genes and then find out the disease ontologies associated with these open chromatin locations. The same process is repeated for Cluster3. We then identify the 15 topmost diseases associated with Cluster2 and maps the statistical significance for these disease in Cluster3. Although esophageal carcinoma did not make into the topmost 15 disease in Cluster2 (It came in 33 th position) we added it into this list of topmost genes. The -log10(Binom p-value) for these diseases are shown in FIG. S5. Esophageal carcinoma is picked up as one of the diseases associated with Cluster2 with very high confidence whereas the statistical significance is almost not present for Cluster3 cells. Several other significant disease associations are seen with adenocarcinomas and with general cancer. In almost all of these disease ontologies Cluster2 shows very high statistical significance whereas Cluster3 shows almost no association with very low p-value. This further confirms our accurate identification of HET1A cells and cancer derived subtypes supporting the conclusion of Corces et al. (2016) that scATAC-seq in addition to scRNA-seq can accurately identify cell types in complex cellular populations.
In summary, we showed that peak accessibility information from scATAC-seq can be used to deconvolute single-cells from a mixed cell population. We showed that an unsupervised clustering algorithm clusters cells according to their respective cell type. Similar methods could be applied to separate malignant cells from normal cell in a cancerous tissue and then investigate the malignant cells in detail.

Conclusions
As single-cell ATAC-seq experiments are gaining momentum, we report Scasat, a pipeline to process and analyse single-cell ATAC-seq. Scasat offers two major utilities, the initial processing of scATACseq data and its downstream analysis. Scasat is implemented in jupyter notebooks making it simple, robust, scalable and easy to extend. Results from our data showed that an unsupervised clustering of the cells based on accessible chromatin regions can group cells into their corresponding cell type. This suggests that regulatory elements can define cell identity quite precisely. The successful implementation of this tool helped us to understand further the epigenetic mechanisms at the single-cell level and opens opportunities for easier and better analysis of single-cell ATAC-seq data.