-
PDF
- Split View
-
Views
-
Cite
Cite
Zsolt Balázs, Todor Gitchev, Ivna Ivanković, Michael Krauthammer, Fragmentstein—facilitating data reuse for cell-free DNA fragment analysis, Bioinformatics, Volume 40, Issue 1, January 2024, btae017, https://doi.org/10.1093/bioinformatics/btae017
- Share Icon Share
Abstract
Method development for the analysis of cell-free DNA (cfDNA) sequencing data is impeded by limited data sharing due to the strict control of sensitive genomic data. An existing solution for facilitating data sharing removes nucleotide-level information from raw cfDNA sequencing data, keeping alignment coordinates only. This simplified format can be publicly shared and would, theoretically, suffice for common functional analyses of cfDNA data. However, current bioinformatics software requires nucleotide-level information and cannot process the simplified format. We present Fragmentstein, a command-line tool for converting non-sensitive cfDNA-fragmentation data into alignment mapping (BAM) files. Fragmentstein complements fragment coordinates with sequence information from a reference genome to reconstruct BAM files. We demonstrate the utility of Fragmentstein by showing the feasibility of copy number variant (CNV), nucleosome occupancy, and fragment length analyses from non-sensitive fragmentation data.
Implemented in bash, Fragmentstein is available at https://github.com/uzh-dqbm-cmi/fragmentstein, licensed under GNU GPLv3.
Introduction
Cell-free DNA (cfDNA) sequencing is revolutionizing non-invasive approaches to prenatal testing, cancer detection, and transplant monitoring (Norwitz and Levy 2013, Oellerich et al. 2020, Cisneros-Villanueva et al. 2022). Clinically relevant features obtained from cfDNA include point mutations, mutational signatures (Sanmamed et al. 2015), copy number variation, fragment lengths (Mouliere et al. 2018), end motifs (Moldovan et al. 2021), fragmentation patterns (Cristiano et al., 2019), and nucleosome footprints (Snyder et al. 2016, Sun et al. 2019, Peneder et al. 2021). However, sharing human cfDNA sequencing data is limited due to the highly sensitive nature of genome sequence information, hampering the development of bioinformatics software for data analysis. Some data are not shared at all, due to missing consent from the participants for sharing their genomic information with other researchers or certain countries limiting access to genetic data of their citizens (e.g. Denmark), and some of the genomic data are available upon request aided by restricted access repositories such as dbGAP (Mailman et al. 2007), the European Genome-Phenome Archive (Freeberg et al. 2022), or the Japanese Genotype-Phenotype Archive (Kodama et al. 2015); however, getting access to data through these repositories is slow and circumstantial. While sequence data are sensitive, much of the data of interest to cfDNA research can be separated from sequence information. Analyses that do not use point-mutation data can be performed using only cell-free DNA fragment coordinates, and such fragmentation data can be publicly shared. FinaleDB (Zheng et al. 2021) is a dedicated database providing open access to de-identified fragment coordinates in a tabulated file format; however, non-sensitive could potentially be shared through any open repository. Currently available cfDNA analysis software is written to only process alignment files, even if the analysis was possible without the sensitive sequence information. To fill this gap, we developed Fragmentstein, a command-line tool that converts fragmentation data into sequence alignment files that can be processed by most contemporary cfDNA analysis software.
Usage
Fragmentstein is implemented as a bash script. It converts a tabulated file (BED, BEDPE, or TSV) containing fragment coordinates into a paired-end alignment file using the sequence of the specified reference genome. For a graphical overview, see Supplementary Fig. S1. Even though Fragmentstein was developed with the purpose to facilitate cfDNA sequence data reuse, it can also be used to create paired-end alignment files from any BED or similarly formatted tabular file.
Application
We evaluated the utility of Fragmentstein by analysing a commonly used cfDNA sequencing dataset (Snyder et al. 2016) in its original BAM format, containing nucleotide-level information, as well as in a non-sensitive TSV format from FinaleDB database (Zheng et al. 2021), containing sequence coordinates only. In order to demonstrate the feasibility of analyses on a range of different cfDNA data, we included paired-end sequencing data from both ssDNA and dsDNA libraries from the Snyder dataset. We analysed whole-genome cfDNA sequencing data of three healthy, four lupus erythematosus, and seven cancer samples sequenced to depths ranging from 10 to 60×. In order to evaluate the feasibility of different types of analyses with the outputs of our tool, we performed fragment length distribution analysis, copy number analysis (Adalsteinsson et al. 2017), and nucleosome profiling (Peneder et al. 2021) on both the original BAM files and the non-sensitive sequencing data processed by Fragmentstein (Fig. 1A). The bam files were processed using the same filtering settings (minimum mapping quality: 30); however, our pipeline for processing the “original” bam files obtained from the publication by Snyder et al. (2016) differed from the pipeline used by FinaleDB in three points: the FinaleDB pipeline used trimmomatic (Bolger et al. 2014) for read trimming and samblaster (Faust and Hall 2014) for marking duplicates, whereas our pipeline used skewer (Jiang et al. 2014) and picard (http://broadinstitute.github.io/picard/), respectively, and while the FinaleDB pipeline calculates GC bias, it does not correct for it, whereas our pipeline does (see the Supplementary Methods for more details).

(A) Overview of the application test case. Alignment files and publicly accessible fragment coordinate data of the same samples were downloaded. Fragmentstein creates alignment files for each sample using only non-sensitive information. The original alignment files and the alignment files generated by Fragmentstein were subjected to fragment length, copy number, and nucleosome occupancy analysis. (B) Heatmap representation of fragment length distributions. The log2 ratio of fragments in each sample is depicted with red showing fragment sizes that are more and blue that are less frequent in a given sample. (C) Tumour fraction estimates output by ichorCNA based on copy number analysis. (D) Cell-type-specific nucleosome occupancy estimated by LIQUORICE. Signatures are defined as z-scores (compared to healthy samples) of coverage dip depths at cell-type-specific DHSs.
Fragment length distributions were very similar when analysing the original BAM files and the BAM files output by Fragmentstein (Fig. 1B). It has been observed that in cancer and several inflammatory diseases, cfDNA fragments are shorter than in healthy individuals. Therefore, we compared the ratio of short (shorter than 150 bp) fragments in healthy individuals, lupus erythematosus, and cancer patients and received near identical results from original BAM files and the Fragmentstein outputs (Supplementary Fig. S2). The slight differences, most apparent in the ssDNA libraries, can be attributed to differences in the alignment filtering (see Supplementary Methods).
Using the ichorCNA package (Adalsteinsson et al. 2017), we detected the same copy number variants with similar tumour fraction estimates in both the original BAM files and the files processed with Fragmentstein (Fig. 1C and Supplementary Fig. S3).
We performed nucleosome footprint analysis as implemented in the LIQUORICE package (Peneder et al. 2021) to identify differences in cell-type signatures between samples. Cell-type signatures were defined as a drop in coverage at cell-type-specific DNase hypersensitivity sites (DHSs). We observed similar levels of contribution of the analysed cell types (haematopoietic, hepatocyte, lung epithelium, and pancreas epithelium) in the original BAM files and the ones created by Fragmentstein (Fig. 1D). While key observations such as increased haematopoietic signatures in SLE samples and decreased haematopoietic signatures in cancer samples were constant between the two processing methods, the intensity of the hepatocyte and pancreas epithelial signature was different. This observation highlights that some derived measures such as nucleosome footprints may be sensitive to preprocessing methods such as different alignment filtering or GC bias correction options.
Discussion
Fragmentstein provides a simple and flexible solution to converting fragment coordinate information from non-sensitive cfDNA data to alignment (BAM) files which can be processed by typical bioinformatics software. While information such as mutational information and mapping quality data is lost during data de-identification when compared to the original BAM files, the recovered alignment files are suitable for most DNA fragment-based analyses. The script is openly accessible, easy to use, and can be customized to fit the user’s specific needs.
Conflict of interest
None declared.
Funding
Z.B. received funding from the Forschungskredit of the University of Zurich (FK-20–103) for his work on facilitating the sharing and reuse of cfDNA sequencing data.
Data availability
Sequencing data used in the study have been acquired from FinalDB (http://finaledb.research.cchmc.org/; https://kircherlab.bihealth.org/download/cfDNA/) and from the Gene Expression Omnibus using the accession number GSE71378 (https://www.ncbi.nlm.nih.gov/geo/query/acc.cgi?acc=GSE71378).