- Split View
-
Views
-
Cite
Cite
Nicola Casiraghi, Francesco Orlando, Yari Ciani, Jenny Xiang, Andrea Sboner, Olivier Elemento, Gerhardt Attard, Himisha Beltran, Francesca Demichelis, Alessandro Romanel, ABEMUS: platform-specific and data-informed detection of somatic SNVs in cfDNA, Bioinformatics, Volume 36, Issue 9, May 2020, Pages 2665–2674, https://doi.org/10.1093/bioinformatics/btaa016
- Share Icon Share
Abstract
The use of liquid biopsies for cancer patients enables the non-invasive tracking of treatment response and tumor dynamics through single or serial blood drawn tests. Next-generation sequencing assays allow for the simultaneous interrogation of extended sets of somatic single-nucleotide variants (SNVs) in circulating cell-free DNA (cfDNA), a mixture of DNA molecules originating both from normal and tumor tissue cells. However, low circulating tumor DNA (ctDNA) fractions together with sequencing background noise and potential tumor heterogeneity challenge the ability to confidently call SNVs.
We present a computational methodology, called Adaptive Base Error Model in Ultra-deep Sequencing data (ABEMUS), which combines platform-specific genetic knowledge and empirical signal to readily detect and quantify somatic SNVs in cfDNA. We tested the capability of our method to analyze data generated using different platforms with distinct sequencing error properties and we compared ABEMUS performances with other popular SNV callers on both synthetic and real cancer patients sequencing data. Results show that ABEMUS performs better in most of the tested conditions proving its reliability in calling low variant allele frequencies somatic SNVs in low ctDNA levels plasma samples.
ABEMUS is cross-platform and can be installed as R package. The source code is maintained on Github at http://github.com/cibiobcg/abemus, and it is also available at CRAN official R repository.
Supplementary data are available at Bioinformatics online.
1 Introduction
Liquid biopsy provides an exceptional source of information for the identification and measurement of biomarkers relevant to precision oncology, from diagnosis and prognosis to treatment selection and monitoring of treatment response (Heitzer et al., 2019). Circulating cell-free DNA (cfDNA) carries the genomic characteristics of tumor cell material shed into the bloodstream. In the presence of metastatic disease and/or of multifocal tumors, where single tissue biopsies would fall short in allowing heterogeneity assessment, cfDNA represents an ideal alternative to capture the disease genomic features. Several studies already demonstrated the prognostic value of circulating tumor DNA (ctDNA; the fraction of free DNA released from tumor cells as opposed to normal cells) and the ability to track tumor dynamics through the analysis of genomic lesions detected in the circulation of cancer patients (Annala et al., 2018; Bettegowda et al., 2014; Dawson et al., 2013; Sclafani et al., 2018; Siravegna et al., 2015; Thierry et al., 2014; Tie et al., 2016; Vietsch et al., 2017). One outstanding example of the use of liquid biopsy for the detection of relevant single-nucleotide variant (SNV) is the FDA-approved test for EGFR exon 21 L858R substitution mutation in metastatic non-small-cell lung cancer patient (Kwapisz, 2017), approved on June 1, 2016. While highly sensitive technologies as digital PCR can be used for the investigation of SNVs in cfDNA, only next-generation sequencing (NGS) approaches allow for the simultaneous interrogation of large sets of genomic loci and for the discovery of mutations, with yet restricted amount of DNA (10–50 ng). In the NGS-based cfDNA testing, the perfect trade-off between SNV detection performance and sequencing depth is key. Specifically, low ctDNA fractions together with potential tumor heterogeneity challenge the ability to confidently call SNVs also due to the sequencing background noise. We therefore recognized the need for a benchmarked widely applicable computational method that combines individual’s genetic knowledge and empirical signal to readily detect and quantify somatic SNVs in cfDNA also in the presence of low tumor fractions. We set up a computational methodology named Adaptive Base Error Model in Ultra-deep Sequencing data (ABEMUS) to discriminate between true SNVs and artefactual signals by learning locus-specific and data-driven variant-allelic fraction (AF) thresholds while leveraging platform-specific single base resolution information from sequencing assays (Fig. 1). Performance and results were compared across an array of in silico and real liquid biopsy data (including in silico dilutions) against SNV detection methods commonly used in tumor tissue-based studies (Cibulskis et al., 2013; Kim et al., 2018; Koboldt et al., 2012; Larson et al., 2012) or specifically proposed for cfDNA data (Kockan et al., 2017).
2 Materials and methods
2.1 Plasma and germline sequencing data from cancer patients
To build different ABEMUS platform-specific sequencing error reference models and study their properties, we collected germline samples sequencing data profiled using five platforms (here intended as the combination of library preparation kit and sequencing machine/chemistry). Specifically, we used both (i) whole-exome sequencing (WES) data from 40 normal samples sequenced both with NimbleGen (Roche NimbleGen SeqCap Exome v3, 64 Mb covered) (Beltran et al., 2017) and with HaloPlex (Agilent HaloPlex Exome, 36 Mb covered) kits (Beltran et al., 2016), and (ii) custom-targeted panel data from three sets of normal samples (N = 20, 113 and 3) sequenced via Roche NimbleGen N250 targeted panel, Ion AmpliSeq Targeted Custom Amplicon Panel (Carreira et al., 2014; Romanel et al., 2015) or Illumina True Seq Custom Amplicon and covering 3.2 Mb, 40 kb and 106 kb, respectively (see Supplementary Table S1). Additionally, we queried 118 plasma samples from 17 metastatic prostate cancer patients (median number of plasma samples per patient is 5) profiled on an Ion AmpliSeq Targeted Custom Amplicon Panel. The case samples have been previously annotated by tumor content (ctDNA) using CLONET (Prandi et al., 2014) and by manually curated SNVs calls (Carreira et al., 2014).
2.2 Data pre-processing for ABEMUS computations
Pileup data (PILEUP files) were generated using PaCBAM (Valentini et al., 2019) to obtain depth of coverage and allele-specific statistics at each considered locus. Genomic positions with variant AF greater than zero are available in *.pabs PaCBAM output files. Sequencing reads with read and base qualities ≥20 were retained in the pileup computation.
2.3 Global and local estimations of sequencing errors
2.4 ABEMUS single-nucleotide variants calls
This function returns the maximum AF observed among 100 000 experiments modeled as binomial distributions with corresponding to and number of trials corresponding to the locus coverage . This value is then rescaled by a factor which maximizes ABEMUS precision and recall in plasma samples with global mean coverage equal to and target size equal to .
Further filtering criteria on minimal locus coverage and minimal AF in plasma sample can be applied to reflect a priori user-specific requirements. Additionally, when matched germline sample data are available, filters on minimal locus coverage and maximal AF in matched germline sample can be applied. At each computation step, the list of genomic loci to be processed is reduced (intermediate and final lists are saved). The final list includes the set of putative somatic SNVs for the plasma sample.
2.5 Synthetic BAM files generation, preserving real data features, coverage and sequencing error
To test ABEMUS performance, synthetic BAM files were generated using summary statistics from a collection of human germline samples. Specifically, we considered 50 germline BAM files profiled with Agilent HaloPlex Exome kit (36 Mb covered) at approximately 200× mean depth of coverage (Beltran et al., 2015). Coverage and allele-specific statistics across all captured genomic regions were computed and characterized both at region and base-specific level. In particular, we computed and the probability distribution , which for each position in the panel measures the probability of observing a mapped read with starting position in . Synthetic BAM files were obtained from synthetic FASTQ files aligned to the human hg19 reference genome using BWA aligner (Li and Durbin, 2009) and were finally processed with SAMtools (Li et al., 2009). Given a number of required reads of length and a set of heterozygous SNPs derived from randomly selected European individuals from the 1000 Genomes Project, synthetic FASTQ files were created by generating synthetic reads using the following procedure: (i) select a start alignment position using the probability distribution ; (ii) build the read sequence considering the genomic coordinates in the human hg19 reference genome and select an allele with probability 0.5; (iii) introduce an error at each read position with a probability reflective of where 0.002 is the average background error computed from the original germline data; (iv) introduce the alternative base of a SNP at genomic position if corresponds to the genomic position of SNP . If the synthetic data are intended to represent a case sample, an heterozygous SNV from a set of pre-selected heterozygous SNVs is introduced in a read at position if corresponds to the genomic position of SNV ; the SNV is introduced with a probability , where represents a level of ctDNA. Base quality values in FASTQ files are all set to a pre-defined value . Using this procedure, we generated two large datasets of synthetic data, one to optimize ABEMUS performance and one to run comparative performance study with other tools.
2.6 Generation of synthetic data to optimize ABEMUS performance
Using the previously described procedure, we generated a set of 50 synthetic germline BAM files and a set of 9 plasma-germline synthetic BAM file pairs reflective of covering 36 Mb (100% of HaloPlex target) at mean coverage of 2000×. Plasma BAM files were generated introducing in each sample a different set of 200 clonal heterozygous SNVs and mimicking a range of ctDNA values, as 80%, 40%, 20%, 15%, 12.5%, 10%, 7.5%, 5% and 2.5%. PILEUP data for these samples were calculated with PaCBAM and used to generate synthetic input data for ABEMUS covering different scenarios of depth of coverage, target size and admixture level. Specifically, starting from those PILEUP data and adopting a sub-sampling procedure, we generated synthetic input data to represent assays with smaller genomic targets (75%, 50%, 25%, 12.5%, 6%, 3%, 1%, 0.5% and 0.1% corresponding to 26.6, 17.7, 8.9, 4.4, 2.1, 1.1, 0.4, 0.2 and 0.04 Mb of the 36 Mb HaloPlex target, respectively), each at multiple mean coverages (50%, 25%, 10% of the original coverage corresponding to 1000×, 500× and 200× mean coverage, respectively). Combinations of targets (N = 10) and coverage levels (N = 4) resulted in an extended collection of 2000 synthetic germline input data grouped in 40 target-coverage classes and 360 synthetic plasma-germline input data also grouped in 40 target-coverage classes across 9 different levels of ctDNA. Case tumor BAM files were generated introducing in each case a different set of 200 clonal heterozygous SNVs except for BAM files covering 0.2 and 0.04 Mb in which sets of 100 clonal heterozygous SNVs were introduced. For all synthetic samples, base qualities were set to 20. Generated synthetic reads length was set to 101 bp. This dataset is referred to as Synthetic Dataset #1.
2.7 Generation of synthetic data for comparative analyses with published tools
A second set of plasma and matched germline synthetic BAM files was generated to compare ABEMUS performances against published SNV detection tools. Three combinations of depth of coverage and target size were considered: (i) 2000× mean depth of coverage across 1% HaloPlex target; (ii) 1000× mean depth of coverage across a 12.5% of HaloPlex target and (iii) 200× mean depth of coverage across 100% HaloPlex target. For each scenario, we generated 50 synthetic germline BAM files and a set of 9 synthetic plasma-germline samples pairs spanning a range of ctDNA values (80%, 40%, 20%, 15%, 12.5%, 10%, 7.5%, 5% and 2.5%). Case plasma BAM files were generated introducing in each sample a different set of 200 clonal heterozygous SNVs. For each plasma sample, two synthetic BAM files were generated, considering base qualities set to 20 and 30. Generated synthetic reads length was set to 101 bp. This dataset is referred to as Synthetic Dataset #2.
2.8 In silico dilutions from real cfDNA data for comparative analyses with published tools
By applying this procedure, a final set of 291 synthetically diluted samples covering a wide range of ctDNA levels (80%, 40%, 20%, 15%, 12.5%, 10%, 7.5%, 5% and 2.5%) was generated. This dataset, which is hence built using a sub-sampling procedure that mixes sequencing reads from real cfDNA and matched control samples, is referred to as Synthetic Dataset #3.
2.9 ABEMUS parameters used in study experiments and data availability
ABEMUS parameters applied in study experiments are listed in Supplementary Tables S2 and S3. The reference error models of the platforms investigated in this study are available at http://github.com/cibiobcg/abemus_models.
3 Results
3.1 ABEMUS summary overview
ABEMUS is a tool specifically designed to detect somatic SNVs from cfDNA data and is implemented as package in the R environment. The identification of somatic SNVs from a plasma sample is performed by ABEMUS using locus-specific and data-driven filters that are calculated exploiting pre-computed reference error models (Fig. 1). For each experimental platform, here intended as the combination of library preparation kit and sequencing machine/chemistry, reference error models that estimate both global and local sequencing error background are built by ABEMUS from a set of germline samples data generated with the same platform. Of note, ABEMUS provides pre-computed reference error models for several experimental platforms. When matched germline sample data are available for a plasma sample, additional filters can be used by ABEMUS to refine the identification of somatic SNVs by further considering private SNPs (e.g. singletons).
As a result, ABEMUS nominates a list of putative somatic SNVs in a format compatible with external tools providing also functional annotations [i.e. Oncotator (Ramos et al., 2015), SnpEff (Cingolani et al., 2012)] together with additional information like the locus strand bias and the genomic context, which altogether can be further used to rank or prioritize the identified SNVs.
3.2 Pbem is a sequencing platform-dependent feature
We tested the hypothesis that sequencing errors, quantified using pbem, depend on the experimental platform. To test this hypothesis, we collected a series of data of germline samples profiled using different platforms as reported in Supplementary Table S1. We first exploited the 113 germline samples from the 40 kb IonTorrent PGM sequencing series and assessed pbem for two disjoint subsets of samples (56 and 57) across all targeted genomic loci. The resulting distributions of pbem and coverage were comparable and further the pbem correlation (Pearson’s product-moment correlation, r = 0.72) indicated agreement between the two sets of base level measures (Fig. 2A, S1 versus S2). Similarly, two subsets of the 36 Mb WES assay (Agilent HaloPlex Exome) sequencing series of 10 germline samples each (Fig. 2A, S3 versus S4) and to subsets of the 3.2 Mb Roche NimbleGen N250 targeted panel sequencing series (Fig. 2A, S5 versus S6) demonstrated comparable results. On the contrary, the same procedure but comparing data generated by two platforms (Ion AmpliSeq Targeted Custom Amplicon Panel on IonTorrent PGM and Illumina True Seq Custom Amplicon on Illumina MiSeq) from the same set of normal samples resulted in non-correlated pbems series (r = −0.02) on the 7201 shared bp (Fig. 2B, S7 versus S8). The same result was obtained from 40 normal samples WES data generated using two kits, the Roche NimbleGen SeqCap Exome v3 and the Agilent HaloPlex Exome (r = 0.03) with 31 Mb shared positions. These experiments suggest that the background noise of sequencing experiments is locus and platform specific (Fig. 2C). Indeed, ∼50% of targeted positions show evidence of errors (pbem > 0) only when data are derived from one platform.
3.3 Stability and optimization of global sequencing error estimation GSE
To formally investigate the properties of GSE background ABEMUS estimates, we compared the coverage-based AF threshold measures computed on synthetic germline data (Synthetic Dataset #1) across different mean coverages (N = 4; 2000×, 1000×, 500× and 200×), target sizes (N = 4; 36, 17.7, 4.4 and 0.4 Mb) and detection specificities (0.99, 0.995, 0.999). Overall, although estimations of AF thresholds were relatively stable across different mean coverages and target sizes (Fig. 3A), especially for strict values of detection specificity, poorly populated coverage bins demonstrated sparse distributions (Fig. 3A and Supplementary Fig. S1). To correct for this bias, we implemented a refined procedure to identify the most suitable coverage-based AF threshold also in those bins that are problematic due to low cardinality. Briefly, assuming that coverage bins stability is function of bins cardinality, we tested the stability of each coverage bin by performing sub-sampling analysis on coverage , representing the bin having the closest but higher cardinality with respect to . Specifically, each coverage bin is first decomposed into subsets and containing positions with AFs > 0 and AFs = 0, respectively. Then, coverage bins are sorted by decreasing cardinality of and starting from the most populated bin of non-zero AFs (and sequentially for each ith coverage bin), k random samplings ( by default) of and AFs are performed from and , respectively. For each random sub-sample, the resulting (with ) is used to estimate . The variability across the k estimated values is quantified using the coefficient of variation . For each ith coverage bin, if (by default), the cardinality of the coverage bin is considered reliable for the AF threshold estimation, hence the is computed using AFs. Otherwise, if , the is updated as where j < i is the last coverage bin such as . If all coverage bins have , then all are set to the coverage independent AF threshold (). As shown in Figure 3B, the refined procedure resulted in highly stable AF thresholds, across different combination of coverage mean, target size and detection sensitivity.
3.4 Assessment of scaling factors to maximize ABEMUS performance
Synthetic Dataset #1 was used to identify the best scaling factor to maximize ABEMUS precision and recall for combinations of coverage and target size at different ctDNA levels. We tested a wide range of values (N = 71, min = 0.5, max = 8, step 0.01) and evaluated the corresponding F1 scores. For each combination of target size, mean coverage and ctDNA level, we selected the minimum factor R among those such that , where is the F1 score achieved by ABEMUS using the scaling factor and a custom threshold. Analyses using indicate that the wider the genomic target and the higher the mean coverage, the lower the optimal required to get a desired F1 score. Conversely, for the same combination of target size and coverage, lower admixtures require a greater (Fig. 4 and Supplementary Fig. S2).
Since the ctDNA level information might not be available upfront for a plasma sample, we also defined the optimal scaling factor maximizing precision and recall across a set of admixtures only based on target size and coverage. Using a set of thresholds for the F1 score (N = 11, min = 0.9, max = 1, step 0.01), we selected the minimum generating an F1 score greater than the highest observed threshold in the greater number of ctDNA levels considered (N = 9).
Using these optimization results, ABEMUS enables the selection of the R factor that best fits the sample’s target size, mean coverage and when available ctDNA level; alternatively, the user can set a preferred scaling factor R or disable the scaling factor (R = 1).
3.5 SNVs detection precision and recall on synthetic data
ABEMUS performances at different target sizes, mean coverages and ctDNA levels were assessed using Synthetic Dataset #2 and were compared to performances of four tools commonly used in tumor tissue-based analysis: SomaticSniper (Larson et al., 2012), MuTect (Cibulskis et al., 2013) (run both in standard mode and with creation and usage of a panel of normals), VarScan2 (Koboldt et al., 2012) and Strelka2 (Kim et al., 2018). All tools were run following developers’ instructions reported on the relevant websites. As described previously, background sequencing error in Synthetic Dataset #2 was introduced using a per-base error model computed from real sequencing data and synthetic reads were generated using two different base quality models. ABEMUS was run by exploiting the optimized scaling factors result of the previous analysis. As shown in Figure 5 and Supplementary Table S4, at the lowest ctDNA level and lowest depth of coverage that we considered, ABEMUS is the only tool reaching an F1 score of 0.1 with a precision above 60%; all other tools demonstrated extremely low F1 score, with Strelka2 being the only one with precision and recall above zero. Increasing the depth of coverage, the performances of all tools increase with ABEMUS being always among the best performing tools for ctDNA level ≥10% and outperforming all other tools for ctDNA levels <10%. Of note, performances reported in the literature (Narzisi et al., 2018) for the tools used in this comparison are in line with our results.
3.6 Comparison on in silico dilutions of real cfDNA samples
The performances of ABEMUS were further investigated using Synthetic Dataset #3, which contains synthetic dilutions we computed from real data generated at high coverage and for a small target (Carreira et al., 2014). ABEMUS was compared with Strelka2 and SomaticSniper—altogether the tools that in the previous analysis achieved reasonable results in a scenario that is similar to the one described by Synthetic Dataset #3—and with SiNVICT, a tool designed for the ultra-sensitive detection of SNVs and InDels in cfDNA samples (Kockan et al., 2017). To measure the performances of the four tools, we used as reference the overall set S of SNVs reported in the original study (Carreira et al., 2014) that were manually reviewed and/or experimentally validated through ddPCR (i.e. SNVs in AR, TP53, FOXA1 and PTEN genes). We defined the positive predictive value (PPV), calculated as the number of SNVs in S that are detected over the total number of detected SNVs in AR, TP53, FOXA1 and PTEN genes, the true positive rate (TPR), calculated as the number of SNVs in S that are detected over the total number of SNVs in S and the product TPR*PPV. PPV, TPR and TPR*PPV were computed considering the set of calls across the four genes of interest performed by each tool across all set of 291 in silico diluted samples. Although the optimal ABEMUS scaling factor R for Synthetic Dataset #3 was 1.1 (for all synthetic samples), we also tested R values around the optimal value, specifically from 0.5 to 1.5. As shown in Figure 6A, SomaticSniper obtained the best results in terms of PPV for most ctDNA levels, but failed in terms of TPR and TPR*PPV, indicating very low sensitivity. SiNVICT, instead, obtained reasonable TPR but failed in terms of PPV and TPR*PPV, indicating a potential high fraction of false positives among the detected somatic SNVs. ABEMUS performed better than Strelka2 in terms of PPV for almost all the tested R values, with optimal scaling factor R = 1.1, demonstrating better PPV than Strelka2 at all ctDNA levels except for the lowest one, where PPV values resulted equal. In terms of TPR values, ABEMUS and Strelka2 resulted in similar performances, with better ABEMUS results at lower R values. ABEMUS was the best tool in terms of TPR*PPV for most scenarios and for the majority of R scaling factors, with optimal scaling factor R = 1.1 performing better than Strelka2 in all conditions except for ctDNA level equal to 2.5%, were the two TPR*PPV values resulted equal. Overall, ABEMUS demonstrated the best performances among the majority of tested conditions, especially when pre-computed optimal scaling factor R was applied.
3.7 Performances on real cfDNA sequencing data
We finally compared ABEMUS and Strelka2 on a set of serial plasma samples (Carreira et al., 2014). Performances of both tools were tested relying on detection of SNVs annotated in previously relevant studies (Abida et al., 2019; Robinson et al., 2015) or in COSMIC (Forbes et al., 2017); for COSMIC only variants annotated as confirmed somatic variants and with primary site Prostate were considered. Scaling factors R optimized for mean coverage and target size were used. As shown in Figure 6B we observed high concordance between ABEMUS and Strelka2, but ABEMUS was able to detect SNVs in positions at low AF were Strelka2 was not. Among the three calls performed only by ABEMUS, two were also validated in the original study and present in other samples from the same patient. These two SNVs were identified in patient V4023, the first in TP53 gene with AF 0.014 and protein change Cys135* in sample 11-244-B with estimated ctDNA of 13.1%, while the second in gene FOXA1 with AF 0.016 and protein change Asp226His in the sample 10-315-B with estimated ctDNA of 15.5%. The remaining SNV identified at low AF only by ABEMUS was found in another sample from the same patient V4012 by both tools, strongly supporting the validity of the ABEMUS private call.
Considering that optimized scaling factors resulted in R = 1.1 for all plasma serial samples, we also tested to what extent the knowledge of ctDNA level would have improved ABEMUS calls. Considering ctDNA levels reported in the original study (Carreira et al., 2014), ABEMUS was run again on all plasma samples with results that were overall concordant. ABEMUS was in this case able to identify in patient V4048 a further SNV with an AF concordant with another SNV captured by both tools in the same sample and in the same gene.
Overall, both ABEMUS and Strelka2 achieved good results but ABEMUS demonstrated increased power in detecting low AF SNVs in low ctDNA levels plasma samples. In addition, upfront knowledge of sample’s ctDNA levels could be used to further improve detection sensitivity.
4 Discussion
Different approaches have been proposed in the past years to characterize somatic mutations in cfDNA. While methods like optimized quantitative PCR (Taly et al., 2012) or dPCR are highly sensitive (Didelot et al., 2013; Yu et al., 2017), they are limited in the number of mutations to test via multiplexing, while requiring up to 3 ng of input DNA. NGS approaches can instead be used to screen a large number of mutations with sensitivity that is limited by background noise and dependent on the sequencing depth. Although recent studies (Mouliere et al., 2018) suggested that fragment size selection might improve somatic SNVs detection sensitivity, highly sub-clonal somatic SNVs due for instance to intra-patient heterogeneity or treatment resistance would remain extremely difficult to detect. In this challenging scenario, tools designed to detect low AF variants (Carrot-Zhang and Majewski, 2017) or computational pipelines specifically tailored for cfDNA data are necessary. So far, cfDNA-specific approaches were either tuned for amplicon-based NGS-targeted platforms (Kleftogiannis et al., 2019; Pécuchet et al., 2016) or yet partially benchmarked against standard SNVs methods across different scenarios of coverage depth and target size (Kockan et al., 2017), potentially limiting their widespread applicability.
Here, we presented a new NGS-based computational method named ABEMUS that uses control samples to build global and local sequencing error reference models that are used to improve the detection of SNVs in cfDNA samples.
We showed that local sequencing error, namely the per-base error measure, is platform specific and that hence platform-specific sequencing error reference models are needed to effectively discriminate between true SNVs and artifactual signals in the challenging cfDNA scenario. In this respect, ABEMUS provides an automatic approach to build platform-specific reference models from NGS control samples.
We showed that ABEMUS sequencing errors reference models are stable across a broad range of depth of coverage and target size scenarios and we optimized, across the same scenarios, the precision and recall of ABEMUS SNVs detection engine.
ABEMUS performances were tested against tools commonly used to identify SNVs in tumor tissue samples and against tools specifically designed for cfDNA samples using synthetic data, cancer patients cfDNA data in silico diluted and cancer patients multi-sample cfDNA data. Overall, we showed that ABEMUS improves the detection of low AF SNVs in low ctDNA levels plasma samples in scenarios spanning from whole-exome data (tens of Mb) to small targeted panels data (tens of kb). Of note, a limitation of the current version of ABEMUS is the absence of a module for the detection of InDels. ABEMUS is easy to use, can be applied on any custom or commercial platform or gene panel and can be integrated in any NGS processing and analysis pipeline.
Acknowledgements
We thank the members of the Caryl and Israel Englander Institute for Precision Medicine (WCM) for fruitful discussions and the LaBSSAH-CIBIO Next Generation Sequencing Facility of the University of Trento for input on the True Seq Custom Amplicon assay.
Funding
This work was supported by Fondazione Cassa di Risparmio Trento e Rovereto (CARITRO to F.D.); National Cancer Institute SPORE [P50-CA211024 to H.B. and F.D.]; Cancer Research UK [A13239 to G.A.]; and Prostate Cancer UK [PG12-49 to G.A. and F.D.].
Conflict of Interest: none declared.
References
Author notes
The authors wish it to be known that, in their opinion, Nicola Casiraghi and Francesco Orlando should be regarded as Joint First Authors.