GCparagon: evaluating and correcting GC biases in cell-free DNA at the fragment level

Abstract Analyses of cell-free DNA (cfDNA) are increasingly being employed for various diagnostic and research applications. Many technologies aim to increase resolution, e.g. for detecting early-stage cancer or minimal residual disease. However, these efforts may be confounded by inherent base composition biases of cfDNA, specifically the over - and underrepresentation of guanine (G) and cytosine (C) sequences. Currently, there is no universally applicable tool to correct these effects on sequencing read-level data. Here, we present GCparagon, a two-stage algorithm for computing and correcting GC biases in cfDNA samples. In the initial step, length and GC base count parameters are determined. Here, our algorithm minimizes the inclusion of known problematic genomic regions, such as low-mappability regions, in its calculations. In the second step, GCparagon computes weights counterbalancing the distortion of cfDNA attributes (correction matrix). These fragment weights are added to a binary alignment map (BAM) file as alignment tags for individual reads. The GC correction matrix or the tagged BAM file can be used for downstream analyses. Parallel computing allows for a GC bias estimation below 1 min. We demonstrate that GCparagon vastly improves the analysis of regulatory regions, which frequently show specific GC composition patterns and will contribute to standardized cfDNA applications.


Introduction
Detailed analyses of circulating cell-free DNA ( cfDNA ) are increasingly used to detect, diagnose and monitor an array of pathological and physiological processes (1)(2)(3)(4)(5) .However, all of these analyses depend on an unbiased representation of cfDNA.A known confounding factor is fragment guanine and cytosine ( GC ) sequence content-based overrepresentation ( 'GC bias' ) in the analyte, which is frequently introduced during library preparation, particularly during polymerase chain reaction ( PCR ) amplification and binding-based purification or enrichment steps, resulting in an underrepresentation of regions that are either extraordinarily GC-rich or GC-poor ( 6 ) .This GC bias may vary considerably between samples, and variation in GC bias has even been found between libraries generated from the same starting material ( 7 ) .As, for example, PCR-free approaches to library preparation are currently impractical for cfDNA due to the low amounts of input DNA, GC bias represents an inevitable problem in cfDNA analyses.
Genome segmentation-based GC bias corrections have been successfully implemented for establishing copy number alterations from cfDNA (8)(9)(10) .However, these approaches are not applicable to locus-based analyses as local GC content may differ greatly from the genome-wide average distribution.
As alternative solution, a fragment-based GC bias correction was previously introduced ( 7 ) and implemented in the deepTools suite ( https:// deeptools.readthedocs.io/en/ develop/ ) ( 11 ) for chromatin or transcription factor (TF) immunoprecipitation data.As such data are experimentally fragmented (e.g.physical or enzymatic treatment of genomic DNA from cells or tissues) with a narrow length distribution, the deepTools implementation considers only one representative (mean) fragment length.This disregards any length variability naturally present from the biological processes that contribute to cfDNA or that might be present in different cfDNA datasets.
A method termed Griffin adapted the Benjamini and Speed GC bias correction ( 7 ) within a computational framework for profiling nucleosome protection and accessibility from cfDNA ( 12 ).However, the Griffin algorithm excludes 18.6% of the reference genome from analysis based on the unique mappability track.Furthermore, the GC bias computation is hindered by the necessity of a separate precomputed GC frequency file specifically for the reference genome build and the included regions.Moreover, the Griffin framework requires the user to edit multiple pipeline configuration files for each batch of samples and directly aggregates GC-corrected values.It does not offer the option to modify the alignments [i.e.binary alignment map (BAM) file] containing weights in the form of read-level tags for correction, hindering integration into other analysis pipelines and applying alternative read aggregation methods.
Hence, there is a lack of stand-alone GC bias correction tools that are especially suitable to cfDNA and reliably preserve fragmentomics signals, like the accessibility of regulatory open chromatin regions or fragment length shifts across genomic regions.We aimed to fill this gap and to develop a GC correction procedure with particular features, such as correction on the fragment level based on both GC content and fragment length, minimal exclusion of genomic regions, performance, simplicity of use and seamless integration into bioinformatics pipelines.

Materials and methods
Our algorithm operates over predefined, reference genomespecific regions of equal size, which are automatically selected, and a sophisticated procedure excludes regions known to be highly problematic for short-read alignment experiments (Figure 1 A).GCparagon is then based on the principle of comparing the obtained (experimental) cfDNA data with expected results from even sampling from the human reference genome sequence (Figure 1 B).

Genomic region pre-selection
The Encyclopedia of DNA Elements exclusion list version 2 was used to define problematic regions ( 13 ).The following additional regions were excluded (using file concatenation, coordinate sorting and bedtools merge): N -masked gaps in the primary assembly of standard chromosomes (chr1-chr22, chrX, chrY); gaps in short chromosome arms, between contigs and between scaffolds; telomeres; centromeres; gaps caused by sequencing issues related to heterochromatic DNA; alternate (ALT) contigs; tandem repeats from the tandem repeat finder software ( 14 ); and low-mappability regions from Global Alliance for Genomics & Health ( 15 ) ( https://github.com/genome-in-a-bottle/genome-stratifications ).The Y chromosome and regions within 1 Mb to any chromosome end were ignored during region pre-selection.A BED file of these merged, excluded regions is generated (Figure 1 A).
The genome scaffold was read and split into intervals of 1 Mb.The genome split was repeated three times, each time shifting the chromosomal start position downstream by 250 kb.Each split was saved to a BED file.The resulting four BED files were intersected with excluded regions as defined above, storing the number of overlapping bases for each genomic interval in tables.Intervals with > 33% (default of a user-defined value) excluded bases were removed.From the remaining intervals from the four splits, those with the smallest overlap with the exclusion list were sorted in ascending order of their overlap and selected following this sort order.Intervals overlapping any previously selected ones were ignored.

Algorithm description
GCparagon comprises two phases (Figure 1 B), the first to calculate weights for GC bias correction and the second to add these weights to the BAM file as a tag.Both algorithm stages use multiprocessing and can be carried out independently.GCparagon uses aligned, paired-end short DNA sequence reads in BAM format (SAM format specification conforming) as input.The input BAM file must be coordinate-sorted.For properly paired reads, the observed template length reported in the BAM file is used as the best available fragment length estimate; non-proper paired reads, mates with at least one extensively clipped alignment ( > 25% of alignment), non-primary alignments and supplementary alignments are not considered.The input BAM file must have been created with alignment software that adheres to the SAM format specification.Especially, aligners like bwa mem, which implement the observed template length metric as in the TLEN#1 definition of the SAM format specification, are supported by GCparagon ( https:// samtools.github.io/hts-specs/ SAMv1.pdf).For each proper paired read, only the alignment with a positive observed template length is processed to avoid double counting of fragments.
GCparagon uses two features, estimated fragment length and GC base count, of each cfDNA fragment within a userdefined fragment length range to create a count matrix.GC base count is inferred from the reference sequence corresponding to each fragment.The option to freely select the fragment length range makes GCparagon applicable to virtually all cfDNA applications.For example, users can include all Illumina-sequenceable cfDNA fragments by choosing an expected fragment length range of 20-550 bp, or if only a specific range is analyzed, as in an application such as DELFI [DNA evaluation of fragments for early interception ( 16 )], the analysis can be limited to fragments between 100 and 220 bp.
Besides fragment length range, we added further customization options for other experimental setups.To facilitate these customization options, we defined four parameter settings, which we refer to as presets 0-3 (see Table 1 ).Preset 0 is the default, whereas presets 1-3 offer different settings for the parameters target number of fragments processed, simulation rounds, minimum fragment attribute pair count, outlier detection, weight smoothing and smoothing strength (details are available at https:// github.com/BGSpiegl/ GCparagon ).Different target numbers of processed fragments have an enormous impact on computation time, but only a surprisingly little effect on the quality of the GC correction (see the 'Results and discussion' section).
To calculate the output weights for GC bias correction, GCparagon uses the DNA intervals generated, as described in Figure 1 A. Python's multiprocessing library allows for the parallel processing of p intervals (large box with matrices; Figure 1 B).After counting fragment attributes observed in an interval of the BAM file, the bias computation step simulates the expected GC base count, assuming each location within the interval may act as fragment origin with equal probability.During this simulation, fragment sequences are repeatedly drawn for each genomic interval following the fragment length distribution observed in the same genomic interval.Two features, fragment length and GC base count, are recorded for each observed fragment passing filters and for each simulated fragment.
Drawing a sequence from the reference genome is repeated if any N bases are present in the simulated fragment sequence.If 55 or one-third (whichever is higher) of the sequences for any fragment length need to be redrawn, the affected genomic bin is discarded.Hence, rare fragment lengths not occurring at least 55 times in a genomic interval cannot cause a genomic interval to be dropped during simulation.The chosen number 55 represents a trade-off between allowing poorly pre-selected genomic intervals (i.e. containing many ' N ' bases) to be automatically removed from the GC bias computation while not over-rejecting intervals with only a few N s.The optimum of this parameter generally depends on the chosen size of preselected genomic intervals.We empirically established the upper threshold of 55 during the testing of the algorithm.We found it appropriate for the provided pre-selected genomic intervals and whole-genome sequencing (WGS) samples.The feature combination counts obtained by random sampling are stored separately for each simulation round.After reaching, for example, 5 million processed fragments for preset 1 or exhausting the list of genomic intervals, the

com ). ( B ) The two main phases of the GC bias correction by
GCparagon.In phase 1 (upper section), the GC bias is computed in the form of correction weights, which are the reciprocal of the corresponding bias v alues.T his is done in f our or fiv e steps (big upper bo x including bo x containing heatmaps; v ertical block arro w 'phase 1' on the lef t and bloc k arrows 'step 1' to 'step 5').In the second phase (bottom section), a weight-tagged BAM file is created from the input using the computed correction weights (bottom right box; bottom left vertical block arrow 'phase 2' and block arrow 'step 6').The BAM input is processed in pre-selected genomic intervals using parallel processing in step 1 (box 'count attributes') and step 2 (box 'simulate attributes').In step 1, the fragment attribute pairs 'observed template length' and 'GC base count' are recorded in the observed attribute matrix (heatmap 'observed fragment attributes') for each fragment passing alignment filters from the input file.Simulated attribute pair counts (heatmap 'expected fragment attributes') are obtained in the second step by sampling the reference genome sequence equally using the fragment length distribution that was observed in the same genomic interval.Simulation is repeated n times to create n expected fragment attribute matrices.A binary mask (plot 'attribute pair mask') is created in step 3 (box 'mask attribute pairs') based on a minimum attribute pair count threshold applied to both the observed and expected attribute matrices.In step 4 (box 'compute correction weights'), the binary mask is used to e x clude both rare and absent attribute pairs from the computation of the n correction weight matrices (rightmost heatmap 'attribute pair weights').The latter are computed by element-wise division of the expected fragment attribute matrix by one of the n observed attribute matrices.The n correction weight matrices are consolidated by element-wise averaging across matrices.The resulting consolidated weight matrix can be post-processed in an optional fifth step (box 'optional: post-processing') to reduce the effect of outliers and noise on the correction (boxes with dashed edges).In the second phase, a weight-tagged version of the input BAM file is generated for downstream analysis.The (post-processed) consolidated weight matrix downstream analysis is used for t agging .resulting matrices are combined across all processed regions by adding all NS GC matrices for a specific simulation round n and by adding all NO GC matrices.This yields one O GC matrix and nS GC matrices, one for each simulation round.GC bias correction weights are then computed as follows: For each simulation round n , the S GC matrix is divided element-wise by the O GC matrix.Feature combinations missing from either the O GC or the S GC matrix or that occurred only rarely (e.g.once for preset 1) are masked out before division.Masked attribute combinations remain uncorrected with a default weight of 1.
Weight computation is done for all n simulation rounds separately.The element-wise average across weight matrices is computed afterward to reduce random effects, which could otherwise manifest as outliers.Such outlier weights result, for example, from randomly drawing ultra-rare attribute combinations, which may be overrepresented in the experimental data.However, averaging does not remove extremely high weights arising from experimentally highly underrepresented attribute combinations observed regularly from random draws.
Thus, extremely high outlier weights require postprocessing.The first post-processing step computes and limits extreme outliers to an upper weight threshold W threshold based on Equation ( 1 ), where W Q3 is the third quartile of non-default weights and IQR W is the interquartile range of non-default weights.The IQR stringency factor f s can be user-defined with the outlier detection stringency command line parameter S od [Equation ( 2)], where a maximal stringency value of 7 results in an f s of 3, effectively limiting the largest amount of non-default weights to the computed threshold.
In the second post-processing step, smoothing is applied to the weight matrix by convolution with a two-dimensional Gaussian kernel.This mitigates random effects, e.g. from stopping the bias computation early, like for preset 1 computations, and effects that might arise from alignment artifacts or fragment length estimation inaccuracies via the observed template length because of the presence of indels.Weight matrices from each post-processing step are saved to text files (correction matrix) and can be used for subsequent alignment tagging.Visualizations of the weight matrices, counts of fragment attribute combination, the weight computation mask and the observed fragment length distribution can be set to be created automatically.
A GC bias-minimized depth of coverage (DoC) signal across genomic loci can be obtained using GC tag sums instead of raw fragment counts, increasing the comparability between loci and samples of similar average DoC.Alternatively, fragment weights can be used directly from the correction matrix text file in combination with alignments from the input BAM file in any custom code, obviating the need to write a new BAM file.However, in our experience, repeatedly associating alignments with GC bias correction weights is slower than writing the tagged BAM file once and using tagged alignments instead.Further, the need to use two linked files instead of one increases the risk of user errors.

Benchmarking and validation
The computation time of 683 samples processed on a SLURM (Simple Linux Utility for Resource Management) high-performance computing (HPC) cluster was extracted from log file time stamps.Expected fragment GC content distribution for each sample was computed by drawing 500 million fragments equally distributed over all chromosomes, excluding Y, following a sample's observed fragment length distribution.
The total number of DNA fragments overrepresented by a distorting sample processing step, like PCR, must be identical to the number of underrepresented fragments by the same process, assuming that fragment abundance in the original population is not limiting.According to this assumption, we computed the fidelity of our GC bias correction by comparing the total fragment count in the O GC matrix to the sum of all weights in the computed W GC matrix.We also corrected GC bias in samples B01, H01, C01 and P01 using the Griffin algorithm and compared the results to ours ( Supplementary Tables S1-S3 ).

Coverage analysis of regulatory regions
DoC signals for transcription start site (TSS) and transcription factor binding site (TFBS) loci were determined as described previously ( 17 ).Unmapped reads, alignments with mapping quality zero and supplementary alignments were not considered.The 10 000 most reported TFBSs were extracted from the Gene Transcription Regulation Database (GTRD) version 21.12 ( 18 ).TSS loci were oriented left to right in 5 → 3 direction according to the gene's strand.Fragment sizes between 110 and 210 bp were included for TSS and TFBS plots.Coverages were normalized for visualizations by dividing the coverage of each locus by the average coverage across two 1.5 kb regions, each starting 1.5 kb up-or downstream to the site of interest.Of these normalized coverage values, the base-wise mean across regions of the values between the 10th and the 90th percentiles was taken.An adapted version of pysam was used to directly extract GC correction weights from alignment tags to compute the fragment's central 60 bp DoC.The adapted version is available at https:// github.com/Faruk-K/ pysam .

Reference genome
The GRCh38 human reference genome provided as hg38 by UCSC ( 14) (without ALT scaffolds) was used for all presented results.

Samples used to demonstrate GCparagon's capabilities
GCparagon was tested on 683 cfDNA samples collected from our liquid biopsy studies.Each study was approved by the Ethics Committee of the Medical University of Graz [approval numbers: 29-272 ex16 / 17 (healthy individuals), 21-228 ex09 / 10 (prostate cancer), 21-227 ex09 / 10 (breast cancer) and 21-229 ex13 / 14 (colon cancer)] and was conducted according to the Declaration of Helsinki.Written informed consent was obtained in each case.We highlight results from four samples [healthy donor (H01) and patients with breast (B01), colon (C01) and prostate (P01) cancer] in our figures.

Genomic coverage and region pre-selection
Careful region pre-selection is vital to make GC correction broadly applicable.GCparagon uses predefined genomic bins and excludes these only if a user-defined threshold of excluded bases is reached (Figure 1 A).Our default exclusion list contains a total of 361.7 Mb, corresponding to 11.9% of the GRCh38 reference genome (excluding Y chromosomal and mitochondrial sequences); hence, 88.1% of the human reference genome (2.67 Gb) remains unmasked.To the best of our knowledge, GCparagon's default exclusion list keeps the largest proportion of the human genome in its analyses compared to other published procedures ( 12 ,24 ).For example, the Griffin procedure excludes 18.6% instead of our 11.9% ( 12 ).

Effective GC correction and validation of GCparagon
We applied GCparagon to several hundred cfDNA samples ( n = 683) for benchmarking and used four exemplary cfDNA samples (H01, B01, P01, C01) with various GC biases to illustrate the validity of our results (Figure 2 A and B).As outlined above, we first analyze the observed GC content ( O GC ) of a sample and then compute the theoretical GC content employing random sampling of fragments with the same length distribution from the reference genome ( S GC ).GCparagon is used to rebalance the distorted GC content, and the successful GC correction should be reflected in a high similarity of the corrected fragment GC content distribution to the genome-wide simulated fragment GC content distribution, comparable to Griffin algorithm results (Figure 2 A).
As expected, the impact of the GCparagon algorithm on sequencing data depends greatly on the present GC bias.For example, the average GC content of fragment sequences of the healthy control sample H01 was 41.4%, only 1.1% above the randomly sampled GC content of 40.4% (Figure 2 A).Accordingly, the algorithm's output visually confirmed the high agreement between O GC and S GC values (Figure 2 B).
In contrast, cfDNA sample B01 had a below-human average GC content of 38.2%, reflected in a left shift of the observed fragment distribution toward lower GC contents, resulting from an overrepresentation of low-GC fragments and underrepresentation of high-GC fragments, respectively (Figure 2 A).In such a case, the correction must rebalance the misrepresented fragments in a way that leads to an increased overall GC content, which GCparagon reflects and visualizes in its correction matrix (Figure 2 C).The corrected GC content profile overlaps closely with the expected profile (Figure 2 A).Conversely, cfDNA sample P01 had an abovehuman average GC content of 45.4%, visible as a shift to the right in the original fragment GC content plot (Figure 2 A).In such a case, GCparagon decreases the overall GC content (Figure 2 C).
Comparing the P01 correction weight matrices from GCparagon and Griffin revealed a pronounced artifact for Griffin caused by extensively soft-clipped read alignments in the low GC content range ( Supplementary Figure S3 ).
Furthermore, we extensively tested presets 1-3 and deemed them appropriate for WGS samples with a DoC between 0.5 × and 60 ×.Comparing the total fragment count in the O GC matrix to the sum of all corrected weights for the four samples results in an average deviation of +0.58%, −0.04% and −0.09% for presets 1, 2 and 3, respectively.In contrast, Griffin fragment pool size changes were much more extensive (reduction up to 45% for B01; see Supplementary Table S3 ).
Overall, GCparagon achieved a close approximation between corrected and simulated GC content in all tested cases.

Potential use of the weight matrix
A GC bias-minimized DoC signal across genomic loci can be obtained using GC tags of fragments instead of raw fragment counts, increasing the comparability between loci and samples of similar average DoC.Alternatively, fragment weights can be used directly from the correction matrix text file in combination with alignments from the input BAM file in any custom code, obviating the need to write a new BAM file.However, in our experience, repeatedly associating alignments with GC bias correction weights is slower than writing the tagged BAM file once and using tagged alignments for further analysis.Moreover, using two linked files  instead of one standard input file increases the risk of user errors.

Time complexity and memory footprint
The GC bias computation using preset 1 with a target fragment number of 5 × 10 6 is very fast.All tested 683 samples finished in under 3 min on an 11-node SLURM HPC cluster, with 58.3% of samples taking 70 s or less ( Supplementary Figure S1A ).Computation time is linear in the number of processed fragments ( Supplementary Figure S1B ).Memory consumption is linear in the number of cores used for parallel processing and the number of sampling rounds per genomic interval.Memory usage was around 2.5 GiB using 12 cores ( ∼200 MB per process) and independent of the number of processed fragments ( Supplementary Figure S2A ).Maximum consumption of 3.8 GiB was observed for P01 at preset 2, with highest memory consumption memory during BAM tagging ( Supplementary Figure S2B ).Compared to Griffin, GCparagon correction was 62-144 times faster (endpoint: correction table output) and 19-31 times if the output of the tagged BAM file was used for comparison (see Supplementary Table S1 ).The memory footprint was up to 2.4 times higher for GCparagon than for Griffin but was still below 4 GiB (see Supplementary Table S2 ).

GC bias and regulatory regions
Overlays of multiple regulatory open chromatin regions, such as TSSs and TFBSs, have been reported to have non-uniform GC content between the binding site and flanking regions ( 25 ) due to shared sequence motifs and other shared sequence features of the stacked regions.Therefore, we examined the GC content of the human reference genome at TSSs (Figure 3 A) and TFBSs (Figure 3 C).We used the TSSs of 1228 housekeeping (HK) genes ( 26 ) and 1172 genes that are unexpressed according to the Protein Atlas ( www.proteinatlas.org ) [Protein Atlas unexpressed (PAU) genes] ( 27 ).We observed that the GC content started to increase already 250 bp upstream to the TSS of PAU genes and even further upstream for HK genes (Figure 3 A).This pattern of GC content increase for exon 1 is more pronounced for HK genes than for PAU genes.Downstream to the TSS, the GC content decreases after 250 bases for HK genes, while the GC content of PAU genes stays at a moderately increased level until ∼750 bases (Figure 3 A).Generally, the GC content of TSS sequences for HK genes is much higher than that of PAU genes, in line with their well-described association with CpG (5 -C-phosphate-G-3 ) islands.Further, when we analyzed the TFBSs of TF GRHL2, a pioneer TF for epithelial cells ( 28 ), and the hematopoietic TF LYL1 ( 29 ), we observed that both TFBSs contain GC-rich binding motifs, which result in an increased GC content close to their binding sites (Figure 3 C).Hence, non-uniform GC content at regulatory regions is an inherent characteristic of cfDNA analyses, which will cause signal distortions when combined with GC biases in the analyzed samples.
In cfDNA analyses, TSS regions of HK genes have a characteristic coverage pattern with low relative coverage at the TSS position and oscillating patterns arising from phased nucleosomes mostly downstream of the TSS (Figure 3 B).In contrast, TSSs of the repressed PAU genes show less variable coverage patterns, reflecting the denser nucleosome packaging and the resulting stronger cfDNA protection ( 17 ,30 ).For samples without appreciable GC bias, such as H01, applying GC-paragon has only a minor effect on the coverage profile (Figure 3 B).However, cfDNA samples with positive GC biases display increased coverages downstream of the TSS, particularly in exon 1 regions (P01 in Figure 3 B).A reverse pattern, i.e. decreased coverage, can be observed in sample B01 with negative GC bias (Figure 3 B).For PAU genes, a similar but much lower distortion downstream of the TSS is observed, following the reference GC content.Again, these distortions are effectively corrected by GCparagon (Figure 3 B).
Accessible TFBSs show decreased coverage at their center flanked by oscillating patterns in cfDNA analyses, indicating phased positioning of nucleosomes around the TFBS ( 31 ).However, increased GC sequence content close to the binding sites (Figure 3 C) results in increased coverage between 250 bp upstream and downstream with positive GC bias (P01 in Figure 3 D) or decreased coverage in samples with negative GC bias (B01 in Figure 3 D).After GC bias correction, the profiles show a regular oscillating pattern (Figure 3 D).

GCparagon enables downstream analyses
Assessment of TSS and TFBS coverage plots is evolving into a new tool to draw conclusions about the accessibility of TFs from cfDNA and hence infer essential biological information from the respective samples ( 12 , 31 , 32 ).Interestingly, we found that genomic features exhibiting a weak accessibility signal, like TSSs of inactive genes or moderately bound TFBSs, were particularly prone to severe coverage distortions interfering with their effective detection and impacting their detection limits.Such distortions may even lead to false positive signals in regions of otherwise nominal coverage, as was observed for LYL1 TFBSs of P01 (Figure 3 D).However, stronger accessibility signals may also show substantial distortions in some cases.Examples of such signals, which correspond to deep relative coverage drops, are the difference between the central dip and downstream peak for B01 HK genes (Figure 3 B) and the difference between the LYL1 / GRHL2 central dip and surrounding coverage peaks for B01 and P01, respectively (Figure 3 D).
In conclusion, distortions seen for various loci generally depend on the sample's overall GC bias, the loci's average GC content, the specific sequence composition of overlaid loci and the locally observed fragment lengths.Our results demonstrate the importance of GC bias corrections for such applications.

Figure 1 .
Figure 1.GCparagon: genomic region pre-selection and outline of algorithm.( A ) Genomic region pre-selection: Individual BED (browser extensible data) files of e x clusion mask ed regions (arra y of stack ed fields ne xt to second 'INPUT' mark) are combined into a single merged BED file containing non-o v erlapping, non-adjacent regions.Genomic scaffolds are read from a genome TSV (tab-separated values) file (left column, first 'INPUT' mark) and split into intervals of 1 Mb.Splitting is repeated three times, with an increasing offset of 250k bases.These four files are intersected with the exclusion marked regions, generating four tables as output.Intervals are sorted in ascending order of overlap percentage.Top bins are selected up to a specified threshold on the e x clusion o v erlap.Genomic interv als o v erlapping an y pre viously selected ones are skipped.Finally, a file of predefined 1 Mb intervals is generated, which is used by GCparagon.Figure created with BioRender ( biorender.com).( B ) The two main phases of the GC bias correction byGCparagon.In phase 1 (upper section), the GC bias is computed in the form of correction weights, which are the reciprocal of the corresponding bias v alues.T his is done in f our or fiv e steps (big upper bo x including bo x containing heatmaps; v ertical block arro w 'phase 1' on the lef t and bloc k arrows 'step 1' to 'step 5').In the second phase (bottom section), a weight-tagged BAM file is created from the input using the computed correction weights (bottom right box; bottom left vertical block arrow 'phase 2' and block arrow 'step 6').The BAM input is processed in pre-selected genomic intervals using parallel processing in step 1 (box 'count attributes') and step 2 (box 'simulate attributes').In step 1, the fragment attribute pairs 'observed template length' and 'GC base count' are recorded in the observed attribute matrix (heatmap 'observed fragment attributes') for each fragment passing alignment filters from the input file.Simulated attribute pair counts (heatmap 'expected fragment attributes') are obtained in the second step by sampling the reference genome sequence equally using the fragment length distribution that was observed in the same genomic interval.Simulation is repeated n times to create n expected fragment attribute matrices.A binary mask (plot 'attribute pair mask') is created in step 3 (box 'mask attribute pairs') based on a minimum attribute pair count threshold applied to both the observed and expected attribute matrices.In step 4 (box 'compute correction weights'), the binary mask is used to e x clude both rare and absent attribute pairs from the computation of the n correction weight matrices (rightmost heatmap 'attribute pair weights').The latter are computed by element-wise division of the expected fragment attribute matrix by one of the n observed attribute matrices.The n correction weight matrices are consolidated by element-wise averaging across matrices.The resulting consolidated weight matrix can be post-processed in an optional fifth step (box 'optional: post-processing') to reduce the effect of outliers and noise on the correction (boxes with dashed edges).In the second phase, a weight-tagged version of the input BAM file is generated for downstream analysis.The (post-processed) consolidated weight matrix downstream analysis is used for t agging .Figure created with BioRender ( biorender.com).
Figure 1.GCparagon: genomic region pre-selection and outline of algorithm.( A ) Genomic region pre-selection: Individual BED (browser extensible data) files of e x clusion mask ed regions (arra y of stack ed fields ne xt to second 'INPUT' mark) are combined into a single merged BED file containing non-o v erlapping, non-adjacent regions.Genomic scaffolds are read from a genome TSV (tab-separated values) file (left column, first 'INPUT' mark) and split into intervals of 1 Mb.Splitting is repeated three times, with an increasing offset of 250k bases.These four files are intersected with the exclusion marked regions, generating four tables as output.Intervals are sorted in ascending order of overlap percentage.Top bins are selected up to a specified threshold on the e x clusion o v erlap.Genomic interv als o v erlapping an y pre viously selected ones are skipped.Finally, a file of predefined 1 Mb intervals is generated, which is used by GCparagon.Figure created with BioRender ( biorender.com).( B ) The two main phases of the GC bias correction byGCparagon.In phase 1 (upper section), the GC bias is computed in the form of correction weights, which are the reciprocal of the corresponding bias v alues.T his is done in f our or fiv e steps (big upper bo x including bo x containing heatmaps; v ertical block arro w 'phase 1' on the lef t and bloc k arrows 'step 1' to 'step 5').In the second phase (bottom section), a weight-tagged BAM file is created from the input using the computed correction weights (bottom right box; bottom left vertical block arrow 'phase 2' and block arrow 'step 6').The BAM input is processed in pre-selected genomic intervals using parallel processing in step 1 (box 'count attributes') and step 2 (box 'simulate attributes').In step 1, the fragment attribute pairs 'observed template length' and 'GC base count' are recorded in the observed attribute matrix (heatmap 'observed fragment attributes') for each fragment passing alignment filters from the input file.Simulated attribute pair counts (heatmap 'expected fragment attributes') are obtained in the second step by sampling the reference genome sequence equally using the fragment length distribution that was observed in the same genomic interval.Simulation is repeated n times to create n expected fragment attribute matrices.A binary mask (plot 'attribute pair mask') is created in step 3 (box 'mask attribute pairs') based on a minimum attribute pair count threshold applied to both the observed and expected attribute matrices.In step 4 (box 'compute correction weights'), the binary mask is used to e x clude both rare and absent attribute pairs from the computation of the n correction weight matrices (rightmost heatmap 'attribute pair weights').The latter are computed by element-wise division of the expected fragment attribute matrix by one of the n observed attribute matrices.The n correction weight matrices are consolidated by element-wise averaging across matrices.The resulting consolidated weight matrix can be post-processed in an optional fifth step (box 'optional: post-processing') to reduce the effect of outliers and noise on the correction (boxes with dashed edges).In the second phase, a weight-tagged version of the input BAM file is generated for downstream analysis.The (post-processed) consolidated weight matrix downstream analysis is used for t agging .Figure created with BioRender ( biorender.com).

Figure 2 .
Figure 2. Validation of GCparagon's GC correction performance.( A ) Density plot of per-fragment GC content across cfDNA samples B01 (patient with breast cancer), H01 (healthy individual), C01 (colon cancer) and P01 (prostate cancer).For each sample, three lines are displa y ed.Dashed lines represent the original GC content for the entire input BAM file.Theoretical GC content, as established by simulating fragments across the whole genome using the observed fragment length distribution, is shown as solid, thick dark lines.GC content after correction with GCparagon is displa y ed as solid lines (medium thickness, lighter color), and as thick, dashed lines (medium thickness, lighter color) after correction with the Griffin algorithm.( B ) Count matrices for observed ( O GC ) and randomly sampled ( S GC ) cfDNA fragments of a healthy individual (H01) with an uncorrected average GC content of 41.4%, computed with parameter preset 1.The y -axis displays the cfDNA fragment length, and the x -axis displays the GC base count.( C ) Correction weight matrix of the GC correction: The y -axis displays the cfDNA fragment length (minimum set to 20 bp, maximum set to 550 bp).The correction value equals the reciprocal of the bias value.The visualization zooms in on non-default weights.The x -axis indicates the GC base count of a fragment.Impossible GC base counts receive a default value of zero and manifest as dark triangles in the upper right part of the heatmap.A color scale indicates the weights for correcting the fragment length and GC base count-specific GC bias.Red indicates a high correction weight for highly underrepresented fragments (e.g., right side of B01 and left side of P01, non-default weights in fragment length range 120 bp to 210 bp).Default weights of 1 for extremely rare fragment attribute combinations are uniformly colored (lower left part of the heatmap).The left panel displays the weight matrix of sample B01 with an a v erage fragment GC content of 38.2%.As the sample's GC content is belo w the e xpected 40.4% f or humans, the o v erall GC content needs to be increased through correction, which is visible as turquoise to red regions for high GC content fragments (right part of non-default weights) and darker blue regions in the weight matrix for low GC content fragments (left part of non-default weights).The middle panel shows the H01 sample, which has a low GC bias of approximately +1%.Accordingly, the colors in this correction matrix indicate values that are closer to 1 and the upper end of the color scale is the lo w est f or the sho wn samples.T he right panel illustrates the correction w eights of P01 with an a v erage GC content of 45.4%.In this case, the GC content needs to be decreased, which is visible as the in v erse color pattern compared to B01.Weights were computed using preset 3.

Figure 3 .
Figure 3.The effect of GC bias on coverage signals of regulatory regions.( A ) GC content of the GRCh38 reference genome averaged across TSSs of HK genes (orange line; top) and PAU genes (gray line; bottom).GC content is shown twice for each gene group: unfiltered (transparent) and smoothed using a 15 bp Hamming window (opaque).A sharp GC decline upstream of the TSS (promoter region) indicates the location of the T A T A box. ( B ) Original and corrected TSS co v erage plots for HK genes (top three panels) and PAU genes (bottom three panels), each for samples B01 (cfDNA from a patient with breast cancer), H01 (healthy control) and P01 (prostate cancer).Co v erage is computed using the fragment's central 61 bp or fe w er if the fragment is shorter.The reference genomes' GC content plots for HK genes and PAU genes from panel (A) are included for comparison.The original coverage is shown with a thin, colored line, and the corrected co v erage with a bold, colored line.The coverage after correction with the Griffin algorithm is shown as solid black line.( C ) GC content of the GRCh38 reference genome averaged across the top 10 0 0 0 TF binding sites reported in the GTRD of two GRHL2 (green line; bottom) and LYL1 (purple line; top), respectively.GC content is shown twice for each TF: unfiltered (transparent) and smoothed using a 15 bp Hamming window (opaque).Both GRHL2 and LYL1 TFBSs show highly increased central GC content.( D ) Original and corrected TFBS coverage plots (co v erage computed using central 60 bp of fragments if 60 bp or larger) for LYL1 (top three samples) and GRHL2 (bottom three panels) from the same cfDNA samples as in panel (B).The genome reference GC content plots for the TFBSs from panel (C) were included for comparison.The coverage after correction with the Griffin algorithm is shown as solid black line.