Characterization of FFPE-induced bacterial DNA damage and development of a repair method

Abstract Formalin-fixed, paraffin-embedded (FFPE) specimens have huge potential as source material in the field of human microbiome research. However, the effects of FFPE processing on bacterial DNA remain uncharacterized. Any effects are relevant for microbiome studies, where DNA template is often minimal and sequences studied are not limited to one genome. As such, we aimed to both characterize this FFPE-induced bacterial DNA damage and develop strategies to reduce and repair this damage. Our analyses indicate that bacterial FFPE DNA is highly fragmented, a poor template for PCR, crosslinked and bears sequence artefacts derived predominantly from oxidative DNA damage. Two strategies to reduce this damage were devised – an optimized decrosslinking procedure reducing sequence artefacts generated by high-temperature incubation, and secondly, an in vitro reconstitution of the base excision repair pathway. As evidenced by whole genome sequencing, treatment with these strategies significantly increased fragment length, reduced the appearance of sequence artefacts and improved the sequencing readability of bacterial and mammalian FFPE DNA. This study provides a new understanding of the condition of bacterial DNA in FFPE specimens and how this impacts downstream analyses, in addition to a strategy to improve the sequencing quality of bacterial and possibly mammalian FFPE DNA.

Bacterial DNA is likely to be similarly damaged, but this is uncharacterized to date. The consequence of such bacterial DNA damage is that FFPE samples will have several associated limitations that must be considered before their effective use in microbiome studies. DNA fragmentation reduces the quantity of DNA fragments within a sample of suitable length for amplicon-based sequencing strategies such as 16S rRNA gene sequencing ($460 bp for V3-V4 [29]). This can exacerbate the characteristic low-bacterial biomass found in FFPE samples. FFPE-induced sequence alterations can decrease sequence quality and lead to false speciation events. These are considerable hurdles standing in the way of accurate, reproducible microbiome research from FFPE samples. All research reported to date, and protocols for purifying and repairing FFPE DNA, relate to mammalian (human) DNA. Differences in DNA conformation and packaging, methylation patterns, and replication and transcription rates, between human and bacteria may lead to different FFPE damage profiles [30][31][32]. A better understanding of potential differences is essential for the proper design of workflows that ensure bacterial DNA quality and guarantee reliable and reproducible sequencing analysis [33]. No characterization of FFPE-induced bacterial DNA damage exists to date.
The product of the interaction between formaldehyde and biomolecules is the formation of crosslinks (methylene bridges). These are ubiquitous in FFPE specimens, where they are more frequently found as DNA-protein crosslinks (DPC) between dG and amino acids Lysine and Cysteine [34,35]. These crosslinks strain the DNA structure, promoting apurinic (AP) sites and ssbreaks [16,36] and inhibiting polymerase chain reaction (PCR) [37]. Fortunately, crosslinks are reversible, the intermediary products (Schiff bases) are reversed by hydration [35] and methylene bridges are reversed with heat treatment [13]. As such, all protocols for FFPE DNA purification available incorporate a heat treatment decrosslinking step, typically as 1 h incubation at 90 C [14]. However, recent studies have shown that this high incubation temperature increases the frequency of ssbreaks and chimeric sequences. A lower incubation temperature reduces the appearance of these sequence artefacts, but with the caveat of reducing the total yield of decrosslinked DNA [20,21]. This suggests that there is potential for an optimized decrosslinking strategy allowing for reduced incubation temperature, without diminishing the yields of decrosslinked DNA.
Assuming the existence of sequence artefacts (AP sites, damaged bases, ss-breaks), the base excision repair (BER) pathway, the main cellular pathway for their repair in metabolically active cells, represents a promising opportunity for their repair [38,39]. Indeed, improvement of the sequencing quality of FFPE specimens has been attempted using DNA glycosylases (uracil-DNA glycosylase) [40]. In addition, commercial kits for some degree of FFPE DNA repair have recently become available: 'NEB FFPE DNA Repair' and 'Illumina Infinium FFPE Repair'; however, their composition is proprietary and undisclosed. Despite such advances, there is a gap in the literature characterizing DNA damage recognition by DNA glycosylases on FFPE specimens, which is essential for designing approaches to reconstitute the BER pathway to repair FFPE DNA damage [41]. The BER pathway can be summarized in five steps. (i) Base excision by a DNA glycosylase, followed by (ii) backbone excision by an AP lyase or AP endonuclease, (iii) ends processing by a polynucleotide kinase (PNK) or exonuclease, (iv) gap filling by a polymerase and (v) nick ligation by a ligase [38,39]. The type of DNA glycosylase determines downstream repair workflow. Monofunctional DNA glycosylases (i.e. UDG), yield an AP site that is excised by an AP endonuclease (i.e. Endo IV), generating a nick with deoxyribose phosphate (dRP) residue in the 5 0 terminus and a clean 3 0 OH terminus. This 5 0 dRP residue is removed by the 3 0 -5 0 exonuclease activity of DNA polymerase (long-patch BER). On the contrary, bifunctional glycosylases (i.e. those repairing oxidative damage), can cleave the backbone by b-or b-d-elimination. The product of b-elimination is a phospho-a,b-unsaturated aldehyde that can be removed by an AP endonuclease, and repaired as monofunctional glycosylases. The product of b/d-elimination is a 3 0 end phosphate that is removed by a PNK and the lesion filled through short-patch BER [38,39,[42][43][44][45][46].
In this study, a 'mock' FFPE model replicating the conditions found in clinical FFPE samples was used to characterize the nature and severity of FFPE-induced damage in bacterial DNA, followed by development of an effective strategy for repairing it. Quantitative PCR and high-resolution melt (HRM) analysis, along with Sanger sequencing were used to screen decrosslinking conditions and available DNA glycosylases, shortlist those found most effective. These were then further tested individually and in combination, with a final validation of whole genome sequencing (WGS) analysis used to determine the most effective DNA repair strategy.

Counting fixed bacterial cells
The cell suspension was counted using a bacterial counting kit for flow cytometry (Invitrogen). In brief, a 10% aliquot from the bacterial suspension was serially diluted to 1 Â 10 6 cells in 989 ml of NaCl. Bacterial cells were stained with 1 ml of SytoBC and 10 ml (1 Â 10 6 ) of counting beads were added to the suspension. Cells were counted in an LSR II flow cytometer (BD Biosciences). The acquisition trigger was set to side scatter and regulated for each bacterial strain to filter out electronic noise without missing bacterial cells. This value was $800. The volume corresponding to $2 Â 10 7 CFU of each bacterial strain and 2.2 Â 10 7 4T1 cells were mixed together.

Cell culture
Mus musculus mammary gland cancer cells (4T1) were grown at 37 C 5% CO 2 , in RPMI-1640 (Sigma-Aldrich) media supplemented with 10% FBS (Sigma-Aldrich), 100 U/ml penicillin and 100 lg/ml of streptomycin (Thermo Fisher) and counted with a NucleoCounter V R NC-100 TM (Chemometec, Copenhagen).  [48]. Processed Protoblocks were placed in a 1.5 cm Â 1.5 cm embedding mould and mounted to a processing cassette. Blocks were sectioned keeping an aseptic technique either at 4 mm for imaging or at 15 mm for DNA purification. The cell load of each slide was calculated by dividing the total bacterial load by the volume of each slide.

Immunofluorescence and histochemistry
Cell morphology was evaluated with Gram staining (Sigma-Aldrich) or H&E staining with Mayer's haematoxylin (Sigma-Aldrich). Bacterial counts were confirmed in three sections stained with DAPI, 1:50 a-E. coli (Abcam, 137967), or 1:400 a-S. aureus (Abcam, 20920), and counterstained with either Alexa Fluor 488 (Jackson Immunoresearch Laboratories Inc., USA) donkey anti-rabbit Ig. Stained sections were mounted in ProLong Gold Antifade reagent with DAPI (Invitrogen, UK). Gram-stained sections were counted in bright field using an Olympus BX51 microscope, with a Â100 lens. Immunofluorescent stained slides were counted at Â20 (4T1 cells) or Â60 (bacteria) with a fluorescence microscope (Evos FL Auto). For each slide, at least 20 randomly selected fields of view were counted. The area of the field of view was recorded using the microscope's software and used to calculate the volume counted.

DNA purification
For purifying DNA from Protoblocks, unless specified, 10 lm Â 15 lm sections aseptically collected sections were deparaffinized with Â2 xylene washes and processed following procedures specified in the QIAGEN FFPE DNA kit protocol (Qiagen Inc., Valencia, CA, USA). DNA was eluted in Tris-HCl (pH 8) and quantified with a Qubit TM dsDNA HS Assay Kit (Invitrogen, USA).
For non-fixed (NF) bacteria, bacterial cultures were grown to an OD 600 of 1. 2 ml aliquots were processed following procedures of the GenElute TM Bacterial Genomic DNA Kit Protocol with lysozyme and lysostaphin (Sigma) and eluted in 50 ml of Tris-HCl (pH 8). In all cases, DNA was stored at À20 C until further analysis.

Quantitative PCR
For quantitative qPCR, reactions were prepared using LUNA Universal qPCR (NEB, Ipswich, MA, USA) and 0.25 mM of each primer (Supplementary Table S1). The thermal profile included an initial denaturation of 1 min at 95 C, and 40 cycles of denaturation at 95 C for 10 s, annealing for 15 s at the primers' optimal temperature (54 C-56 C) (specified by New England Biolabs (NEB) calculator for Hot Start Taq) and 20-40 s of extension at 68 C (20 s for 200 bp amplicons and 40 s for 400-500 bp amplicons).
High-fidelity quantitative PCR reaction set-up Reactions were prepared using NEBNext-Ultra II Q5 Master Mix, 0.5 mM of each primer (Supplementary Table S1 Quantitative qPCR assays parameters Amplification was performed in an AriaMx (Agilent Technologies, USA) using DNA-binding dye absolute quantitation experiment type. Each assay included triplicates of 5 points standards using log-dilutions of a 10 7 copies gene block, designed upon a species-specific genetic region. Primers targeting these regions and maintaining a similar Tm (62 C) were designed using the National Center for Biotechnology Information (NCBI) primer design tool and their parameters (DG, hairpins and dimers) verified using IDT's oligo analyser tool. Primers and gene-blocks were acquired from IDT (Coralville, USA) (see Supplementary Table S1). The qPCR efficiencies between 95% and 105% and R 2 values >0.995 were deemed as acceptable, all samples were run in triplicate.

High-resolution melt (HRM) curve analysis
For melt curve analysis, it was essential to first normalize the amplifiable DNA fraction of samples tested. To achieve this, a quantitative qPCR was performed for fragments of the same length. The measured copy numbers obtained by qPCR were used to normalize the samples to 1 Â 10 6 copies/ml. The 20 ml reactions were prepared using Â1 NEB Luna probe qPCR mix, 1.25 mM EvaGreen Dye (Biotium, CA, USA), 37.5 nM ROX as reference dye, 0.25 mM of each primer and 2.5 ml of copy number normalized template DNA. Escherichia coli primers rendering amplicons of 100, 200 and 500 bp were used for this assay (Supplementary Table S1). The amplification of the analysed target region was first amplified as specified for absolute quantitation, but included a final 2 min at 68 C extension step. This was followed by HRM analysis set to read fluorescence every 0.2 C with a 10 s soak time from 65 C to 95 C. All experiments were performed using an AriaMx thermocycler (Agilent Technologies).
Here, normalized fluorescence (Rn) obtained every 0.2 C, across the temperature gradient (65 C-95 C), was used to monitor the melting temperature (Tm) profile of the template. Changes in the Tm profile are indicative of changes in the template sequence. To better observe these changes, the Tm profiles were plotted on a Tm difference (DTm) plot, where the Tm difference is represented by the deviation of the recorded Rn values of a test plotted against those recorded for a NF reference, for which the DTm is 0. Therefore, DTm ¼ Rn Test -Rn of reference. Here where aberrant profiles that differ from NF DNA with DTm < 0.1 C are typical of FFPE DNA and are indicative of low-level, non-identical changes randomly distributed across the template [49]. Therefore, in these plots, a lower DTm is indicative of a reduced/lower number of sequence artefacts in the template. Raw Tm values were extracted from the AriaMx software and analysed in R environment, v3.4.4.

Sanger sequencing
Sanger sequencing was performed on 500 ng of purified and/or treated DNA for each replicate on the same genomic regions analysed by qPCR. Sequencing was performed by Eurofins Genomics.

Assembling BER reaction
Buffer The BER pathway was reconstituted in a final buffer with 1X NEB CutSmart buffer (50 mM potassium acetate, 20 mM Tris-Acetate, 10 mM Magnesium acetate and 100 mg/ml of bovine serum albumin, pH 7.9), supplemented with 100 mM of dNTPs, 50 mM of NADþ and 2 mM of DTT. Enzyme efficiency in this buffer was analysed by comparing its activity with the buffer provided by the manufacturer. The compared enzyme activity was used to adjust the enzyme units used for the BER reaction.

Bioinformatics and statistical analysis
The qPCR and HRM data analysis Statistical analysis performed in the base R environment (v3.6.1). Visualizations were carried out using the ggplot2 package (v3.2.1).

Sanger sequence analysis
The effect of DNA repair enzymes on DNA sequence length and readability was assessed by Sanger sequencing. The ratio of clipped sequence length to unclipped sequence length between samples was compared to elucidate this. Statistical analysis performed in the base R environment (v3.6.1). Visualizations were carried out using the ggplot2 package (v3.2.1).

WGS sequence analysis
All metrics relating to sequence data were calculated in the Linux environment, and using the QUAST tool (v5.0.2) and statistical analysis performed in the base R environment (v3.6.1). Visualizations were carried out using the ggplot2 package.

Method for variant calling
Filtering HiSeq sequence data were quality filtered. Only very highquality bases (Phred score >30) were considered to minimize the risk of sequencing errors causing false-positive variants. Short fragments were also removed to reduce the likelihood of spurious alignments of regions from contaminant bacterial genomes. Trimmomatic (v0.38) was used to remove all reads shorter than 60 bp in length, and to trim reads when the average per base quality in a sliding window of size 4 dropped below 30.

Alignment
Of the three possible Burrows-Wheeler alignment tools, the BWA-mem aligner was used as the average read length was 150 bp, and BWA-mem (v0.7.17) is recommended when reads are over 70 bp in length. Default settings were used with the exception of allowing alignments with a minimum score of 0, rather than the default 30. Given the stringent parameters used for read length and quality filtering, relaxing the minimum alignment score gave the best possible chance of variant detection. All samples were aligned with the original reference genomes.

Variant calling
Variant calling was done with BCF tools, using the BCF call function. The variants were then filtered using the norm and filter functions within BCF tools. Filtering was done to remove variants when the read depth was below 10, the quality was below 40, or when the variant identified was not supported by both the forward and reverse read of a read pair. The number of variants identified was then normalized between samples based on the read coverage in the initial alignment BAM file.

Validation
Using the Picard tool within the Genome Analysis Tool Kit suite, all samples were down-sampled to ensure SNP: coverage ratio remained constant when coverage was reduced to lowest level present in samples.

Characterization of bacterial FFPE DNA damage
Measuring fragmentation of PCR readable DNA The length of PCR-readable fragments from bacterial DNA subjected to FFPE treatment was measured by quantitative PCR. Targeting a 525 bp chromosomal region, primers were designed to amplify DNA fragments of lengths 200 bp, 300 bp, 400 bp and 500 bp. Template DNA was purified from FFPE blocks loaded with 1 Â 10 8 E. coli cells, fixed for 48 h and stored for >6 months. Each qPCR reaction was loaded with 5 ng of DNA, corresponding to 1 Â 10 6 CFU. As seen in Fig. 1a, the quantity of amplifiable DNA is significantly reduced after FFPE treatment. For NF DNA, the amplification of PCR-readable fragments is almost 100% and is independent of fragment size, whereas a log-fold reduction of amplifiable DNA is observed for even short (200 bp) fragments of FFPE DNA (P < 0.001). This becomes more pronounced as fragment length increases, with significant correlation between reduction in the quantity of amplifiable DNA and fragment length, leading to a log-fold reduction in amplifiable DNA quantity between 200 bp and 500 bp fragments (P < 0.001).
Assessing the extent of formaldehyde cross-links in FFPE bacterial DNA The presence and frequency of formaldehyde crosslinks present in bacterial DNA were assessed by comparing the quantity of amplifiable DNA obtained after performing or omitting a crosslink reversal incubation on paired-samples (n ¼ 6), a strategy resembling the straightforward formaledyhe assisted isolation of regulatory elements (FAIRE) method [13]. As can be seen in Fig. 1b  DNA observed after cross-link reversal, indicating that 95-97% of the amplifiable DNA in the sample held crosslinks that inhibited its amplification.

Evaluating the presence of damaged nucleotides
The presence of damaged bases in bacterial FFPE DNA was investigated by subjecting FFPE-DNA to the activity of DNA glycosylases targeting base oxidation, deamination and carboxylation with enzymes listed in Supplementary Table S3. DNA lesions resulting from DNA glycosylase activity (AP sites and 3 0 P) [39,43], inhibit amplification [50]. Therefore, DNA glycosylase activity can be measured by comparing the quantity of amplifiable DNA in a sample after treatment/no treatment with a DNA glycosylase, with a decrease in amplification implying the presence of the targeted DNA damage. As seen in Fig. 1c, a decrease in amplifiable DNA was noticeable in concentration normalized samples after treatment with all glycosylases, with the highest activity observed for UDG and FPG as indicated by the 35-50% and 67-80% reduction in the recovery of PCR readable DNA fragments after treatment (P < 0.001) (Fig. 1c). It should be noted that Endo VIII activity is not measurable by this PCR analysis, as lesions targeted by this enzyme (hydantoins) are PCR inhibitory, thus, the removal of this damage would not have any effect on the amount of amplifiable DNA template [51].

Assessment of DNA sequence quality by sequencing
Overall DNA damage is reflected in the outputs of sequencing. Damaged bases and single-strand breaks present as sequencing misreads, such as chimeras, indels and SNPs that lead to poor quality reads, which will be routinely filtered out prior to analysis. As seen in Fig. 1d, a significant decrease in high-quality, sequencing-readable DNA was observed in both Sanger sequencing and WGS, for FFPE samples compared with their paired NF samples. This was accentuated by prolonged DNA fixation, where the reduction of high-quality sequences reaches 30% (P < 0.001).

Development of a DNA repair strategy
Having characterized the nature of FFPE-induced damage to bacterial DNA, an appropriate repair strategy was devised, as outlined in Fig. 2

Optimization of decrosslinking
Crosslinks block polymerase processivity, reducing yields of PCR readable DNA [37]. Recently, it has been shown that 90 C decrosslinking incubation, reduces DNA sequence quality [20,21]. For this reason, we aimed at investigating strategies that reduce heat exposure in order to find the optimal balance that improves the output DNA sequence quality without significantly affecting its yield.

Temperature
The effect of decrosslinking temperature on the yield of amplifiable DNA was investigated by quantitative PCR in DNA extracted from FFPE blocks loaded with Staphylococcus aureus (Fig. 3ai) and E. coli (Fig. 3aii), fixed for 24 h and stored for 3 months. Reactions were loaded with 10 6 copies of template and incubated at 90 C for 1 h (reference protocol ¼ industry standard mammalian DNA isolation from FFPE tissue), 80 C Â 1 h, 72 C Â 2 h or 65 C Â3 h. Compared with the reference 90 C protocol, no significant difference in amplification of PCR readable DNA was observed at 80 C for both bacteria (P > 0.05), while a Â4 (E. coli) and a Â10 (S. aureus) decrease in the amount of PCR readable DNA was evident at both 72 C and 65 C (P < 0.001). In this case, PCR amplification is indicative of the template fraction that was efficiently decrosslinked.

Buffers
The ability of three protein lysis buffers in setting reaction conditions (enthalpy disruption) that facilitate decrosslinking at 80 C were examined: Test Buffer 1 (TB1) -based upon the protein denaturing properties of chaotropic agents (GuHCl); Test Buffer 2 (TB2) -denaturing proteins with a reducing agents (DTT); Test Buffer 3 (TB3) -relying on the denaturing properties of an ionic detergent (sodium dodecyl sulphate). Decrosslinking with the three buffers was tested against the reference buffer (Buffer ATL, Qiagen FFPE Kit) at 80 C Â 1 h. The effect of each buffer upon decrosslinking efficiency was assessed quantitatively by comparing the quantity of amplifiable DNA recovered after treatment. Contents of FFPE slides loaded with E. coli and S. aureus cells were suspended in each buffer (n ¼ 6). Purified DNA was subjected to qPCR for amplification of a 200 bp fragment. TB1 and the reference displayed the highest yield (P > 0.05), significantly higher than TB2 (P < 0.05) and TB3 (P < 0.01); (Fig. 3b).
Evaluating DNA sequence quality of optimized strategy The optimized strategy 1 h at 80 C in TB1 was tested against the standard protocol 1 h at 90 C in QIAGEN ATL Buffer for its capacity to decrosslink DNA, indicated by the yield of 500 bp PCR products (Fig. 3ci) and the sequence quality of the fragments yielded ( Fig. 3cii and iii). This was tested in DNA sourced from FFPE blocks loaded with E. coli fixed for 48 h and stored for 1 year (representing maximum damage conditions). For quantitative analysis (Fig. 3ci), reactions were loaded with normalized DNA concentration. For qualitative analysis ( Fig. 3cii and iii), reactions were loaded with 10 6 amplifiable copies of the DNA fragments. Results are shown in Fig. 3ci, reflect those in Fig. 3b  DNA did not differ significantly from that of the reference protocol. However, the sequence quality of DNA recovered was improved with the new strategy. As it can be seen in Fig. 3cii, the Tm of samples treated with the new strategy was less variable and closer to that of paired-NF DNA, exhibiting a Tm difference [DTm (%)] of 2.88 (not significant), vs. 3.02 (P < 0.05) for the reference protocol. This was further confirmed with HRM (detailed in 'Material and methods' section), where aberrant profiles (from that of NF DNA) are indicative of randomly distributed sequence aberrations typical of FFPE DNA [49]. DTm plots in Fig. 3ciii, show that the DTm for samples decrosslinked with the new strategy [DTm (%) ¼ 3.5] is significantly lower than that of the reference protocol (buffer ATL at 90 C) [DTm (%) ¼ 6.1] (P < 0.05). This indicates that with the new strategy, without compromising DNA yields, the sequence quality of decrosslinked template is less damaged (resembles more NF DNA).

DNA glycosylases reduce sequence alterations in FFPE DNA
After examining their activity on FFPE DNA (Fig. 1c), the effect of treatment with DNA glycosylases on DNA sequence quality was assessed by: (i) Tm analysis, (ii) Sanger sequencing and (iii) HRM. For Tm analysis and HRM, all reactions were loaded with 1 Â 10 6 genome copies of DNA sourced from FFPE blocks loaded with E. coli and set to amplify 3 Â 100 bp fragments ( Fig. 4a and Supplementary Fig. S1). For all the regions analysed, the Tm of samples treated with all DNA glycosylases significantly changed from FFPE untreated samples (P < 0.001) and came closer to resemble that of the NF reference. This was further assessed by HRM, by comparing the melting profile of a 200 bp fragment (as explained in Fig. 3 and methods). As seen in Fig. 4c, the plotted DTm (from paired-NF) of glycosylases treated FFPE DNA was found to be much lower than that of untreated FFPE DNA. The same effect was evident with Sanger Sequencing (Fig. 4b)  To further confirm these results, six replicates treated with each mix were pooled (n ¼ R6) and analysed by WGS. Data validated that all mixes improved the sequence (i) coverage, (ii) number of reads and QF passed reads and reduced the amount of SNPs (iii). The best performance in all cases was observed in the BER mix with FPG and Endo VIII.
DNA glycosylases significantly improved (P < 0.001) the number of high-quality reads recovered, increasing the readability of DNA to levels no longer significantly different from NF DNA. Interestingly, samples treated with Endo VIII alone showed an improved sequence quality. Given that damage targeted by Endo VIII is PCR inhibitory, this might be indicative of activity in non-blocking lesions (Fapy-A), reflect PCR errors triggered by blocking lesions (jumping PCR) or be Average fragment size (bp) Figure 6: Combined protocol-bacterial DNA. Outputs of Bioanalyzer and WGS for bacterial FFPE DNA exposed to the combined treatment (blue, labelled as new protocol, Rn ¼ 6). This was compared with that obtained from six pooled paired-samples decrosslinked with the reference protocol and unrepaired (grey, labelled reference protocol, Rn ¼ 6) and that from DNA obtained from NF samples with the same bacterial and DNA content (orange, labelled NF, Rn   due to a reduction of Taq Polymerase fidelity (A rule and/or deletions) [52,53].

Development of an in vitro BER system
For the in vitro reconstitution of the BER pathway, a suitable universal buffer was sought and tested by examining enzymatic activity for each enzyme (see Methods) and compared with activity in their recommended buffer (see Supplementary Fig.  S2). Optimization of enzyme and co-factor quantity usage was then performed (Supplementary Tables S2-S4).
First, the BER pathway was reconstituted for single repair pathways triggered by a single DNA glycosylase (UDG, FPG or Endo VIII), with units and enzymes listed in Supplementary  Tables S3 and S4, and its performance was tested by HRM analysis. Figure 5a shows the HRM plots of DNA exposed to the BER pathway reconstituted for FPG, UDG or Endo VIII. As explained in methods, the more similar a DNA sequence is to the NF reference, the lower the difference in melting temperature (DTm closer to 0). As seen in Fig. 5a, exposure of DNA to each reconstituted BER pathway led to a reduction in DTm in FFPE DNA and an increase in the quantity of PCR readable template ( Supplementary Fig. S3) suggesting a reduction in the frequency of sequence artefacts. The frequency of sequence artefacts observed after treatment was more effective for the FPG driven BER reaction, with a $50% decrease in DTm observed for untreated samples, this was followed by Endo VIII with a $31% reduction and finally UDG with a $14% decrease in the DTm. These results indicate that BER was reconstituted correctly and that these reconstituted pathways effectively corrected sequence artefacts without reducing the PCR readable template.
Subsequently, the reconstitution of a BER system able to target different types of DNA damage found in FFPE samples was addressed by mixing the pathways for the glycosylases treated in the system. Since FPG-BER (Fig. 5a) yielded the best results for single glycosylase-BER reactions, this enzyme was combined with ENDO VIII and/or UDG and their efficiency in reducing sequence artefacts tested by HRM. As shown in Fig. 5b, all combinations resulted in sequences with DTm lower than those of untreated FFPE DNA. The FPG þ UDG mix showed the best performance at reducing the DTm (31%), followed by FPG þ Endo VIII (18%). However, in terms of improving the PCR readability of a 500 bp fragment, FPG þ Endo VIII (47% increase, P < 0.01) outperformed FPG þ UDG (30% increase, P < 0.01), as measured by Taq qPCR. This effect was confirmed by high-fidelity qPCR (providing a more stringent discrimination of damaged and repaired sequence), where FPG þ UDG showed a 20% increase and FPG þ UDG only a 4% increase of amplifiable DNA (Supplementary Fig.  S4). To confirm these results, a normalized DNA quantity from six replicates for each BER mix and six unrepaired samples were pooled into one (n ¼ R6) and sent for analysis by WGS (Fig. 5c). At this level of resolution, it is evident that the repair mix with FPG þ Endo VIII offered the highest improvements in sequence quality in terms of providing (i) a coverage Â4 higher than unrepaired, (ii) Â4 more total reads and quality filter (QF)-passed reads and (iii) a 50% reduction in the number of variants detected per sequence coverage. This repair mix was thus selected as the best repair mix for bacterial FFPE DNA.

Analysis of combined decrosslinking and BER treatment
The sum of the above treatment strategies (decrosslinking and DNA repair) was tested by WGS in DNA sourced from FFPE blocks containing a mix of five bacterial strains, fixed for 48 h and stored for 2 months. DNA was decrosslinked at 80 C with TB1 (methods) and repaired with the FPG þ Endo VIII-BER repair mix. The results of this were compared with those obtained from paired-samples treated with the reference protocol (decrosslinking at 90 C with QIAGEN ATL buffer, without DNA repair), and NF DNA obtained from equal cell contents. Experimental replicates were pooled (n ¼ R6) and sent for WGS analysis. Results for this analysis are shown in Fig. 6 and Supplementary Fig. S5. The results obtained from exposing bacterial FFPE DNA to the proposed new protocol indicate that bacterial FFPE DNA treated with the proposed method shows an improvement in integrity, readability and sequence quality, as evidenced by (i) integrity [average fragment length (a, b)]: plotted in Fig. 6a are the average fragment lengths measured with a fragment analyser. Fragment length of DNA treated with the new protocol (444 bp) is Â3.3 longer than that treated with the reference protocol (136 bp). Importantly, this raises the average fragment length to that of fragments typically desired for 16S sequencing (460 bp). The same effect was observed in the length of fragments read by WGS, where fragment lengths were 2-3 bp longer on average (Fig. 6b). (ii) Readability: with the new protocol, the number of total reads and QF-pass reads per layer of coverage was increased by 24% and 34%, respectively, and the ratio of QF-passed to total reads increased by 8.4%. (iii) Sequence quality: this was measured in terms of number of sequence artefacts detected. The number of chimeric reads per coverage detected in samples treated with the new protocol was reduced by 57% (P ¼ 0.37) (Fig. 6e). Similarly, the number of SNPs detected was reduced by 58% (P ¼ 0.41) (Fig. 6f and Supplementary Fig. 5) in all strains tested. Despite the reduction in SNP's being uniform across all strains tested, FFPE was found to produce a different SNP profile in Gram-positive vs. Gramnegative bacteria. As seen in Supplementary Fig. S6, a broad spectrum of SNPs was more proportionally abundant in FFPE Gram-negative E. coli when compared to the NF reference, while this was less pronounced in the Gram-positive B. longum. In these profiles, a reduction on SNPs derived from oxidative damage is observable in Gram-positive and Gram-negative bacteria; however, there is still a prevalence of SNPs-derived cytosine deamination.
Similar improvements in DNA quality and quantity to those shown in bacterial DNA were also obtained for the mammalian cell line used (4T1), where a 21% decrease in the amount of SNPs per layer of genome coverage and a 65% increase in the breadth of genome coverage was observed in the DNA treated with the proposed method (Fig. 7). Although these improvements were accompanied by a slight rise in chimeric sequences, the fact that this is seen in both the repaired samples and the NF reference indicates that this is likely a function of the increased reference genome coverage seen for these samples. All of these findings are coherent with results from quantitative PCR and Tm analysis. Although these improvements are not supported by statistical significance, given the considerable effect size, we are confident that this lack of significance is due to sample size alone. Altogether, the strategies proposed here were thoroughly investigated by PCR/sequencing, both individually and in combination. These results consistently indicate an improvement in the sequence integrity, readability and quality of readable bacterial FFPE DNA.

Discussion
A plethora of studies has characterized FFPE-induced damage in human/mammalian DNA, where the abundance of DNA present in FFPE samples and presence of a well-characterized reference genome allow for high-quality reproducible research. To our knowledge, this is the first such study in prokaryotic DNA, where an understanding of effects of FFPE on DNA, and impact on downstream analyses is arguably even more important.
Our results show bacterial FFPE DNA to be a poor PCR template, with a log-fold reduction in the recovery of DNA fragments. This can be at least partially attributed to DNA fragmentation, since an inverse correlation between fragment size and PCR readability was shown (Fig. 1a), culminating in a log-fold reduction in recovery between 200 and 500 bp fragments. Crosslinks were found to be ubiquitous in FFPE bacterial DNA (Fig. 1b), and potentially more prevalent than in FFPE human DNA, based on previous research [12,20]. Current decrosslinking protocols have been found to induce sequence alterations [21], and reducing heat exposure has been proposed to prevent this damage [20,21]. Our results are in agreement with these hypotheses, as a reduction from 90 C (current protocols) to 80 C, showed a significant reduction in off-target effects, without compromising the decrosslinking efficiency. Here, we hypothesize that TB1 [containing 50 mM Tris-HCL (pH 8.0), 30 mM EDTA, 800 mM GuHCl, 0.5% Triton-X, 0.5% Tween-20] provided a reaction condition promoting decrosslinking at a lower temperature. This could be explained by a higher degree of protein denaturation, facilitated by GuHCl which interacts with multiple protein groups, including the backbone and hydrophobic and polar side chains. This is supported by GuHCl ability to increase the activity of Proteinase K and increase the torsional mobility of denatured proteins (at 1 M concentration) [54][55][56]. Furthermore, unlike SDS, chaotropes interacts with nucleic acids, altering their secondary and tertiary structure [57,58]. In fact, 1 M concentrations of GuHCl have been shown to reduce the Tm of DNA by 13 C, and increase the stringency of its hybridization, promoting correct base pairing [59]. All of this would facilitate the exposure and hydrolysis of ubiquitous DPC [60,61] and DNA-DNA complexes [62][63][64] at a lower temperature [65], reducing potential straining of the DNA structure and maintaining a high base paring fidelity. This could have been also assisted by other reaction conditions, such as pH and ionic strength [62,[66][67][68], Tris-HCl formaldehyde scavenger activity [61,69] or possibly guanidinium-formaldehyde interactions, but this requires further investigation.
Treatment with glycosylases significantly reduces the appearance of sequence artefacts in FFPE DNA. Glycosylases generate blocked ends that are in most cases, unsuitable for amplification. This effect was confirmed in all glycosylases tested. Studies performed in human DNA have shown that cytosine deamination to uracil is the main source of sequence artefacts in FFPE DNA [12], although this has been controversial [14,20,21,70]. Our data suggest that DNA damage found in bacterial FFPE DNA is primarily driven by oxidation and cytosine deamination, as evident in higher activity observed for FPG, Endo VIII and UDG. It is-known that oxidized products of cytosine can trigger its deamination [71]. While UDG repairs cytosine deamination and some of the oxidized deaminated lesions (5-OH dU), Endo VIII has a broader spectrum for these targets. Altogether, quantitative and qualitative analysis by qPCR ( Fig. 5a and Supplementary Fig. S4) and sequencing (Fig. 5c) of samples treated with Endo VIII BER consistently yielded better results than UDG BER did, in terms of template readability and sequence fidelity. The same can be said for FPG þ Endo VIII BER when compared to FPG þ Endo VIII þ UDG BER, despite some SNPs derived from cytosine deamination being evident in the repaired DNA profiles (Supplementary Fig. S6). While the HRM melting curve analysis provided a valuable guide, confirmation was provided by qPCR and sequencing data. After exhaustive comparisons of different approaches to the problem, the strategy found to be most effective involves decrosslinking using a chaotropic agent at 80 C, followed by DNA repair using a combination of formamidopyrimidine DNA glycosylase and Endonuclease VIII.
To conclude, the information generated here provides a better understating of FFPE-derived DNA damage, informing strategies for its repair. Here is also presented a thoroughly characterized method to address this damage. Given the increased activity in, and controversy surrounding, the field of low-biomass microbiome analysis, methods that improve the quality of microbiome studies (through sensitivity improvement or access to increased sample size) such as described here, are necessary. Given the paucity of published information on mammalian FFPE DNA repair, and none on bacterial repair, the strategy devised here provides compelling evidence to further pursue BER strategies to improve the sequencing quality of bacterial FFPE DNA and possibly mammalian FFPE DNA.

Data availability
All sequencing data have been uploaded to the Sequence Read