Snapper: high-sensitive detection of methylation motifs based on Oxford Nanopore reads

Abstract Motivation The Oxford Nanopore technology has a great potential for the analysis of methylated motifs in genomes, including whole-genome methylome profiling. However, we found that there are no methylation motifs detection algorithms, which would be sensitive enough and return deterministic results. Thus, the MEME suit does not extract all Helicobacter pylori methylation sites de novo even using the iterative approach implemented in the most up-to-date methylation analysis tool Nanodisco. Results We present Snapper, a new highly sensitive approach, to extract methylation motif sequences based on a greedy motif selection algorithm. Snapper does not require manual control during the enrichment process and has enrichment sensitivity higher than MEME coupled with Tombo or Nanodisco instruments that was demonstrated on H.pylori strain J99 studied earlier by the PacBio technology and on four external datasets representing different bacterial species. We used Snapper to characterize the total methylome of a new H.pylori strain A45. At least four methylation sites that have not been described for H.pylori earlier were revealed. We experimentally confirmed the presence of a new CCAG-specific methyltransferase and inferred a gene encoding a new CCAAK-specific methyltransferase. Availability and implementation Snapper is implemented using Python and is freely available as a pip package named “snapper-ont.” Also, Snapper and the demo dataset are available in Zenodo (10.5281/zenodo.10117651).

The amplicon was purified using Cleanup Standard Kit (Evrogen, Russia), cloned into the pCR2.1 vector system using The Original TA Cloning Kit (Invitrogen, USA ) according to the manufacturer's recommendations and transformed into E. coli Top10 strain.Cloning was verified by PCR with the M13F/R vector-based primer set (Supplementary Table 7).The plasmids pCR2.1-8,pCR2.1-9containing the expected amplicon (Supplementary Table 8) were purified using the QIAprep® Spin Miniprep Kit (Qiagen, Germany) and sequenced.Plasmids with verified amplicon nucleotide sequence were digested with restriction enzymes using XbaI/KpnI sites for hp0008 gene disruption amplicon and KpnI/NotI sites for hp0944 gene disruption amplicon.Amplicons were purified from agarose gel in appropriate concentrations and used to transform H. pylori A45 cells as described earlier.
Supplementary Table 1.Plasmids and strains used in this study.
This study  Supplementary Figure 2. Two-step PCR conditions used for amplification of full-length fragments 1671 bp for hp0008 and 1567 bp for hp0944 gene disruption in H. pylori A45 strain, respectively .

H. pylori mutant strains construction. Plasmids used for gene inactivation restriction-modification systems
H. pylori mutant strains hpy, hp91/92, and hp1352 were obtained earlier [11].A cassette containing kanamycin resistance gene (aphA-3) as a selectable marker was inserted into the target region of the genome.For a detailed description of the plasmids used to inactivate the genes involved in restriction-modification systems, see Supplementary Materials Table 9 (3).Validation of the correct insertion of the cassette into the H. pylori genome was confirmed by PCR-analysis and sequencing of the obtained strains.Target inactivation -genes encoding methyltransferases was confirmed by restriction analysis of genomic DNA of the obtained strains using methyl-sensitive restriction endonucleases: Mbol (GATC), Hin1ll (CATG), Hinfl (GANTC).H. pylori mutant strains hp8, hp944 were obtained during this work by homologous recombination.For construction hp944 mutant strain (A45 strain disrupted in the hp0944 gene) the amplicon, consisting of chloramphenicol resistance catGC cassette flanked by two fragments of the hp0944 of 363 bp and 456 bp, was obtained by two-step PCR amplification.For construction hp8 mutant strain (A45 strain disrupted in the hp0008 gene) the amplicon, consisting of kanamycin resistance gene under chloramphenicol promoter region flanked by two fragments of the hp0008 gene of 325 bp and 463 bp, was obtained by two-step PCR amplification.All H. pylori flanking regions were amplified from genomic DNA of A45 strain using Tersus polymerase kit (Evrogen, Russia) according to the directions of the manufacturers.All corresponding primer sets listed in Supplementary Table 7 (2).Schematic map of the primer location and amplicon structure used for hp0008 and hp0944 gene disruption in H. pylori A45 strain illustrated by Supplementary Figure 5 (1) in Supplementary Materials.
Plasmid Construction pGEM-HpyIM-к (hpy) Two fragments of the gene encoding HpyAI methyltransferase of H. pylori strain A45 labeled IMl and IM2 were amplified using primers 1IMfw, 1IMrev, 2IMfvv, 2IMrev (Supplementary Table 9).For the full-length gene assembly, the following reaction mixture was composed: 200 ng of each IM1 and IM2 fragments, 20 pM of each primers 1lmfw and 2Imrev, 2mM dNTPs, 1xPCR buffer, 4u of Taq polymerase.PCR was performed with the following amplification conditions: 94°C -30 sec, 45°C -1 min, 72°C -1 min.30 sec.A total of 35 amplification cycles were performed.The resulting fragment, after being treated with Xhol restriction enzyme, was cloned into the pGEM-T-Easy plasmid (Promega, USA) which resulted in formation of the plasmid pGEM-HpylM.The aphA-3 kanamycin resistance gene was amplified from the pHel3 plasmid [1] using primers X-aph-fw and X-aph-rev, then the resulting fragment was digested with restriction enzyme Xhol and cloned into pGEM-HpylM plasmid treated with Xhol restriction enzyme, which resulted in formation of the plasmid pGEM-HpyIM-k.

Motif enrichment algorithm
The input data required by the algorithm are a list of k-mers (by default k = 15) that are likely to bring a modified base in the FASTA format, a list of long k-mers (by default k = 29), a reference genome in the FASTA format, signal shifts for each considered short k-mer and signal shifts for each long k-mer.
The optional algorithm parameters are the desired confidence level threshold conflevel (default is 100 to provide higher sensitivity) and the maximum number of extracted motifs max_motifs (default is 20).
1) The first stage is the generation of motif_variants, the list of all potential motif variants.An individual motif variant is a vector with size of k looking as follows: where the i-th element represents the nitrogenous base (A, G, T, or C) located in the i-th position of k-mer.
The maximum size of motif_variants is limited by the total set of k-mers presented in the reference genome.Since on this step we mainly focus only on R-M systems of types II or III these motif variants should satisfy certain constraints that are common for all Type II-III R-M system recognition sites.These constraints are listed below: 2) Next, the algorithm generates the reference_set list containing all k-mers presented in the reference genome.It will be used as a control set.The set of all k-mers that are most likely to bring a modified base will be named seq_set.
3) Next, the algorithm performs the chi-square test for each motif presented in seq_set in order to compare its frequency in seq_set with the frequency in reference_set.After all motifs in seq_set are processed, the motif_variants list is sorted according to the chi-square statistic value in descending order.The top motif is extracted and adjusted.
4) The adjustment process works as follows.All `N`bases in the considered motif variant that are neighboring with canonical bases are iteratively changed to 'A', 'C', 'G', and 'T' and the frequency of resulting motifs is compared with the reference.If the A:C:G:T ratio in one particular position in seq_set is close to reference_set, this base remains to be 'N'.Otherwise, it changed to a more suitable letter (Y, W, S, H, etc).For the adjustment, the algorithm uses only non-filtered seq_set regardless of iteration.
5) Next, the algorithm checks for the completeness of the extracted motif.It collects all k-mers that match the considered motif from the reference_seq, merges signal levels for these motifs for native and control samples independently and compare them using K-S test.If the K-S test p-value is lower than the threshold, the algorithm considers the motif as incomplete.If the motif is incomplete, the algorithm checks if this motif is a part of a long motif.
To do that, the algorithm collects all long modified k-mers that contain this pattern, and performs a local motif enrichment with the logic similar to the main enrichment algorithms.
The algorithm generates a list of all long motif variants, where each long motif variant should satisfy the heuristics generated using the Golden Standard REBASE dataset of long motifs (http://rebase.neb.com/cgi-bin/trdlist?g).For each long motif variant, the algorithm performs the chi-square test, returns 15 motifs with the highest chi-square statistics value and writes to the output file.If the considered short motif is actually a part of a long motif, these 15 long motif variants are quite similar, and the top variant describes the actual long motif sequence.
Otherwise, the motif should be considered as short but not fully-methylated in the reference genome.
6) Next, the algorithm checks if the adjusted motif is a submotif for one of the extracted earlier motifs.If it is, the motif is transformed to its supermotif.For example, if we have already extracted the motif 'NNNCATGNNNN' with higher confidence level, and the current adjusted motif variant is 'NNNNCATGWNN', it will be transformed to 'NNNNCATGNNN'.
7) Thereafter, the algorithm extracts the adjusted motif variant and performs filtering of the seq_set by removing all k-mers that match this motif variant.
8) If the confidence level of the last extracted motif is less than conflevel or the number of extracted motifs equals max_motifs, the algorithm stops.Otherwise, the algorithm goes to 3) and repeats processing of the filtered seq_set to try to extract the next motif.
The algorithm output is the list of extracted motifs.

Long methylation motifs detection example
The main Snapper pipeline is designed to detect only short methylation motifs with a maximum length of 8 bases.However, we developed an additional module that performs local enrichment for motifs that seem incomplete (details are available in Supplementary Materials 3, step 5).As a result, this module proposes a list of longer motif variants and corresponding confidence levels.For example, in N. gonorrhoeae FA 1090, the main Snapper pipeline discovers a short GCA motif as methylated but not complete.The long motifs enrichment module supposes that GCA is a TRD1 recognition site sequence and searches for the most probable TRD2 sequence.In this particular case, the best statistics was obtained for GCANNNNNNTGC which is an actual Type I R-M system recognition site in the considered N. gonorrhoeae strain.The algorithm saves the modified contexts containing the target motif so that a user could perform motif enrichment with MEME to additionally approve the motif sequence or ise Nanodisco to define the methylation type.
In cases when TRD2 sequence with satisfactory confidence has not been found, the algorithm concludes that the motif is short but not fully-methylated.confidence level legend: red <= 500 500 < yellow <= 3000 3000 < green effect size legend: red <=0.25 0.25 < yellow <= 0.5 0.5 < green

TCNNGA motif in J99 strain
Snapper did not extract the TCNNGA motif as methylated while PacBio did.To verify the results we manually observed all the modified context containing this motif.There were 1255 different 11-mers that had a significant signal shift and contained TCNNGA subsequence.Actually, TCNNGA is a quite highly-represented motif in the J99 genome that might cause a false-positive inference.We checked the presence of other motifs in each context and found that almost all of them (1190 contexts) were explained by the presence of one or more other confirmed methylation motifs (Supplementary Listing 1).Among the other 65 probably modified contexts more than half (38) contained homopolymeric fragments which as we found might often cause a false-positive signal shift.Generally, despite a quite high number of TCNNGA-containing contexts the number of unexplained contexts containing only TCNNGA but not any other methylation motifs is insufficient for TCNNGA inference.Thus, we can conclude that TCNNGA is not an individual methylation motif in our J99 strain.

Supplementary Listing 1.
A shortened list of TCNNGA-contaning contexts that are likely to bring a modified base.The second column shows the presence of a non-TCNNGA motif that explains the signal shift in each case.Only 25 out of 1255 contexts are shown just as an example.In total, 1190 TCNNGA-contaning contexts that have a significant signal change are explained by other motifs.

LC-MS analysis
Liquid chromatographic separation was performed on a reverse phase column (15 cm × 75 µm i.d., Agilent Technologies, USA) packed with Zorbax 300SB-C18 resin (particle size -3 um, pore diameter -100 A) that was coupled to a Q-Exactive HF hybrid quadrupole-Orbitrap mass spectrometer (Thermo Fisher Scientific, Germany) via a nanoelectrospray source Nanospray Flex (Thermo Fisher Scientific, Germany).Source voltage was set at 2200 V and capillary temperature at 325°C.Each sample was introduced through EASY-nLC (Thermo, USA) chromatography system in trap-elute configuration (trap column was a 2 cm × 75 µm i.d.Acclaim PepMap column by Dionex, USA, with C18 resin with 3 um-particles with 100 A pores).Samples were introduced onto trap-column with 10 uL of solvent A (0.1% v/v formic acid) at constant pressure 500 bar.Peptides were eluted with a gradient of0 5 to 50 % (v/v) of solvent B (0.1 v/v formic acid, 79.9 v/v acetonitrile) across 60 minutes at flowrate of 500 nl/min in 3 linear steps (10 minutes to 10% B, 25 min to 20% B, 15 min to 35% B, 10 min to 50% B).After each elution system and columns were washed with 100% of solvent B for 10 minutes and regenerated with 5% of solvent B for 20 minutes.
The mass-spectrometer was operated in positive mode in a data-dependent experiment with survey scans acquired at a resolution of 120,000 at m/z 400 within m/z range of 200-1600 with automatic gain control set for 3x106 and maximum injection time of 32 ms.As many as 20 of the most abundant precursor ions with a charge +2 and above from the survey scan were selected for HCD fragmentation.The normalized collision energy was 27.MS2 spectra were acquired at resolution of 7500 at m/z 400, automatic gain control was set for 2x105 and maximum injection time for 32 ms.After fragmentation ions were dynamically excluded from consideration for 45 s.

Quantification
Proteins were relatively quantified using the MaxQuant software version 1.6.10.43.Raw files were searched with an integrated Andromeda search engine against the core peptide database.that consisted of peptides that could be produced by strictly one protein.Trypsin/P was specified as the cleavage enzyme, and two missed cleavages were permissible, minimum peptide length was 7. The FDR was set to 0.01 for peptide spectral matches.
Protein abundance was estimated as a total intensity of its 3 most intense peptides.Difference in each protein's abundance was considered significant if it allowed the FDR to be below 0.05, in addition.

1 .
Schematic map of the primer location and amplicons structure used for hp0008 and hp0944 gene disruption in H. pylori A45 strain.

Table 2 .
Oligonucleotides used in this study.
. A motif variant should not contain more than 8 letters other than N. 6.A motif variant should not contain more than two sequential N inside.
1.A motif variant has strictly k letters.2. A motif variant should contain at least 3 non-degenerate letters.3. A motif variant should contain either A or C nucleotides (or both).4. A motif variant should not contain individual non-degenerate nucleotides surrounded by two N. 5

Table 5 .
The comparison of Snapper, Nanodisco and Tombo on four external datasets.

Table 11 .
The list of proteins significantly overrepresented in hpy and hp91/92.

Table 12 .
BLAST annotation of the proteins overrepresented in hpy and hp9192 mutants.