Precise characterization of somatic complex structural variations from tumor/control paired long-read sequencing data with nanomonsv

Abstract We present our novel software, nanomonsv, for detecting somatic structural variations (SVs) using tumor and matched control long-read sequencing data with a single-base resolution. The current version of nanomonsv includes two detection modules, Canonical SV module, and Single breakend SV module. Using tumor/control paired long-read sequencing data from three cancer and their matched lymphoblastoid lines, we demonstrate that Canonical SV module can identify somatic SVs that can be captured by short-read technologies with higher precision and recall than existing methods. In addition, we have developed a workflow to classify mobile element insertions while elucidating their in-depth properties, such as 5′ truncations, internal inversions, as well as source sites for 3′ transductions. Furthermore, Single breakend SV module enables the detection of complex SVs that can only be identified by long-reads, such as SVs involving highly-repetitive centromeric sequences, and LINE1- and virus-mediated rearrangements. In summary, our approaches applied to cancer long-read sequencing data can reveal various features of somatic SVs and will lead to a better understanding of mutational processes and functional consequences of somatic SVs.


INTRODUCTION
Structural variations (SVs) have been known to play an important role in cancer patho genesis. Ad vances in highthroughput sequencing technologies have enabled us to perform genome-wide somatic SV detection, and a number of cancer-driving SVs have been identified (1)(2)(3). On the other hand, millions of repetiti v e elements are widely distributed throughout the human genome, which hinders unambiguous alignment by current standard short-read technologies. According to several computational predictions, such repeat sequences comprise one-half to two-thirds of the human genome ( 4 , 5 ). Since the majority of the current sequencing data is collected using short-read sequencing technologies, se v eral classes of SVs, especially those whose br eakpoints ar e located in these r epeat r egions, have been 1. Canonical SV module can capture not only most of the SVs that can be identified using short-read sequencing platforms but also additional ones. 2. For insertions, the full-length inserted sequences obtained by the nanomonsv allowed us to characterize their genetic properties (such as 5 truncations, internal inversions, and target site duplications) and to identify source sites for 3 transduction mediated by LINE1. 3. Single breakend SV module of nanomonsv can identify single breakend SVs w here onl y one breakpoint is iden-tified because the other breakpoint is typically located in repetiti v e regions. Examples of single breakend SVs include LINE1-mediated r earrangements, r earrangements associated with centromeric regions, and viral integrations.

Whole genome sequencing using o xf or d nanopor e technologies and illumina no v aseq 6000
The cell-lines used in this study (COL O829, COL O829BL, H2009, BL2009, HCC1954 and HCC1954BL) were obtained from ATCC (American Type Culture Collection). For Oxford Nanopore Technologies (ONT) sequencing data, high-molecular-weight (HMW) genomic DNAs were extracted from these cell-lines with QIAGEN Genomictip 500 / G (QIAGEN). HBV-positi v e li v er cancer cell-line PRC / PRF / 5 was obtained from the JCRB cell bank (National Institutes of Biomedical Innovation, Health and Nutrition), and HMW-genomic DNA was isolated using SmartDN A chip (Anal ytik Jena). DN A libraries were then pr epar ed using the Ligation Sequencing Kit 1D and sequenced on the PromethION platform with R9.4.1 flow cells, to generate fast5 files. Then, these fast5 files were basecalled and converted to FASTQ files using Guppy 3.4.5.
Then, these were aligned by minimap2 with '-ax map-ont -t 8 -p 0.1' option to the human r efer ence genome provided at the Genomic Data Commons w e bsite (GRCh38.d1.vd1). Summary statistics were calculated using NanoStat package ( 22 ) after removing secondary and supplementary alignments from BAM files. For Illumina short-read sequencing data, we performed Illumina Novaseq 6000 with a standard 150 bp pairedend read protocol, and these were aligned by BWA-MEM ( 23 ) version 0.1.17 to the same human r efer ence genome and were sorted by the genomic coordinates, followed b y remov al of PCR duplicates via biobambam ( https:// github.com/gt1/biobambam ) version 0.0.191 as previously described ( 24 ). In addition, we performed somatic structural variation detection using manta ( 25 ), SvABA ( 26 ), GRIDSS ( 27 , 28 ), GenomonSV, and TraFiC-mem ( 20 ) (see Supplementary Text for detail).

Ov ervie w of nanomonsv
Here we describe an ov ervie w of nanomonsv. A more detailed description of the algorithm can be found in the Supplementary Text. In this paper, SVs were largely classified into two categories: • Canonical SV: SVs characterized by two breakpoints and inserted sequences between them. These SVs include insertions where two breakpoints are typically close together. • Single breakend SV: SVs characterized by a single breakpoint and the sequence after the breakpoint, which are often not uniquely aligned to the r efer ence genome, and their positions are not precisely located.
Nanomonsv consists of two related detection modules designed to detect each of the above SVs; Canonical SV PAGE 3 OF 18 Nucleic Acids Research, 2023, Vol. 51, No. 14 e74 module and Single breakend SV module. Prior to performing nanomonsv, we assume that both tumor and control sequence files are already aligned to the r efer ence genomes with minimap2 ( 29 ). The procedures of Canonical SV module and Single breakend SV module are depicted in Figures  1 and 2 A, respecti v ely.
Both modules consist of four steps: parsing, clustering, refinement, and validation. In the 'clustering' step, the reads from the tumor sample that presumably cover the same SVs are clustered as SV candidates with possible breakpoint ranges. If we observe apparent supporting reads in the matched control sample or in non-matched control panel samples (30 Nanopore sequencing data from the Human P angenome Refer ence Consortium ( 30 ) ar e used in this study), these are eliminated. The 'refinement' step in Canonical SV module plays an essential role in determining the single-nucleotide resolution breakpoints as well as error-corrected inserted sequences using the modified Smith-Waterman algorithm, which allows one-time jump from one genomic region to the other (see Supplementary Figure S1 and similar algorithm in a previous study ( 31 )). Particularly, polished inserted sequences are beneficial for classifying and characterizing insertion e v ents. The last 'validation' steps in both modules thoroughly confirm whether the candidate SV is truly specific for the tumor. More specifically, aligning the putati v e SV segment sequence to each read close to putati v e breakpoints enables precise detection of variant supporting reads, especially those partially covering the breakpoints and not counted in the parsing step (similar approaches have been attempted in previous studies ( 32 ), albeit in a different context than somatic SV confirmation). Lastly, for deletions and insertions, we focus on those whose sizes are 100 bp or larger. We also removed deletions and insertions confined within simple repeat regions.
We also de v eloped a wor kflow to characterize putati v e single breakend SVs by realigning the consensus sequence to the r efer ence genome and ex ecution of RepeatMasker (Figure 2 B, see Supplementary Text for detail). For SVs specifically identified by Single breakend SV module, if their breakpoints on the other side were unambiguously identified, they were reclassified as canonical SVs. They included SVs that were filtered out in Canonical SV module because they did not marginally exceed the threshold in the various filtering steps.

PCR validation
To generate primer sequences for PCR validation for each canonical somatic SV, we first pr epar ed the sequence template by concatenating 800 bp nucleotides from the first breakpoint, the inserted sequence, and 800 bp nucleotides from the second breakpoint. Then, the Python bindings of Primer3 ( 33 ) are performed, setting the sequence target as 25 bp nucleotides from the first breakpoint, the inserted sequence, and 25 bp nucleotides from the second br eakpoint. Her e, we cr eated fiv e pairs of primer sequences for each primer product size range of 201-300, 301-400, 401-500, . . . , and 1501-1600. Next, we performed Genome-Tester ( 34 ) to remove pairs of primer sequences that have too many binding sites ( > 5 for left or right primers) and too many alternati v e PCR products (more than two for insertion and deletion and more than one for other types of SVs). Finally, for each somatic SV for validation, we selected one primer pair that has a smaller product size, less number of primer binding sites, and alternati v e PCR products.
To design a primer for highly repetiti v e sequences such as centromere and telomere, we selected primer sequences that were expected to occur once in the sample genomes. For example, for these primer sequences, we should be able to observe them about 15 times in a 30x coverage FASTQ sequence (in the haploid r efer ence genome). Ther efor e, we designed the primer sequence as follows: 1. We parsed k -mers ( k = 19) from the original FASTQ, and calculated the histogram for each k -mer. 2. For each k -mer subsequence in the assembled contigs for single breakend SVs, we masked it with 'N' if it occurred less than 8 times or more than 50 times. 3. We conca tena ted the pr e-br eakpoint sequence and assembled contigs (we limited to 2000 bp) masked by the above, and designed primers using primer3 on this sequence.
All PCR r eactions wer e performed in a total of 20 ul volume using 10 ul of Go Taq Master Mix (Promega), 1 ul of each primer (final 0.5 nM), 1 ml of gDNA (20 ng), and 8 ul of double-distilled water. The PCR samples were denatured at 95 • C for 2 min, subjected to 40 cycles of amplification (95 • C for 30 s, 55 • C for 30 s and 72 • C for (product size (bp) / 1000) min and followed by a final extension step a t 72 • C . A list of primers is provided in Supplementary Data 3. PCR products were resolved by agarose gel electrophor esis. Repr esentati v e PCR products were purified using QIAquick Gel Extraction Kit (Qiagen) according to the manufactur ers' r ecommended protocols. Finally, the purified samples were subjected to direct capillary sequencing (eurofin). All sequence data were analyzed using A pE ( https://jorgensen.biolo gy.utah.edu/wayned/a pe/ ) and the Chromas Lite viewer (Technlysium Pty., Ltd.).

Evaluation of nanomonsv using benchmark dataset and simulation
For highly reliable somatic SV sets, we used two datasets. The one is high-confidence somatic SV files obtained from the high-coverage NovaSeq data ( 35 ) ( https://www.n ygenome.org/bioinf ormatics/3-cancer-celllines-on-2-sequencers/COLO-829-NovaSeq--COLO-829BL-NovaSeq.sv.annota ted.v6.soma tic.high confidence. final.bedpe ). The other is from somatic SV truth set generated by multi-platform and experimental validation ( 36 ) (truthset somaticSVs COLO829.vcf available at https://zenodo.org/recor d/3988185 ), which is conv erted to GRCh38 coordinates with liftOver. We removed insertions and deletions with ≤100 bp lengths because these were the out-of-score in this paper. For high coverage Nanopore sequence data (ERR2752451, ERR2752452) and PacBio sequence data (ERR2808247, ERR2808248) of COLO829 and its matched control, we downloaded FASTQ sequencing data of ENA study accession PRJEB27698 ( 36 ), and Workflow of somatic SV detection in nanomonsv Canonical SV module. Canonical SV module for nanomonsv consists of the following four steps. P arsing: the r eads likely supporting SVs ar e extracted fr om both tumor and matched contr ol BAM files using CIGAR string and supplementary alignment information. Clustering: the reads from the tumor sample that presumably span the same SVs are clustered, and the possible ranges of breakpoints are inferred for each possible SV. If there exist apparent supporting reads in the matched control sample (or non-matched control panel samples when they are available), these are also removed. Refinement: Extract the portions of the supporting reads around the breakpoints, and perform error-correction using racon ( 78 ) to generate a consensus sequence for each candidate SV. Then, aligning the consensus sequence to those around the possible breakpoint regions in the reference genome using a modified Smith-Waterman algorithm (which allows a one-time jump from one genomic region to the other, see Supplementary Figure S1), we identify the exact breakpoint positions and the inserted sequence inside them. Validation: From the breakpoint determined in the previous step, we generate the 'putati v e SV segment sequence.' Then we collect the reads around the breakpoint of putati v e SVs and check whether the putati v e SV segment sequence exists (then the read is set as a 'variant supporting read') or not (then the read is classified to a 'r efer ence r ead') in each read of the tumor and matched control. Finally, candidate SVs with ≥3 variants supporting reads in the tumor and no variant supporting reads in the matched control sample are kept as the final SVs. See Supplementary Text for detail. aligned to the r efer ence genome with minimap2 to the GRCh38 r efer ence genome and sorted and index ed using samtools. Then, nanomonsv was performed on these data as described in the previous section.
For the comparison with nanomonsv, we adopted 'separate detection and subtraction a pproach', w here we independentl y a pplied standard SV detection tools (Sniffles ( 10 , 37 ) ( https://github.com/fritzsedlazeck/Sniffles ) version 2.0.7, cuteSV ( 15 ) version 2.0.0, Delly ( 38 ) version 1.0.3, SVIM ( 13 ) version 2.0.0) to both tumor and matched control samples with different thresholds, and eliminated the SVs called in matched controls from those found in tumors. We first aligned the FASTQ files of tumor and matched control using minimap2 with the same setting with nanomonsv.
Then, we performed Sniiffles, cuteSV, Delly and SVIM on tumor and matched control B AM files, separ ately. The option of each software were: • Sniffles: '-minsupport 1 -non-germline' For each method, we extracted SVs from tumor samples with ≥3 supporting reads (5 ≥ for high coverage data from PRJEB27698) and removed those whose consists of the f ollowing f our steps. P arsing: the r eads putati v ely supporting single breakend SVs are extracted from both tumor and matched control BAM files using soft clipping information in the CIGAR strings. Clustering: the reads from the tumor sample that presumably support the same single breakend SVs are clustered. The candidates are removed if apparent supporting reads are detected in the matched control sample (or non-matched control panel samples when they are available). Refinement: Gather the soft-clipped part of the reads with 100 bp margins inside the breakpoints and generate an error-corrected consensus sequence by two round iterations of all-vs-all alignment by minimap2 ( 29 ) and polishing with racon ( 78 ). Then, aligning the consensus sequence to those around the possible breakpoint regions by Smith-Waterman algorithm, we detect single base resolution breakpoints and the consensus sequence after the breakpoint. Validation: from the breakpoint determined in the previous step and the error-corrected consensus sequence after the breakpoint, we generate the 'putati v e SV segment sequence.' Then, as with Canonical SV module, the reads around the breakpoint of putati v e single breakend SVs are classified into 'variant supporting read' or 'reference read' for both tumor and matched control. breakpoints overlapped with any of SVs detected from normal samples allowing for 200 bp margins. We also removed SVs confined within simple repeat regions. We also used CAMPHORsomatic ( 39 ) ( https://github.com/ afujimoto/CAMPHORsomatic ) on commit 7ad6bdb for somatic SV detection. We applied our own patch ( https://github.com/ncc-ccat-gap/module box aokad/blob/ master/20221005-CAMPHORsoma tic/SB CH.pa tch ) to CAMPHORsomatic since it could not be executed without tha t modifica tion. We run CAMPHORsoma tic with the default setting.
For simulations, we pr epar ed two haploid human genomes; extracted 22 autosomes and chromosome X from the human r efer ence genome (GRCh38), and injected in-silico germline SVs (2500 duplications, 5000 indels , 100 inversions , 50 in version-deletion, and 50 in versionduplications) using the 'simSV' command of by SUR-VIVOR (version 1.0.6, https://github.com/fritzsedlazeck/ SURVIVOR ) ( 40 ). Then, we merged the haploid human genomes to make diploid human genomes with germline SVs to constitute an in-silico matched control genome. Then, we further generate 'somatic SVs' (100 duplications, 200 indels, 100 translocations and 100 inversions) on the in-silico matched control genomes to make up an in-silico tumor genome. Since the coordinate system of the simula ted soma tic SVs is based on the in-silico matched control genome, we converted the coordinate system of the simula ted soma tic SV list back to the GRCh38. Next, we performed NanoSim ( 41 ) ( https://github.com/bcgsc/NanoSim , version 2.6.0) on this in-silico tumor and matched control genome to generate Nanopore-like tumor and two matched control (one is literally for matched control data and the other is for mixing with tumor sequencing data) sequencing data. After learning the parameters using Nanopore reads of COLO829BL aligned to chromosome 22 via the r ead analysis.p y script, we generated simulated Nanopore reads with sufficient depths ( ∼180Gb yields) via the simulator.py script. These FASTQ files were aligned with min-imap2 to generate BAM files. Finally, we sub-sampled Nanopore-like BAM files to generate tumor and matched control BAM data with specified sequencing amounts (10x, 20x, 30x, 40x, and 50x) and the tumor purities (0%, 20%, 40%, 60%, 80% and 100%) and performed nanomonsv as well as Sniffles, cuteSV, Delly, SVIM and CAMPHORsomatic as described above to obtain somatic SV calls from each method.

Methylation analysis
To quantify the amount of methylation, we used nanopolish version 0.11.1 ( https://github.com/jts/nanopolish ). First, we performed the 'nanopolish index' command from the original fast5 file to generate the index that associates read IDs and their signal-de v el data. Then, we executed the 'nanopolish call-methylation' command to make the TSV file summarizing the log-likelihood ratio for methylation for each read ID and genomic position. Then, we obtained the methylation frequency at each genomic position using the script provided on the software w e bsite. To measure the significance of methylation frequency difference between the tumor and the match control at each LINE1 source element, we first calculated the P -value at each locus using Fisher's exact test with the alternative hypothesis of onesided, and then obtained an asymptotically exact P -value using harmonicmeanp package version 3.0 ( 42 ).

Calculation of higher -or der r epeat match scor e
First, single breakend SVs that are classified as 'High Repeat single breakend SVs' and that are mostly annotated with 'Satellite / centr' by RepeatMasker are e xtracted. Ne xt, we executed the StringDecomposer ( 43 ) version 1.1.2 for each contig against the final monomer FASTA files generated by HORmon ( 44 ) (cen* monomers.fa files under the monomersFinal directory, downloaded from https: //figshare.com/articles/dataset/HORmon/16755097/1 ). Then, for each chromosome monomer file result (final decomposition.tsv), the degree of monomer concordance is calculated. More specifically, we read the result the files one line at a time, and if the pr e-/ post-r elationship of the monomers (curated from cen* hors.tsv files from HORmonHORs directory, see Supplementary Data 1) is consistent, ( < end-pos > -< start-pos > ) * < identity > / 100 is added, and the divided by the length of the contig is the HOR match score (see Supplementary Figure S2).

Comparison with short-read sequencing data
We used three cancer cell-lines (COLO829, H2009 and HCC1954) and their matched controls (COLO829BL, BL2009 and HCC1954BL) for the evaluation (see Table 1 for the detailed description of these cell-lines). Long-read whole-genome sequencing was conducted using GridION and PromethION. The total outputs were 59.13 to 156. 30 Gbps, and the N50 sequence lengths ranged from 14 309 to 24 501 bp (see Table 1 , Supplementary Figure S3). To compare with a short-read platform, we also performed highcoverage sequencing of these three paired cell-lines using Illumina Novaseq 6000 platform. The total amounts of yield after polymerase chain reaction (PCR) duplication removal were 205.76 Gbps to 484.26 Gbps.
A ppl ying nanomonsv to these long-read data and rescuing canonical SVs identified from Single breakend SV module, we identified 49, 724 and 748 canonical SVs for COLO829, H2009 and HCC1954, respecti v ely (Figure 3 A, Supplementary Figure S4, Supplementary Data 2). Those included 39 SVs that were specifically identified by Single breakend SV module and reclassified into canonical SVs. For the evaluation of precision, we performed the PCR on 139 randomly selected SVs, and 132 (94.9%) showed tumor sample-specific bands with predicted product sizes (see Supplementary Data 3, Supplementary Figures S5, 6). Except for insertions, the validated ratio was reasonably high [96.1% (99 / 103)]. A relati v ely low valida tion ra tio for insertions [89.92% (33 / 37)] might be partly due to the larger size of their PCR products. Even for the insertions not v alidated b y PCR, we observed tumor-specific supporting reads by manual inspection with a genome viewer ( 45 ) in most cases (Supplementary Figure S5c). To evaluate recall, we compared with SVs commonly detected by four algorithms (manta ( 25 ), SvABA ( 26 ), GRIDSS ( 27 , 28 ) and GenomonSV) in the short-read platform, which were considered to be 'true' somatic SVs with a high degree of Table 1. Summary statistics of long-read (Nanopore) and short-read (Illumina) data from six cell-lines. COLO829 (from a metastatic cutaneous melanoma patient) and COLO829BL (from a lymphoblastoid line of the same patient) have been often used as a benchmark in many previous studies ( 35 , 47 , 77 ). Although this cell-line has been known to have hypermutated nature for somatic single nucleotide variants as well as double nucleotide ones, the number of somatic SVs seems to be relati v ely low. H2009 (from metastatic lung adenocarcinoma) has many long insertions mainly by high LINE1 activity and has been used in studies investigating the mechanism of MEIs ( 20 , 21 ). HCC1954 (from ductal breast carcinoma) and HCC1954BL also have been frequently used as a benchmark (TCGA mutation calling benchmark 4, https://gdc.cancer.gov/ ) and seem to have a relati v ely large number of somatic SVs. Although these cell-lines have been used in many studies, there have been few efforts to characterize e xhausti v e and accurate lists of somatic SVs from these cell-lines  (Figure 3 B), suggesting the high sensitivity of nanomonsv on long-read sequencing data e v en for relati v ely low cov erage compar ed to short-r ead sequencing data. For COLO829, H2009, and HCC1954, 6, 87 and 51 (7.1-12.0%), respecti v ely, were ne wly detected by long-read sequencing data (not identified by any of the four algorithms or by TraFiC-mem ( 20 ) applied to high-coverage Illumina short-read sequencing data). These long-read specific SVs were also validated by PCR method with similar accuracy as SVs detected in the short-read technology (Supplementary Figure S5a). These long-read specific SVs were mostly insertions or SVs with two breakpoints located in repeat or low-complexity regions (Supplementary Figure S7). For instance, the somatic translocation connecting chromosomes 3 and 6 (chr3:26390429-chr6:26193811) in COLO829 was missed by Illumina sequence data, probably because the short-read alignment was highly ambiguous around the breakpoint of chromosome 3 (overlapping with LINE1 annotation). Some of the SVs in this category had clear signals of copy number changes around the breakpoints (Supplementary Figure S8), giving another evidence that they were genuine somatic SVs.
Breakpoint positions detected by nanomonsv on ONT sequencing data were mostly (96.7%) within two bp of those detected by Illumina sequencing data (Figure 3 C), despite the difference in error rate between the two platforms. Ther efor e, r easonably accura te identifica tion of breakpoint positions is possible with error correction and careful examination of supporting reads from error-prone long-read sequencing.
Ninety-nine somatic SVs were those affecting known cancer-related genes ( 46 ). These included important cancer genes such as the 12 kb deletion of PTEN in COLO829 ( 47 ) and the 5kb deletion of STK11 in H2009 though these were also identified by the short-read platform.

Evaluation of nanomonsv using benchmark dataset and simulation
We compared 49 somatic SVs obtained by nanomonsv using ONT sequencing data of COLO829 with high-confidence somatic SV sets for the same cell-line generated by high-coverage short-read platforms and multiple variant callers ( 35 )  . Although the recall was slightly lower for Arora benchmark, the number of supporting reads for their sequence data was generally small (Supplementary Figure S9b).
To evaluate the importance of the a pproach jointl y utilizing tumor and matched control samples, we separately applied regular SV detection tools (Sniffles2 ( 37 ), cuteSV ( 15 ), Delly ( 38 ) and SVIM ( 13 )) to tumor and matched control samples with different thresholds, and filtered out the SVs called in matched controls from those in tumors (we call this approach as 'separate detection and subtraction approach'). The precision and recall of this approach were inferior to those of nanomonsv, suggesting that sim ultaneousl y utilizing tumor and matched control data is effecti v e for the sensiti v e and accura te identifica tion of somatic SVs (Figure 3 F, Supplementary Figure S9c). We hav e also e valua ted the software CAMPHORsoma tic ( 39 ), which handles tumor and matched control samples simultaneously. The precision and recall of nanomonsv were better than those of CAMPHORsomatic (Figure 3 F, Supplementary Figure S9c). Next, we evaluated the performance of nanomonsv using simulation data with different tumor purity and sequence yields. Overall, the precision and recall of nanomonsv were superior to other approaches. Although the recall ratio became small for very low tumor purities and sequence yields, precision was relati v ely stable, implying the robustness of nanomonsv (Supplementary  Figures S10-S12). Especially in the case of low tumor purity, the sensitivity is significantly reduced without a decent sequence yield. Ther efor e, e v en for long reads, it is desirab le to hav e 30-40 × cov erage (roughly equi valent to 90-120 Gbps yield) as in typical short-read-based studies ( 48 ).

Characterization of mobile element insertions
Canonical SV module identified a total of 509 insertions, among which 492 were from H2009. For insertions, our approach can identify complete inserted sequences as well as inserted positions. Ther e ar e many possible types of insertions, such as tandem duplication, mobile element insertions (MEIs), viral sequence integration, and processed pseudo gene. To systematicall y characterize the inserted sequences, especially focusing on MEIs, we have developed a pipeline for classifying the inserted sequences based on comparison with transcriptome, annotation with repeat sequence information, and re-alignment to the r efer ence genome (see Figure 4 A). First, if the inserted sequence significantly matched with a transcript, the insertions were classified into processed pseudo gene ( 49 , 50 ), w hich are copies of mRNAs integrated into the genome by re v erse transcriptase acti vity of LINE1 elements. We identified two processed pseudogenes affecting IBTK and CARNMT1 genes in H2009 (see Supplementary Figure S13). Although the existence of these pseudogene insertions had been identified by the short-read platform using the same cell-line ( 49 ), a detailed structure of the entire inserted sequence such as the position of the inversion breakpoint was first confirmed in this study.
Second, when either of three major mobile elements (LINE1, Alu and SVA) covered most of the inserted sequence ( ≥80% by Repea tMasker, http://www.repea tmasker. org ), the inserted sequences were categorized into each class. We identified 323 LINE1 and 15 Alu insertions in thr ee cell-lines, r especti v ely (Figure 4 B). The LINE1 insertions are frequently accompanied by inversion at the 5 end, whose mechanism can be explained by 'twin priming' ( 51 ). In fact, by investigating inserted sequences, the 5 inversions were observed in 81 (25.1%) of LINE1 insertions. In addition, 5 inversions were frequently accompanied by the partial loss of internal LINE1 sequences, which might occur during the integration process (Figure 4 C). We also observ ed other comple x structural changes. One e xample was 1100 bp insertion at chromosome 14, which was a direct conca tena tion of 160 bp 5 end and 900 bp 3 end LINE1 sequence without a 5 inversion. These diversities of insertion structures produce some deviations between inferred insert sequence lengths from short-read and long-read sequence data (Supplementary Figure S14) because accurate inference of the insert nucleotide length from short-read sequencing data is difficult.
Next, the remaining insertions were aligned to the human genome to explore LINE1 3 transductions, in which unique DNA segments downstream of LINE1 elements are mobilized as part of aberrant retrotransposition e v ents ( 52 ). Transposed sequences can be a combination of LINE1 elements and their downstream sequences (partnered transductions) or only downstream ones (orphan transductions).
When a LINE1 element existed upstream of the aligned site of inserted sequences, we can infer that the LINE1 element is the source of transduction. As possible LINE1 sources, we first extracted 5228 full-length evolutionarily recent primate-specific LINE1 elements from the human r efer ence genome (r efer ence putati v e LINE1 source elements). In addition, since se v eral acti v e non-r efer ence LINE1 sour ce elements can be detected as polymorphic insertions, we also included 652 and 2610 full-length LINE1 insertions identified in 1000 genomes Phase 3 ( 53 ) and gnomAD v2.1 ( 54 ), respecti v el y. Furthermore, w hen many inserted sequences were aligned to the same genomic locations, we searched for the germline LINE1 insertion near those positions from the normal sequence data and manually curated the putati v e rare germline LINE1 insertions that were considered as the sources of LINE1 3 transduction.
We identified 107 somatic 3 transduction e v ents (61 partnered and 46 orphan transductions) from 33 putati v e LINE1 source elements, of which 105 from 31 source elements were from H2009 ( Figure 5 ). Of the 24 LINE1 sources from the reference genome, 20 belonged to the human-specific LINE1 (L1HS) subfamily, three to L1PA2, and one to L1PA4 (the second and fourth youngest primatespecific subfamilies, respecti v ely). Nine were deri v ed from non-r efer ence LINE1 sour ce elements (four from 1000 Genome Phase 3, three from gnomAD, and two from manual cur ation), corrobor ating the importance of populationand individual-specific hot LINE1 elements ( 55 ). Several transductions included the 5 inversions, implying the same mechanism as solo L1 insertion, such as twin priming functions during re v erse transcription. For each LINE1 source element, 3 end positions of the inserted sequences tended to concentra te a t the close genomic positions. This may be because these 3 end positions are probably the location where the transcription is terminated, and the positions with a potency of transcription termination may be scattered because they r equir e some characteristic sequences. As localized h ypo-meth ylation of the LINE1 promoter region has been reported to dri v e the somatic activation as source elements ( 20 ), we quantified the methylation le v el using nanopolish ( 56 ) on raw signal-le v el data of ONT sequencing. For all the 23 r efer ence LINE1 sour ce elements, the methyla tion ra tios w ere significantly low er in tumors than in the matched controls (Figure 6 A, Supplementary Figure S15). We also identified two examples of nested LINE1 transduction ( 20 ), where somatically inserted LINE1 elements themselves became the source of the next LINE1 transduction (Figure 6 B).
The refinement step of the nanomonsv procedure performs error correction of the insert sequences. The accuracy of the insert sequences by nanomonsv was estimated to be mostly more than 95% (Supplementary Figure S16a). This refinement of inserted sequences enabled us to investigate the features such as target site duplications and polyA tails, which were frequently accompanied by MEIs (Supplementary Figure S16b). Target site duplications and poly-A tails were observed in 67.2% (314 / 467), and 96.8% (452 / 467), respecti v ely (Figure 4 D, E). These results suggest that longread sequencing has great potential for characterizing various mechanisms of genomic insertions.

SVs connected with centromere and telomere sequences
Single breakend SV module identified in a total of 91 somatic single breakend SVs (3, 38 and 50 in COLO829, H2009 and HCC1954, respecti v ely, see Supplementary Data 5,6). Of those, 32 single breakend SVs were bound to satellite (23 and 5 SVs for alpha satellite and human satellite sequences, respecti v ely) or simple repeat sequences (4 SVs).
Although e v en short-read sequences can be used to identify single breakend SVs with satellite or simple repeat sequences ( 28 ), long-read sequencing enables us to elicit more refined information about their nature by assembling the raw read after the breakpoint. In alpha satellite regions, various types of a pproximatel y 171 bp monomer sequences constitute high order repeat (HOR) structure per centromere region ( 44 ). In chromosome X, 12 di v ergent monomers are ordered to form an approximately 2000 bp canonical HOR (ABCDEFGHIJKL), which occupies most of the centromeric region over millions of bases ( 57 ). On the other hand, non-canonical forms of HOR structures specific to populations and individuals are occasionally observed ( 58 , 59 ). For each of the 21 single PAGE 13 OF 18 Nucleic Acids Research, 2023, Vol. 51, No. 14 e74 breakend SVs leading to alpha-satellites (excluding two that matched inacti v e alpha-satellite sequences), we e xamined the consistency of the contig sequence with the HOR pattern at the centromere of each chromosome by calculating the HOR match score (Figure 7 A, see Materials and Methods for details). At least, 12 single breakend SVs were estimated to be interchromosomal (Supplementary Data 6), and se v en of them corresponded to the deri vati v e chromosomes inferred by previous SKY karyotype experiments (resource hosted on the Cellosaurus w e bsite ( 60 , 61 )). Also, w e could validate 7 out of 8 using PCR (see Supplementary Figur e S17). Ther efor e, translocation involving centromer e sequences may be a frequent e v ent.
Most of the estimated HOR from the contig centromere sequence were canonical ones which are chromosomespecific and evolutionary defined ( 44 ). On the other hand, we identified non-canonical HORs in three single breakend SVs bound to alpha satellite sequences. One single breakend SV at chromosome 11 connected to the centromere sequence of chromosome X had a 17-mer monomer of ABCDEFGHIJKLHIJKL (Supplementary Figure S18).
We detected a single breakend SV joining a centromere sequence of chromosome 13 and complex rearranged regions in RB1 , a well-characterized tumor suppressor gene located in the region distant from the centromere sequences (Figure 7 B, Supplementary Figure S19). Furthermore, we identified a single breakend SV at chromosome 20 connected to chromosome 8 alpha satellite sequences with an inversion in the alpha satellite side near the breakpoint, which was validated by PCR (Figure 7 C). We have also identified three single breakend SVs leading to telomeric sequences (Figure 7 D, Supplementary Figure S20) ( 62 , 63 ), two of which, corresponding deri vati v e chromosomes have been detected by previous SKY (der (14)t(X;14) in H2009 and der(2)t(2;8;4) in HCC1954) ( 60 , 61 ). These observations suggest that SVs involving centromere and telomere sequences are common e v ents in cancer, and our approach can help re v eal their complex structures.

LINE1-mediated r earr angements detected by single br eakend SV module
Many contig sequences of single breakend SVs showed prominent pa tterns indica ti v e of LINE1-mediated rearrangement, where the first portion matched the LINE1 sequence and the remaining portion unambiguously matched the human genome sequence distant from the breakpoints (Supplementary Figure S21). Although its presence is widely known, LINE1-mediated rearrangement has been notoriously difficult to detect from short-read sequencing data.
In the H2009 cell line, where LINE1-mediated deletions were analyzed e xtensi v ely in pre vious studies using a shortread platform ( 20 , 21 ). Our analysis detected 12 LINE1mediated deletion and rearrangement e v ents. Ten of these were accompanied by local deletions (112-10430 bp), of which six had also been detected in previous studies. The newly detected ones tended to have shorter inserted LINE1 sequences. We also newly identified one large intrachromosomal rearrangement and one interchromosomal translocation mediated by LINE1 sequences. Three newly identified LINE1-mediated rearrangements were validated by PCR (Supplementary Data 7). Most LINE1-mediated SVs had a relati v el y simple structure w here different locations were connected via LINE1 segments. Howe v er, we also identified two complex LINE1-mediated rearrangements ( Figure  8 A, Supplementary Figure S22). One was predicted to be an insertion with a pproximatel y 30 000 bp in length from a distant genomic region, mediated by a 658 bp LINE1 segment and an orphan transduction. The other was an inv ersion e v ent affecting the CENPI gene with two breakends, one of which was deri v ed from a partnered transduction from a non-r efer ence LINE1 sour ce site on 3q21.1. In the HCC1954 cell line, we also identified one interchromosomal translocation mediated by a LINE1 and one putati v e Alu-mediated deletion (Supplementary Figure S23). While poly-A tails were observed in the majority of LINE1mediated rearrangements (10 out of 12), no rearrangements had target site duplications, consistent with previous studies ( 21 , 64 ).

Hepatitis B virus integration detection
Vir al integr ation into the cancer genome is fairly frequent in cancers such as human papillomavirus ( ∼8000 bp) in multiple cancers ( 65 ), hepatitis B virus (HBV) ( ∼3300 bp) in li v er cancers ( 66 ) and human T-cell leukemia virus type I ( ∼9000 bp) in adult T-cell leukemia / lymphoma ( 67 ). We have applied nanomonsv to a cell-line, PRC / PRF / 5, known to have HBV integration. Since there were no matched controls for this cell-line, we used BL2009 cell-line as a dummy matched control and just focused on HBV integration detection. We identified 12 HBV integrations. Most of these integrations were identified by Single breakend SV module because the integrations were usually accompanied by large deletions and translocations. Nanomonsv identified not only all the integrations identified in previous studies by Illumina short-read platforms but also one new integration (Supplementary Data 8). Howe v er, the advantage of long-reads is the ability to reconstruct the HBV insertion site and internal sequence completely. We observed that one integration had characteristic inverted duplication consisting of HBV and human genome sequences around the integration sites. For example, in the HBV-mediated rearrangement that connected the TERT gene promoter (known as the frequent HBV integration site ( 66 , 68 , 69 )) on chromosome 5 and to the locus of chromosome 13, intermittent segments of human and viral sequences formed inverted duplication (Figure 8 B). Furthermore, we identified an HBVmediated rearrangement at chromosome 8 connected to a Human Satellite 2 sequence, whose origin was predicted to be the one in chromosome 1 by alignment to the CHM13 r efer ence genome ( 70 ), suggesting that this e v ent is an HBVmediated interchromosomal translocation.

DISCUSSION
We proposed two approaches for identifying somatic structural variations (SVs), Canonical SV module and Single breakend SV module. Canonical SV module can identify the majority of the SVs identified from short-read platforms as well as novel ones. The precision and recall of Canonical SV module were demonstrated to be superior to the 'separate detection and subtraction approach' using existing SV detection tools. Furthermore, we have developed a workflow for detecting and classifying single breakend SVs (Single breakend SV module). We demonstra ted tha t it could identify complex SVs, such as those involving satellite sequences, LINE1-mediated rearrangement, and viral integration, which had been difficult to detect by short reads. We could determine the breakpoints of SVs with a singlenucleotide resolution with non-templated sequence insertions to some extent. Currently, most sophisticated algorithms on short-read platforms support single-nucleotide resolution detection using split-read evidence or local assemb ly. Howe v er, there has been little evaluation on the resolution of breakpoints of SVs using noisy long-read sequencing data. Identifying breakpoints at single-nucleotide resolution allows us to identify micro-homology and nontemplated sequence insertions, which can provide us with valua ble information a bout the mechanisms that generate SVs ( 71 , 72 ). In addition, it is highly preferable for comparison and annotation with SVs r egister ed in a public database.
In this paper, we did not focus on somatic VNTR / microsa tellite repea t e xpansion e v ents ( 73 ). Although long-read sequencing technology can potentially improve the detection of repeat expansion events, the current frame wor k based on the reference genome may not be appropriate to capture long repeat expansion e v ents because the r efer ence genome is not reliable at the location susceptible to these events. One possible approach to capture these e v ents may be to list microsatellite and VNTR r egions befor ehand, count the number of repeats using short tandem repeat aware alignment algorithms, and measure the difference in repeat count profiles between tumor and matched control data.
Although the current approach successfully identified somatic SVs and MEIs, detection of those present in the minority of cells (subclones) is still challenging with a modest sequencing depth. One way to deal with this is to perform target region amplification by adapti v e sampling ( 74 , 75 ). Another possibility to tackle this problem would be to combine single-cell sequencing technologies ( 76 ) with long-read platforms.
On the other hand, the interpretation of the detailed structure and properties of complex SVs is not fully automa ted a t present, and m uch of the work is done manuall y, which remains a challenge for processing many samples. For this purpose, there is a need to cover and classify more 'complex' forms of SVs. In addition, visualization methods need to be de v eloped to facilita te interpreta tion. It will also be necessary to establish an appropriate format for describing complex SVs in the future.
Single breakend SV module incorporates some assembly. Howe v er, it cannot detect SVs where both of the breakpoints are located in areas where reference genomes are not well-characterized, such as highly repetiti v e regions. It will be necessary to obtain and utilize a complete r efer ence genome for each individual ( 70 ) or consider using a graph genome that covers a major variation of human genomes ( 30 ).

DA T A A V AILABILITY
The raw Oxford Nanopore sequence data and Illumina short-read sequence data used in this study are available through the public sequence repository service (BioProject ID: PRJDB10898).

SUPPLEMENT ARY DA T A
Supplementary Data are available at NAR Online.