BleTIES: annotation of natural genome editing in ciliates using long read sequencing

Abstract Summary Ciliates are single-celled eukaryotes that eliminate specific, interspersed DNA sequences (internally eliminated sequences, IESs) from their genomes during development. These are challenging to annotate and assemble because IES-containing sequences are typically much less abundant in the cell than those without, and IES sequences themselves often contain repetitive and low-complexity sequences. Long-read sequencing technologies from Pacific Biosciences and Oxford Nanopore have the potential to reconstruct longer IESs than has been possible with short reads but require a different assembly strategy. Here we present BleTIES, a software toolkit for detecting, assembling, and analyzing IESs using mapped long reads. Availability and implementation BleTIES is implemented in Python 3. Source code is available at https://github.com/Swart-lab/bleties (MIT license) and also distributed via Bioconda. Supplementary information Supplementary data are available at Bioinformatics online.


Introduction
Ciliate cells have two different types of nuclei: somatic macronuclei (MAC) and germline micronuclei (MIC).MACs develop from MICs during conjugation following meiosis and fertilization.During MAC development, the genome is extensively edited, to excise thousands of interspersed, internally eliminated sequences (IESs) through an RNA-guided mechanism, accompanied by chromosome fragmentation, and segmental rearrangement, with significant variation in the nature and degree of editing between species (reviewed in Chalker et al., 2013).
The task of identifying such edits from sequence data is a specialized subset of structural variant detection, whereby characteristics peculiar to ciliates, i.e. short tandem repeats called 'pointers' (Chen et al., 2014) or TA repeats characteristic of domesticated excisases (Klobutcher and Herrick, 1995) at IES junctions, can be exploited for annotation.When mapped against a MAC reference genome, genomic DNA reads that originate from MICs or developing MACs contain IESs or arrangements not present in the mature MAC.These are observed as inserts or clips in the aligned reads and can be used to annotate IES and genome scrambling junctions and to reconstruct the IES sequences.Accurate IES annotations are necessary to analyze the effects of knockdown of candidate genome rearrangement genes by measuring IES retention across known loci, expressed as an 'IES retention score' (Denby Wilkes et al., 2016;Swart et al., 2014).Software for annotating IESs, genome scrambling, and scoring retention have been developed for short read sequencing data, for example, ParTIES (Denby Wilkes et al., 2016), ADFinder (Zheng et al., 2020), and SIGAR (Feng et al., 2020).
Single-molecule long read sequencing from Pacific Biosciences (PacBio) and Oxford Nanopore offer advantages over short reads for IES prediction: (i) long reads several kb long may span multiple IESs, unlike short reads ($300-400 bp insert size) which span either one or zero IES sites; (ii) the entire IES sequence can potentially be covered by the read; (iii) IESs containing repetitive sequences are more likely to be detected with more sequence context.Because each long read represents a single genomic DNA molecule, the correlation of IES excision or retention between neighboring IESs can be investigated.Existing software packages like ParTIES were designed for short reads, make assumptions inapplicable to long reads with higher error rates, and would not be easily adapted to accommodate them.Therefore, we developed BleTIES (Basic Longread Enabled Toolkit for Interspersed DNA Elimination Studies) for IES detection and analysis from long read libraries.

Description
BleTIES has a modular design: the main module for de novo IES reconstruction, MILRAA, depends on standard bioinformatics libraries Biopython, pysam, and htslib (Cock et al., 2009;Li et al., 2009).Input data are the MAC genome assembly (Fasta format) and genomic long reads mapped to it in BAM format (Li et al., 2009), e.g. using minimap2 (Li, 2018).If a read contains IESs, the mapper should report them as inserts relative to the assembly.PacBio circular consensus sequences (CCSs) are sufficiently accurate to infer IES coordinates directly, but the naive calling of insert coordinates and sequences from uncorrected long reads (PacBio subreads or Nanopore reads) is inaccurate because of their high error rates.Therefore, with such reads, MILRAA identifies clusters of inserts within a specified threshold distance, and if a cluster is above the minimum coverage cutoff, the insert and flanking 100 bp regions are extracted from each mapped read and assembled to a more accurate consensus sequence with SPOA (Vaser et al., 2017).The consensus is re-aligned to the reference, with the included flanking sequence helping to 'anchor' the consensus in the reference, simultaneously predicting a more accurate insert position and sequence (Fig. 1).Tandem repeats (pointers) and/or TA motifs flanking the insert are also annotated if present.The software is currently limited to nonscrambled IESs.
In addition to the previously defined IES retention score, which is on a per-read basis, BleTIES reports a per-read retention score: the fraction of annotated IES sites spanned by a read where the IES is not excised.This metric can be used to bin reads that have high IES retention, using the MILCOR module, useful for partial assembly of the MIC genome.MILTEL predicts alternative chromosome breakage sites by looking for reads bearing telomeric repeats in their clipped regions.
To test MILRAA against defined inputs and to estimate the coverage required for IES prediction, we simulated long read libraries containing mixtures of published MAC-only and IEScontaining sequences with pbsim2 v2.0.1 (Ono et al., 2021) from Paramecium tetraurelia (Arnaiz et al., 2012), whose predicted IESs are mostly <100 bp in length.About 20Â coverage of IEScontaining PacBio subreads was sufficient to recover most of the original IES coordinates (exact: 82% recall, 89% precision; 65 bp: 92% recall, 99% precision).Almost all assembled IESs had flanking TA sequences reconstructed; about 48% of original IESs were assembled exactly, and 70% with a maximum of 1 mismatch and 1 indel.As expected, more IESs were identified and correctly assembled with higher coverage of IES-containing reads (Supplementary Material).
MILRAA was also tested with real mixed libraries of MIC and MAC (NCBI accession: SAMN12736122) mapped against a MAC genome reference (Sheng et al., 2020) for Tetrahymena tetraurelia, with 7551 previously characterized IESs [median length 2.78 kb (Hamilton et al., 2016)].Two-thirds of these could be recovered (median length $2 kb) with median sequence identity $99%, at ca. 50-60Â read coverage.In addition, about 2800 new IESs were predicted.IESs that were not recovered were on average longer (median $5 kb), hence insufficient reads could span the entire IES for prediction by MILRAA (Supplementary Material).These could be resolved by binning out IES-containing reads with MILCOR and reassembling them, or by increasing the average read length.

Conclusion
We have demonstrated the benefits of single-molecule long read sequencing data over short reads in annotating genome editing in ciliates, using our software tool BleTIES.These include the detection and assembly of longer IESs and direct reporting of IES sequences and retention scores.One SMRT cell of the PacBio Sequel II platform currently (2021) yields $300 Gb of subreads, and $20 Gb of error-corrected CCS reads.Given modest ciliate genomes, for example, 72 Mb (MAC) and 98 Mb (MIC) for P. tetraurelia (Gue ´rin et al., 2017), several IES retention experiments could potentially be analyzed with a single multiplexed sequencing run.

Fig. 1 .
Fig.1.Overview of BleTIES MILRAA method to reconstruct IES junctions from error-corrected CCS reads versus from subreads.(A) CCS reads have accuracy >99%, so the consensus insert coordinate from read mapping is adequate if coverage is sufficient.(B) (1) Subreads have higher error rates, so mapping alone is insufficient to define the insert position.(2) Therefore, clusters of adjacent inserts are identified, from which the insert sequence plus flanking 6100 bp are extracted from the subreads.(3) Extracted sequences are assembled with SPOA(Vaser et al., 2017).(4 and 5) This consensus is realigned to the reference to obtain a more accurate estimate of the insert position