ChimeraTE: a pipeline to detect chimeric transcripts derived from genes and transposable elements

Abstract Transposable elements (TEs) produce structural variants and are considered an important source of genetic diversity. Notably, TE-gene fusion transcripts, i.e. chimeric transcripts, have been associated with adaptation in several species. However, the identification of these chimeras remains hindered due to the lack of detection tools at a transcriptome-wide scale, and to the reliance on a reference genome, even though different individuals/cells/strains have different TE insertions. Therefore, we developed ChimeraTE, a pipeline that uses paired-end RNA-seq reads to identify chimeric transcripts through two different modes. Mode 1 is the reference-guided approach that employs canonical genome alignment, and Mode 2 identifies chimeras derived from fixed or insertionally polymorphic TEs without any reference genome. We have validated both modes using RNA-seq data from four Drosophila melanogaster wild-type strains. We found ∼1.12% of all genes generating chimeric transcripts, most of them from TE-exonized sequences. Approximately ∼23% of all detected chimeras were absent from the reference genome, indicating that TEs belonging to chimeric transcripts may be recent, polymorphic insertions. ChimeraTE is the first pipeline able to automatically uncover chimeric transcripts without a reference genome, consisting of two running Modes that can be used as a tool to investigate the contribution of TEs to transcriptome plasticity.


INTRODUCTION
Transposable elements (TEs) are mobile DNA sequences that comprise a large fraction of eukaryotic genomes, from 15% in Drosophila melanogaster ( 1 ), 45% in humans ( 2 ), to 85% in maize ( 3 ).Many TE copies have lost their ability to transpose as a result of accumulated mutations and recombination throughout evolution ( 4 ).Despite their lack of mobility, such ancient TE insertions may still harbor functional protein domains, alternati v e splice sites, and cis -acting regulatory sequences, as transcription factor binding sites (TF-BSs) and polyadenylation (PolyA) sites.Ther efor e, TEs ar e a major source of genetic di v ersity, not only due to their mobilization, but also because they donate protein domains to genes (5)(6)(7)(8), and regulatory sequences that modify the expression of nearby genes (9)(10)(11)(12)(13).The participation of TEderi v ed sequences in the host biology is a process called domestica tion or exapta tion ( 14 ).The ancestral role of the TE sequence can be domesticated into an essential host function, but it can also be modified, and adapted, into a ne w e xapted function that ma y also be essential f or the host species ( 14 ).
Chimeric transcripts are RNAs stemming from two sequences of different origins ( 15 ).Hereafter we define chimeric transcripts as mature transcripts that have both gene and TE-deri v ed sequences.These transcripts can be divided into three types: ( 1 ) TE-initiated transcripts: chimeric transcripts with a TE transcription start site (TSS) ( 16 , 17 ); ( 2 ) TE-exonized transcripts: TE sequences are incorporated into the transcript either partially or as full-length exons (18)(19)(20); and ( 3 ) TE-terminated transcripts: chimeric transcripts with a TE transcription termination site ( 21 , 22 ).TE-initia ted and TE-termina ted transcripts might modulate gene expression levels either by the presence of TF-BSs , PolyA sites , or chromatin changes; while TE-exonized transcripts may alter the protein sequence of coding genes and have a direct effect on the protein function.Regardless of the TE position, such e v ents of TE e xaptation and domestication have been associated with many biological roles and are widespread among eukaryotic species ( 14 ).In D. melanogaster , the CHKov1 gene produces a chimeric transcript with a truncated mRNA resulting in resistance to insecticide and viral infection ( 23 ).In humans, the SET-MAR gene produces a chimeric transcript containing a Hs-mar1 copy, involved in non-homologous end-joining DNA repair ( 24 ).In cancer, TEs become acti v e due to a global h ypometh yla tion sta te ( 25 ) and such activation may generate new chimeric transcripts with detrimental outcomes ( 26 ), a process called onco-exaptation ( 9 ).For example, in large B-cell lymphoma, the FABP7 gene has an endogenous retrovirus LTR co-opted as a promoter, generating a nov el protein involv ed in abnormal cell proliferation ( 27 ).Recently, such phenomenon has been observed in a larger scale.A pan-cancer study re v ealed 1,068 tumor-specific TEinitiated transcripts from genes coding antigens, showing the high prevalence of TE exaptation in cancer ( 28 ).Therefore, chimeric transcripts have a large impact on host biology, but their study r emains hinder ed by the ubiquitous repetiti v e nature of TE copies, and lack of methods to identify chimeric transcripts.
Previous studies with Cap Analysis Gene Expression (CAGE) re v ealed a significant percentage of genes producing TE-initiated tr anscripts, r anging from 3-14% in humans and mice, depending on the tissue ( 29 ).More specifically, in human pluripotent stem cells, chimeric transcripts comprise 26% of coding and 65% of noncoding transcripts ( 30 ).In D. melanogaster , a study using expressed sequence tags (ESTs) showed that the proportion of genes with chimeric transcripts is only ∼1% ( 31 ).Another study used an approach based on cap-enriched mRNA sequencing but with a preliminary transcript elongation step allowing not only to detect TSSs but also to estimate mRNA expression rates (RAMPAGE) ( 32 ).This study detected TE-initiated chimeric transcripts in 36 stages of D. melanogaster life cy cle, and observ ed that up to 1.6% of all tr anscripts were chimer as, representing up to ∼1% of all genes ( 33 ).More recently, a tissue-specific study has shown that 264 genes produce chimeric transcripts in the midbrain of D. melanogaster , corresponding to ∼1.5% of all genes ( 34 ).
Se v eral bioinformatic methods have been de v eloped to take advantage of RNA sequencing (RNA-seq) to identify chimeric transcripts, such as CLIFinder ( 35 ) and LIONS ( 36 ).The former is designed to identify chimeric transcripts deri v ed from long interspersed nuclear elements (LINEs) in the human genome, whereas the latter identifies only TE-initiated transcripts.Both methods need a r efer ence genome and they only detect chimeric transcripts deri v ed from TE insertions present in the r efer ence genome.Ther efore, it is impossible to identify chimeric transcripts derived from polymorphic TE insertions that may exist in other populations , strains , or individuals.Finally, the latest addition to chimeric transcript detection, TEchim ( 34 ), can detect chimeras with TEs that are polymorphic and absent from the r efer ence genome of D. melanogaster , but it is not a pipeline designed to run automatically with any other genome.
In this study, we hav e de v eloped ChimeraTE, a pipeline that uses paired-end RNA-seq reads to identify chimeric transcripts.The pipeline has two Modes: Mode 1 can predict chimeric transcripts through genome alignment, whereas Mode 2 performs chimeric transcript searches without a r efer ence genome, being able to identify chimeras deri v ed from fixed or polymorphic TE insertions.To benchmark the pipeline, we have used RNA-seq from ovaries of four D. melanogaster wild-type strains, for which we have assembled and annotated genomes.We found that ∼1.12% of genes have chimeric transcripts in the ovarian transcriptome, of which ∼88.97% are TE-exonized transcripts.Our results also re v eal that the retrotransposon roo is the most frequent exonized TE family.In addition, with Mode 2, we found 11 polymorphic chimeric transcripts deriving from TE insertions that are absent from the D. melanogaster refer ence genome.Ther efor e, this w ork pro vides a new strategy to identify chimeric transcripts with or without the reference genome, in a transcriptome-wide manner.

ChimeraTE: the pipeline
ChimeraTE was de v eloped to detect chimeric transcripts with paired-end RNA-seq reads.It is de v eloped in python3 v.3.6.15 language, and it is able to fully automate the process in only one command line.The pipeline has two detection Modes: ( 1 ) genome-guided, the r efer ence genome is provided and chimeric transcripts are detected aligning reads against it; and ( 2 ) genome-blind, the r efer ence genome is not provided and chimeric transcripts are predicted for fixed or polymorphic TEs.These Modes have distinct approaches that may be used for different purposes.In Mode 1, chimeric transcripts will be detected considering the genomic location of TE insertions and exons.Chimeras from this Mode can be classified as TE-initiated transcripts (TE located upstream of the gene), TE-exonized transcripts (TE within introns or embedded in gene exons), and TE-terminated transcripts (TE located downstream of the gene).In addition, results from Mode 1 can be visualized in a genome bro wser, which allo ws a manual curation of chimeric transcripts in the r efer ence genome.Mode 1 does not detect chimeric transcripts deri v ed from TE insertions absent from the provided r efer ence genome.Mode 2 predicts chimeric transcripts considering the alignment of reads against transcripts and TE insertions, in addition to a transcriptome assembly (user optional).Hence, Mode 2 detects chimeric transcripts from de novo TE insertions and an assembled genome is not necessary.In this Mode, two alignments are performed: (i) transcript alignment and (ii) TE alignment.Then, based on both alignments, the pipeline identifies chimeric reads that support chimeric transcripts, regardless of the TE genomic location.In Mode 2, since there is no alignment against an annotated genome, it is not possible to classify chimeric transcripts considering the TE position as in Mode 1.
Both ChimeraTE Modes use chimeric reads, which are defined by paired-end reads that contain both TE and exon sequences, as evidence of chimeric transcripts.This method has been widely demonstrated by other authors as a potential source of artifactual reads, mainly due to the occurr ence of mix ed clusters ( index hopping ) on the Illumina's flow cell that may be too close to each other, as well as jumping PCR ( in vitro crossover artifact), generating read pairs that are connecting two cDNA portions that are not joined in the sample (37)(38)(39)(40).Indeed, it has been shown that up to 1.56% of all reads produced by Illumina multiplexed approaches might generate chimeric reads ( 41 ), including cases that may support chimeric transcripts deri v ed from different genes.These artifactual reads originate more likely from highly expressed genes since ther e ar e mor e molecules on the Illumina's cell.Conversely, because TE-derived sequences might comprise a low proportion of the transcriptome, artifactual reads from TEs should be produced at a low fr equency.Furthermor e, it is unlikely that artifactual reads from the same gene and TE family among RNA-seq replicates would be produced.Nonetheless, to avoid including false chimeric reads, both Modes of ChimeraTE only call chimeric transcripts that are detected in at least two RNA-seq replicates (user optional for more).

ChimeraTE mode 1: genome-guided approach
In ChimeraTE Mode 1, paired-end RNA-seq reads, a whole genome assembly, and its respective gene / TE annotation are used to predict chimeric transcripts (Figure 1 A).We highlight the need for robust gene and TE annotations to submit an analysis to Mode 1, since each TE insertion and exon will be used to identify chimeric reads.Firstly, RNA-seq alignment to the genome is performed with STAR v.2.7.6a with default parameters (Figure 1 B), and transcript expression is assessed with Cufflinks v.2.2.1 ( 42 ) for each RNA-seq replicate.We consider FPKM ≥1 as an expressed gene by default, but it can be changed by the user with thefpkm parameter.Only concordant reads (both reads from the pair are aligned) with unique alignment ( samtools -q 255 ) are selected and converted to BED format with samtools v. 1.10 ( 43 ) and bedtools v.2.30.0 ( 44 ), implemented in p ython3 with p ybedtools v.0.9.0 ( 45 ).Split r eads ar e also considered as chimeric reads when one mate of the pair stems from the TE to the exon, or vice-versa (Supplementary Figure 1).The alignment position of these reads are considered independently with bedtools bamtobed -split .Finally, reads aligned into the forward and re v erse strands are separated with samtools ( 43 ).
The IDs from reads that have aligned against exons are identified with bedtools intersect (Figure 1 B).Then, reads with at least 50% of their length ( -o ver lap 0.50 ) aligned against r efer ence TE copies have their IDs selected (Figure 1 B), and TE copies without aligned reads are removed from the downstream analysis.These two lists of IDs are intersected between each other and reads that have aligned to both exons and TEs are identified, generating a raw list of chimeric reads (Figure 1 C).TE-initiated transcripts are the first to be searched by ChimeraTE Mode 1. Expressed genes that have TE copies within their 3 kb upstream region (default but adjustable with -window parameter) are identified with bedtools ( 44 ).For each gene, the pipeline will check whether the TE located upstream has chimeric r eads shar ed with the gene exons (Figure 1 D).If there are no chimeric reads, both gene and TE are discarded.The same method is applied to identify TE-terminated transcripts, but with genes harbouring downstream TEs up to 3 kb (user adjustable value) (Figure 1 1D).Finally, TE-exonized transcripts are also identified thanks to chimeric reads aligned to exons and TEs located within genes.Based on the TE position towar ds e xons, they are classified as embedded, intronic and overlapped.Embedded TEs are located entirely within an exon.In these cases, reads aligned only to the TE sequence are not considered chimeric reads (Supplementary Figure 1).Indeed, the exon can be artificially divided into two portions, and chimeric reads must have a read (or a split read) aligned to the TE, whereas its mate (or the extension from the split read) aligned to at least one exon portion (Supplementary Figure 1).This method avoids autonomous TE expression as evidence of chimeric transcript expression.The same is performed to identify TE-exonized transcripts deri v ed from TEs that are partiall y overla pping exons but are also located in introns (Supplementary Figure 1).Finally, TE-exonized transcripts with TEs located in introns are identified with bedtools intersect , and chimeric r eads ar e detected to support chimeric transcripts.
These steps are repeated for all RNA-seq replicates provided in the input.Then, the raw results from replicates ar e compar ed, and all chimeric transcripts that were found in ≥2 replicates and have been identified with at least ≥2 chimeric reads on average between r eplicates ar e considered true chimeric transcripts (Figure 1 C).These thresholds may be changed by the user with -cutoff and -replicate parameters.Mode 1 output is a table with a list of predicted chimeric transcripts categorized by TE position, with gene ID, TE family, chimeric read coverage, TE location, gene location, and gene expression (Figure 1 D).

ChimeraTE mode 2: a genome-blind approach to uncover chimeric transcripts
ChimeraTE Mode 2 is the genome-blind approach of the pipeline.The input data are stranded RNA-seq reads, gene transcripts, and r efer ence TE insertions.TE consensus can be used, but it will decrease the alignment rate, and ther efor e it should be used only when a fasta file with TE insertions is not available (Figure 2 A).The data will be used to perform two alignments with bowtie2 v.2.4.5 ( 46 ), one against all transcripts and another against all TE sequences, both with parameters: -D 20 -R 3 -N 1 -L 20 -i S,1,0.50 (Figure 2 A).To avoid very low-expressed transcripts predicted as chimeras and to ensure reasonable processing time, the SAM alignment is converted to BAM with samtools v. 1.10 ( 43 ), and FPKMs are computed for the reference transcripts provided in the input using eXpress v. 1.5.1 ( 47 ).Then, all genes with average FPKM < 1 are removed from the downstream analysis (Figure 2 B).To identify chimeric reads between TEs and gene transcripts, both alignments are converted to BED with bedtools v.2.30.0 ( 44 ).Among all aligned pairedend reads, the pipeline considers as chimeric transcripts the ones that have at least one read aligned to the TE sequence (singleton mapping) and its mate to the gene transcript, or when both reads have aligned (concordant mapping) to the TE and gene transcript.In order to identify these reads, the TE alignment output is used to create a list with all read 1 IDs that have aligned against TEs, and another list with all The alignment against transcripts is performed and their expression is calculated.Transcripts with FPKM < 1 are removed from the downstream analysis.Next, a list of reads aligned against transcripts is created.Through the alignment of reads against TE insertions, a second list with reads stemming from TEs is also cr eated.Then, mapped pair ed-end r eads and singletons are identified, generating the list of chimeric reads, for all replicates.All chimeric transcripts that have an average of chimeric reads > = 2 and are present in > = 2 replicates are maintained as true chimer as. ( C ) Tr anscriptome assembly and chimeric reads: The de novo transcriptome assembly is a non-default option of ChimeraTE Mode 2. It performs a transcriptome assembly and aligns reads against the assembled transcripts.Then, TE insertions in the assembled transcripts are identified with RepeatMasker and the TE reads are recovered.Using the two lists of reads (transcripts and TEs), the chimeric read list is generated and the putati v e assembled chimeric transcripts ar e pr edicted.Next, a blastn is performed between these transcripts and the r efer ence transcripts provided in the input.All transcripts with length > = 80% are selected.The process is repeated for all RNA-seq replicates and chimeric transcripts assembled > = 2 replicates are maintained as true chimeras.( D ) Chimeric transcripts: if the assembly is activated, ChimeraTE mode 2 provides three outputs: (1) Chimeric reads: predicted only based on the method depicted in B; (2) Assembled transcripts: predicted only based on the transcriptome assembly method depicted in C; and (3) Double evidence: predicted by both methods -B and C-.r ead 2 IDs, r egardless if their mates have also aligned (concordant mappings or singleton mappings).The same lists ar e cr eated by using the transcript BED file: ( 1 ) r ead 1 IDs of transcript mapping reads and ( 2 ) mate 2 IDs of all mate 2 r eads, r egardless of mate mapping.All mate 2 IDs that have a TE-aligned read 1 are searched in the list of transcriptaligned mate 2. The same is performed in the opposite direction (TE-aligned read 2, transcript-aligned mate 1).These read pairs will therefore be comprised of two mates from the same pair that were singletons in the alignments, i.e. pairs comprised of one read that has aligned against a TE, and its mate against a gene transcript.The cases in which the chimeric transcript does not have the TE insertion in the r efer ence transcript will be supported only by these singleton chimeric reads (Figure 2 D).For cases in which the TE insertion is present inside the reference transcript, chimeric reads supporting it may either be singleton or concordant r eads.Ther efor e, chimeric r eads can be concordant r eads in both alignments (TEs and genes), or they may be concordant only in the gene transcript alignment and singleton in the TE alignment.Due to the repetiti v e nature of TEs, short-read alignment methods provide very few unique aligned reads against loci-specific TE copies as most reads align ambiguously between similar TE insertions.Therefore, when a chimeric transcript has been identified involving more than one TE family, the TE family with the highest coverage of chimeric reads is maintained.Subsequently, ChimeraTE uses two chimeric reads as a threshold for calling a chimeric transcript, that can be modified by the user with the -cutoff parameter.Such value does not r epr esent transcript expression nor TE expression, but it represents the coverage supporting the junction between a gene transcript (CDSs / UTRs) and a TE sequence.Finally, the output tables show the list of genes and the respective TE families detected as chimeras, r efer ence transcript ID and the total coverage of chimeric reads supporting it.
The support of chimeric transcripts performed by ChimeraTE Mode 2 is from chimeric reads aligned by an end-to-end approach.Such alignment may reduce alignment sensiti vity, since e xon / TE junctions may be covered by split reads.To mitigate the loss of detection power due to the alignment method with Mode 2, alongside with chimeric read detection using alignments against transcripts and TEs, there is an option to run Mode 2 with a transcriptome assembl y a pproach, w hich can be activated with -assembly parameter (Figure 2 C).This approach will use RNA-seq reads to perform a de novo transcriptome assembly with Trinity v2.9.1 ( 48 ).RepeatMasker v4.1.2 ( 49 ) is used to identify assembled transcripts that may hav e TE-deri v ed sequences providing -ref TEs , a custom TE library, or predefined TE consensus sequences from Dfam v3.3 ( 50 ), according to the tax onom y level, i.e.: flies , mouse , humans.Then, RNA-seq r eads ar e aligned with bowtie2 v.2.4.5 ( 46 ) against the assembled transcripts.Subsequently, the alignment is used to identify whether transcripts containing TEderi v ed sequences have chimeric reads, including split reads.All assembled transcripts with chimeric transcripts are selected as candidate chimeric transcripts.Next, these candidates are submitted to a homolo gy anal ysis with r efer ence transcripts using blastn v2.11.0 ( 51 ).Finally, all assembled transcripts with masked TEs that have at least 80% of similarity with r efer ence transcripts across 80% of their length (can be modified withmin length parameter) are considered chimeric transcripts.All these steps are repeated for all RNA-seq replicates provided in the input.Finally, the list of chimeric transcripts obtained from all replicates with the transcriptome assembly approach is compared, and all chimeras that have been identified with at least ≥2 chimeric r eads and wer e found in ≥2 r eplicates, ar e consider ed as true chimeric transcripts.By activating the -assembly option in Mode 2, the output table will provide chimeric transcripts that have been predicted based on different evidences (Fig-

D. melanogaster wild-type strains: genome assemblies and RNA-seq
To assess ChimeraTE's performance, as well as the efficiency in the identification of chimeric transcripts deri v ed from polymorphic TE insertions, we have mined previousl y available RN A-seq data from ovaries of four D. melanogaster wild-type strains ( 52 ), two from Gotheron, France (44 56 0 N 04 53 30 E), named dmgoth101 and dmgoth63; and two from S ˜ ao Jos é do Rio Preto, Brazil (20 41 04.3 S 49 21 26.1 W), named dmsj23 and dmsj7.Such RNA-seq data was produced with mRNA libraries, which is strongly advised to use with ChimeraTE, since total RNA libraries contain pre-mRNA sequences and might introduce many false positi v es of TE-e xonized transcripts.
Sequencing was performed on an Illumina Hiseq (125 bp pair ed-end r eads), with two biological r eplicates.All RNAseq libraries were trimmed by quality, and adapters were removed with Trimmomatic ( 53 ).Each strain had also its genome previously sequenced by Nanopore long reads and assembled ( 54 ).The high-quality assemblies allowed us to manually check whether chimeric transcripts predicted by both ChimeraTE Modes have the predicted TE insertions inside / near genes, as well as manually curate the presence of chimeric reads.

Running ChimeraTE with D. melanogaster data
To run ChimeraTE Mode 1 on the available RNAseq data, we performed gene annotation in the four D. melanogaster genomes with Liftoff v. 1.6.3 ( 55 ) using -s 0.5 -s 0.75 -ex c lude par tial parameters and the r6.49D. melanogaster genome ( dm6 strain) available in Flybase as r efer ence.TE annotation was performed with RepeatMasker v4.1.2( 49 ), with parameters:nolow ;norna ; -s; and -lib with the TE sequence library for D. melanogaster provided by Bergman's lab v.10.2 ( https://github.com/bergmanlab/drosophila-transposons/blob/master/current/D mel transposon sequence set.fa ).Small TE insertions with length < 80 bp were discarded due to the lack of robustness to define them as TEs, following the 80-80-80 rule ( 56 ).In addition, due to the presence of simple sequence repeats (SSRs) in many TEs from D. melanogaster , such as r oo , 412 , tir ant ( 57 ), se v eral insertions from RepeatMasker might be indistinguishable from SSRs if the r epeat r egion comprises a large proportion of the TE sequence ( 58 , 59 ).Ther efor e, to incr ease the robustness of the TE annotation, we identified the presence of SSRs with TRF v. 4.09 ( 60 ) across all insertions annotated on each strain, with parameters 2 , 5 , 6 , 75 , 20 , 50 , 500 , -m , -h and removed TEs that presented SSRs over > 50% of their sequences.We used ChimeraTE Mode 1 with default parameters, setting -strand rf-stranded , since our paired-end reads have the orientation as re v erse and f orward f or the first and second r eads, r especti v ely.In case the stranded paired-end reads were produced as forward and reserve (first and second r eads, r especti v ely), the parameter would be -strand fwd-stranded .For ChimeraTE Mode 2, we demonstrate its potential in detecting chimeric transcripts deri v ed from TE insertions that are not present in a reference genome, e v en though the transcript sequences and TE copies provided as input stemmed from the dm6 r efer ence genome.Ther efor e, we have used ChimeraTE mode 2 with RNA-seq from the four wild-type strains listed previously, with reference transcripts from D. melanogaster ( dm6 strain) available in Flybase r6.49 ( http://ftp.flybase.net/genomes/Drosophilamelano gaster/ current/fasta/dmel-all-transcript-r6.49.fasta.gz ) ( 61 ).The TE annotation for the dm6 genome was assessed with the same protocol used on the four wild-type strains, with RepeatMasker ( 49 ).We have used default parameters, ex cept b y -assembly .

Additional species data
In order to demonstrate that ChimeraTE can be used with other species than Drosophila spp., we applied it to Arabidopsis thaliana , Homo sapiens and the fish Poecilia reticulata (guppy) RNA-seq datasets.The latter was selected as an example of non-model species.The genome and gene annotation of A. thaliana was obtained from NCBI (GCF 000001735.14),version TAIR10.1.The TE annotation was assessed with RepeatMasker v.4.1.2( 49 ), with par ameter s -species ar abidopsis -cutoff 225 -nolo wnorna -a -s .We downloaded two replicates of pairedend RNA-seq from A. thaliana leaf, available on NCBI (SRR21230172, SRR21230173).Regarding human data, the genome, gene and TE annotations wer e r etrie v ed from NCBI ( 62 ), version GRCh38.p14.Two paired-end RNAseq replicates from the human leukemic cell-line K562 were downloaded from NCBI (SRR521457, SRR521461).Finally, P. reticulata' s genome and gene annotation were also retrie v ed from NCBI (GCF 000633615.1),and TE annotation was assessed with RepeatMasker v.4.1.2, parameters : -species Actinopteri -cutoff 225 -nolow -norna -a -s .Two RNA-seq replicates from ovary follicular tissue were downloaded from NCBI (SRR17332506, SRR17332508).Both ChimeraTE Mode 1 and Mode 2 were used with default parameters, except by -threads 32 , and -ram 64 .

Benchmarking polymorphic chimeric transcripts with nanopore genomes
Once chimeric transcripts were identified, we used the highquality Nanopore assemblies for dmgoth101, dmgoth63, dmsj23 and dmsj7 previously published ( 54 ) to confirm whether genes predicted by Mode 1 as chimeric transcripts have indeed the respective TE insertion located near or within them.To do so, we have used an ad-hoc bash script to create three bed files from genes: 3 kb upstream; 3 kb downstr eam, and gene r egion.Then, we used bedtools intersect ( 44 ) to identify genes with TEs located in the three regions.For Mode 1 we have randomly sampled 100 chimeric transcripts of each wild-type strain to visualize the alignments performed by Mode 1 on the IGV genome browser ( 63 ).For Mode 2, all genes not found by bedtools intersect with the predicted TE insertion were visualized in IGV .In both manual curations, we considered false positi v es those cases in which we did not find the TE insertion, or we found the TE insertion, but without chimeric reads.
In order to assess the number of chimeric transcripts found by Mode 2 in wild-type strains deri v ed from TE insertions absent in the dm6 genome, we also used the adhoc bash script to create the bed files with 3 kb ± and the gene regions for dm6 .Then, we used bedtools intersect ( 44 ) with TE annotation and the gene regions.By using this method, we generated a list of genes with TEs located 3 kb upstream, inside genes (introns / exons), and 3 kb downstream for the dm6 genome.Then, the polymorphic chimeric transcripts were identified with the comparison of genes with TEs inside / nearby in dm6 and the list of chimeric transcripts in the wild-type strains.In addition, all chimeric transcripts deri v ed from TEs that were not found in dm6 were manually curated with the IGV genome browser ( 63 ).

Validation of chimeric transcripts by RT-PCR
D. melanogaster strains were kept at 24 • C in standard laboratory conditions on cornmeal-sugar-yeast-agar medium.
For sampling, 3-7 day old D. melanogaster females were immersed in Phospha te-buf fered saline solution (PBS) and dissected under a ster eomicroscope.Thr ee biological samples were collected in buffer TA (35 mM Tris / HCl, 25 mM KCl, 10 mM MgCl 2 , 250 mM sucrose, pH 7.5), each composed of 30 pairs of ovaries.All samples were collected on ice in 1.5 ml RNAse free tubes and stored at −80 • C until use.Total RNA was extracted using QIAGEN AllPrep DN A / RN A Micro extraction kit (Qiagen 80284) following the manufacturer's guideline.cDNA synthesis was performed on 1 g of total RNA or water (RT negati v e control) with BIORAD iScript cDNA systhesis (Biorad 1708891).Primers were designed considering the TE and the exon location with aligned chimeric reads (Supplementary Table 1) and PCRs were performed with Taq'Ozyme (Ozya001-1000).

Sequence and protein analysis of TE-e x onized elements
The sequences of TE-exonized elements were extracted from wild-type genomes with bedtools g etf asta ( 44 ), parameter -s, using BED files created by ChimeraTE Mode 1.Since the TE reading frame incorporated into the gene transcript is unknown, we r ecover ed all open r eading frames (ORFs) with EMBOSS v.6.6.0 getorf ( 64 ) in the three coding frames, from both strands.Next, the protein domains in these sequences were assessed with pfam-scan v. 1.6 ( 65 ), using Pf am-A.hmmda tabase fr om InterPr o ( 66 ), with default parameters.We applied a filtering step for protein domains with multiple overlapped matches in the same TE insertion, keeping in the downstream analysis only the longest match.We also removed protein domains that had no association with any TE function, following the description from In-terPro.Finall y, to anal yze w hether r oo elements genera ting TE-exonized embedded transcripts provided a specific motif to the chimeric exons, we extracted their sequences with bedtools g etf asta , and aligned with r oo consensus sequence with MUSCLE v.5.1 ( 67 ).Alignment plots were performed with MIToS v.2.11.1 ( 68 ), using the package Plots .

Differ ential expr ession analysis
The read count for differential gene expression analysis for the four wild-type strains was obtained from Fablet et al. ( 52 ).The data were normalized with DEseq2 v. 3.10 ( 69 ), with a designed model ( ∼genotype), assuming that the only variable to identify differ entially expr essed genes between strains must be the genotype.We performed pairwise comparisons with the four wild-type strains, considering differentially expressed genes the ones with adjusted p-value < 0.05.

RESULTS AND DISCUSSION
ChimeraTE predicts chimeric transcripts deri v ed from genes and TEs using two different strategies.Mode 1 is a genome-guided approach that will predict chimeric transcripts from paired-end RNA-seq through chimeric read pair detection.The main advantages of Mode 1 in comparison to Mode 2 are that the first one can detect split reads between exons and TEs, capture chimeric transcripts with low cov erage / e xpression and classify chimeric transcripts according to the TE position: TE-initiated, TE-exonized, TEterminated transcripts.Howe v er, Mode 1 misses chimeric transcripts deri v ed from TE insertions that are absent from a r efer ence genome.Mode 2, w hich is a genome-blind a pproach, performs two alignments against r efer ence transcripts and TE copies and then, similar to Mode 1, predicts chimeric reads between these two alignments.In addition, Mode 2 can optionally perform a de novo transcriptome assemb ly ab le to detect chimeric transcripts and chimeric reads through split read alignment, improving the sensitivity.Despite the similarities between both Modes, the results from Mode 1 and 2 had an overlap of 50.46% (Supplementary Data, note 1).We have shown that such differences are likely to be associated with the alignment used by the Modes, and read length (Supplementary Data, note 1, Fig S1 .1)( 70 , 71 ).We also observed that increasing the minimum number of chimeric reads (default = 2) causes a decrease of 48.45% for Mode 1 and 33.48% for Mode 2 in the total number of chimeras found in two RNA-seq replicates (Supplementary Data, note 2).Ther efor e, a substantial part of the results is detected with low coverage of chimeric reads.Finally, we demonstrated that the use of RNA-seq replicates is crucial to reduce the frequency of potential chimeras supported by artifactual chimeric reads (Supplementary Data, note 3) ( 37 , 72-77 ).

Setting up the datasets for ChimeraTE
Each ChimeraTE Mode r equir es differ ent input datasets.To run Mode 1 (genome-guided), gene and TE annotations, along with a genome fasta file are necessary.We took advantage of available paired-end RNA-seq datasets from the ovaries of four wild-type strains of D. melanogaster ( 52 ), for which high-quality genome assemblies were also available ( 54 ).We performed gene annotation in the new assemblies using D. melanogaster 's genome ( dm6 ) from Flybase r6.49 as a r efer ence and obtained ∼17,278 genes per genome (Supplementary Tab le 2).Regar ding TE annotation, we found ∼10.48% of TE content in the four wild-type strains, similar to our previous estimates for these strains ( 54 ).Filtering out TE insertions smaller than 80 bp, removed ∼5,549 TEs (20.59%) across the four wild-type strains.In addition, to remove potential mapping artifacts, we discarded ∼752 TE copies (2.79%) per strain that harbored SSRs longer than half the TE length.On average across strains, 85% of the TEs filtered out due to the presence of SSRs were roo elements (Supplementary Figure 2).These roo copies were around 112 bp (Supplementary Figure 3), which is the same length as the known SSR present in the roo consensus sequence ( 57 ).In total, 123 TE families in the four genomes wer e uncover ed, comprising a mean of ∼21,402 TE insertions ( standard deviation = 883).In all genomes the TE content in bp is higher for LTR, then LINE elements, followed by DNA and Helitron families (Supplementary Table 2) ( 78 ).In order to run Mode 2 (genomeblind), we used r efer ence transcripts from Flybase r6.49 and performed TE annotation on the r efer ence dm6 genome.We have obtained 29,086 TE insertions, r epr esenting 16.37% of the genome content, following similar TE family proportions as seen for the four wild-type genomes (Supplemen-tary Table 2).Nevertheless, the dm6 genome TE annotation shows an extra ∼6% of TE content compared to the wildtype genomes, mainly due to LTRs and LINEs (Supplementary Table 2).The D. melanogaster's heter ochr omatin corresponds to ∼32.5% of its genome ( 79 ), and includes repeat sequences, as TE copies ( 80 ).The wild-type assemblies have a high coverage for euchromatin regions, since 96% of gene content was r ecover ed, but a poor heter ochr omatin coverage, corresponding to the lack of TEs compared to the dm6 genome.Howe v er, the search for chimeric transcripts with ChimeraTE Mode 1 is restricted to the gene region and its 3 kb boundaries , hence , the lack of heter ochr omatic TE insertions does not impact the chimeric transcript search.

Chimer aTE mode 1 r ev eals that 1.12% of genes pr oduce chimeric transcripts in D.melanogaster wild-type strains
ChimeraTE mode 1 was run on the four wild-type strain genomes and their respecti v e ovarian RNA-seq data ( 52 ).Across all strains, we found 327 genes producing 408 chimeric transcripts, accounting for 1.83% of the total genes in D. melanogaster , 4.04% of the expressed genes (FPKM > 1), and r epr esenting 1.12% of all genes when normalized by the four strains.Previous studies have found similar proportion of genes generating chimeras in D. melanogaster.Among them, Coronado-Zamora et al. is the only study that performed chimeric transcripts search in multiple strains and different tissues, including ovaries, for which the authors found 1.41% of genes with chimeric transcripts when normalized by all strains ( 81 ).To verify whether ChimeraTE identified the correct TE family for each chimera, we have compared the genomic coordinates of TEs and genes (3 kb upstr eam / downstr eam and inside genes) with bedtools intersect ( 44 ) and predicted genes with TE copies in these regions.All chimeric transcripts in the four wild-type strains had at least one TE insertion in the expected position from the predicted TE family (Supplementary Tables 3-6).In addition, we randomly selected 100 chimeric transcripts in each wild-type strain to visualize in IGV ( 63 ) and confirm the presence of chimeric reads as expected (Figure 3 A).All the 400 manually inspected chimeras wer e corr ectly found in the genome browser.Among all genes generating chimeric transcripts, around 89% are TE-exonized (61.84% correspond to TE-exonized embedded, 17.18% to TE-exonized intronic, 9.95% to TE-exonized overlapped), 8.64% TE-terminated and 2.36% TE-initiated transcripts (Figure 3 B).It is important to note that ChimeraTE does not classify an internal TE copy as either a TE-initiated or a TE-terminated transcript, e v en if the transcript does indeed start, or end, at the TE sequence.TE-initia ted and TE-termina ted repea ts are necessarily outside of gene regions as per ChimeraTE categories (see Materials and Methods).Ther efor e, one should assume that the high prevalence of TE-exonized transcripts ( ∼89%) might be associated with potential cases of misannota ted TE-initia ted and TE-termina ted transcripts.Indeed, a previous study in D. melanogaster has shown that ∼2.8% of genes hav e TE-deri v ed promoters in embryos, pupae, and larvae tissues ( 33 ).
A recent analysis in D. melanogaster ovaries has found 884 chimeric transcripts deri v ed from 549 genes, among fiv e wild-type strains ( 81 ), indicating that a gene is capable of producing multiple chimeric transcripts involving different TE copies.In agreement, TE-exonized chimeric genes detected by ChimeraTE, show evidence of multiple TE sequences acting as chimeric transcripts (22.6% of TEembedded genes, 30.95% for overlapped, and 30.34% for intronic).Ne v ertheless, TE-initiated transcripts do not involve more than one TE per gene.Among 46 TE-terminated transcripts found in all strains, only fiv e genes were reported with more than one TE copy: ( 1 ): NADH dehydr og enase ( CG40002 ) in dmgoth63; ( 2 , 3 ) Max and rudimentar y-lik e in dmsj23; and ( 4 , 5 ) six-banded , CG14464 in dmsj23.All of them, except CG14464 , were associated with two TE insertions from the same TE family, located close to each other (Supplementary Figure 4), suggesting they might be multiple hits from RepeatMasker corresponding to only one TE copy ( 82 ).Thus, multiple TE copies producing chimeric transcripts exist onl y w hen TE copies are within genes.We then tested whether the number of TEs within genes has a correlation with the presence of multiple TEexonized chimeras.By using all the genes expressed in ovaries (FPKM > 1) with at least one TE insertion within, we found weak positi v e correlation between them (Supplementary Figure 5), among which the highest was found in dmsj23 (Pearson; r = 0.22; P = 2.6E-08), and nonsignificant for dmgoth101 (Pearson; r = 0.06; P = 0.08).Howe v er, when we take into account the number of TE copies per TE family present within genes, and the respecti v e number of chimeras generated by TE family, we observ ed a positi v e correlation (Pearson; r = 0.87; P < 2.2E-16) (Supplementary Figure 6).Therefore, TE-rich genes are not the most frequent among the ones generating chimeras, but TE families with high copy number within genes are the ones likely to produce TE-exonized transcripts.
Among all chimeric transcripts predicted by ChimeraTE Mode 1, 11 chimeras have been previously described in D. melanogaster using EST data ( 31 ).It is important to note that the authors did not consider heter ochr omatic regions, as well as chimeric transcripts deri v ed from INE-1 elements.From these 11 chimeras, fiv e were found in all strains: Ssdp (gene) -HMS-Beagle (TE family); Agpat1 -1360 ; anne-1360 ; Atf6-Xelement and ctp-HMS-Beagle .The other six chimeric transcripts are absent from some strains: CG6191-joc k ey ; CG15347-HB ; CG3164-McClintoc k; Svilroo; CHKov1-Doc and Kmn1-pogo (Supplementary Table 7).We manually checked these chimeric transcripts in the IGV genome browser and we confirmed the presence of the TE families predicted by ChimeraTE (Supplementary Table 7).In addition, we also found three chimeras reported previously with RAMPAGE: CG31999 -Tc1-2 ; CkII α-1366 ; Svilroo ( 33 ).Finally, in another study on D. melanogaster midbrain, 264 genes were shown to produce chimeric transcripts using single-cell RNA-seq data ( 34 ).The authors demonstra te tha t retrotransposons have splice donor and acceptor sites that generate new chimeric isoforms through alternati v e splicing ( 34 ).Despite the differences between tissues and methods, we found 22 chimeric transcripts identified previously using scRNA-seq ( 34 ) (Supplementary Table 7), of which two wer e pr eviously found in ESTs ( 31 ).Taken together, the genome-wide analysis performed by ChimeraTE Mode 1 has uncovered 296 genes with chimeric transcripts in the D. melanogaster ovarian tissue for the first time.

ChimeraTE mode 2: a method to uncover chimeric transcripts without genome alignment
D. melanogaster has a high rate of TE insertion polymorphism across worldwide population (83)(84)(85).To test the ability of ChimeraTE Mode 2 in detecting chimeric transcripts deri v ed from TE insertions absent from a reference genome, we used RNA-seq from the four wildtype strains with transcript and TE annotations from the D. melanogaster r efer ence genome ( dm6 ).ChimeraTE  2C)."Chimeric reads": chimeric transcripts detected only by the method of chimeric reads (Figure 2B)."Double evidence": chimeric transcripts detected by both methods.( B ) Correlation between chimeric read coverage and gene expression of chimeric transcripts found by Mode 2. In all strains, we did not observe correlation between gene expression (FPKM) and the coverage of chimeric reads between the three categories of evidence.All double evidence chimeras were found with high coverage of chimeric reads, suggesting high reliability of chimeras when both methods (chimeric reads and transcriptome) are considered together.
Mode 2 may predict chimeric transcripts based on two types of evidence: either using chimeric reads only or b y taking adv anta ge of de nov o transcriptome assembly.Chimeric transcripts with both evidences are named 'doub le-e vidence' (chimeric reads and transcript assembly -Figure 2 B and C).Among the four wild-type strains, ChimeraTE Mode 2 has identified 324 genes (Supplementary Table 8-11) producing 339 chimeric transcripts (Figur e 4 A), r epr esenting 1.81% of the total D. melanogaster genes, and 1.07% when normalized by the four wild-type strains.Comparing the 'chimeric read' approach with the 'transcript assembly' one, the former method found ∼51% as many chimeric transcripts than the detection by transcriptome assembly (Figure 4 A).This is probably due to the technical limits of de novo transcriptome assembly of repetiti v e-rich transcripts ( 86 ), which may lead to the lack of chimeric transcript isoforms.Between both methods, we did not find any correlation with chimeric read coverage and gene expression (FPKM), indicating that e v en low expressed genes (FPKM < 10) can be detected by both methods independently (Figure 4 B).Double evidence chimeras r epr esents ∼14% of all chimeras detected by Mode 2, and they have been found only with high chimeric read coverage (Figure 4 B).
As each chimeric transcript was detected based on reference transcripts and TE insertions, we used the Nanopore assemblies to manually inspect the presence of the predicted TE family inside and near ( ±3 kb) genes, with the intersection of genomic coordinates from genes and TEs with bedtools intersect ( 44 ).We considered true chimeric transcripts the cases in which we found in the assembly the presence of the predicted TE insertion inside / near the gene.The alignment of RNA-seq libraries against the genome sequence was also used to confirm the presence of the TE insertion, as well as the presence of chimeric reads, with IGV genome browser ( 63 ).The manual curation was performed with the three groups of results from Mode 2 ('chimeric reads', 'transcriptome assembly' and 'double-evidence').We found that 96.30% of all chimeric transcripts predicted by double evidence were true, whereas we observed 90.59% from transcriptome assembly and 73.13% from chimeric r eads.Ther efore, taken together, ChimeraTE Mode 2 has provided a reliable inference with 81.2% sensitivity (weighted average), based on genomic manual curation.
The main difference of ChimeraTE Mode 2 from the previous pipelines is its ability to detect chimeric transcripts deri v ed from TEs that are absent in a r efer ence genome, using RNA-seq from non-r efer ence individuals / cells / strains.We first identified in the dm6 r efer ence genome the genes with TEs located upstream, inside, and downstream.We found that 2,239 genes in the dm6 genome have TEs located 3 kb upstream, 1,863 inside (introns and exons), and 2,320 downstr eam.These genes wer e selected as candidates to produce chimeras in the dm6 genome and then compared with the list of chimeric transcripts generated by ChimeraTE Mode 2 in the four wild-type strains.In addition to the comparison with dm6 genes harboring TEs inside / near, we have manually curated the TE insertions and the presence of chimeric transcripts with the IGV genome browser ( 63 ).Such manual curation was performed with the wild-type genomes, by checking the presence of the TE and chimeric read alignments.For instance, the Mps1-FB chimera is an interesting case, since it is the only chimeric transcript specific to French populations, as we found it in both dmgoth101 and dmgoth63.In the dm6 r efer ence genome, Mps1 has an overlap between its 3 UTR and the 3 UTR of the alt gene, located on the other strand (Figure 5 A).The same distribution of both genes is found in the Brazilian strains, dmsj23, and dmsj7 (Figure 5 B).Conversely, in dmgoth101 and dm-goth63, there is a gap of ∼9,500 bp between Mps1 and alt , with four small FB copies of ∼412 bp in length (Figure 5 C), indicating that it may be an old insertion.Furthermore, one of the FB copies is located downstream of the alt gene, which has also been identified as TE-terminated downstream in dmgoth63 and dmgoth101.This result shows that ChimeraTE Mode 2 can detect chimeric transcripts deri v ed from TEs that are not present in the reference genome.
Taking into account all chimeric transcripts detected by Mode 2, we found 11 genes (3.40%) generating chimeric transcripts deri v ed fr om TEs that are absent fr om the dm6 genome (Table 1 ).There are specific chimeras from French strains: Mps1-FB , CG1358-S and CG46280-POGON1 genes, but only Mps1 had chimeric transcripts in both French dmgoth63 and dmgoth101 strains, whereas CG1358 was found as chimera only in the dmgoth101, and CG46280 only in dmgoth63.The Ythdc1-roo chimera was observed with a strain-specific polymorphic roo element inside an exon in the dmgoth63 strain.Regarding the Brazilian strains, we found two TE insertions present in both dmsj23 and dmsj7, involving the genes cic and TrpRS-m , but they were found as chimeras only in the dmsj23 strain.We also found rb , r-1 and ArfGAP1 with dmsj23-specific TE insertions, whereas in the dmsj7 we found caps and Ubi-p5E with specific TE insertions giving rise to chimeric transcripts.
To evaluate Mode 2 accuracy in detecting chimeric transcripts deri v ed from TEs absent in the r efer ence genome, we have performed RT-PCRs for 10 out of 11 cases predicted (it was not possible to design specific primers for the TrpRSmprotop a chimera).The primers were designed considering the TE and the exon location with aligned chimeric reads (Supplementary table 1).For Mps1 -FB , caps-pogon1 , Ythdc1roo , Ubip5E-gypsy, CG46280 -pogon1 , the amplified fragment sizes were in agreement with ChimeraTE Mode 2 predictions (Supplementary Figure 7A-E).The r1-stalker2 chimera was amplified in both Brazilian strains, e v en though the TE insertion generating this chimeric transcript was absent in dmsj7 genome assembly Table 1.11 polymorphic chimeric transcripts from TE insertions that are not present in the dm6 r efer ence genome.The r ed triangle in a line r epr esents the presence of a chimeric transcript; the white triangle in a line r epr esents the pr esence of the TE insertion but without a chimeric transcript; the line without triangles r epr esents the absence of the TE insertion (Supplementary Figure 8A).We hypothesized that this might be due to a low-frequency TE in the dmsj7, absent from the assembly and without enough read coverage in the RNA-seq to predict it.Indeed, we confirmed the presence of this TE insertion by PCR in both Brazilian strains (Supplementary Figure 8B).The chimeric transcript from rb-mdg3 was detected only in dmsj23 by ChimeraTE Mode 2, but the RT-PCR assay was successful only in the dmsj7 strain (Supplementary Figure 9).The chimera deri v ed from CG1358-S was predicted only in dmgoth101 (Supplementary Figure 10), despite the TE insertion being also present in dmgoth63 genome.By RT-PCR, we detected this chimeric transcript in both strains, suggesting that it was not detected in dm-goth63 by ChimeraTE due to Illumina coverage issues.A similar result was f ound f or the cic-pogon1 chimera, for which we amplified the chimeric transcript in both Brazilian strains, but it was detected only in dmsj23 by ChimeraTE (Supplementary Figure 11).Altogether, these results indica te tha t Mode 2 can predict chimeric transcripts from TEs absent from the r efer ence genome, but this prediction depends upon the coverage of chimeric reads in the RNA-seq data.

Evaluation of chimeric transcripts in dmgoth101 through nanopore RNA-seq
Short-read RNA-seq may introduce biases during library pr eparation (e.g.r e v erse tr anscription, second-str and synthesis, PCR amplification) ( 87 ), or e v en during transcriptome assembly, with loss of information regarding isoform di v ersity, especially for those with low-expression and repetiti v e regions ( 88 ), which might be the case of many chimeric transcripts.In the past years, the third generation of longread sequencing has been used to mitigate such biases, because there is no need to perform transcriptome assembly.Thus, we validated chimeric transcripts predicted by ChimeraTE Mode 1 with short reads using long reads produced with Oxford Nanopore Technology (ONT) from dm-goth101 ovaries ( 89 ).We did not perform the same validation with RNA long reads for chimeras from Mode 2 predicted with short reads, due to the lack of information regarding the genomic position of TEs generating them.The long reads were aligned to the dmgoth101 genome and then checked for the presence of all chimeric transcripts detected by ChimeraTE Mode 1 in dmgoth101.The presence of the chimeric transcripts was confirmed when the long read was aligned into the gene exons, and it had the TE-deri v ed sequence as part of the reads.From two TEinitiated transcripts, we found only CG14613-S2 chimera by ONT reads.In TE-terminated transcripts, six (40%) out of 16 were also observed by ONT data.Finally, in TEexonized transcripts, we identified 89 (62.24%) out of 143 embedded TEs, 12 (52.17%)out of 23 overlapped TEs, and 13 (46.43%)out of 28 intronic.The undetected chimeras through ONT reads might be associated with the longread library construction protocol, TeloPrime, which was used to generate full-length cDNA libraries.Such a protocol might miss weakly expressed isoforms, as well as truncated 3 ends due to RNA degradation in vivo or during RNA manipula tion ( 90 ).We investiga ted whether shortread coverage could explain the lack of detection of certain chimeric transcripts with ONT data.Taking all dm-goth101 chimeras from Mode 1, chimeric transcripts found by both ChimeraTE and ONT data had an average chimeric read coverage of 69.44, whereas the undetected ones had 10.83.Ther efor e, these r esults indica te tha t the detection of chimeras with ONT reads is likely to be more efficient with highly expressed chimeric tr anscripts.Over all, our results demonstra te tha t 50.16% of all chimeric transcripts detected in the ovary transcriptome of D. melanogaster with ChimeraTE Mode 1 can also be found by ONT RNA-seq.

The high pr ev alence of roo elements in chimeric transcripts is due to ∼132 nt TE insertions embedded into e x ons
Se v eral TE families hav e been associated with chimeric transcripts in D. melanogaster ( 31 , 34 ).Here, we found 76 TE families producing at least one chimeric transcript in the four wild-type strains, r epr esenting 61.79% of all TE families annotated.We found a positi v e correlation between the frequency of TE families in chimeric transcripts and their respecti v e abundance in the D. melanogaster genomes (Pearson; r = 0.53; P < 2.2E-16).In the TE-initiated transcripts, S2 , INE-1 , hobo , 1360 and 412 had the highest frequencies (Figure 6 ).In TE-exonized transcripts, the prevalence of roo elements is more pronounced in the embedded TEs group, comprising 50% of all chimeras, followed by INE-1 elements, with 19%.
Interestingly, for both TE-exonized intronic and overlapped, INE-1 had the highest frequencies, with 37.37% and 23.25%, respecti v el y (Figure 6  We then investigated whether exonized TE insertions generating TE-exonized transcripts could contribute to protein domains, especially roo due to its high prevalence in TE-exonized embedded chimeras.Through our analysis with pfam-scan , we found in total 28 genes with TE-exonized transcripts from TE insertions harbouring a TE protein domain, corresponding to an average of ∼16 per strain (Supplementary Table 12).We assessed the completeness of these protein domains conserved on these insertions, assuming domains with > 75% of completeness as complete, and < 75% as partial.Our analysis re v ealed that ∼45% of all domains are likely to be conserved (Figure 7 A), for which 92% correspond to catalytic domains, and 8% to zinc finger binding domains (PRE C2HC -PF07530.14;zf-CCHC -PF00098.26).In addition, 17.69% of all TE-deri v ed protein domains stem from embedded TE insertions, 27.43% from overlapped with exon, and 54.86% from intron regions.Furthermore, only one roo insertion had protein domains detected (Peptidase A17; DUF5641), which is embedded within CR46472 long noncoding RNA gene.Taking into account that 50% of the TE-exonized transcripts are deri v ed from embedded roo insertions, this result suggests that despite their high frequency, they are old insertions without catalytic potential.
Since we predict these chimeric transcripts only based on chimeric reads, we are not able to confirm whether these domains have been fully incorporated into gene transcripts, limiting our biological conclusions regarding catalytic or binding effects.There are only three chimeric transcripts with TE-deri v ed protein domains for which we found the whole TE sequence incorporated into the transcript with ONT RNA-seq data (Supplementary Figure 12).Among them, CG4004hobo , and nxf2 -TART-A are present into CDS regions, suggesting a putati v e functional role of the TE sequences.The third chimera, CR42653 -Doc , is a noncoding pseudo-gene, which suggests that the TE-deri v ed protein domain does not have a functional role.
TE insertions disrupting coding regions and promoters ar e mor e likel y to be deleterious, being consequentl y eliminated by purifying selection ( 91 ), although se v eral e xaptation e v ents hav e been documented ( 23 , 92 ).The presence of roo elements in 50% of all TE-exonized embedded transcripts suggests a neutral or adapti v e role of this family when incorporated into gene tr anscripts.R oo is an LTR r etrotransposon, encoding thr ee proteins: gag , pol , and env , which have been through domestication e v ents from retroelements in many species, including Drosophila ( 8 , 93-94 ).It is the most abundant euchromatic TE family of D. melanogaster ( 1 , 95 , 96 ), and its insertions have been associated with modifications in the expression level of stress response genes due to the presence of TFBSs ( 97 ).They have also low enrichment of r epr essi v e histone marks ( 98 , 99 ), potentially explaining their high transposition rates ( 100 , 101 ).We analyzed the roo sequences that are embedded into exons of TE-exonized transcripts.The length of these chimeric insertions is ∼132 bp, confirming that they are old insertions since the full-length roo consensus sequence is 9,250 bp (Figure 7 B).Subsequently, we analyzed whether these exonized roo insertions are donors of preferential motifs to the chimeric transcripts.All roo insertions stem from a specific region, between the 5 UTR and the roo openr eading frame (Figur e 7 B).Compar ed to a roo consensus sequence, in dmgoth63 and dmsj7, most insertions show a deletion from the 5 UTR up to the middle of the ORF.Despite the low nucleotide di v ersity of roo insertions in the D. melanogaster genome, the 5 UTR region has a hypervariable region, including deletions and repeats, with se v eral copies missing a tandem repeat of 99 bp ( 57 , 102 ).It has been proposed that this region may have a role in roo transposition, by heter ochr oma tiniza tion, recruitment of RNA pol II, and interaction with other enzymes ( 102 ).Curiously, a study assessed the nucleotide di v ersity between roo insertions, and characterized the same region as a deletion hotspot ( 78 ).A recent work also observed this region as part of TE-exonized transcripts in a transcriptome-wide manner in D. melanogaster ( 81 ).Why this region has been maintained through evolution, and whether it could be adapti v e or neutral, remains unclear.genomic context, we observed ∼1,767 polymorphic TE insertions in these three regions regardless of the presence of chimeric transcripts, for which we found an equal distribution between TEs upstream ( ∼32%), TEs inside genes ( ∼35%) and TEs downstr eam ( ∼32%).Such r esult suggests that polymorphic insertions are not pr efer entially maintained by selection in specific gene locations, but chimeric transcripts from polymorphic TEs ar e mor e likely to be generated when they are within gene regions.In addition, we investigated whether the TE insertions absent from the reference genome could either be population-specific (Brazil or France) or strain-specific.The four TE-initiated transcripts deri v ed from polymorphic TE insertions were strainspecific, two from dmgoth63 and two from dmsj7 (Figure 8 A).For TEs inserted inside genes, 66.67% were found only in one strain.Only four chimeric transcripts are common to all strains CG10543roo , CG10077roo , RASSF8roo and CG7239roo (Figure 8 B).We did not find a TE-e xonized deri v ed from a TE insertion present only in the Br azilian str ains, but we did find one for the French str ains, the simj -roo chimer a.Interestingly, nine chimer as were found between pairs of Brazilian and French strains (Figure 8 B), indicating retention of ancestral polymorphisms in these populations due do incomplete lineage sorting ( 107 ).Finally, for TEs located downstream genes (Figure 8 C), the chimera Mps1-FB is the only one found in both French strains, and absent from Brazilian strains, as we observed by Mode 2 (Figure 5 ).Taken together, these results reinforce the potential of TEs in generating genetic novelty between D. melanogaster strains.

Modification in gene expression is not a general rule for chimeric transcripts
TEs can modify gene expression when inserted inside and in their close vicinity.We hypothesized that genes with Mode 1 (Pearson; r = 0.99; P = 0.01), as well as for in Mode 2 (Pearson; r = 0.99; P = 0.009) (Supplementary Figure 14).

CONCLUSION
In the last decades, RNA-seq has provided the opportunity to understand transcriptome plasticity, which can lead to phenotypic di v ergence, from related species or indi viduals from a population ( 114 ).Among se v eral sources of modification in gene expression and novel isoform transcripts, TEs have been considered fundamental suppliers of transcriptome plasticity, participating in gene networks and incorporating either regulatory sequences or protein domains into gene transcripts.The identification of chimeric transcripts is an important step to the understanding of transcriptome plasticity, since they may be triggered by ectopic conditions, such as cancer, oxidati v e stress, and heat shock ( 26 , 115 , 116 ), which may lead to both detrimental and advantageous outcomes ( 117 ).Ther efor e, uncov ering the e xtent of chimeric transcripts between individual / cell / strain transcriptomes is a crucial first step to investigate potential exapta tion / domestica tion events, or gene disruption and loss of function.
Chimeric transcripts have been identified more recently by different methods exploiting RNA-seq data ( 34-36 , 81 ), but none of them provided the possibility to predict chimeras from TEs that are absent from r efer ence genomes.Here, we de v eloped ChimeraTE, a pipeline ab le to identify chimeric transcripts from TEs that are absent from the r efer ence genome.Mode 1 is a genome-guided approach and may be used either when the user is not interested in chimeras deri v ed from TEs absent from the r efer ence genome, or when the user has a high-quality genome assembly for each individual / cell / strain; whereas Mode 2 is a genome-blind approach, with the ability to predict chimeric transcripts without the assembled genome, but missing chimeras where the TE is smaller than the length of the reads.
We analyzed ovarian RNA-seq from four D. melanogaster wild-type strains, for which we have genome assemblies.Altogether, we found that ∼1.12% of all genes are generating chimeric transcripts in ovaries, following the pr oportion fr om previous studies obtained with ESTs, and RNA-seq in midbrain tissue ( 31 , 34 ).Furthermore, our r esults r e v ealed that 50% of all TE-e xonized transcripts with TEs embedded deri v e from roo elements, and most specifically, a small region between tandem repeats in the 5 UTR and the beginning of the roo ORF.These results suggest that these roo insertions have neutral or advantageous effects as they are maintained within these gene transcripts.Howe v er, we did not provide enough support to claim these roo -exonized transcripts as exaptation or domestication e v ents, due to the lack of e vidence regar ding the functional role of these chimeras.Further studies must be performed to clarify this subject, mainly because chimeric transcripts detected by ChimeraTE can be degraded by surveillance pathways that degrade aberrant mRNA, such as non-sense-mediated mRNA decay ( 118 ), no-go decay ( 119 ), and non-stop decay ( 120 ).
Altogether, this new approach allows studying the impact of new mobilization events between populations or be-tween treatment conditions, providing insights into biological questions from a broad community of r esear chers, ranging from cancer r esear ch, to population transcriptomics, and adapta tion studies.ChimeraTE implementa tion will be useful for the ne xt discov eries regar ding the e volutionary role of TEs and their impact on the host transcriptome.

DA T A A V AILABILITY
ChimeraTE is an open-source collaborati v e initiati v e available in the Zenodo repository ( https://doi.org/10.5281/zenodo.8143045).The RNA-seq data used in this study are available in the NCBI BioProject database ( https://www.ncbi.nlm.nih.gov/bioproject/ ), under PRJNA795668 accession, while the long-read ONT data is available under PR-JNA956863.

SUPPLEMENT ARY DA T A
Supplementary Data are available at NAR Online.

Figure 1 .
Figure 1.ChimeraTE Mode 1 (genome-guided) workflow.Round white boxes: input data; square boxes: pipeline step; round gray boxes: thresholds that can be modified.( A ) Input data: fasta file with the genome assembly, gtf files with gene and TE annotations, as well as stranded paired-end reads from RN A-seq (fastq).( B ) Alignment: RN A-seq alignment to the genome is used to calculate gene expression levels.Genes with FPKM < 1 are removed from downstream analyses.A subsequent list of reads that have aligned against genes or TE insertions is created.( C ) Chimeric read detection & filtering: both r ead lists ar e then compar ed and r ead pairs that have common r eads between the two lists are named chimeric r eads, i.e. pair ed-end r eads mapping to a gene and a TE copy.The average of these reads between replicates is used as chimeric read coverage for each putative chimeric transcript.All putative chimeras are then processed with three ChimeraTE scripts to categorize them into TE-initiated, TE-exonized, and TE-terminated transcripts.These steps are perf ormed f or all RN A-seq replicates.Finall y, all chimeric transcripts present in at least 2 replicates and with at least 2 chimeric reads on average between replicates are maintained.( D ) Chimeric transcripts: Three predictions obtained from Mode 1. Blue bo xes: ex ons; red bo xes: TEs; arro whead in between TE and ex on bo xes: transcription sense; blue and red boxes linked by a line: chimeric reads.The ChimeraTE mode 1 output is divided into three predictions: (i) TE-initiated transcript: the TE insertion is located upstream of the gene region; (ii) TE-exonized transcript: the TE insertion is present within e xons (embedded), ov erlapping e xons (ov erlapped), or intr ons (intr onic); (iii) TE-terminated transcript: the TE insertion is located downstream of the gene region.

Figure 2 .
Figure 2. ChimeraTE Mode 2 (genome-b lind)wor kflow.Round white boxes: input data; square boxes: pipeline step; round gray boxes: thresholds that can be modified.( A ) Input data: two fasta files containing r efer ence transcripts and TE insertions, as well as stranded pair ed-end r eads from RNA-seq (fastq).( B ) Alignment and chimeric reads:The alignment against transcripts is performed and their expression is calculated.Transcripts with FPKM < 1 are removed from the downstream analysis.Next, a list of reads aligned against transcripts is created.Through the alignment of reads against TE insertions, a second list with reads stemming from TEs is also cr eated.Then, mapped pair ed-end r eads and singletons are identified, generating the list of chimeric reads, for all replicates.All chimeric transcripts that have an average of chimeric reads > = 2 and are present in > = 2 replicates are maintained as true chimer as. ( C ) Tr anscriptome assembly and chimeric reads: The de novo transcriptome assembly is a non-default option of ChimeraTE Mode 2. It performs a transcriptome assembly and aligns reads against the assembled transcripts.Then, TE insertions in the assembled transcripts are identified with RepeatMasker and the TE reads are recovered.Using the two lists of reads (transcripts and TEs), the chimeric read list is generated and the putati v e assembled chimeric transcripts ar e pr edicted.Next, a blastn is performed between these transcripts and the r efer ence transcripts provided in the input.All transcripts with length > = 80% are selected.The process is repeated for all RNA-seq replicates and chimeric transcripts assembled > = 2 replicates are maintained as true chimeras.( D ) Chimeric transcripts: if the assembly is activated, ChimeraTE mode 2 provides three outputs: (1) Chimeric reads: predicted only based on the method depicted in B; (2) Assembled transcripts: predicted only based on the transcriptome assembly method depicted in C; and (3) Double evidence: predicted by both methods -B and C-.

Figure 3 .
Figure 3. General results from Chimera Mode 1. ( A ) Fi v e e xamples of chimeric transcripts manually curated with the IGV genome bro wser.Red bo xes: TE insertion; b lue boxes: e xons and UTRs; b lack density graphs: cov erage of RNA-seq reads; head arrows: transcription sense, blue and r ed box es linked by a line: chimeric reads.( B ) Total genes generating chimeric transcripts, following the TE position classification in the four wild-type strains.

Figure 4 .
Figure 4. ( A ) Total number of chimeric transcripts found by ChimeraTE mode 2. "Assembled transcripts": chimeric transcripts detected only by the method of transcriptome assembly (Figure2C)."Chimeric reads": chimeric transcripts detected only by the method of chimeric reads (Figure2B)."Double evidence": chimeric transcripts detected by both methods.( B ) Correlation between chimeric read coverage and gene expression of chimeric transcripts found by Mode 2. In all strains, we did not observe correlation between gene expression (FPKM) and the coverage of chimeric reads between the three categories of evidence.All double evidence chimeras were found with high coverage of chimeric reads, suggesting high reliability of chimeras when both methods (chimeric reads and transcriptome) are considered together.

Figure 5 .
Figure 5. Mps1 gene and its downstr eam r egion in the genomes of the four wild-type strains.( A ) Mps1 in the dm6 r efer ence genome and the alt gene located downstream to it, on opposite strands and with overlapped 3' UTRs.( B ) In the dmsj23 and dmsj7, Mps1 and alt are distributed as found in the r efer ence genome.( C ) In dmgoth101 and dmgoth63, there is a FB insertion located downstream to Mps1 , which has chimeric reads supporting a TE-terminated downstream in both strains.

Figure 6 .
Figure 6.The frequency of the 76 TE families generating chimeric transcripts in the four wild-type strains.In chimeric transcripts deri v ed from TEs near genes, INE-1 was the most frequent (15%) in TE-terminated downstr eam, wher eas for TE-initiated upstream, S2 , INE-1 , hobo , 1360 and 412 had the same frequency (11%).Regarding TEs inside genes, the roo element has the highest frequency of TE-exonized embedded, representing 50% of all chimeras.In TE-exonized intronic and overlapped, INE-1 was the most frequent TE family.
); w hereas roo element was observed in only one chimera ( Lk6roo , in dmgoth63 and dmsj23) as a TE-exonized overla pped.Notabl y, the 1360 element is among the most frequent TE families in all TEexonized categories.The other 38 families have an average frequency of 2%.Finally, in TE-terminated transcripts, INE-1 is the most frequent TE, with 15.5% of all cases, followed by 1360 and 412 , with 13% and 11%, respecti v ely.Taken together, these results suggest that INE-1, S2, 412 and 1360 are candidate TE families to investigate the exaptation of regulatory motifs since they were the most abundant in TE-initiated and TE-terminated transcripts.On the other hand, in TE-exonized transcripts, roo might have a potential for exapta tion / domestica tion of both regulatory and protein domain motifs, since it is observed with high pre valence within e xons (UTRs and CDSs).Regar ding TEe xonized transcripts deri v ed fr om TEs at intr onic regions and overlapping with exons, INE-1 and 1360 are the families with the highest frequency of intron retention, suggesting they might harbour splice sites allowing incorporation into the gene transcripts.

Figure 7 .
Figure 7. ( A ) Completeness of protein domains identified in TE insertions generating TE-exonized transcripts, with p-value < 0.05.( B ) Alignment depth with roo consensus and embedded roo insertions generating TE-exonized transcripts.At the top, the scheme of the full-length r oo element: br o wn bo xes:LTRs; yello w bo x: first tandem repea ts a t 5' UTR; blue box: second tandem repea t a t 5' UTR; r ed box: Open r eading fr ame (ORF).The cover age depth of the multiple alignments between embedded roo insertions and the consensus is separated by strain.

Figure 8 .
Figure 8.The 76 chimeric transcripts deri v ed from TE insertionsthat are absent in the dm6 r efer ence genome, 5.19% of them corr espond to TEs located upstream, 74.03% to TEs located inside genes (introns and exons), and 20.78% to TEs located downstr eam.( A ) TE upstr eam: Chimeric transcripts in which the TE is located up to 3kb upstream of the gene.( B ) TE inside: Chimeric transcripts with TE insertions located inside the gene region (exons and introns).Ther e ar e f our chimeric transcripts f ound in all str ains, and one specific to French str ains.( C ) TE downstream: Chimeric transcripts in which the TE is located up to 3kb downstream of the gene.Only Mps1-FB was specific to French strains, whereas all the other 15 chimeras are strain-specific.

Figure 9 .
Figure 9. Normalized counts from DEseq2 indicating gene expression of genes generating polymorphic chimeric transcripts among the four wild-type strains.Colorful forms r epr esent differ entially expr essed genes (adj.p-value < 0.05); transpar ent forms r epr esent non-differ entially expr essed genes (adj.p-value).( A ) Expression le v el of genes producing chimeric transcripts only in dmgoth101, in comparison to dmgoth63, dmsj23, and dmsj7.( B ) Expression le v el of genes producing chimeric transcripts only in dmgoth63, in comparison to dmgoth101, dmsj23, and dmsj7.( C ) Expression le v el of genes producing chimeric transcripts only in dmsj23, in comparison to dmgoth101, dmgoth63, and dmsj7.( D ) Expression le v el of genes producing chimeric transcripts only in dmsj7, in comparison to dmgoth101, dmgoth63, and dmsj23.Overall, gene expression does not change when comparing strains with and without chimeric transcripts.