Analysis of 19 Heliconiine Butterflies Shows Rapid TE-based Diversification and Multiple SINE Births and Deaths

Transposable elements (TEs) play major roles in the evolution of genome structure and function. However, because of their repetitive nature, they are difficult to annotate and discovering the specific roles they may play in a lineage can be a daunting task. Heliconiine butterflies are models for the study of multiple evolutionary processes including phenotype evolution and hybridization. We attempted to determine how TEs may play a role in the diversification of genomes within this clade by performing a detailed examination of TE content and accumulation in 19 species whose genomes were recently sequenced. We found that TE content has diverged substantially and rapidly in the time since several subclades shared a common ancestor with each lineage harboring a unique TE repertoire. Several novel SINE lineages have been established that are restricted to a subset of species. Furthermore, the previously described SINE, Metulj, appears to have gone extinct in two subclades while expanding to significant numbers in others. Finally, a burst of TE origination corresponds temporally to a burst of speciation in the clade, potentially providing support to hypotheses that TEs are drivers of genotypic and phenotypic diversification. This diversity in TE content and activity has the potential to impact how heliconiine butterflies continue to evolve and diverge.

. We examined the raw RepeatMasker output from each 174 genome for the presence of any ZenoSINE elements greater than 100 bp in length. Counts ranged from 5-175 21 in the melpomene and sylvaniform clades. Sixty-two were found in A. vanillae, and only 12 were 176 identifiable in Dryas iulia. Examination of the extracted hits on a clade by clade basis reveals that relatively 177 few are likely to be genuine ZenoSINE elements. All of the hits from A. vanillae and members of the 178 melpomene and sylvaniform clades were about half the size of the average ZenoSINE consensus, truncated 179 at the 5' end. For hits in the D. iulia genome, the hits were also short but the truncation occurred at the 3' 180 end. We suggest that the vast majority, if not all, such low-copy-number hits in Table 2 follow are similarly 181 false positives. 182 Previous efforts by Lavoie et al. (2013)were unsuccessful in determining the LINE partner for Metulj, but 193 based on our more complete analysis of the TE content of all 19 genomes, we now suspect that it is 194 mobilized by an RTE family LINE (Supplemental Figure 7A). ZenoSINE,Fritillar,and Flambeau show 195 similarity between their tails and the tail of LINEs from the Zenon family (Supplemental Figure 7B), 196 suggesting a similar relationship. Flambeau exhibits 3' similarity with R1 LINEs (data not shown). 197 The SINE tail may influence the success of the SINE in hijacking the LINE enzymatic machinery 198 at the ribosome (Dewannieux and Heidmann 2005). Our investigations into the evolution of the 3' tail 199 revealed informative patterns (Supplemental Figure 8). In most Heliconius, young Metulj show a distinct 200 bias toward A and T over G and C and A:T ratios are biased slightly toward T in young insertions, a signal 201 not observed in older elements. The A prevelance over C and G is slightly higher in members of the erato 202 and sara clades and distinctly higher in D. iulia, A. vanillae, and H. doris. 203 Because of the apparent partnership that has evolved between these SINEs and their partner LINEs, 204 one might expect similar recent accumulation profiles. However, no relationship between the accumulation 205 patterns is easily resolvable (Figure 4). Indeed, while there does appear to be some mirroring in H. doris, 206 H. burneyi, and possibly in the erato and sara clades, the accumulation patterns observed in melpomene and 207 sylvaniform are essentially opposite. A similar lack of correspondence in landscapes is apparent for 208 ZenoSINE and Zenon LINEs. Examining correllations between recently accumulated SINEs and LINEs 209 also reveals no discernable pattern (Supplemental Figure 9). While the expected high correspondence 210 between ZenoSINE and Zenon LINEs is observed, so are high correlations with Dong and RTE-BovB. 211 Further, the expected correlation between RTE-type LINEs and Metulj is not observed. 212

SINE Birth and Death: 213
Four of the novel SINEs likely originated recently within the Heliconiini. A BLAST search of all 214 taxa excluding Heliconius in the NCBI WGS database using ZenoSINE consensus sequences yields only 215 severely truncated and low similarity hits in the genomes of other lineages. Analysis of Fritillar suggested 216 that it is restricted to A. vanillae, strongly suggesting that it originated in that lineage. The BLAST search 217 produced 12 high similarity, partial hits to the fellow nymphalid butterfly Vanessa tameamea (the 218 Kamehameha butterfly, GCA_002938995.1). The hits are limited to the 5' (likely tRNA-derived) half of 219 Metulj are present in all species examined, suggesting that their origin is more ancient but at least 225 prior to the diversification of the Heliconiini. A BLAST search of the NCBI WGS database yields hits only 226 in heliconiine genomes thus far deposited with NCBI. Similar results are obtained by a broader search of 227 all insect nucleotides in the database. Thus, while a specific point of origin cannot be identified, we suggest 228 that Metulj originated with the clade or shortly thereafter. The lack of any substantial recent accumulation 229 in members of the melpomene and sylvaniform clades (Figure 3) strongly suggest that Metulj is dead or 230 dying in those lineages. 231 TE origination rates: 232 Table 2 details the rates at which various branches in the phylogeny gained novel TEs. In agreement 233 with much of the data presented above, the erato and sara clades along with H. doris and the three outgroups found no significant correlation (p=0.11). However, we did find a marginally significant correlation of 251 genome size with TE count (p=0.0165). We repeated the analysis accounting for phylogenetic relatedness 252 using the pic function in the R package ape v5.1 using a tree generated from concatenated, non-coding, Because these species diverged very recently, we hypothesized that recent insertions may be more 257 relevant for differences in genome size. However, this was not consistent with the data. When only 258 considering TE insertions with divergence values less than 0.05, we found no association of genome size 259 with either TE length (p=0.0891) or TE count (p=0.482). 260 To determine if any one element could be influencing genome size evolution, we next classified 261 each insertion based on both class and family and analyzed each independently. For the full data (recent 262 and old elements), after correcting for multiple comparisons, only I.Nimb elements were significantly 263 associated with genome size (I.Nimb length p=5.17e -5 , I.Nimb count p= 8.76e -5 ). However, I.Nimb 264 elements make up only a small fraction of the genome, and the pattern appears to be driven by two outliers, 265 H. telesiphe and E. tales (Supplemental Figure 10). For the recent elements, again a single element, 266 Penelope, is associated with genome size (Supplemental Figure 11), but here the association is with count 267 alone, and again it appears to be driven by the high density of Penelope in H. telesiphe (Penelope length 268 p=.059, Penelope count p=1.1e -4 ). These observations suggest that there are substantial differences in the ways that each species deals 289 with genomic stress caused by TE mobilization and that TE defense strategies diverge rapidly in each 290 lineage. This is consistent with the model of piRNA clusters acting as TE 'traps' in which, upon an 291 element's insertion into a cluster, a piRNA-based defense against that element is mounted (Lu and Clark  The reason for the death of Metulj in the latter clades is unclear, as is the cause of the resurgence 316 in other species. Why any SINE goes extinct is unknown and could be influenced by multiple factors 317 including genomic defenses, the partner LINE, mutations in the SINEs themselves, and population genetic new subfamilies that are unique to this clade suggests a simple cessation of retrotransposition. If we are correct in our conclusion that RTE LINEs are responsible for Metulj mobilization, some clues may be found 321 by examining those elements. One potential explanation is to view the SINE-LINE relationship not as a 322 partnership but as a competition for the enzymatic machinery produced by LINEs. If the SINEs are 323 particularly effective at hijacking that machinery, it may be possible for them to suppress LINE 324 mobilization to some extent, even to the eventual demise of the LINE partner, as was recently hypothesized 325 in sigmodontine rodents (Yang et al. pers. comm.). Our analysis of Metulj tails suggests that the ancestral 326 tail of Metulj SINEs was A-rich and that a switch toward tails containing more T residues may be involved 327 in the success of this SINE in the erato and sara clades. This hypothesis does not, however, hold true for D. 328 iulia, A. vanillae, or H. doris, which have all experienced high rates of recent Metulj accumulation but 329 exhibit a bias toward A nucleotides in their tails. These results suggest that the reasons for the differential 330 success in heliconiine genomes may be many, and complex. 331 Not surprisingly, the outgroup species, with their deeper divergences, exhibit their own unique 332 patterns. D. iulia, with the highest proportion of Metulj in its genome, experienced a recent surge in 333 accumulation that outpaced any other heliconiine examined. E. tales mirrors the erato and sara clades while 334 A. vanillae appears to have experienced a gradual increase in accumulation very recently. 335 Previous analyses (Lavoie, et al. 2013) suggested that larger TEs were removed via non-336 homologous recombination. This hypothesis is not refuted by our data. Examination of the TE landscape 337 plots described above suggests that, unlike the pattern observed in mammalian genomes, where TEs remain 338 as molecular fossils over large swaths of evolutionary time (Lander, et al. 2001;Waterston, et al. 2002), 339 there is substantial turnover of TEs in these butterfly genomes. For example, when examining the temporal 340 accumulation landscapes of Metulj, a SINE that averages well under 300 bp, we can readily see evidence 341 of ancient accumulation (Figure 4). The LINE TE classes exhibit much less clear signatures: we rarely see 342 ancient peaks in accumulation plots (Supplemental Figure 12). This suggests that these genomes can rapidly 343 diverge over evolutionary time once reproductive isolation is acquired, with distinct lineages retaining little 344 ancient TE-derived homology from larger elements across their genomes. 345 Drosophila reproductive isolation (Petrov, et al. 1995). 373 TEs have been shown to respond to environmental stressors, thereby leading to substantial genomic 374 instability (Rey, et al. 2016). Such instability has the potential, in turn, to provide novel genotypes and 375 phenotypes upon which selection can act, either through direct changes to coding regions (Clark, et al. more closely related lineages, we demonstrate that even over relatively short periods, the TE landscapes in 389 members of a single genus can diverge rapidly due to differential TE dynamics. Lineages whose common 390 ancestor harbored a single complement of TEs now play host to very distinctive complements of recently 391 active TEs with patterns that resemble genomic fingerprints. Even in the case of LTR accumulation, where 392 no significant difference exists with regard to overall accumulation amounts, the identities of the elements 393 SINEs were inspected for a repetitive tail and A and B boxes. LTR retrotransposons were required to have 446 recognizable hallmarks such as TG, TGT or TGTT at their 5' and the inverse at the 3' ends. Because of the 447 complexity of SINE evolution, putative SINEs were analyzed uniquely as described below. While 448 sequences in the unknown category could be transposable elements, they formed only a very small fraction 449 of the total putative TE sequence, and they could also represent segmental duplications or other non-TE

SINEs: 454
SINE evolution is complex and identifying subfamily structure is a difficult problem, primarily due 455 to the high number of insertions typical of a genome. Initial analysis suggested three SINE families in these 456 genomes. The first is the previously described Metulj family. The second is a novel family that appears to 457 be derived from the fusion of Zenon LINE 3' tails with a 5' head of unknown origin, which we call 458 ZenoSINE. A small subfamily distinct from the main ZenoSINE family was identified in and restricted to 459 the A. vanillae genome A. vanillae is commonly known as the Gulf Fritillary. Thus, we dubbed this 460 subfamily 'Fritillar'. Finally, a third family that is derived from R1 elements is restricted to D. iulia. One 461 common moniker for this species is 'flambeau' and we suggest the same name, Flambeau, for this family 462 of SINEs. 463 Metulj SINEs were far more numerous and widespread than their ZenoSINE cousins (discussed 464 below), and therefore represented a more difficult analytical problem. A recently developed network-based 465 method for subfamily (aka community) detection was used to identify Metulj subfamilies (Levy, et al. identical consensus sequences were merged (such pairs were identified using BLAST requiring 100% 481 identity and 95% query coverage). Consensus sequences were computed per subfamily and were used to 482 refine the subfamily annotation, resulting in a final set of 2,493 subfamilies (Supplemental File 2). This set 483 was further grouped into 147 clusters to simplify downstream analyses using cd-hit-est. Clustering criterion by non-homologous recombination. As a result, we focused on the LINE open reading frame to increase 488 the potential for comparable data. A special effort was made to identify full-or near full-length open reading 489 frames (ORFs) for each clade. First, we identified all known LINE elements from the H. melpomene 490 genome in RepBase. These were combined with any LINEs identified in our de novo analysis after 491 removing possible duplicates. All remaining elements were filtered, retaining any with intact ORFs of at 492 least 2kb, starting with methionine, and with clearly identifiable start and stop codons using 'getorf' from 493 the EMBOSS package (Rice, et al. 2000). 494 To identify subfamily structure of LINEs, phylogenetic analysis of these ORFs was accomplished 495 by masking each genome with the resulting library and retaining any hits of 1.5kb or longer. Generally, 496 extracted hits were aligned using MUSCLE and subjected to a neighbor-joining (NJ) analysis (described 497 below). However, large numbers of hits impeded efficient alignment in some cases due to memory 498 limitations. To work around this problem, we reduced the number of hits by randomly selecting smaller 499 numbers of sequences from the pool and re-aligning until successful. In some of these cases, there was a 500 lack of overlapping sites that impeded the NJ analysis. In these cases, we extended our filter to include hits 501 that were at least 2kb, producing the needed overlapping regions. 502 Each set of aligned ORFs was subjected to NJ analysis to identify any apparent structure. NJ  The 3' tails of Metulj elements exhibited substantial complexity, with a variety of structures 535 including poly-A tracts, poly-T tracts, repeated ATTTA motifs, and repeated GATG motifs, among several 536 others. Based on previous work, we suspected that differences in the tail may influence relative success in 537 retrotransposition (Dewannieux and Heidmann 2005; Ohshima and Okada 2005). To investigate how tail 538 structure evolved, we extracted 100 random full-length Metulj insertions from each taxon. Each set of 539 extracts was aligned to representative consensus sequences. This was repeated ten times for each taxon. 540 The 3' ends of each alignment were degapped starting where the tail begins and the ratios of each pair of 541 nucleotides was identified and plotted after log-transformation. This was conducted separately for 'old' and 542 'young' SINEs. 543 To determine if either Metulj or ZenoSINE accumulation patterns were correlated with any LINE 544 elements, Pearson correlation coefficients based on proportion of each genome occupied were visualized 545 using the "corrplot" package in R and RStudio v1.0.143. 546

TE origination rates: 547
To estimate approximate rates that lineages evolved new TE lineages, we calculated the number of 548 branch-specific TEs using RepeatMasker output. A TE was scored as 'present' (score = 1) in a genome if 549 at least 5000 bp of sequence attributable to that TE was identifiable in the genomes of terminal branches. sum threshold scores were subclade specific based on the number of species examined. For example, the 553 erato subclade, with four members, had a presence sum threshold of 3.5. Branch times were obtained using 554 the median scores for each node calculated using TimeTree (Kumar, et al. 2017). Rates of TE origination 555 were calculated by dividing the number of branch-specific insertions by the time that the branch likely 556 existed. 557 We estimated lineage-specific DNA contributions to selected branches of the tree by identifying 558 DNA that was deposited by novel TEs that evolved on those branches. We then calculated both the genome 559 proportions occupied by those elements and the total bp. For example, we summed the total contributions 560 made by each of the 118 novel TEs identified in the D. iulia genome (Table 2). Similarly, we summed total 561 the total bp deposited by each novel TE identified on the erato-sara common branch in each member of 562 those clades and calculated the mean (Supplemental File 1). 563

Genome size correlations: 564
Using the annotations generated, we compiled summary statistics of transposable element content 565 in each heliconiine genome, in terms of TE bases per base pair (TE length) and number of insertions per 566 base pair (TE count). We obtained genome size estimates from Edelman et al. (submitted). Because the 567 absolute values of these measures are several orders of magnitude apart, we Z-transformed each category 568 by subtracting the mean and dividing by the standard deviation. 569

Acknowledgments 570
The authors wish to thank the College of Arts and Sciences at Texas Tech University for funding 571 related to this work. In addition, we would like to thank the Texas Tech HPCC 572 (http://www.depts.ttu.edu/hpcc/) for providing computational resources necessary to complete this 573 project. Angela Peace provided assistance with early conceptual analyses. The 20-genome Heliconius 574 project was funded by a SPARC Grant from the Broad Institute of Harvard and MIT as well as startup and 575 studentship funds from Harvard University. 576 Table 1. Total numbers of SINE insertions >100 bp present from each family described in the 19 genomes 757 examined. Color coding indicates relative counts, darker green depicts higher numbers in each category. 758  The combined plot at the top represents 'all' data. Species and their phylogenetic relationships (Figure 1) are depicted on the X-axis. Values on the Y-axis are genome proportions calculated as described in the text.
Abbreviations are as described in Supplemental Table 1. Briefly, the first letter indicates genus, and the following three (or four) letters, except in the cases of H. hecale and H. hecalesia, indicate species as listed in Figure 1.