Relaxation of Natural Selection in the Evolution of the Giant Lungfish Genomes

Abstract Nonadaptive hypotheses on the evolution of eukaryotic genome size predict an expansion when the process of purifying selection becomes weak. Accordingly, species with huge genomes, such as lungfish, are expected to show a genome-wide relaxation signature of selection compared with other organisms. However, few studies have empirically tested this prediction using genomic data in a comparative framework. Here, we show that 1) the newly assembled transcriptome of the Australian lungfish, Neoceratodus forsteri, is characterized by an excess of pervasive transcription, or transcriptional leakage, possibly due to suboptimal transcriptional control, and 2) a significant relaxation signature in coding genes in lungfish species compared with other vertebrates. Based on these observations, we propose that the largest known animal genomes evolved in a nearly neutral scenario where genome expansion is less efficiently constrained.


Collection of samples, library preparation and transcriptomic sequencing
Two adult male specimens of N. forsteri, with estimated age of 45 and 39 years (Fallon et al. 2019), respectively, were sampled with an electrofishing boat in the Brisbane River, southeast Queensland, Australia.The specimens were anaesthetized with AQUI-S anesthetic, transported on ice back to the laboratories of the Griffith University.The required permits for sampling for scientific purposes were obtained from the Queensland authorities (Australian Ethics Committee protocol number ENV/03/18/AEC; Queensland fisheries permit number 182081).
The specimens were dissected and samples from all the main organs and tissues were immediately placed in RNAlater solution (Sigma) and stored at -80°C until extraction.Homogenized lung, brain, liver and testis samples were selected for total RNA extraction.RNA was extracted using Tri Reagent (Sigma) following the manufacturer's protocol, using 1-Bromo-3-chloropropane for phase separation and 50:50 isopropanol and high salt solution (0.8M sodium citrate, 1.2M sodium chloride) for precipitation.RNA integrity was assessed on an Agilent 4200 Tapestation High Sensitivity ScreenTape.To provide a broader representation of the landscape of mRNAs expressed in lungfish, the RNAs extracted from 5 additional tissues (i.e.kidney, blood, eye, muscle, and heart) in the first individual were mixed in equimolar quantities to obtain a pooled sample.
Upon an initial check that a suitable quality and quantity of extracted RNAs was available for all samples, carried out with Nanodrop 2000 (Thermo Fisher Scientific, Waltham, MA, USA) and Tapestation 2200 (Santa Clara, CA, USA) instruments, barcoded stranded polyA-selected cDNA libraries compatible with Illumina sequencing were prepared with a TruSeq Stranded mRNA kit (Illumina, San Diego, CA, USA).
All libraries were prepared and sequenced by Omega Bioservices (Norcross, GA, USA) and Australian Genome Research Facility (AGRF, Melbourne, VIC, Australia).The libraries obtained from the first individual were sequenced on a HiSeq 2500 platform (Illumina, San Diego, CA, USA) with a 2x150 paired-end (PE) strategy, providing the reference material for the subsequent de novo transcriptome assembly process.The libraries obtained from the second individual were sequenced on a NovaSeq 6000 platform (Illumina, San Diego, CA, USA) using a 1x100 strategy, generating single-end (SE) reads that provided a biological replicate for an accurate evaluation of transcriptional activity.

De novo transcriptome assembly and functional annotation
All raw reads were demultiplexed, quality-checked, and trimmed according to base-calling quality scores and the presence of residual Illumina sequencing adaptors and barcodes using FastQC v. 0.11.9 (Andrews 2010).The first 15 nucleotides at the 5'end of each read were trimmed to remove the compositional bias observed in this region due to the non-random adapter ligation (Wright et al. 2019).Trimmed reads shorter than 75 bp were discarded.
The trimmed PE reads from the first individual were de novo assembled using the Oyster River Protocol v.2.3.1 (MacManes 2018), which combines the outputs of three different assemblers, i.e.Trinity (Grabherr et al. 2011), SPAdes (Bankevich et al. 2012) and TransABySS (Robertson et al. 2010) to generate a nonredundant highly representative transcriptome.Poorly expressed transcripts, possibly resulting from exogenous contamination or otherwise not biologically relevant, were removed by setting the TPM_FILT parameter to 1.
The completeness of the transcriptome assembly was evaluated by the analysis of a highly conserved set of single-copy orthologous genes shared by all vertebrates, extracted from OrthoDB v.10 ( Kriventseva et al. 2008), using BUSCO v.3 (Simão et al. 2015), computing the fraction of complete, fragmented and missing orthologs.

Assessment of transposon activity
A de novo library of the repeated elements found in the reference genome of N. forsteri (Meyer et al. 2021) was generated with RepeatScout v.1.0.5 (Price et al. 2005).The output of this process was analysed with RepeatMaskerv.4.1.2 (Smit 2015), which allowed the removal of the sequences observed less than 10 times in the entire genome, and therefore unlikely to represent bona fide transposable elements.All the elements included in the filtered library were then classified with the Transposon Classifier RFSB tool included in the TransposonUltimate bundle (Riehl et al. 2022).
The trimmed reads obtained from lung, brain, liver, and testis in both individuals were individually mapped against the de novo transcriptome assembly and the filtered repeat library with Salmon v. 1.3.0 (Patro et al. 2017).The global transcriptional activity of transposable elements was computed by calculating the fraction of the reads mapped to the repeat library for all samples, relative to the number of reads mapped to the complete transcriptome assembly.

Assessment of pervasive transcription
Different approaches were used to evaluate the rate of pervasive transcription in lungfish.First, the contigs included in the newly generated de novo transcriptome assembly were matched with the complete gene sequences annotated in the reference genome (Meyer et al. 2021) to identify non-coding transcripts putatively derived from intergenic regions.This task was carried out with a BLASTn search, based on an evalue threshold of 1 × 10 −10 and included all annotated exon sequences.Assembled contigs lacking significant matches were further inspected for the presence of Open Reading Frames (ORFs) larger than 100 codons with Transdecoder v.5.5.0, to remove genuine protein-coding transcripts encoded by unannotated or wrongly annotated genes, based on the previous report of the suboptimal BUSCO score of the reference genome annotation (Meyer et al. 2021).The remaining contigs were also screened with RepeatMaskerv.4.1.2,removing those characterized by the presence of any element included in the repeat library described in the previous section, ruling out an association with transposable element activity.
Finally, the presence of transcripts completely embedded within the introns of annotated genes was assessed with a BLASTn search, using the same e-value threshold listed above, only retaining hits with 100% query coverage.All resulting transcripts lacked protein-coding potential, significant sequence homology with the annotated gene set and repeated elements, being therefore considered as the likely product of pervasive transcription.
To further investigate the impact of this phenomenon at a comparative level, gene annotations and multiple publicly available RNA-Seq datasets were retrieved from NCBI for a number of representative bony fish species and for one amphibian with a giant genome, the axolotl Ambystoma mexicanum (see table S1 for the complete list of SRA accession IDs).To improve the detection of pervasively transcribed genomic regions in N. forsteri, the sequencing data generated in this study was complemented with additional publicly available libraries generated from a juvenile individual of undetermined sex, which only included the brain, liver and gonad tissues (BioProject accession ID: PRJNA645042).
A snakemake (Köster and Rahmann 2018) pipeline was built by combining several tools, i.e., fastp (Chen et al. 2018) for raw read trimming, STAR (Dobin et al. 2013) for mapping trimmed reads against the reference sequence datasets, samtools (Danecek et al. 2021), bed tools (Quinlan and Hall 2010) and bedops (Neph et al. 2012) for handling and analysing mapping files.Briefly, mapping only retained reads marked with a single non-ambiguous alignment against the reference genome (i.e.those flagged with NH:i:1), and all multi-mapping read were discarded.To prevent the inclusion of misalignments, stringent mapping parameters were used to exclude low scoring matches (-q parameter was set to 20), and only reads mapped in pairs were used to assess pervasive transcription.The read mapping files derived from the alignment between all available RNA-seq datasets and annotated reference genomes for each species were analysed with the aim to identify stretches of intergenic genomic sequence displaying high mapping coverage (i.e. at least 20X higher than expectations in case of random mapping), indicative of transcriptional activity.To further mitigate the impact of non-specific mappings, only regions longer than 254 nucleotides (i.e. the median exon length of the genes annotated in N. forsteri) were taken into account.
The identified regions were then used to compute the fraction of the genome subjected to pervasive transcription, both in terms of primary and spliced transcript size (i.e. by including or excluding introns, respectively).

Orthologous sequence recovery
To estimate the rate of evolution of the lungfish lineage, we analyzed the coding sequence (CDS) from a set of orthologous genes expected to be highly conserved and not subjected to duplication in all vertebrates.
These sequences were extracted from the genomes (where available) and/or transcriptomes of several selected species (table S2).Transcriptome assemblies were retrieved from the NCBI TSA database or, alternatively, de novo assembled as described above for the Australian lungfish, using the raw RNA-seq data retrieved from the SRA database as an input.Orthologous genes were recovered with BUSCO v.3 (Simão et al. 2015) based on the matches obtained against the vertebrate OthoDB v.10 dataset (Kriventseva et al. 2008).Only unambiguous matches (i.e.complete and not duplicated) were subjected to further analysis.The extracted CDS for each ortholog were aligned with MACSE v.0.9b1 (Ranwez et al. 2011) and alignments were manually inspected to identify extremely divergent sequences in one or a few species, which may indicate the presence of a paralogous gene or of incorrect gene predictions in that (or those) specific lineages.In those cases, the entire gene was removed from the dataset.Finally, the multiple sequence alignments were concatenated and trimmed with BMGE (Criscuolo and Gribaldo 2010), run in a codon mode (selecting a BLOSUM95 substitution matrix), removing all gapped positions.

Inference of phylogenetic tree
A maximum-likelihood tree was inferred with IQ-TREE v1.6.10 ( (Nguyen et al. 2015) using a set of 1790 conserved single-copy genes (BUSCO) from 15 vertebrate species.The dataset was partitioned by a single gene and a substitution model was selected for each partition by the built-in ModelFinder option MFP+MERGE (Kalyaanamoorthy et al. 2017) and 1,000 ultrafast bootstrap replicates (Hoang et al. 2018) were performed.Final figures were made using iTOL (Letunic and Bork 2019).

Molecular evolution analyses: dN/dS estimates
Starting from the dataset of 1,790 BUSCO genes, dN/dS rates (ω) were computed with CODEML, included in the PAML v.4.8 package (Yang 2007), testing different models.In particular, to identify signatures or relaxation of natural selection we performed a branch test, which can detect a different rate of evolution in only one (or a group of) lineage of a tree (Yang 1998).In this analysis, the branches of the phylogenetic tree are divided into foreground and background branches and selection is tested with a comparison of two models.We used the contrast of the one-ratio model in which the foreground and the background have the same ω value (ωfg = ωbg, the null) and the two-ratio model in which the foreground and the background have two different ω values (ωfg ≠ ωbg, the alternative).In particular, ωfg > ωbg indicates positive or relaxed selection acting on the foreground branches, while ωfg < ωbg indicates an excess of constraint in the foreground.In the first analysis, lungfish (foreground) were compared with the rest of the tree (background, SR1 in fig. 2b main text).Additional partitions of the tree (SR2-SR6) were also considered to confirm that the signal showed by the lungfish clade was robust.Each of the new partitions was designed to account for a specific bias or clade-specific effect.In detail, to confirm that the result of SR1 was lungfish-specific, tetrapods and lungfish were switched, and tetrapods were compared with the rest of the tree (SR2).To further control for the possible effect of having lungfish in the background, tetrapods were compared with fish, lungfish excluded (SR3).The next two partitions, SR4 and SR5, were designed to evaluate the evolutionary signal of the two giant genome urodeles (A.mexicanum and C. orientalis).Finally, lungfish were compared with the other fish of the dataset for direct comparison with the results of SR3-SR5 analyses.We also used each lungfish branch as foreground branch against the rest of the phylogeny in 9 independent analyses.To identify whether the relaxation signature was gene specific (i.e. the same genes are relaxed in all the lungfish branches) or branch specific, we tested if the overlap between the candidate genes identified in the species-specific foreground branch and in the entire lungfish clade was significantly larger than expected.Significant results are expected when the signal at the clade level is due to a lineagespecific relaxation.
The statistical significance of the differences in ω observed in different tree partitions was evaluated with a likelihood-ratio test (LTR).The resulting p-value was corrected for multiple testing with the Benjamini-Hochberg method (False Discovery Rate, FDR, Benjamini and Hochberg 1995) through the p.adjust function of R v3.6.2 (R Core Team 2019) with a significance threshold of FDR < 0.01.Finally, we calculated branchspecific ω values after concatenating the 1,790 genes, using the free-ratio model implemented in CODEML (Yang 2007).
The same dataset consisting of 1,790 genes and 15 species, as well as the same tree partition scheme (fig. 2b, main text), were considered in a second set of analyses more specifically designed for inferring relaxation of natural selection.These analyses were performed with the software RELAX implemented in the HyPhy v2.5.27 package (Wertheim et al. 2015).The method is based on BS-REL (Branch-Site Random Effects Likelihood) models, developed to improve previous branch-site models, which often generated false positives and false negatives when assumptions were violated (Kosakovsky Pond et al. 2011;Wertheim et al. 2015).The approach is based on the principle that relaxation pushes all ω categories toward neutrality (ω = 1) reducing the intensity of both positive (ω>1) and purifying (ω<1) natural selection.Similar to CODEML analysis, a priori partitions in the phylogeny are defined with a test group, consisting of the branches of interest (foreground in CODEML), compared with the reference group (background in CODEML).The method estimates a discrete distribution of ω shared by all branches of the phylogeny, defined by three classes of ω (ω1, ω2, ω3) to each of which it assigns a certain proportion of sites in the sequence (p1, p2, p3).The test constrains the distribution in the test and reference branches so that ωT = ωR k , and each class of ω in the test partition is obtained by raising the corresponding class in the reference partition to the power of k. k defines how the intensity of selection in the test branches (T) varies with respect to the reference branches (R) and always takes values greater than or equal to 0. An LRT test is then performed between a null model in which k = 1, i.e., ωT = ωR, and an alternative model in which k is a free parameter.In this work, the resulting p-value was corrected for multiple testing with the Benjamini-Hochberg method and a significance threshold of FDR < 0.01 (False Discovery Rate, FDR, Benjamini and Hochberg 1995).Similar to the CODEML free-ratio, RELAX also provides a General Descriptive Model that infers how the relative intensity of selection varies along the phylogeny, without defining a priori partitions within the tree.In this model, all branches share a single distribution of ω, but each branch has its own k parameter that intensifies (k > 1) or relaxes (k < 1) that distribution and can take values between 0 and 50 (Wertheim et al. 2015).We performed this analysis for the 1,790 genes, obtaining a branch-specific k parameter for each gene.We then calculated the proportion of putatively relaxed genes, i.e. genes with k<1 relative to the total number of genes that produced a RELAX result (fig.2a, main text).

Private NS changes
We translated the aligned coding sequences used for CODEML and RELAX analyses into their respective amino acid sequences, and we identified in each of the 15 terminal branches of the tree in fig.2a the positions where 14 lineages share the same amino acid, while a single lineage carries an alternative residue (script available at https://gitlab.com/54mu/enrichment_test).These private amino acid changes were then classified into either radical or conservative according to the Conservative-Radical Index (CRI, Sharbrough et al. 2018).Briefly, seven different amino acid classification schemes are used to evaluate rates and patterns of radical and conservative amino acid evolution.Each scheme highlights different amino acid properties that are likely to shape protein evolution, for example, amino acid charge, threedimensional structure, changes between polar and nonpolar amino acids, volume and aromaticity.The overall index of the degree of amino acid change CRI is calculated for each possible amino acid change by averaging across the seven amino acid classification schemes (radical changes assigned a value = 1.0, conservative changes assigned a value = 0.0).
1.9 GO enrichment of genes under different evolutionary rates.
Gene Ontology terms relative to each BUSCO id were retrieved from OrthoDB v.10 ( Kriventseva et al. 2019).
The enrichment tests were performed with a hypergeometric test, run on the relaxed and constrained gene sets, implemented as a python script (available at https://gitlab.com/54mu/enrichment_test)using an FDRcorrected p-value threshold < 0.05 and Observed-Expected threshold > 2 for the selection of significant results.

Sequencing, de novo transcriptome assembly and identification of non-coding transcripts
The sequencing of the five libraries from the first lungfish individual yielded a total of 197.6M raw PE reads, which were roughly reduced by 3.5% following the trimming procedure, leading to a total of 190.8M clean reads.These were subdivided among the five libraries as follows: 35.9M reads for brain, 43M reads for liver, 37.8M reads for lung, 39M reads for testis and 35.1M reads for the mixed library (table S3).The four libraries obtained from the second lungfish individual generated a total of 161.6M raw SE reads, reduced by 2.2% to 158.1M by the trimming process, for a final set of 43.2M reads for brain, 31.5Mreads for liver, 45.5M reads for lung and 37.9M reads for testis.
The de novo assembly produced a total of 289,539 contigs, with a length range comprised between 200 and 18,641 nucleotides (mean and median length = 516.33 and 290 nt, respectively, N50 = 778 nt, with 2,522 contigs longer than 5 Kb, see table S3).In spite of a high proportion of short assembled contigs, the transcriptome assembly was found to include a high fraction of transcripts with full-length CDSs and to be highly representative of the complete landscape of protein-coding mRNAs expected to be expressed in this species across a broad range of tissues and physiological conditions.Overall, 917 out of the 978 vertebrate BUSCOs (93.76%) were detected as complete, 53 (5.41%) were fragmented and just 8 (0.81%) were missing.
Complete/fragmented and complete/missing ratios were 17.30 and 114.63, respectively.
In further support of the high quality of the de novo assembled transcriptome, high mapping rates were observed for nearly all libraries.These were 85.44% (brain), 89.70% (brain), 92.51% (lung) and 90.07% (testis) in the first individual and 82.14% (brain), 92.84% (liver), 91.61% (lung) and 92.14% (testis) in the second individual.The only exception was the library obtained from mixed tissues in the first individual, where the mapping rate was 61.63%, possibly owing to a lower quality of extracted RNA.
The functional annotation associated 24,318 contigs (8.40% of the total) with Gene Ontology terms and 14,329 contigs (4.95% of the total) with Pfam conserved domains.In summary, 24,473 contigs (8.45% of the total) could be associated with a functional annotation (table S3).
Owing to this particularly low annotation rate and the high number of de novo assembled transcripts, our assessment of protein-coding potential, overlap with annotated genes in the reference N. forsteri genome and presence of repeated elements led to the identification of a total of 189,822 contigs (65.56% of the total) denoting non-coding mRNAs unrelated with TE activity.Out of these, only 1,528 (0.53% of the total) could be tentatively associated with intronic polyadenylated transcripts, even though the present availability of Illumina libraries only and the lack of a detailed map of polyadenylation sites prevented to discriminate such transcripts from fragments of aberrant mRNAs retaining introns.
These results suggest that nearly two-thirds of all assembled expressed sequences in N. forsteri (accounting for 34.78% of the complete transcriptome assembly size) may be linked with pervasive non-coding transcription.

Evaluation of transposable element activity
Although the strategy used for the identification of pervasively transcribed mRNA described above had already excluded an association with TE activity, we performed an additional analysis to further support the lack of involvement of these elements in the widespread transcription of intergenic regions.Based on the newly generated repeat library (table S4), we computed the global contribution of different classes of transposons to the transcriptional activity observed in brain, liver, lung, and testis in the two analyzed lungfish individuals.Albeit this analysis evidenced a significant variation both among different classes of transposons and between the two biological replicates (with the first individual showing a higher TE activity), TEs cumulatively only accounted for a minor fraction of global transcription in all samples, averaging 1.02% of the total.In general, the brain was the tissue characterized by the highest transposon activity, reaching 2.08% in individual 1, whereas in the majority of the other samples these values were lower than 1%, with the minimum (0.49%) recorded in the testis from individual 2 (fig.S1). .Total transcriptional effort ascribable to transposable elements, displayed as the fraction of reads mapped against the repeat library, with a breakdown of the relative contribution of the main TE subclasses (see table S4 for further details).

Impact of pervasive transcription on global transcriptional activity
In lungfish, we could detect that approximately 1.42 Gb of intergenic genomic sequence was subjected to pervasive transcription, accounting for 4.12% of the total genome assembly size, which is consistent with the identification of a very high number of contigs presumably derived from pervasive transcription in the de novo assembled transcriptome (fig.S2).This estimate takes into account the full size of the inferred primary transcripts (i.e. by including introns in the computation), whereas both the spliced size (44.53Mb) and genomic coverage (0.13%) were significantly smaller, as expected from the very large reported intron size observed in N. forsteri (Meyer et al. 2021).The comparative analysis carried out by analyzing the same phenomenon in multiple other bony fish species, as well as in axolotl, an amphibian similarly characterized by a giant genome revealed, as expected, a tendency of larger genomes to display a proportionally larger total amount of pervasively transcribed genomic regions (fig.1a, main text, table S5), (Spearman rs =0.95, P < 10-4).Nevertheless, the fraction of the genome subjected to pervasive transcription was quite uniform across all bony fish species, regardless of genome size, with the lone exception of the European grayling Thymallus thymallus, whose genome has been subjected to an additional round of whole duplication compared with other Actinopterygii approximately 80 Mya (Sävilammi et al. 2019).These fractions of pervasively transcribed regions were roughly 8 times lower than lungfish.Even though an increased amount of transcribed intergenic genomic sequence was also noted in the axolotl, this fraction was inferior by approximately 2.5 times compared with lungfish (fig.1b, main text, table S5).
These observations were further supported by the contribution provided by such regions to the total transcriptional efforts in multiple tissues, which can be interpreted as a proxy for the energy investment of an organism towards pervasive transcription.The data collected about the expression of pervasively transcribed mRNAs clearly identified the lungfish as an outlier compared to all other species, including axolotl, where no more than 6.39% of all transcriptional effort was linked with this class of mRNAs (fig.1c, main text, table S6).On average, despite the presence of a few outlier tissues and the slightly higher expression observed in T. thymallus, pervasive transcription only accounted for 3.06% (with a standard deviation of 2.34%) in bony fishes.Pervasive transcription was on the other hand highly prevalent in all lungfish tissues, with little variations between the two analyzed individuals, ranging from 24.30% in the liver of the first individual to 38.68% in the testis of the second individual.Overall, the testis was the tissue where pervasive transcription was more significant, which is consistent with the expected relaxed state of chromatin during spermatogenesis, which may increase its accessibility by the RNA polymerase II complex and accessory factors regulating transcription.Widespread intergenic transcription was less prominent in liver, whereas lung and brain showed an intermediate level of activity.

Relaxation of selection in lungfish and private nonsynonymous changes
Figure S3 shows the number of significant genes in CODEML and RELAX analysis.The first analysis was carried out with the aim of detecting any selective relaxation present in lungfish (foreground or test) relative to the rest of the species (SR1, fig.2b, main text, and fig.S3).Out of the initial 1,790 genes, CODEML identified 611 genes for which a model with two different estimated ω for the two partitions was favored, compared to the null model with a single ω estimated for the entire topology (FDR < 0.01).For 601 of these 611 genes, the foreground ω resulted greater than that of the rest of the tree.This result suggests that 33.6% of the BUSCO genes analyzed showed a faster evolutionary rate in the lungfish branches (ωfg > ωbg), indicating either relaxation or positive selection.For the remaining 10 genes (0.6%) CODEML suggested higher conservation in the lungfish branches (ωfg < ωbg; fig.S3).
We further verified whether the pattern of relaxed selection was consistent in all or most of the branches, or if different genes were under relaxed selection in different evolutionary lineages.Most of the candidate genes for relaxed selection are found when assuming a different selective regime in the lungfish clade (Lung), and not its common ancestor (LNP), compared to the rest of the phylogeny (table S7, figure S4a).
Accordingly, most of the genes under relaxed selection at the clade level are accumulating non synonymous substitutions in specific branches (see permutation tests in table S7), suggesting that relaxation was not gene specific but rather lineage specific.This reinforces the hypothesis that purifying selection is not efficient across the whole clade, resulting in independent and random accumulation of slightly deleterious mutations in each lineage.It is worth noting that these different patterns of selective regimes were identified despite the slower evolutionary rate in the lungfish branches than in the rest of the tree, affecting both nonsynonymous and synonymous sites (fig.S4b and S4c).b Number of genes that have a selective pressure in the foreground branch(es) significantly larger than in the background branches (ωfg > ωbg; FDR < 0.01) according to the branch-test in CODEML.The fraction f refers to the ratio between this number (n) and the total number of genes that have a significantly different (either larger or lower) selective pressure in the foreground branch(es) than in the background branches.
c Mean (Standard Deviation) selective pressure intensity (ω) estimated across all 1790 genes, assuming two distinct values, one for foreground and one for the background branches.
*/** = the overlap between the candidate genes identified in the specific foreground branch and in the Lung clade is significantly larger than expected (100,000 permutations; * P < 0.05, **P < 0.000001).

Figure S4a
. Candidate genes for relaxed selection in lungfish.The number of genes detected as candidates for relaxed selection (FDR < 0.01) is shown for different configurations of foreground (in colour) and background branches (in black), which were assumed to be under distinct selective pressures (ωf and ωb, respectively).A) each terminal and internal branch was tested separately as foreground branch; B) the lungfish clade was considered as foreground; C) both the lungfish clade and its ancestor were considered as foreground; C) three ω values are assumed, one for the lungfish clade (ωf1), one for its ancestor (ωf2) and one for the rest of the phylogeny (ωb); genes were classified as candidates if ωf1 > ωb and ωf2 > ωb.Branch codes are as in table S7.

Figure S4a
. Candidate genes for relaxed selection in lungfish.The number of genes detected as candidates for relaxed selection (FDR < 0.01) is shown for different configurations of foreground (in and background branches (in black), which were assumed to be under distinct selective pressures (ωfg and ωbg, respectively).A) each terminal and internal branch was tested separately as foreground branch; B) the lungfish clade was considered as foreground; C) both the lungfish clade and its ancestor were considered as foreground; C) three ω values are assumed, one for the lungfish clade (ωfg1), one for its ancestor (ωfg2) and one for the rest of the phylogeny (ωbg); genes were classified as candidates if ωfg1 > ωbg and ωfg2 > ωbg.Branch codes are as in table S7.The same 1,790 genes were analyzed with RELAX algorithm in two independent runs.After removing those genes that failed to produce an output in at least one of the two runs (n=87), 146 genes (8.9%) were identified as relaxed (k<1) and 14 genes (0.8%) under intensified selection (k>1) (fig.S3).
Comparing the results of CODEML and RELAX, 139 genes showed a consistent signal of relaxed selection in lungfish, while 6 genes consistently showed signatures of higher conservation in these branches (fig.S3, fig. 2b main text).When tetrapods instead of lungfish were compared with the rest of the tree (SR2), we observed an opposite trend.In particular, the number of relaxed genes appeared drastically reduced (CODEML ωfg > ωbg intersection RELAX k<1 n=36), and the number of constrained genes increased (CODEML ωfg < ωbg intersection RELAX k>1 n=137) (fig.S3).However, since the presence of the "relaxed" lungfish clade in the background/reference group may increase the signal of conservation in tetrapods, here representing the foreground/test, the same analyses were performed excluding the lungfish clade (SR3).In this case, a tetrapod-specific relaxation signal was detected for 105 genes (CODEML ωfg > ωbg intersection RELAX k<1), while 33 genes resulted more constrained (CODEML ωfg < ωbg intersection RELAX k>1) (fig S3).Although the number of relaxed genes in tetrapods increased when lungfish were excluded from the analysis (SR2=36 vs SR3=105), the relaxation signal remained significantly higher in lungfish than in tetrapods (chi-square test p < 0.05).Likewise, even though the estimated number of constrained genes in tetrapods decreased after removing lungfish from the analysis (SR2=137 vs SR3=33), constrained genes remained significantly underrepresented in lungfish than in tetrapods (chi-square test p < 0.001).By comparing SR3 and SR4, the effect of the giant urodele genomes in the foreground/test appeared negligible, while SR5 showed that the two urodeles alone in the foreground/test behave as the other tetrapods.Finally, SR6 showed that the lungfish maintained their relaxed signature when compared to the fish alone (fig.S3, fig.2b main text).
Under strong selection relaxation, relaxed genes are expected to be enriched for deleterious mutations, and thus relaxed phylogenetic lineages are expected to show an excess of radical amino acid changes at conserved protein positions compared to the rest of the tree.To whether this was the case for the lungfish lineages, we translated the aligned coding sequences used for CODEML and RELAX analyses into their respective amino acid sequences, and we identified, in each of the 15 terminal branches of the tree in fig.2a (main text), the positions where 14 lineages share the same amino acid, while a single lineage carries an alternative residue.These private amino acid changes were then classified into either radical or conservative according to the Conservative-Radical Index (CRI, Sharbrough et al. 2018).With the exception of P. annectens, the other four lungfish species show similar trends to the rest of the tree, suggesting that mutations with large selective coefficients are purged by purifying selection in relaxed phylogenetic lineages (fig.S5).

GO enrichment of genes under different evolutionary rates.
We found ten significantly enriched GO biological process or molecular function terms in the candidate genes for relaxation of selection suggested by both CODEML and RELAX, mostly related to protein turnover (table S8).Six genes turned out to be more constrained in lungfish than in the rest of the tree.These loci code for KIF1-binding protein (IPR022083), Mannose-6-phosphate isomerase (IPR016305), probable ATP-dependent DNA helicase HFM1 (IPR014001), ubiquitin carboxyl-terminal hydrolase 7 (IPR008974), podocin (IPR001107) and structural maintenance of chromosomes protein 5 (IPR027131).

Figure S2 .
Figure S2.Total transcripts per million (TPM) of three main transcript categories in the de novo assembly, separated by tissue.

Figure S3 .
Figure S3.Detailed results of CODEML and RELAX analyses run starting from six different tree partitions (SR1-SR6).Blue branches: background/reference; orange branches: foreground/test; grey branches: excluded from the analysis.Dataset: 1,790 orthologous genes (or subsets).Venn diagrams show the number of genes for which analysis yielded significant results for CODEML and/or RELAX.Green: relaxed genes (RELAX, k<1, solid line) and/or genes with increased ω in the foreground (CODEML, ωfg > ωbg, discontinuous line); red: genes under intensified selection (RELAX, k>1, solid line) and/or genes under purifying selection (CODEML, ωfg < ωbg, discontinuous line).

Figure S4b .Figure S4c .
Figure S4b.Phylogenetic trees with topology and branch lengths corresponding to the average dN (panel A) and dS (panel B) values estimated using the two-ratio branch model in CODEML.The topology corresponds to the unrooted tree used to estimate substitution rates across the phylogeny.The lungfish clade (in red) was set as foreground branch and the rest of the phylogeny as background branches.

Figure S5 .
Figure S5.Proportion of private amino acid changes classified as radical (CRI > 0.5) or conservative (CRI<0.5)according to the overall index of the degree of amino acid change (CRI) developed by Sharbrough et al. (2018).

Table S1 .
Raw read accession IDs used in the pervasive transcription comparative analysis.

Table S2 .
Species names and respective transcriptome resources for orthologous sequence recovery and dN/dS analyses.

Table S3 .
Summary of the sequencing and de novo transcriptome assembly metrics.

Table S4 .
number of sequences in library for each class of transposable elements.

Table S5 .
Total size of the genomic regions subjected to intergenic pervasive transcription across different species and total fraction of the genome subjected to pervasive transcription (%), based on detected primary transcripts (i.e. by including introns).

Table S6 .
Total contribution of pervasively transcribed intergenic regions to the global transcriptional effort in different tissues from multiple species.Gonads may refer either to the testis or to the ovary, depending on available data (see fig. 1c main text).

Table S7 .
Candidate genes for relaxed selection across the lungfish phylogeny.Branch (or clade) set as foreground in the branch-test of CODEML: LNP & Lung = common ancestor of Lung and Lung had two independent selective pressures (candidate relaxed genes correspond to those that were detected as having a selective pressure (ω) significantly larger in both LNP and Lung; ω mean values for the two sets of branches are reported); (LNP+Lung) = common ancestor of Lung and Lung; Lung = lungfish clade; LNP = common ancestor of Lung; Lpar = terminal branch to L. paradoxa; Nfor = terminal branch to N. forsteri; Paet = terminal branch to P. aethiopicus; Pann = terminal branch to P. annectens; Pdol = terminal branch to P. dolloi; LP = common ancestor of L. paradoxa and the Protopterus clade; P2 = (P.aethiopicus, P. annectens) ancestor; P3 = Protopterus ancestor; Lungany = in at least a branch within Lung; LNP-or-Lungany = in LNP or at least a branch within Lung; Lung-and-Lungany = in Lung and at least a branch within Lung; LNPonly = in LNP but not in Lung. a

Table S8 .
GO enriched in the 139 genes identified as relaxed by CODEML and RELAX in lungfish compared with the rest of the phylogeny (FDR < 0.05 and Obs -Exp > 2).