Molecular Traits of Long Non-protein Coding RNAs from Diverse Plant Species Show Little Evidence of Phylogenetic Relationships

Long non-coding RNAs (lncRNAs) represent a diverse class of regulatory loci with roles in development and stress responses throughout all kingdoms of life. LncRNAs, however, remain under-studied in plants compared to animal systems. To address this deficiency, we applied a machine learning prediction tool, Classifying RNA by Ensemble Machine learning Algorithm (CREMA), to analyze RNAseq data from 11 plant species chosen to represent a wide range of evolutionary histories. Transcript sequences of all expressed and/or annotated loci from plants grown in unstressed (control) conditions were assembled and input into CREMA for comparative analyses. On average, 6.4% of the plant transcripts were identified by CREMA as encoding lncRNAs. Gene annotation associated with the transcripts showed that up to 99% of all predicted lncRNAs for Solanum tuberosum and Amborella trichopoda were missing from their reference annotations whereas the reference annotation for the genetic model plant Arabidopsis thaliana contains 96% of all predicted lncRNAs for this species. Thus a reliance on reference annotations for use in lncRNA research in less well-studied plants can be impeded by the near absence of annotations associated with these regulatory transcripts. Moreover, our work using phylogenetic signal analyses suggests that molecular traits of plant lncRNAs display different evolutionary patterns than all other transcripts in plants and have molecular traits that do not follow a classic evolutionary pattern. Specifically, GC content was the only tested trait of lncRNAs with consistently significant and high phylogenetic signal, contrary to high signal in all tested molecular traits for the other transcripts in our tested plant species.

although Champigny et al. (2013) presented an additional 665 transcriptional units for which the reference genome had no annotation. Recently, Yin et al. (2018) have added to the number of novel transcripts in E. salsugineum with evidence of expression of an additional 65 transcripts, none of which are available in the reference annotation of E. salsugineum.
LncRNAs may be missing from genome annotations because they are difficult to identify due to their low, tissue-and condition-dependent expression (Derrien et al. 2012). Further, contrary to protein-coding genes and other non-coding loci, the evolution of lncRNAs is not well understood. Limited nucleotide conservation has been identified in mammalian lncRNAs (Hezroni et al. 2015), and structural conservation remains controversial (Rivas et al. 2017). Instead of using homology, distinguishing traits such as transcript length (Kapranov et al. 2007), open reading frame (ORF) (or lack of) length (Kapranov et al. 2007), GC content (Niazi and Valadkhan 2012), and number of exons in a transcript (Derrien et al. 2012) are often used in lncRNA prediction studies. Detected phylogenetic signal in traits of transcripts, rather than sequence homology, can indicate that trait values follow the expected evolutionary patterns of tested species. For example, high phylogenetic signal implies traits are more similar in closely related species, whereas low phylogenetic signal suggests the opposite: less similarity in tested traits than expected in closely related species. However, identifying which evolutionary process may be influencing a significant phylogenetic signal is complex and many different processes are associated with both high or low signal estimates (Revell et al. 2008).
Phylogenetic signal can be measured using many different indices, however three different approaches are prevalent throughout estimation methods: Brownian motion, an Ornstein-Uhlenbeck process and spatial autocorrelation. High phylogenetic signal detected by a signal estimation method that uses the theory of Brownian motion, or evolution following a random walk, can be observed in both natural selection and genetic drift scenarios (Revell et al. 2008). Conversely, low detected phylogenetic signal can be inferred as the lack of similarity in tested traits, as opposed to divergence of traits, and is common in adaptive radiation or other fast adaptive processes (Kamilar and Cooper 2013). First described with applications to evolution by Hansen (1997), the Ornstein-Uhlenbeck process allows for a random walk, similar to Brownian motion, but also for species to evolve toward an adaptive peak or fitness optimum, thus suggesting data that fit an Ornstein-Uhlenbeck process as evidence of an adaptive process. Furthermore, local estimates of Moran's I, based on the concept of spatial autocorrelation, estimate phylogenetic signal throughout evolutionary time. Positive autocorrelation indicates similarity of trait values at a given phylogenetic distance, while negative autocorrelation suggests dissimilarity at a given phylogenetic distance. Diniz-Filho (2001) has shown, however, that it is the changes in local autocorrelation over phylogenetic distance beyond a significance threshold that are important for evolutionary process inference. A trait following an Ornstein-Uhlenbeck adaptive process would have a reduced phylogenetic distance at which Moran's I changes in magnitude and crosses a threshold of no significant autocorrelation, also called a "phylogenetic patch". Thus, it is necessary to test for phylogenetic signal in traits using multiple and diverse estimation methods to infer putative evolutionary processes that may be driving significant signal.
In this study, we predicted lncRNAs from transcriptomes of 11 plant species with widely different evolutionary histories. Transcripts were assembled from RNASeq data without restriction of existing reference annotation in order to obtain a representation of all expressed loci in each study. Transcript sequences were then input into CREMA (Simopoulos et al. 2018) for accurate lncRNA prediction and ranking.
Unlike other lncRNA prediction tools, CREMA is trained only on experimentally validated lncRNA transcripts and has been tested for use on transcript sequences of multiple plant species. Following lncRNA prediction, we observed that up to 99% of predicted lncRNAs may not be present in their corresponding reference annotation. Thus, we caution that researchers should not rely only on publicly available annotation for lncRNA research. Finally, as there has been little evidence for sequence conservation in lncRNAs between species in different families (Nelson et al. 2016), a phylogenetic signal was not expected in the distinguishing molecular traits of lncRNAs, such as transcript length and GC content. However, our comparative study detected a consistently high phylogenetic signal in GC content of lncRNAs with no similarity identified in the other traits tested in these regulatory transcripts. In particular, GC content differences relative to proteincoding RNA represent a trait that could help researchers distinguish putative functional lncRNAs from non-functional and spurious transcription, or fragmented protein-coding RNAs.

Data collection
RNASeq data from multiple plant species were downloaded from the SRA database (See Table 1 for accession and SRA IDs). All plants in this analysis have a publicly available sequenced genome. All RNASeq reads except for those from E. salsugineum were downloaded from the Sequence Read Archive (SRA) (https://www.ncbi.nlm.nih.gov/sra). To be considered for analysis, plants must have been grown under control conditions without being subjected to stress. For consistency, preference was given to studies that used leaf tissue from mature plants, although use of older seedlings was accepted. Leaves or leaf-like tissue represented a shared morphological feature of photosynthetically active cells throughout all tested species and was used throughout the study to ensure an equal opportunity for capturing lncRNA expression. Additionally, only RNASeq reads from Illumina technology were considered, however both paired and single end reads were used.
E. salsugineum reads were sequenced using Illumina technology from Shandong ecotype rosette leaves grown under control, unstressed conditions as outlined in MacLeod et al. (2015). Fully-expanded leaves were used for RNA sequencing, and were collected between 8 and 10 hr into the day cycle. RNA was extracted from leaves flash-frozen using liquid nitrogen using a modified hot borate method as described by Champigny et al. (2013). A quality control analysis was completed on the RNA using RNA Nano 600 chips on a Bioanalyzer 2100 and purified using three on-column purifications by Genelute mRNA miniprep kit (Cat. No. MRN10,Sigma). Finally, preparation of cDNA for sequencing was performed with the NEBNext multiplex cDNA synthesis kit for Illumina using random hexamers (Cat. No. E7335, New England Biolabs, Ipswich, MA). Cleanup of fragmented RNA was performed with Agencourt AMPure XP Beads (Cat. No. 163987, Beckman Coulter, Mississauga, ON) following the manufacturer's protocol. Raw FASTQ files were deposited to the SRA with submission ID SRR7962298 and BioProject accession PRJNA494564.
Transcript assembly and lncRNA prediction Reads from all plant species were trimmed using Trimmomatic v0.36 (Bolger et al. 2014) and aligned to their corresponding genomes using STAR v2.5.2b (Dobin et al. 2013) with default settings other than-outFilterIntronMotifs set to RemoveNoncanonical and-alignEndsType EndtoEnd. Aligned reads were assembled into transcripts by StringTie v1.3.4d (Pertea et al. 2015). GTF files of assembled transcripts were merged with GTF files of annotated genomes and are stored on GitHub: https://github.com/caitsimop/lncRNA-compGenomics. Alignment quality was tested using gffcompare v0.10.4 (https://github.com/gpertea/ gffcompare) by comparing assembled transcript GFF files with reference genome GFF files. Alignment quality metrics were used to confirm alignment quality and transcript assembly quality using accuracy and precision values (File S2). Gffcompare output was also used to identify novel transcripts and to quantify transcript exon numbers in each RNASeq library.

Identifying lncRNAs From RNASeq data
Assembled transcript sequences were input into CREMA (https:// github.com/gbgolding/crema) (Simopoulos et al. 2018) for ranked lncRNA prediction. The number of lncRNAs in each species was calculated as a percentage of all transcripts (the sum of novel assembled transcripts and transcripts in reference annotation). The percentage of lncRNAs was used for normalization across all studied plant species to ensure appropriate comparisons to species with different sized transcriptomes.

Phylogenetic signal in lncRNA traits
Four continuous molecular traits were chosen for phylogenetic signal analysis on predicted lncRNAs: 1. Number of exons in transcript, 2. GC content of transcript, 3. Length of transcript, and 4. Length of maximal ORF. Features were extracted from transcript sequences and gffcompare outputs using a custom Python script. The phylosignal R package (Keck et al. 2016) was used to detect phylogenetic signal in lncRNAs, all other transcripts, and the differences between lncRNAs and all other transcripts for each trait in all species except for Boea hygrometrica. Separate phylogenetic signal tests were completed for each trait. Although we expect there to be correlation between transcript length and ORF length, we did not observe a correlation, particularly in lncRNAs. Phylogenetic signal of the mean value of the four traits was calculated using three separate methods: Moran's I (Moran 1948;Gittleman and Kot 1990), Blomberg's K (Blomberg et al. 2003) and Pagel's l (Pagel 1999). Local autocorrelation estimates at 100 phylogenetic distance points were also computed using Moran's I and phylosignal to identify the location and sign of the detected autocorrelation. To identify significant autocorrelation estimates, 1000 bootstrap replicates were used for 95% confidence interval calculation. Autocorrelation estimates were considered significant if 95% confidence intervals did not overlap the null hypothesis threshold of -0.111. The null hypothesis that there is no detectable phylogenetic signal, or, autocorrelation, was a threshold of 21=ðn 2 1Þ where n ¼ 10, or the number tested species, as suggested by Keck et al. (2016). Because branch lengths were required by the phylosignal package, branch lengths were estimated using the dnaml program in PHYLIP (Felsenstein 1993) and a MAFFT v.7.205 (Katoh et al. 2002) alignment of rps16, atp2, 18s, 26s and SMC1 (File S4). B. hygrometrica was not included in phylogenetic signal analysis due to the limited percentage of annotated loci in genome annotation. The tree topology of land plants as reported by the Amborella Genome Project (2013) was used, and branch lengths were estimated from this topology. Branch lengths representing site changes were converted to relative age of branches using the R package ape (Paradis and Schliep 2019). A lambda value of 0 was chosen from 0, 0.1 and 1 after testing for the lowest log likelihood of lambda options.
Trait values with high K values (K .1) were chosen for further testing for better fit to models that consider an Ornstein-Uhlenbeck process and may indicate selection on traits. Traits values were fit with macroevolutionary models using the geiger ) R package and the fitContinuous function. Both "BM" (Brownian motion) and "OU" (Ornstein-Uhlenbeck) models were considered. Log likelihood estimates were used to test for the goodness of fit for "BM" and "OU" models while also considering the number of parameters that each model contains.

Data Availability
Raw FASTQ RNA sequencing data are available in the SRA with submission ID SRR7962298 and BioProject accession PRJNA494564. Ranked lncRNA prediction scores and GFF files of assembled transcripts are available on the author's GitHub https://github.com/caitsimop/ lncRNA-compGenomics. Classifications of lncRNAs made by gffcompare are made available in File S1. Quality of transcriptome assemblies are available in File S2. The FASTA file containing sequences of the genes used in the estimation of branch lengths is available in File S3. The phylogenetic tree with branch lengths adjusted relative to time is found in Figure S4. Supplemental material available at FigShare: https:// doi.org/10.25387/g3.8250953.

Multispecies lncRNA prediction
We chose plant species with diverse and divergent evolutionary histories for lncRNA prediction comparisons. Specifically, we included Angiosperms (both monocots and dicots), a Lycophyte, a Bryophyte and an algal species for this work (see Table 1 in Materials and Methods).
As novel transcripts were of importance to this work, RNASeq data from published experiments were used when available to assemble transcripts or were prepared de novo as in the case of one species (E. salsugineum). To be chosen for this study, data from the SRA database had to include samples produced with Illumina sequencing technology. Species chosen must have publicly available reference genome sequences and corresponding annotation. For consistency, leaf tissue from mature plants was preferred but younger leaves, or entire organisms in the case of Chlamydomonas reinhardtii, were used. To remove any potential biases toward transcripts induced by stress, only control, unstressed samples were chosen for analyses. After read mapping to appropriate plant genomes, transcripts were assembled using StringTie allowing for identification of novel transcripts. Sequences of assembled transcripts were input into CREMA, a lncRNA prediction tool (Simopoulos et al. 2018) and total lncRNA numbers in each plant species are described in Figure 1 and Table 2. Ranked prediction scores of all transcripts in each species are available on GitHub: https:// github.com/caitsimop/lncRNA-compGenomics. The percentage of total transcripts predicted as lncRNAs range from 3% in E. salsugineum to 16.6% in Amborella trichopoda with a mean percentage of 6.4% 61:1% for the 11 analyzed plant species (Table 2). On average, 52% of the predicted lncRNAs were found in intergenic regions leaving almost half of the predicted lncRNAs overlapping with a protein coding gene or as putative splicing variants (File S1).
To determine how reference annotation may affect lncRNA research in plant systems, all assembled transcripts from each species, including novel transcripts, were compared to those found in the corresponding reference annotation. Transcripts that were not found in reference annotation and also predicted as a putative lncRNA were identified and are referred to as "novel" lncRNAs throughout this manuscript. The proportion of novel lncRNA in all predicted lncRNAs ranged among species from a low 4.5% predicted in A. thaliana and a high 99.6% in Solanum tuberosum (Figure 1). Because A. thaliana is a well studied model plant with an almost fully annotated genome, we expected this species to have fewer novel transcripts assembled from the RNASeq data. Additionally, we expected that most lncRNAs predicted from the A. thaliana transcriptome to already be found in the reference annotation. The high percentage of lncRNAs found in the reference annotation of A. thaliana indicates that CREMA makes accurate predictions and suggests that the lower percentages of known lncRNAs identified in the other species are due to incomplete annotations ( Figure 1).

Phylogenetic signal in molecular traits of plant lncRNAs
We first considered overall trends in some typical distinguishing traits of lncRNAs: ORF length, GC content, number of exons, and transcript length. All species showed a similar trend where putative lncRNAs had a lower GC%, fewer exons, and shorter ORF length compared to the other transcripts in their corresponding transcriptomes (Figure 2). Length of transcripts, however, deviated from this trend where Zea mays, Selaginella moellendorffi and A. trichopoda all have putative lncRNAs longer than other transcripts in their transcriptome (Figure 2). Deviation from the expected trend of shorter lncRNA transcripts in three of the selected species suggests that transcript length may not be a useful distinguishing trait in lncRNA prediction.
We tested for phylogenetic signal in mean trait values of the four molecular traits previously mentioned in both lncRNAs and all transcripts other than lncRNAs. Phylogenetic signal estimates were calculated using three different methods that employ two different models of evolution, Moran's I, Pagel's l and Blomberg's K. We estimated phylogenetic signal in all species but B. hygrometrica due to the incomplete status of its genome annotation.
Since each phylogenetic signal estimation method is based on different concepts, all estimates cannot be interpreted the same. First, Moran's I is a measure of autocorrelation (Gittleman and Kot 1990). Autocorrelation, when referring to phylogenetic signal, indicates how correlated traits are in terms of phylogenetic distance. Due to the use of 10 species, Moran's I must be compared to a calculated threshold of -0.111 (Keck et al. 2016) to determine significant autocorrelation and phylogenetic signal. A significant estimate greater than -0.111 indicates positive significant global autocorrelation and that trait values of closely related species are more similar to each other. Conversely, a significant estimate less than -0.111 suggests global significant negative autocorrelation. The Brownian motion model, originally used to describe the motion of particles suspended in fluid, is another model used to describe how traits evolve through time. In the case of phylogenetic signal, a trait following the Brownian motion model exhibits a random walk where the value of the trait can change in any direction at any time. Pagel's l uses this Brownian motion model and can be interpreted as the transformation the phylogeny requires to explain trait distribution if the trait followed Brownian motion (Pagel 1999). Thus, a value of 1 would indicate a phylogeny as expected under Brownian motion and high phylogenetic signal, and a significant value of 0 would mean a trait distribution that does not follow Brownian motion, consequently, low phylogenetic signal. Finally, Blomberg's K, which also uses the Brownian motion model, can be interpreted as the ratio of observed values over expected values if the trait follows the Brownian motion model (Blomberg et al. 2003). A value of K = 1 can be interpreted as a trait distribution following Brownian motion, and as K becomes larger than 1, a stronger signal is detected. Conversely a value of K , 1 indicates low phylogenetic signal, and less similarity between closely related tested species. Table 3 shows phylogenetic signal estimates using all three methods for each of the four molecular traits. The mean trait values and phylogenetic relationships of the tested species are presented in Figure 2. In predicted lncRNAs, GC content was the only trait that demonstrates high phylogenetic signal using all phylogenetic signal detection methods (Table 3). While ORF length was had a significant positive global autocorrelation (I = 0.040; Table 3), a value of K , 1 indicates less similarity than expected under Brownian motion, suggesting unclear phylogenetic signal estimation. Blomberg's K also suggests less similarity than expected in the number of exons of lncRNAs, however no other method indicated detectable significant phylogenetic signal. Transcript lengths of lncRNAs in tested species also demonstrate a moderate positive global autocorrelation. Conversely, all four traits consistently had significant phylogenetic signal estimates when all transcripts other than lncRNAs were evaluated, although l for ORF length, number of exons and transcript length were slightly less than 1.

Evolutionary processes and phylogenetic signal
We examined traits with an estimated K . 1 with an evolutionary model that may suggest natural selection, the Ornstein-Uhlenbeck process (Hansen 1997), because high phylogenetic signal defined as K . 1 (Kamilar and Cooper 2013) can indicate similarity by both genetic drift and natural selection. In lncRNAs, GC content is the only trait with K . 1, and has significant phylogenetic signal detected using all three methods (Table 3). We compared the fit of a Brownian motion model vs. an Ornstein-Uhlenbeck model in our data using log likelihood values and a chi square test for significance estimates. Although the Ornstein-Uhlenbeck model had the smallest log-likelihood, a chi square test indicated that there was no significant fit difference when comparing the Brownian motion and Ornstein-Uhlenbeck model (P = 0.81). Because the Brownian motion model has the least number of parameters, this suggests that a Brownian motion model is the most reasonable fit for the data, and there is a lack of evidence for an adaptive process.
All four traits of all transcripts other than lncRNAs had significant high phylogenetic signal when estimated using Blomberg's K (K .1), therefore we also tested for a better fit explained by the Ornstein-Uhlenbeck process. Again, the Ornstein-Uhlenbeck process was not a significantly better fit than a Brownian motion model when considering log likelihood values and a chi-square test (ORF length: P = 1, GC content: P = 1, number of exons: P = 1, transcript length: P = 0.75).
We tested for local autocorrelation at 100 phylogenetic distances considering the most recent common ancestors of all species as a robust approach to an analysis on a small phylogeny. Confidence intervals were computed using 1000 bootstrapping replicates for a non-parametric significance estimate using a calculated threshold of -0.111. Figure 3 visualizes the local correlations of traits in both lncRNAs and all other transcripts and are limited to the phylogenetic distances of the tested phylogeny (0-1 phylogenetic distance). We detected significant positive local autocorrelation at short phylogenetic distances in ORF length and GC content of lncRNAs ( Figure 3). This suggests that closely related species contain lncRNAs with similar ORF lengths and GC content. There was no consistent significant autocorrelation at any short phylogenetic distances in any tested traits in all other transcripts (Figure 3). Detected phylogenetic patches are shorter in the ORF and transcript lengths of lncRNAs compared to all other transripts. The opposite is true in the GC content of lncRNAs, where longer phylogenetic patches are observed. Shorter phylogenetic patches suggest an adaptive process as described by an Ornstein-Uhlenbeck model, rather than genetic drift.

DISCUSSION
We used raw RNASeq data from multiple independent studies to make inferences on the numbers of predicted lncRNAs in 11 phylogenetically divergent plant species, and to identify putative phylogenetic signal in these regulatory loci. Our data-mining approach enabled us to use the same protocols for read mapping, transcript assembly, and lncRNA prediction for each species. In performing the same read-mapping and lncRNA prediction protocols, we were able to address a concern raised by Kapusta and Feschotte (2014) that comparisons between lncRNA numbers in animals can be misleading when prediction numbers are products of meta-analyses involving different prediction methods and lncRNA criteria. We found that the percentage of transcripts predicted Figure 1 Total predicted lncRNAs from 10 plant species. The counts of putative lncRNAs are categorized by transcripts that appear in the reference annotation of each species (purple) and novel transcripts, or those that did not appear in transcriptome annotation (coral). The percentages of novel transcripts (coral) predicted as lncRNAs appear above each bar.
n However, there is a lack of evidence for function in these thousands of transcripts as they were identified by a transcript filtering method and the majority remain without experimental validation. In particular, NONCODE v5 has predicted function in only 1961 of their identified lncRNAs. Nonetheless, due to the methods used regarding read mapping, our results rely on the genome quality of each species. Fragmented genomes, as seen in A. trichopoda, E. salsugineum and S. moellendorffi, may have limited our predictive abilities, and low predicted lncRNA numbers may in part be due to data availability. The review by Kapusta and Feschotte (2014) also included a metaanalysis describing variation in predicted lncRNA numbers among multiple animal species, a comparison similar to our observed prediction numbers in plants. In addition to their concern about transcriptome data arising from different methodologies, Kapusta and Feschotte (2014) also raised the issue of temporal and location specific lncRNA expression. We share a comparable concern in that plant lncRNAs have yet to be predicted in all tissue types for all developmental time points in all possible environments, so undoubtedly the number of putative lncRNAs detected in plants will increase over time. For example, in our study, we identified 2918 putative lncRNAs in A. thaliana plants that were grown under conditions designed to avoid exposing plants to sources of stress. In contrast, although using different prediction methods, Zhao et al. (2018) identified 6150 putative lncRNAs in A. thaliana plants undergoing cold, ABA and drought treatments. This difference in predicted lncRNAs is consistent with the expectation that lncRNAs likely play a role in stress responses and hence one expects to find increased diversity and abundance of lncRNAs in stressed relative to unstressed plants. Interestingly, Zhao et al. (2018) found that lncRNAs in A. thaliana are shorter and have fewer exons than all other transcripts, observations that agree with our study (Figure 2). Thus our machine learning-based methodology that was trained on only empirically characterized, functional lncRNAs and the filtering method employed by Zhao et al. (2018) led to similar conclusions on traits shared by lncRNAs that distinguish them from other transcripts. This finding is also consistent with our previous, cross-validation estimates of CREMA's accuracy (96%) and specificity (0.994) (Simopoulos et al. 2018) making CREMA a suitable method for lncRNA prediction from RNASeq data.
The genome of A. trichopoda, the sister taxon to all other extant angiosperms, represents a species with a unique evolutionary history. During genome annotation, The Amborella Genome Project (2013) observed a larger number of the atypical 23 to 24nt plant miRNAs than expected as they were found in two times greater frequency than any other land plant. Additionally, eight predicted miRNA families in A. trichopoda have evidence of loss in more recent angiosperms (Amborella Genome Project 2013). The excess of miRNAs in A. trichopoda may reflect the high proportion of lncRNAs predicted in this study (at 16%; Table 2) as miRNA progenitors are considered to be lncRNAs (Saini et al. 2008).
Two plants, namely E. salsugineum and B. hygrometrica, were found to have the lowest proportion of lncRNAs in their transcriptomes (Table 2). E. salsugineum represents a plant with a halophytic life strategy and a capacity to tolerate a variety of extreme environmental conditions (Kazachkova et al. 2018). Indeed, E. salsugineum has Figure 2 Mean trait values of transcripts predicted as lncRNAs (coral, circle) and all other assembled transcripts (purple, triangle). Species are ordered as per phylogenetic relationships.
n P , 0.05. I = Moran's I, K = Blomberg's K, l = Pagel's l a Where "ORF" refers to ORF length," GC%" refers to GC content, "Exons" refers to number of exons and "Length" refers to transcript length.
been used as a model plant in stress response studies due to its naturally high tolerance to abiotic stresses such as salt (Taji et al. 2004), cold (Griffith et al. 2007), drought (MacLeod et al. 2015), and nutritional deficiencies (Velasco et al. 2016). Moreover, E. salsugineum shows constitutive expression of genes reported to be stressresponsive in many plants (Taji et al. 2004;Gong et al. 2005;Wong et al. 2006;Velasco et al. 2016). B. hygrometrica, aptly named "the resurrection plant", is also considered an extremophile by virtue of its capacity to survive desiccation (Xiao et al. 2015). However, B. hygrometrica shows different expression pattern changes when experiencing stress compared to E. salsugineum. Zhu et al. (2015) did not observe constitutively high expression of stress tolerance genes in B. hygrometrica during desiccation. Instead, B. hygrometrica seemed to require gradual dehydration priming for survival after rehydration post-desiccation (Zhu et al. 2015). B. hygrometrica plants that have been subsequently rehydrated after this dehydration "training" have expression patterns more similar to desiccated plants than those without drought priming (Zhu et al. 2015). In other words, after experiencing a first gradual dehydration there are expression differences between well-watered B. hygrometrica plants and ones that experienced desiccation. The observation that B. hygrometrica plants can show "preparedness" among expressed genes normally responsive to a stressful condition is somewhat analogous to the constitutive nature of expressed genes in E. salsugineum. Specifically, E. salsugineum plants grown in the absence of high salt display the expression of genes reported to be salt-responsive in other plants (Taji et al. 2004;Wong et al. 2006). Interestingly, both E. salsugineum and B. hygrometrica display a low proportion of predicted lncRNAs in their unstressed transcriptomes suggesting a possible connection between high natural stress tolerance and low lncRNA number. Conceivably, with stress-related genes constantly expressed under a primed condition, a plant adapted to an extreme environment may not require the precise regulation conferred by the recruitment of diverse lncRNAs, an important role proposed for the function of lncRNAs in plant stress responses. This hypothesis merits further work as little is known of the importance, or lack thereof, of lncRNAs in extremophile species.
E. salsugineum and A. trichopoda have distinct evolutionary patterns and yet both species have few putative lncRNAs present in their reference annotations. In total, five of the ten tested plant species had less than 50% of predicted lncRNAs in their respective genome annotations. As genome annotation often relies on homology of predicted genes for functional annotation (Bolger et al. 2018), particularly homology to A. thaliana protein-coding genes, lncRNAs can often be left out of genome annotation. Researchers studying plant lncRNAs frequently rely on bioinformatic analyses to assemble novel transcripts for lncRNA prediction (Liu et al. 2018;Shuai et al. 2014), indicating that missing lncRNA annotation should be taken into consideration in forthcoming genome annotation projects. Similar to our work, Jackson et al. (2018) recently described misannotation of lncRNAs in mammalian genomes. Gaps in annotation and the ensuing problem with lncRNA identification is exacerbated by the fact that lncRNAs do not follow classic evolutionary conservation. Instead, in both plant and animal systems, lncRNAs often depict a positional conservation pattern, even with minimal sequence conservation, making functional predictions also difficult (Hezroni et al. 2015;Mohammadin et al. 2015). A lack of extensive lncRNA conservation between species led to our investigation into phylogenetic signal detection in molecular traits of lncRNAs.
In plant species, lncRNA homology has been shown to be virtually non-existent outside of the family classification. Only 1% of predicted A. thaliana (Brassicaceae) lncRNAs were identified as homologous in Tarenaya hassleriana (Cleomaceae) (Nelson et al. 2016). Interestingly, although distantly related, human and plant lncRNAs bear similarities in number of exons, and transcript and ORF length. For example, Derrien et al. (2012) describe human lncRNAs as being shorter than protein coding genes, and by commonly having fewer exons which was also observed in our plant species analyses ( Figure 2). However, human lncRNAs are typically spliced and most often have two exons, while our study suggests plant lncRNAs are more often unspliced and comprised of a single exon (Figure 2). Domination of single exon novel lncRNAs in our work may be indication of genomic contamination in source data or bioinformatic transcript assembly artifacts. Nevertheless, Figure 3 Moran's I local correlogram of mean trait values in lncRNAs and All Other Transcripts. Coral points indicate significant phylogenetic signal at a particular phylogenetic distance. The horizontal line represents a value of the null hypothesis that no phylogenetic signal is detected. The null hypothesis value is -0.111, or 21=ðn 2 1Þ where n ¼ 10, or the number of tested species. The 95% confidence intervals, computed using bootstrapping, are also plotted and were used to identify significant values. genomic contamination is unlikely seen in all ten independent experiments, and transcript assembly artifacts more likely result from de novo assembly rather than genome guided mapping. Further, Haerty and Ponting (2015) also observed a lower GC content in lncRNAs than the protein coding genes of metazoan lncRNAs, with singleexon lncRNAs having the lowest GC content of all exon-types, reminiscent of the uni-exonic plant lncRNA majority of our analyses. However, the extent to which conserved traits typify plant and animal lncRNAs is difficult to assess at present. CREMA is trained on only validated lncRNA and may be subject to a prediction bias to animallike lncRNA sequences given that non-plant sources currently comprise the majority of validated lncRNAs (Simopoulos et al. 2018). As more plant lncRNA undergo validation, the extent of conservation among lncRNAs from diverse organisms will be easier to detect and estimate with greater precision.
Despite these concerns over bias, among the lncRNAs predicted by CREMA we found significant phylogenetic signal by at least one method in all four tested traits of lncRNAs: ORF length, GC content, the number of exons and transcript length (Table 3). High phylogenetic signal detection in traits of lncRNAs was not expected given the lack of evidence for sequence conservation in this class of RNA (Nelson et al. 2016;Hezroni et al. 2015). Importantly, because the phylogenetic relationships of species are influencing the tested traits as indicated by detectable phylogenetic signal, the species can no longer be considered independent and statistical methods that assume independence cannot be used. While it is possible to detect phylogenetic signal in our data, it is difficult to infer evolutionary processes from phylogenetic signal estimates as many unique processes can invoke a similar signal estimation (Revell et al. 2008). However, high phylogenetic signal and consequent high similarity of GC content in lncRNAs of closely related species is not unexpected, as Haerty and Ponting (2015) demonstrated evidence of selection on the GC content of intergenic lncRNAs in animal species. Additionally, as previously mentioned, the GC content of animal lncRNAs is lower than that of protein-coding genes, mirroring what our work identified in plant lncRNAs.
A lack of similarity in ORF length, number of exons, and transcript length of lncRNAs in close relatives may be due to a variety of processes, including but not limited to: stabilizing selection with high selective strength, selection with variable strength that is bounded by phenotypic limits, punctuated divergent selection, or genetic drift of which rate of drift began low and increased toward present time (Revell et al. 2008). Because of the variety of possible complex interpretations of phylogenetic signal and process, Revell et al. (2008) do not recommend over-interpretting evolutionary processes from signal data. We have found, however, unique patterns in the phylogenetic signals of molecular traits of lncRNAs compared to all other transcripts in plant species that imply lncRNAs are not following similar evolutionary trends as most other transcripts. Moreover, the lack of similarity in three of the four tested molecular traits in lncRNAs is of interest and this observation implies that evolution of lncRNAs could be species specific, and is not easily defined by an over-arching evolutionary process. On the other hand, it is possible that there are subclasses of lncRNAs with conserved molecular traits yet to be defined due to a lack of validated transcripts.
Because CREMA (Simopoulos et al. 2018) predicts lncRNAs using a complex ensemble machine learning model that initially uses ORF length, GC content and transcript length as features for transcript classification, it is possible that high detected phylogenetic signal in these features is a product of the lncRNA prediction tool. However, CRE-MA's logistic regression ensemble classifier that is used for the final lncRNA prediction does not use molecular traits as prediction features, but instead binary outputs from eight gradient boosting models. Additionally, we have identified low phylogenetic signal in two of these three molecular traits (Table 3) suggesting that CREMA is able to predict lncRNAs with varying ORF and transcript lengths.
In this work we show that the annotation status of plant species can affect lncRNAs prediction with up to 99% of predicted lncRNAs missing from reference annotation. While researchers may be striving to increase the volume of lncRNA research, the effort to annotate genomes with lncRNAs is not reflective of the increased interest in this RNA class. As such, we caution researchers interested in these regulatory loci to be wary of relying solely upon genome and transcriptome annotations for lncRNA identification. Additionally, our work shows that plant lncRNAs have inconsistent detectable phylogenetic signal in sequence traits, further confirming the complex evolutionary history of lncRNAs. In particular, the differences in detected phylogenetic signal in lncRNAs compared to all other transcripts suggests that lncRNAs evolve, on average, differently than other loci.