RNA element discovery from germ cell to blastocyst

Abstract Recent studies have shown that tissue-specific transcriptomes contain multiple types of RNAs that are transcribed from intronic and intergenic sequences. The current study presents a tool for the discovery of transcribed, unannotated sequence elements from RNA-seq libraries. This RNA Element (RE) discovery algorithm (REDa) was applied to a spectrum of tissues and cells representing germline, embryonic, and somatic tissues and examined as a function of differentiation through the first set of cell divisions of human development. This highlighted extensive transcription throughout the genome, yielding previously unidentified human spermatogenic RNAs. Both exonic and novel X-chromosome REs were subject to robust meiotic sex chromosome inactivation, although an extensive de-repression occurred in the post-meiotic stages of spermatogenesis. Surprisingly, 2.4% of the 10,395 X chromosome exonic REs were present in mature sperm. Transcribed genomic repetitive sequences, including simple centromeric repeats, HERVE and HSAT1, were also shown to be associated with RE expression during spermatogenesis. These results suggest that pervasive intergenic repetitive sequence expression during human spermatogenesis may play a role in regulating chromatin dynamics. Repetitive REs switching repeat classes during differentiation upon fertilization and embryonic genome activation was evident.


INTRODUCTION
Expression profiles of known RNAs have been catalogued for a range of cell types, with the use of expression arrays and, more recently though RNA deep-sequencing studies. This has yielded a series of useful databases including GTEx (https://www.gtexportal.org/home/), EMBL-EBI's Expression Atlas (https://www.ebi.ac.uk/gxa/home/), The Human Protein Atlas (https://www.proteinatlas.org/), and ENCODE (www.encodeproject.org) (1)(2)(3)(4)(5)(6). These databases and RNA-seq studies generally focus on annotated genes and transcript variants that are derived from transcript modeling programs such as Cufflinks (7) and are provided as part of the Refseq and Gencode annotations (8,9).
Both coding and non-coding RNAs play major roles in all cellular processes. In addition to protein-coding RNAs, at present, there are 48 different non-coding and pseudogene classes of RNA documented in the version 27 annotation of the Human Gencode. Approximately 40% of the annotated genes in Gencode correspond to long and short non-coding RNA genes (10). Non-coding intergenic regions are known to contain regulatory RNAs. These include long intergenic non-protein coding RNA (lincRNA), enhancer RNA (eRNA), piwi-interacting RNA (piRNA) and circular RNAs, with others just beginning to be described (11)(12)(13)(14). The human transcriptome is likely to be more complex than even these annotations indicate, as an estimated three quarters of the human genome is transcribed (15). This would include novel tissue-specific RNAs, whose roles remain to be established (16).
The palette of RNAs appear enriched in certain specific tissues, with each providing a specialized function, e.g., brain--cognitive and functional system level control, and germline--stem cell--defining development (17)(18)(19). Their corresponding complexity is exemplified in the testis by the collection of unique structural and functional spermatozoal-specific transcript variants (20) that are observed during maturation, as sperm assume their unique shape. This culminates with the compaction of the sperm nucleus to a transcriptionally and translationally inert structure. The latter is ensured by fragmenting rRNAs (21), as well as others and completes with the expulsion of the majority of the cytoplasm. In addition to the paternal genome and sperm encapsulated RNAs (22), RNA/proteins and other molecules from distant tissues acquired during epidydimal transit (23,24) are delivered at fertilization. This provides a pathway for soma-to-germline transmission (22,25,26) that perhaps conveys signals echoing how other tissues have responded to the environment (reviewed in (27)).
We have previously shown that unannotated transcripts corresponding to intronic and intergenic regions of the spermatogenic genome are comparatively abundant in human sperm (20,(28)(29)(30). They vary amongst species and in response to and can provide markers of disease (30)(31)(32). These observations drove the development of this algorithm to systematically identify the genomic locations of RNAs, defined as RNA elements (RE), i.e., regions transcribed throughout the genome. This unbiased analysis tool is not limited to those RNAs currently defined in the databases, as it does not seek to generate gene structures from REs. It is compatible with a range of Next Generation Sequencing (NGS) platforms, RNAs from varied sources, abundance, quality, and levels of fragmentation, i.e., FFPE-like RNAs. The algorithm only requires the BAM file of genomic alignments to detect transcribed regions of novel loci in conjunction with well-known annotated loci.
In the current study, the RE discovery algorithm was applied from the perspective of the human male germ cell to blastocyst paradigm. A series of spermatogenesis and embryogenesis pattern specific intergenic human REs were identified, indicating that the transcriptome extends wellbeyond the annotated genes, including those delivered at fertilization. Tissue-specific REs comprised of intronic and intergenic REs were uncovered and, in some cases, exon boundaries extended. Transcribed genomic repetitive sequences, such as simple repeats, HERVE, and HSAT1, were shown to be associated with RE expression during spermatogenesis, and may play a developmental stage specific role. Similarly, in the human embryo, MER73 was associated with RE transcription at the minor wave of zygotic genome activation and MLT2A1 and SVA-D expressed through the major wave during the transition to the embryonic genome. Overall, this study provides a deeper understanding of the dynamic transcriptome of human sperm, as well as uncovering the possible role of specific repetitive sequences in the spermatogenesis.

RE discovery
The current study used Gencode release 26 (for GRCh38) and the GRCh38 genome for RE discovery, which is detailed in Supplemental Appendix A. RNA-seq samples used in RE discovery are described in Supplementary Table S1. Sample reads were pre-processed prior to RE discovery with Trimmomatic version 0.36, trimming Illumina adaptors and poly(A + ) sequences, where appropriate, with parameters 'LEADING:3 TRAILING:3 SLIDINGWIN-DOW:4:15'. Reads were aligned to the GRCh38 genome using HISAT2 (version 2.0.6), using the parameters '-p10 -max-seeds 30 -k 2'. Read coverage was provided to the RE discovery tool in bigwig format, generated by converting BAM files to bedgraph format, using the bedtools tool genomeCoverageBed, with the parameters '-split -bg', and subsequently bigwig format, using the bedGraphToBigWig program (available from the UCSC Genome Browser utilities). The threshold parameter for RE discovery was set to 2.5 reads per million, to minimize their contribution of background noise. Novel REs from each study were combined using custom R commands, which merged overlap-ping novel REs, re-annotated the merged REs, and added the merged REs to the exonic REs, to produce a collective set of REs. The collective set of REs for the different samples is given in Supplementary Table S2 and was subsequently used in all analyses. For comparison of RE discovery to established transcript-building software, Cufflinks (v2.2.1) and Stringtie (v1.3.4) were used on the same preprocessed reads previously used for RE discovery (7,33). Default parameters for both Cufflinks and Stringtie were employed, using Gencode release 26 (for GRCh38) as the reference annotation.

Differential expression (LMEM, fold change, LM)
A linear mixed-effects model (LMEM) was used to calculate differential expression between poly(A + ) and total RNA libraries from oocyte through early embryonic development (34)(35)(36). The LMEM was used with a random slope and intercept for each cell type, to consider heterogeneity across cell types [formula = RPKM ∼ RNA.type + (1 + RNA.type | Tissue.simple)]. Residuals of randomly selected REs were analyzed for homoscedasticity, ensuring that the assumptions of the LMEM were satisfied. Multiple testing correction was applied to P-values for resultant slopes, using Benjamini-Hochberg correction (37).
Differential expression of poly(A + ) and total RNA libraries in sperm and testis tissue was determined using a fold-change (fold change = log 2 mean(Total RNA) mean(poly(A + )) ). The use of an expression ratio, rather than linear modeling, was necessary due to the technical differences between the total RNA sperm samples (30) and the three poly(A + ) sperm libraries (38), as well as the absence of multiple independent total RNA testis samples (20).

RE enrichment for repetitive sequences
In cases when median RE RPKM in spermatozoa exceeded 1 RPKM (thus removing REs with low coverage in most samples), peak RE RPKM was 25 RPKM and subsequently used as an expression threshold (Supplementary Figure S1B). REs were first assigned as 'Expressed' if the median RPKM for the cell/tissue type was >25 RPKM. The enrichment or depletion of repetitive sequences in the expressed REs was calculated using UCSC's Repeatmasker track (for GRCh38), a hypergeometric test and custom R code. The proportion of each genomic repeat in all available REs was used as input probability, with the number of expressed REs for the given cell type used as the sample size. The probability of drawing the actual number of expressed REs overlapping the given repeat type was adjusted using a Bonferroni correction (39). To identify repeats of interest, significantly enriched or depleted repeats were additionally filtered to remove repeats with minimal over-or under-enrichment. Thus, only repeats whose difference between the expected and observed RE count was >10 REs were retained.

Expression clustering
Expression patterns across spermatogenic cell types were identified using the R package Mfuzz (30,(40)(41)(42). A total Nucleic Acids Research, 2019, Vol. 47, No. 5 2265 of 20 cluster patterns were generated per analysis, with a minimal membership value of 0.7 required to assign a RE to a given cluster.

FDR calculation for GTEx and PPV calculation for sperm
The accuracy of the RE discovery algorithm to identify expressed loci was calculated using the Jodar et al. dataset, which consisted of seven fertile human sperm samples, prepared using total RNA (30). RE discovery was performed on the seven samples, at a range of from 1 to 10 RPM, at increments of 0.5 RPM. The RPKM of the resulting novel REs for each sample was calculated, along with the median RPKM across the seven samples. Experimental thresholds for calling a RE as 'expressed' ranged from 1 to 200 RPKM, at increments of 1 RPKM. At each expression threshold, the number of REs with a median RPKM at or exceeding the threshold were recorded. The positive predictive value (PPV) at each expression threshold and was calculated as PPV = Novel REs > Expression threshold Novel REs > Expression threshold + Novel REs ≤ Expression Threshold The ability of the RE approach to recapitulate tissue expression in the established databases was determined using the testis expression in the GTEx database (3). The median TPM for all GTEx testis samples was downloaded and was processed to replace duplicated common gene names with the mean TPM for all instances of the gene name. Only gene names found in both GTEx and exonic REs were used in subsequent intersect analysis. The unique gene names with expression exceeding 5 TPM were compared to those of the exonic REs exceeding 25 RPKM.

RE identification and classification
The RE discovery algorithm was designed to detect expressed regions of the genome using RNA-seq, regardless of the sequencing platform or read structure. A detailed description of RE discovery is presented in Supplementary Appendix A, and the corresponding code is provided online (https://github.com/mestill7/RE discovery). Briefly, the known gene annotation (e.g. RefSeq, Ensembl, The trimmed reads are then aligned to the genome, and the duplicate alignments removed. Read coverage is then used to identify 10 bp segments with read coverage surpassing . The expressed 10 bp segments (shown in blue) are merged and annotated according to their adjacency to exons (shown in red). (B) RE discovery workflow for theoretical RNA-seq samples. Each sample has a different library size, and correspondingly, different read coverage thresholds at a of 2.5 reads per million (RPM). Non-exonic regions of read coverage surpassing the assigned threshold are deemed 'Novel REs'. Merging novel REs from the four different samples, yields two novel REs, one from Sample 1 (S1) and one from Sample 3 (S3) that are separated by up to 150 bp. Novel REs of the different samples S1 and S3 are merged into a final RNA element, represented in purple. Exonic REs are excluded from this merging step. The final novel RE set for the four samples is then annotated as Intronic, Near-Exon, purple, (<10 kb from exon), and Orphan REs (>10 kb from exon).
Gencode) for the genome of interest is parsed into individual exon locations. In the current study, Gencode release 26 (GRCh38) was used, with non-coding entries considered as annotated 'exons' (10). As summarized in Figure  1A, RE discovery first requires the sequenced reads to be processed, e.g. adaptors trimmed and low-quality bases removed, prior to alignment to the genome of interest. For the unannotated regions of the genome, the mean read coverage was calculated for each 10 bp genomic segment and the 10 bp segments with sufficient read coverage, determined by a threshold , retained. For the purposes of this study, = 2.5 reads per million provided well-balanced signal to noise ratio (Supplemental Supplementary Figure S2) that was suited for RNA libraries generated from low-input, potentially fragmented RNAs, as is often found in clinical formalin-fixed paraffin-embedded (FFPE) samples and spermatozoa (20,43). The overlapping 10 bp regions were subsequently merged to yield the final novel REs for each collection of samples studied. The merging steps allow for a maximum of 150 bp between element bins, intended to allow for gaps in coverage caused by sequencing bias and/or biological fragmentation. Novel REs were then annotated according to their genomic position, relative to known exons ( Figure 1B). 'Intronic' REs were located within introns, while any nonintronic REs located within 10 kb of an annotated exon were designated 'Near Exon' REs; 'Orphan' REs were at a distance greater than 10 kb from any known exon. An exonic RE was extended into a near exon RE if they were within 50 bp and the difference in read coverage was <50%. As summarized in Figure 2, previously published RNA-seq studies representative of human spermatogenesis, mature sperm, oocyte, embryonic stages, and liver samples, detailed in Supplementary Table S1, were subject to RE discovery. This set of RNA-seq libraries encompassed both poly(A + ) selected and total RNA preparations. A database of REs across the different tissue and types was created by merging the novel REs from each study with the exonic and noncoding transcript REs (Supplementary Table S2) and used in all subsequent analyses. For any given tissue type, the majority of REs are lowly expressed, necessitating filtering of lowly expressed REs prior to analysis ( Supplementary Figure S1A and B). RE length was on average higher in exonic REs compared to novel REs (Supplementary Figure S1C).
The above RE identification method was developed to ensure accuracy in face of extensive RNA fragmentation, naturally occurring in human sperm. Certain tissue preparations, such as FFPE, also yield compromised RNA preparations. Given that several established transcript-building algorithms are readily available, we compared both Stringtie (v1.3.4) and Cufflinks (v2.2.1) to the RE approach for two random sperm samples and two male human cell lines. RNA-seq datasets from human cell lines, i.e. SRR020288 (h1 hESC) and SRR3192556 (OCI-LY7, derived from a B cell lymphoma), provided independent datasets when testing the RE method. Using minimal thresholds of expression (>10 RPKM in REs, >1 FPKM in Cufflinks and Stringtie), the majority of expressed REs overlap locations of transcripts generated using transcript-building software. Across the four samples, 67-92% of 'expressed' REs overlap Stringtie results, and 81-90% overlap Cufflinks results at the above thresholds of expression (Supplementary Table S3). Notably, regardless of the transcript-building method and required expression thresholds, a majority of REs (complete range 21-93%) lacking overlaps with Cufflinks and Stringtie results are Exonic REs, suggesting that the established transcript-building methods are less than ideal for fragmented or unevenly covered transcripts.
With the function of the novel REs being unknown, we hypothesized that the novel REs may have regulatory roles. To assess this, REs were overlapped with a series of epigenetic marks and regulatory genomic sequences (Supplementary Figure S3). For regulatory chromatin marks (proximity to DNase hypersensitive regions, proximity to CTCF binding sites, proximity to topologically associating domains (TADs), and proximity to ENCODE Transcription Factor Binding Sites (TFBS)) (44)(45)(46)(47)(48)(49), the novel RE classes largely showed a similar overlap proportion as Exonic REs. All RE classes showed very little overlap with piRNA clusters (50,51). Notably, all classes of novel REs had a high overlap (>50%) with repetitive sequences (UCSC's Repeatmasker track (for GRCh38) (52), compared to the ∼22% overlap in Exonic REs.

Poly(A + ) selection reduces RNA-seq complexity
The RE discovery algorithm was developed to identify transcribed intergenic loci from RNA-seq data. Many novel loci (e.g. Near-exon and Orphan REs) were hypothesized to be derived from non-polyadenylated RNAs, since this class appears underrepresented in the genome and the GENCODE annotations. A series of poly(A + ) selected and Total RNAs from a range of cell types that capture the period from fertilization to early embryonic development from oocyte, zygote, 2-cell embryo, 4-cell embryo, 8-cell embryo and morula ( Figure 2) and the male germline, through ejaculated sperm and testis, were examined (34)(35)(36). Applying a linear mixed-effects model (LMEM) to Total and poly(A + )selected RNAs from the human oocyte and various stages of early embryonic development, revealed a comparatively lower number of REs detected within the poly(A + ) selected fraction ( Figure 3A). In general, the number of novel REs that were either increased or depleted by poly(A + )enrichment do not markedly differ ( Figure 3B). Interestingly, the number of Orphan REs approximately doubled upon poly(A + )-enrichment as compared to the Total RNA fractionation. This suggests that a population of Orphan REs belong to a larger, yet unknown set of polyadenylated transcripts. To determine if poly(A + ) enrichment of Orphan REs reflected a specific class of genomic repeat, the distribution of repeats within the 150 poly(A + ) enriched Orphan REs was assessed and is shown in Figure 3C. Within the 129 Orphan REs that contain a repetitive element, the majority are LTRs and SINEs. It is worth noting that 40 of the 55 LTR-containing REs are ERVL-MaLRs (Supplementary  Table S4). This is a non-autonomous LTR-retrotransposon element derived from ERV (53,54) that may function in regulating gene expression during the oocyte-to-embryo transition (55).
The effect of poly(A + ) enrichment was also assessed individually for human sperm and testis samples, providing  Table S5).

Round spermatids and mature sperm have numerous intergenic SREs
RE expression throughout spermatogenesis was examined as a comparison to previously published patterns of whole transcript expression during the spermatogenic cycle(42). The spermatogenic stages encompassed six cell types (Spermatogonia through Round Spermatids), isolated using laser capture microdissection (42). Clustering of the various REs expression patterns across spermatogenesis was initially performed using Mfuzz, (40,41) with the published six cell types (42). As shown in Supplementary Figure S5, RE expression across spermatogenesis recapitulated those patterns previously observed using whole transcripts (values for each RE are provided in Supplementary Table S6). To extend the analysis to the final stage of spermiogenesis, RNA-seq from ejaculated sperm datasets from fertile males (30) were included (Figure 2). The addition of mature sperm enabled the discovery of several patterns specific to early round spermatids and maturing round spermatids, as observed through mature spermatozoa ( Figure 4A-D). The final stages of spermatogenesis involve a burst of transcription, as well as the formation (and eventual loss) of the residual body as the majority of the cytoplasm is expunged from the cell. The burst of transcription in round spermatids is observed as a general increase in transcription of exonic REs that include 34,226 REs found in round spermatids but not in the late pachytene stage spermatocytes. Interestingly, a large portion of spermatid and/or mature sperm-specific clusters are generated from novel REs, suggesting that intergenic and intronic REs play a substantial role in the final stages of spermatogenesis that forms each spermatozoon as summarized in Supplementary Table S7. To verify these observations, expressed (median expression >25 RPKM) REs for each spermatogenic stage were partitioned according to RE class ( Figure 4E). The vast majority of REs expressed in pre-meiotic and meiotic stages were exonic (85 ± 7%). This was followed by a notable increase in the number of novel REs in Round Spermatids and Spermatozoa. The contribution of novel REs to the total transcriptome rose to 47% in mature sperm.
Ontological analysis of the exonic and novel REs (with the exception of Orphan REs) showed that the most abundant REs in round spermatids were enriched for genes involved in organelle biogenesis and maintenance. This is in accord with the physiological changes occurring dur- ing spermiogenesis. REs that are abundant in both round spermatids and spermatozoa were enriched for TNF-alpha signaling, associated with maintaining a homeostatic state (56,57). REs that were primarily abundant in spermatozoa were associated with a range of signaling pathways, such as Glutamate Receptor signaling, WNT signaling, NGF signaling, EGFR1, and Signaling by Rho GTPases (Supplementary Table S8). WNT signaling has several roles in spermatogenesis, from maintenance to maturation, and thus motility (58-60). The role of NGF signaling in spermatogenesis in humans is unclear but has been implicated in mammalian Sertoli-germ cell signaling, sperm motility, and the acrosome reaction (61,62). Sperm EGFR activation is a major driver of sperm capacitation (63,64), while Rho GTPases are likely to aid as mediators of the acrosome reaction (65). Odorant receptors may be required for sperm chemotaxis in mammals (38), while glutamate receptors may also be involved in capacitation and/or sperm chemotaxis (66,67) although such functions have yet to be demonstrated in mammalian systems.

Sex-chromosome expression during spermatogenesis
Meiotic sex chromosome inactivation (MSCI), the process by which genes located on the X-chromosome are repressed during meiosis, is essential for successful meiosis during human spermatogenesis (68). However, abundant evidence suggests that numerous X-linked genes escape post-meiotic X chromosome silencing (PMCI), a process that may be less effective in humans than other species (69,70). In comparison, most classes of Y-linked REs undergo silencing during MSCI, with the exception of Y-linked Orphan REs that are present throughout spermatogenesis.
As shown in Figure 5, repression of exonic X-linked REs during spermatogenic MSCI is evident. This is followed by de-repression of X-linked exonic and novel REs, that return to pre-meiotic levels in mature sperm ( Figure  5A). Notably, several X-linked REs are intensely expressed (at a threshold of 25 RPKM) in solely one spermatogenic stage, including the post-meiotic stages, i.e., round spermatids and to a greater extent, mature sperm ( Figure 5B). Overall, the patterns of X-linked REs across spermatogenesis imply a larger upregulation of genes and novel REs in the post-meiotic stages than previously thought, with the number of expressed X-linked REs largely following the patterns laid by autosomes. We note that two of the 289 paternally transmitted REs were located on the Xchromosome, and both were exonic REs. The two REs are located (in hg38 coordinates) at chrX 2717605 2717652 and chrX 149929645 149930127, corresponding to CD99 and XX-FW81066F1.2, respectively. The spermatogenic roles of CD99, a cell surface glycoprotein involved in T-cell adhesion processes, and XX-FW81066F1.2, a poorly described transcript with a putative protein structure or antisense lncRNA function (5), are unknown.

Paternal transmission of REs
It has been proposed and shown in vitro that human sperm deliver a cadre of RNAs upon fertilization (20,27,71,72). In the current study, the series of human RNA-seq profiles from sperm, oocyte, and embryo allowed for the identification of REs that are transmitted to the human oocyte solely by sperm. These are in addition to those 26,740 zygotic REs (5% FDR), associated with a total of 6,118 individual named genes, which are essentially provided by the oocyte, but not present in sperm (Supplementary Figure S6). Up to 289 sperm REs were identified as a majority contributed by paternal transmittance, with an FDR of ∼3.4%, and 75 REs essentially provided by the sperm, at an FDR of ∼2.7% (Supplementary Figure S7A,B and Table S9). Interestingly, the 289 REs were enriched for 'cycling of RAN in nucleocytoplasmic transport' (P = 8.36 × 10 -8 ) and the Unc 51 Like Kinase (P = 1.47 × 10 -3 ). RAN cycling is required for effective translocation of RNA and proteins across the nuclear pore. The human sperm REs contain RANGAP1, XPO7, XPO6, NUP210 and NUP214, as members of the nucleoporin complex. Interestingly others have shown that at least in embryonic stem cells, the nucleoporin complex may regulate parentally imprinted genes (73). In comparison, the Unc 51 Like kinase is associated with autophagy a process that is essential for the oocyte-to-embryo transi-tion (74). These observations are consistent with the view that the paternal RNAs may contribute to re-establishing nuclear transport in the zygote and clearance of extraneous cellular complexes post-fertilization, when cell lineages begin to be established.

Differential gene expression during embryogenesis
Transcriptomic changes across mammalian embryogenesis have been well-studied, using both microarrays and RNAseq (75)(76)(77)(78)(79). However, these experiments have not addressed the contribution of intergenic RNAs to embryogenesis and, importantly, during human embryogenesis. Towards filling this gap, we examined expression changes of novel REs from oocyte to blastocyst while considering the contribution of the spermatozoon, testing the hypothesis that both exonic and novel REs would exhibit distinct patterns.
To identify differentially expressed REs, a linear model was applied to the single-cell oocyte and embryonic RNAseq datasets (35,80). Differential expression with REs reiterated previous analysis of RefSeq-annotated genes suggesting that the oocyte, zygote, and 2-cell embryo contain a similar distribution of transcripts (35). Few differences (59 REs) were identified between oocyte and zygote, and no differential REs were identified between zygote and 2-cell embryo (Supplementary Figure S8). As expected, exonic REs  Figure S9) (81). This included a set of novel maternal REs specific to the early zygote (maternal genes) and EGA (the 4-and 8-cell stage). The majority of these novel maternal REs are intronic, suggesting e.g. (1) incomplete processing, (2) expression within an intron, (3) retention of circular RNA, or some other form. They are supplemented by a series of maternal intergenic Orphan REs. Interestingly, these REs also followed similar patterns, defining clusters of REs that are present during the minor first wave of human ZGA, as well as clusters that are active during EGA ( Figure 6). While the novel REs with a maternal gene pattern are enriched for neuronal genes (Neuronal system, P = 2.12e-05), those expressed during EGA are associated with protein metabolism (P = 4.90e-06), consistent with the energy and synthesis requirements of the early embryo.

Expression of repeats during spermatogenesis and early embryonic development
Genomic repetitive elements and small non-coding RNAs are thought to play a role in confrontation-consolidation of the maternal and paternal genomes after fertilization (29,82). As novel REs tend to overlap genomic repetitive sequences, we employed RE expression to determine what genomic repeats may influence RNA expression throughout spermatogenesis and early human embryo development (Figure 7). The relative enrichment or depletion of repetitive sequences in the expressed REs was calculated for each available cell type. Briefly, the number of instances of genomic repeats overlapping expressed REs in each cell type were compared to an expected random distribution, with the random distribution drawn from the repeat occurrence in all available REs. Using a hypergeometric test, both relative enrichment and depletion of repeat families were calculated across cell types. Despite the many instances of repeat depletion, there were relatively few instances of enrichment.
Although several studies have examined the influence of environment on epigenetic marks, such as DNA methylation, at genomic repeats in spermatozoa, much less is known about genomic repeat expression during spermatogenesis and if genomic repeats are in part driving spermatogenesis, perhaps through transcriptional regulation or chromosomal reorganization (83,84). Four repeat families, LTR71B, HERVE-int, HSAT1 and MER1A were primarily expressed in both round spermatids and mature spermatozoa, while the centromeric repeat AATGG(n) showed greatest expression in the leptotyne/zygotene and late pachytene stages through the post-meiotic phase (85). The simple repeat AAGA(n) was enriched solely in mature spermatozoa. The genomic repeats identified here as expressed during spermatogenesis suggest that different repeats have different roles in spermatogenesis. For example, the centromeric repeat AATGG(n) likely plays a role in establishing stage specific chromosomal structure and position throughout spermatogenesis (86,87). The simple repeat AAGA(n) and HSAT1, primate-specific Satellite repetitive element, may also play a role in organizing sperm nuclear structure through Matrix-Associated Regions (MARs) of sperm, which are enriched in TTCT(n) and TCTT(n) repeats (87). The remaining spermatogenesis-associated repeats LTR71B, HERVE, MER1A are all members of the HERV family of retroviruses or DNA transposons. The murine embryo and sperm are known to express a LINE-1-encoded Reverse Transcriptase (RT) that may serve to reverse transcribe the sperm-supplied retroviral and transposon RNAs for integration into the genome (88)(89)(90). However, we note that the presence of LINE-1-encoded RT in mature murine spermatozoon, does not appear to be extended to an enrichment of LINE1 RNAs in human sperm. This likely reflects a species differences, although one cannot exclude the influence of differing methodologies. However, MLT2A1 and SVA-D are both present during EGA, while MER73 was strongly enriched in oocyte and the early embryo (Figure 7). Both MLT2A1 (primate-specific) and MER73 are LTRs for ERVL endogenous retrovirus, while SVA-D is a hominid-specific composite retroelement (SINE-R + VNTR + Alu) (54). Although SVA-D is a marker of naive human ESCs, consistent with the enrichment from 4 cell to morula stage, it is not enriched in blastocyst stage, from which human ESC cell lines are derived (91). The ERVL retrotransposon has been previously implicated in mammalian embryonic development (55).

DISCUSSION
In this study, we sought to enrich our understanding of the transcriptome across the human germ cell and early embryogenesis. To accomplish this, we interrogated publicly available RNA-seq datasets using a new method 'REDa' to identify novel RNA elements (REs). This method was used to detect REs in differentiating cells of the germline, embryonic cells, and somatic tissues. The RE discovery algorithm possesses a robust positive predictive value (PPV) (Supplementary Figure S2), eliminating background signal even at the lower thresholds. Novel REs are annotated as Intronic, near-exon (within 10 kb of an exon), and Orphan (>10 kb from an exon), and are considered along with the previously known exonic REs.
The accuracy of the RE approach, which separates exons into individual units, rather than linking exons into a whole transcript, was tested by comparing expressed REs in testis libraries to the testis expression levels given by GTEx. At least 91% of gene names associated with testisexpressed exonic REs overlap with gene names expressed in GTEx testis tissue, suggesting that the RE approach can recapitulate the patterns designated in established expression databases. The accuracy of the RE approach was further tested on poly(A+) selected libraries, reiterating previous studies that indicate a reduction of transcript diversity and exon expression upon poly(A+) enrichment (92). The number of human zygotic LTR and SINE-associated REs that may be derived from poly(A+) intergenic transcripts is of note. In accord with the data of others (93)(94)(95)(96), this could afford transcript stabilization and nuclear export (97) perhaps increasing their retention in a given cell of the dividing embryo. Notably, at least in mouse, the transcription of retrotransposon-derived RNAs is thought to impact chromatin accessibility, and thus embryonic development (98).
Isoform discovery approaches, such as Cufflinks and Stringtie (7,33), provide methods to suggest novel genes/isoforms, often relying on key structures like exonintron junctions (99). Spermatozoal RNAs are often fragmented, limiting the efficacy of established transcriptbuilding and differential expression algorithms. The RE method is solely intended for identifying expressed regions, which can subsequently be interrogated for the presence of novel isoforms or gene structures. A comparison of the RE approach to that of Cufflinks and Stringtie suggested that the established transcript-building methods are not sufficient for fragmented or unevenly covered transcripts. Additionally, the presence of spliced reads, a critical component to transcript-building, is reduced in spermatozoal RNAs (36% -40%) compared to RNAs from cell lines (41-64%). We note that others have also employed a targeted Cufflinks (35) discovery approach to identify novel linear embryo transcripts. Reflective of the low level of expression and rigor required for identification, the majority of these linear transcripts were not discovered using the RE strategy (data not shown).
The transcriptome of the human male germline has largely been limited to the whole testis, with a few studies generating information from isolated germ cell populations from this mildly heterogeneous tissue. This contrasts with the mature mammalian spermatozoon, which is known to contain a complex transcript population and can be obtained in a relatively pure form (30,100). As described above, a large proportion of novel REs contribute to the post-meiotic phase of human spermatogenesis. GO analysis suggested that a range of signaling pathways, such as Glutamate Receptor signaling, WNT signaling, NGF signaling, EGFR1, and Signaling by Rho GTPases, are associated with REs present in ejaculated spermatozoa, with several of these pathways linked to sperm capacitation and the acrosome reaction. The TNF-alpha signaling associated REs enriched throughout the post-meiotic phase of spermatogenesis may be another part of a surveillance mechanism to ensure an optimal contribution (32).
Relatively few paternal full-length RNAs are likely to be exclusively contributed to the embryo (22). Of note, the genes associated with the paternally transmitted REs did not overlap those long RNAs suggested to be paternally derived in mouse (101). This is likely due to the differences in genome activation, which occurs in the late 1-cell zygote in mouse (102), compared to the later 4-8 cell stage of human embryos, or other sperm derived RNAs providing a substitutive function (20,103). The paternally transmitted REs in human were associated with RAN cycling and autophagy, suggesting that the paternal RNAs may contribute to re-establishing nuclear transport in the zygote and clearance of extraneous cellular complexes postfertilization. Several paternal RNAs, all of which are expressed in human sperm (104), are generated from genes involved in RAN cycling (RANGAP1, XPO7, XPO6) or nucleoporins (NUP210, NUP214).
Although few paternally derived zygotic RNAs are Xlinked, the expression patterns of REs located on the X chromosome are congruent with the current paradigm of Meiotic sex chromosome inactivation (MSCI) and reactivation during spermatogenesis (70). The current study also showed robust repression of exonic X-linked REs during spermatogenesis, as required for successful meiosis. A robust post-meiotic de-repression of exonic and novel Xlinked exonic and novel REs also became apparent. The data suggest that the process of post-meiotic X chromosome silencing (PMCI) during human spermatogenesis is selective, as many genes and novel REs escape silencing.
The mechanism(s) driving spermatogenesis may involve the use of repetitive sequences as regulators of transcription and/or chromatin states (98,105,106). Its nuclear architecture reflects the complex and orchestrated compaction and restructuring of its chromatin via protamination. This is linked through the nuclear matrix/lamina in a non-random manner (107), consistent with the current 3D models (108). The enrichment of centromeric AATGG(n) repeat RNAs appears in the leptotyne/zygotene and late pachytene stages through the post-meiotic phase (85). This repeat can form a double folded hairpin (85), that in mice can promote RNA:DNA hybrids mediating heterochromatin formation (109). Perhaps this aids in excluding large repetitive DNA domains from homology searching enhancing the fidelity of meiosis as observed by the clustering of pericentromeric chromatin during meiosis (110).
A similar simple nuclear matrix/lamina associated repeat AAGA(n) that resides within the inner nuclear compartment (22) was enriched solely in mature spermatozoa yet does not appear in the oocyte or the developing embryo. As shown above four repeat families (LTR71B, HERVEint, HSAT1 and MER1A) are transcribed coincident with the physiological changes of spermiogenesis with marked enrichment in both round spermatids and mature spermatozoon. LINE1 RNAs, which encode reverse transcriptase, were not enriched in human sperm or the zygote. However, the presence of an RT in the early embryo would provide the opportunity for LTR71B, HERVE-int and MER1A, components of HERVs and DNA transposons, to undergo transposition (111). Insertion by retrotransposition might then act to provide regulatory networks, or genetically/ epigenetically modify the developing embryo (88,89) during syngamy.
Repetitive elements enriched during early embryogenesis were also identified in this study. Upon fertilization repeat classes expressed from spermatogenesis switch to MER73 of the oocyte, which then later change during EGA to include an endogenous retrovirus ( Figure 6). This suggests that the majority of zygotic repetitive element-containing RNAs are of maternal origin. During EGA, repeat expression again switches, to SVA-D and MLT2A1. Both ERVL retrotransposon LTRs have been implicated in mammalian embryonic development. HERVK is expected to increase in the morula and blastocyst stage human preimplantation embryos (112). However, we did not observe an enrichment for HERVK or HERVK LTRs in this set of expressed REs, a discrepancy that may be due to differing methods in library preparation and read assignment.
This study introduces a RE discovery algorithm (REDa) that identifies tissue and cell type specific expression in both exonic and intergenic REs. Expression patterns of REs were identified across human spermatogenesis, extending the current knowledge of the transcriptome in developing human sperm. In addition to observing considerable effects of poly(A + ) enrichment, the sheer abundance of intergenic RNAs suggests that they play a large role in spermiogenesis. Of note, extensive expression of repetitive elements during spermatogenesis, suggests that perhaps these are driving spermatogenesis, while sperm-delivered repeat-derived RNAs may play more of a regulatory role in the human embryo.

SUPPLEMENTARY DATA
Supplementary Data are available at NAR Online.