Intron exon boundary junctions in human genome have in-built unique structural and energetic signals

Abstract Precise identification of correct exon–intron boundaries is a prerequisite to analyze the location and structure of genes. The existing framework for genomic signals, delineating exon and introns in a genomic segment, seems insufficient, predominantly due to poor sequence consensus as well as limitations of training on available experimental data sets. We present here a novel concept for characterizing exon–intron boundaries in genomic segments on the basis of structural and energetic properties. We analyzed boundary junctions on both sides of all the exons (3 28 368) of protein coding genes from human genome (GENCODE database) using 28 structural and three energy parameters. Study of sequence conservation at these sites shows very poor consensus. It is observed that DNA adopts a unique structural and energy state at the boundary junctions. Also, signals are somewhat different for housekeeping and tissue specific genes. Clustering of 31 parameters into four derived vectors gives some additional insights into the physical mechanisms involved in this biological process. Sites of structural and energy signals correlate well to the positions playing important roles in pre-mRNA splicing.


INTRODUCTION
Discovery of eukaryotic genes as discontinuous structures, with protein-coding segments or exons disrupted by nonprotein coding segments or introns, was one of the most unanticipated findings in molecular biology (1), whose mystery is yet to be solved fully. In fact, identification of genes with correct exon-intron architecture is one of the hardest problems in eukaryotic genome annotation. The key sig-nals used for this purpose are the splice-site (SS) sensors which conventionally include a G-T sequence signal at the 5 SS (+1/+2 position at the 5 end of intron) and A-G sequence signal at the 3 SS (last two positions at the 3 end of intron). There exists a plethora of sequence variations at these sites, considering that thousands of different sequences act as naturally occurring splice sites in the human transcriptome (2), along with many variably located cryptic SSs (3). This greatly reduces the accuracy and fidelity of SS sequence signals for the identification of precise exon-intron boundaries, and the situation becomes more challenging in the wake of alternative splicing happening so prevalently in eukaryotes (4). The exact locations of the exon-intron boundaries are crucial not only for defining the encoded amino acid sequence but also for understanding the molecular mechanism underlying the regulation of pre-mRNA splicing, which is fundamental to understand gene expression and is of great medical relevance as at least 15% of human genetic disorders and many diseases are caused by aberrant pre-mRNA splicing (5).
Over the years, computational methods have emerged as a major force in fast and accurate characterization of genes/genomic segments. Some algorithms have been developed which determine a splice site based on a score calculated by measuring its concordance to matrices built using large collections of splice sites (6)(7)(8)(9). Earlier gene prediction tools like Genscan, Genomescan combine exonintron and splice signal models with similarity to known protein sequences in an integrated mode for gene predictions (10)(11)(12). Some other important tools like Genewise, Genomewise (13), Augustus (14), Fgenesh (15), GeneParser (16), GeneID (17) are ab initio gene prediction tools which have been developed using various programming models (Dynamic or Hidden Markov Model) on sequence information for functional signals including splice sites. These methods have been developed on huge training data and performance is high for a species/organism but for a naïve genome/genomic fragment, it decreases considerably. Some currently available splice-junction prediction tools identify exon-intron boundaries in mRNA sequences, for organisms with reference genome (18)(19)(20)(21) as well as without a reference genome (22), but these tools are unable to annotate splice junctions in DNA sequence. Recently, role of chromatin organization and nucleosome positioning as determinant of exon-intron boundary was also investigated; though it looks at the problem with a new angle, satisfactory level of sensitivity and specificity could not be achieved (23). Despite the many insights resulting from such studies over the years, it is apparent that our conceptual frameworks are not adequate yet. New ideas and models are needed for identification of splice sites in genome sequences.
So, if the discriminatory signals for exon-intron boundaries are not uniformly present in the corresponding sequence, where to look for such signals? It is a wellknown fact that DNA in living cell is not a uniform linear macromolecule but displays local structural and energetic variations which have been found to facilitate interactions with proteins and play a key role in several biological processes (24). The known structural biology of B-form DNA advanced dramatically with the solution of the crystal structure of the B-form oligonucleotide duplex d(CGCGAATTCGCG) in 1981, indicating the first observation of sequence dependent structural heterogeneity at the molecular level (25). Considerable subsequent efforts to gather data pertaining to sequence effects on the structure, during the last few decades, have led to revolutionary evolution in the analysis of nucleic acids structure (26)(27)(28)(29)(30)(31)(32)(33). Many studies over the years have found that similar sequences may lead to similar structure and energetics, but reverse is not true however, different sequences can lead to DNA molecules with similar structure and energetic properties (34)(35). Do exon-intron boundaries too represent a similar case where DNA attains a uniform and unique structural and energetic state, despite the presence of huge sequence variations, at these sites? But why would DNA structure and energetics change at exon-intron boundaries as splicing is an affair between pre-mRNA and spliceosome (a dynamic macromolecular machine composed of five small nuclear RNAs, associated polypeptides and many other protein factors) and there is never a direct interaction between spliceosome and DNA and so, this idea initially seemed unlikely. However, while carrying out literature survey, we started getting clues. Some studies have shown that exons have higher thermodynamic stability compared to introns, untranslated regions (UTRs) and intergenic regions, (36)(37). Though these studies do not investigate the energetics of splice sites, they indicate that exon-intron boundaries might show some signal depicting the transition in thermodynamic property from exon to intron or vice versa. Further, a large number of evidences have shown that pre-mRNA splicing is pre-dominantly co-transcriptional (38)(39)(40)(41). Evidences exist for kinetic coupling (42)(43)(44) as well as physical and mechanistic coupling (45)(46) between transcription and splicing. These studies are indicating an indirect link between DNA template and splicing. Do structure and energetics of DNA template at exon-intron boundaries offer some mechanisms to regulate both-the elongation rate of pre-mRNA as well as splicing of upstream intron, or, are they offering some platform to physically/mechanistically link the RNA polymerase II and spliceosome? Before going any further in this direction, it became imperative that the structural and energetic behavior of exon-intron boundaries be investigated. Since these signals do not manifest directly in the sequence itself, previous studies pertaining to sequence analysis of exon-intron boundaries (6-17) do not offer information regarding the structure and energy signals of boundary junctions.
Over the years, some very remarkable methods have become available for the analysis of nucleic acids structure (26)(27)(28)(29)(30)(31)(32)(33). During the last 15 years, we have also made some significant efforts to understand the DNA language in terms of its energetics and structure (47)(48)(49)(50)(51)(52)(53)(54)(55). For the present study, we proceeded by downloading all the exons (3 28 368) from protein coding genes of human genome from GENCODE database and obtained the genomic coordinates of exonstart and exon-end position. Using these genomic coordinates, two boundary sequences datasets were prepared-Dataset I and Dataset II, each having 3 28 368 sequences of length 401 nucleotides (detail in method section). These boundary sequences were subjected to structural and energetic characterization using 28 structural and three energy parameters. To obtain numeric values of conformational parameters for the unique di-nucleotides steps, we downloaded the crystals structures of B-DNA from Nucleic Acids Database (NDB) (55) and applied the Curves+ webserver (31) on these structures for the same. In-house programs were used for calculating the energy parameters (53). Here, we report that these parameters provide unique structural and energetic signatures at SS junctions and the information for these signatures is in-built in their sequences. Our results offer a whole new paradigm for understanding pre-mRNA splicing which can go a long way in understanding regulation of eukaryotic gene expression.

Boundary sequence dataset
Genome annotation file of human genome was downloaded from GENCODE database and from this, all the exons (3 28 368) from protein coding genes were extracted, and for each exon-exon-start and exon-end genomic coordinates were taken out. Using these genomic coordinates, two datasets for boundary sequences were prepared: Dataset I and Dataset II. Dataset I was prepared by extracting 401 nucleotides, spanning 200 nucleotides upstream and downstream, with respect to the exon end position, taking it as '0'; these sequences, each of length 401 nucleotides, represent exon sequence from −200 to 0 and intron sequence from +1 to +200. Likewise, Dataset II was prepared with respect to exon start position (the sequences here represent intron sequence from −200 to −1 and exon sequences from 0 to +200). In this way, each dataset has 3 28 368 sequences, of length 401 nucleotides each. As control dataset (Dataset III), we extracted 30 140 sequences of length 401 nucleotides from the middle of exons, which are >1000 nucleotides long (www.scfbio-iitd.res.in/chemgenome/intron exon).

Parameters for characterization of genomic sequences
We have used 28 structural and three energetic parameters. The structural parameters include--nine backbone (Alpha, Beta, Gamma, Delta, Epsilon, Zeta, Chi, Phase and Amplitude), eight inter-BP (Shift, Slide, Rise, Tilt, Roll, Twist, H-Rise and H-Twist), six intra-BP (Shear, Stretch, Stagger, Buckle, Propel and Opening) and five BP-axis (X Displacement, Y Displacement, Inclination, Tip and Axis-Bend) parameters. The values of these parameters were calculated by applying Curves+ webserver (44) on 74 B-DNA crystal structures obtained from NDB database (Supplementary Table S1.1) (55). After calculating values for all the parameters for each B-DNA structure, all occurrences of unique 10 di-nucleotide steps in the 5 to 3 direction were considered for each parameter and the average of all the occurrences were calculated. Proper methods were used for the statistical analysis of angular values (56)(57)(58).
The energy parameters include-hydrogen bond energy, stacking energy and solvation energy. The values for these three energy parameters for the unique 10 di-nucleotide steps was done as reported in our previous work (53).
The numeric values, of all the 30 parameters, for the unique di-nucleotides steps, obtained above are provided in supplementary Table S1.2. All the numeric conversions of the present study were made according to this table.

Obtaining the structural and energy numeric profile of each sequence
The calculated di-nucleotide values for each parameter (from Supplementary Table S1.2) were used for getting numeric profile of each sequence of all the three datasets by performing moving average calculation on a sliding window of 25 bp covering 24 di-nucleotide steps (the first element of the moving average is obtained by taking the average of the initial first 24 di-nucleotide steps then the window is shifted forward, excluding the first number and including the next set of 24 di-nucleotide steps) (selection of the 25-bp window size was based on initial screening of sample data with window sizes of 15, 20, 25 and 30). The same exercise was performed independently on all the selected sequences for all the 31 parameters. In this way, 31 numeric profiles were obtained for each of the 3 28 368 sequences, for both the datasets: Dataset I and Dataset II. Likewise, all sequences of Dataset III (CDSs) were also subjected to numeric profile generation; 31 numeric profiles were generated for each sequence. [The term 'Profile' here is used for the unique set of numeric values for each nucleotide position (from −200 to +200, through 0) along the length of the sequence.] Data is available in raw csv format as supplementary file 2.

Normalization of values
To bring all the parameters on the same scale, the values were made dimensionless using normalization. The values were normalized between 0 and 1 by subtracting the minimum value of the profile from each value and then by dividing the value with the range of the profile (i.e. max -min).

Error analysis of data
The standard error of the mean at each position from −200 to +200 for all the parameters was calculated by dividing standard deviation of values at that position divided by square root of total number of observations. The standard error along with mean value is presented in Supplementary Figure S1.1a-e and S1.2a-e as shaded error bars.

Profile plotting of sequences
The plotting was performed using MATLAB software.

Examining the observations on individual sequences
To examine the generality of observations on individual sequences of Datasets I and II, following methodology was used.
For sequences of both datasets, for each parameter, a vector of 61 residues in length (spanning −30 to +30 through 0) was taken and was named as junction vector. To generate the CDS vector (as control, for comparison), for every position in the junction vector, a relative position towards the exon region was mapped at 150 residues away from it. For Dataset I (exon from −200 to 0 and intron from +1 to +200) the control vector was upstream of junction vector while for Dataset II (intron from −200 to +1 and exon from 0 to +200) control vector was downstream of junction vector. Then for every pair of junction vector and corresponding CDS vectors, the area enclosed by them was calculated. Logic is that, those pairs of junctions and CDS vectors where area enclosed by them is small (<2 standard deviations from the mean) will be indistinguishable, whereas vice versa is true for pairs having area greater than this value. Thus, sequences which qualified the threshold criteria of (mean -2 × standard deviations) for the area calculated, were selected as having significant junction signals and those not meeting the threshold criteria were considered as sequences not having the signal. Formulas for calculation of area under the curve and optimization process of threshold values are given in supplementary methodology S1.1a, b and Supplementary Figures S1.3-S1.4.

Signals in housekeeping genes and tissue specific genes
In order to compare the signals, at splice junctions, of housekeeping genes with those of tissue specific genes, we obtained the complete list of 53 exons from 11 housekeeping genes (59) and 141 exons from 11 tissue specific genes (top 6 brain specific and top 5 liver specific genes) (60). With respect to the exon start and end position in each case, 200 nucleotides were extracted from each side from the corresponding genomic sequence, as explained earlier, to prepare two datasets for both housekeeping genes (Dataset HK I and Dataset HK II) and tissue specific genes (Dataset TS I and Dataset TS II). Plotting was done as explained earlier.

Clustering the data into sets of few plots
To simplify the data to facilitate a better interpretation, the data of 31 plots for both the Datasets were clustered into a set of a few plots. Data were sorted out by the value near Nucleic Acids Research, 2021, Vol. 49, No. 5 2677 position −25, identified as having positive or negative values (and slopes). Five parameters (Y-displacement, Opening, Delta, stacking energy and Solvation energy) did not match these profiles and so were eliminated. Rest 26 parameters were clustered in two groups, based on the value near −25 position. Group I represents those parameters which exhibited an increase near −25 position and included 13 parameters-Stretch, Rise, Tilt, Roll, Twist, H-rise, H-twist, beta, Gamma, Epsilon, Phase, Amplitude, Hydrogen Bond Energy. While 13 parameters (X-displacement, Inclination, Tip, ax-Bend, Shear, Stagger, Buckle, Propeller twist, Shift, slide, alpha, Zeta and Chi) showed a decrease in value near −25 position and were clustered together as group II. The plots of these two groups for both the Datasets-I and II, were generated by scaling each data set as follows: (dataaverage(data)/(max(data) -min(data)).

RESULTS AND DISCUSSIONS
Numeric profiles of 31 parameters (28 structural and 3 energy) were obtained for pooled sequences of each dataset: Dataset I (3 28 368 sequences), Dataset II (3 28 368 sequences) and Dataset III (30 140 sequences) (each sequence of length 401 nucleotides). For this, for each parameter, numeric profiles of all the individual sequences belonging to a particular dataset were superimposed and average over all numeric sequences for each position was calculated. In this way, for each dataset, we obtained 31 average numeric profiles and these average profiles were then used for the plotting purpose, with abscissa showing nucleotide position and ordinate representing the numeric value of that parameter (Supplementary Figures S1.1-S1.2a-e, parameter-wise plot for the three datasets (Datasets I, II and III), showing the error bars too). To evaluate all the parameters on single scale, values were normalized and all the 31 normalized parameters were plotted together on this new scale, for the three datasets-Datasets I, II and III. (Figure 1).
It is clear from Figure 1 that for all the 31 parameters, a unique pattern is observed from −50th to +25th position, which is quite distinct from the corresponding upstream and downstream regions, indicating a considerable change in DNA structural and energetic properties at these locations. Figure 1a represents the signal obtained for Dataset I (exon sequence from −200 to 0 and intron sequence from +1 to +200), showing the parameter profile as we move from exon to intron while Figure 1B represents the profile of Dataset II (intron sequence from −200 to −1 and exon sequence from 0 to +200) as sequence transitions from intron to exon. It is quite notable that though change in values of each parameter starts happening from around −50th position, the pattern of change is quite unique for each parameter for both the datasets; for some parameters, values increase initially followed by sudden decrease and the reverse for others. On the other hand, the plots of CDSs (Dataset III), as shown in Figure 1C, come as straight lines, with no changes occurring anywhere across the entire length of sequence (parameter wise value is given in Supplementary  File S2). This clearly suggests that DNA undergoes a distinct change in its structure and energy as it transitions from exon to intron and vice versa while no such change occurs across the length of CDS.
Since Figure 1 is an average plot of all the sequences of a particular dataset, it becomes imperative to know the generality of this observations across the individual sequences. Using the methodology as explained in method section, it was observed that for both the datasets, the signal for each parameter, at positions '−30 to +30' was observed in >95% of the sequences (detailed results are available in Supplementary Table S1.3 and S1.4). A distribution plot of area calculated for every pair of junction vector and CDS vector (3 28 368) is not feasible, with such a large data. The observation of structural and energy signals on such a large percentage of data led us to investigate the situation at sequence level too. We wanted to know whether at these positions (i.e. −30 to +30) some consensus exists at sequence level or not. For this, sequences within each dataset (Dataset I and Dataset II) were aligned from −30 to +30 positions and consensus was observed using WebLogo3 software (61) (Figure 2).
It is very clear from Figure 2 that for both the datasets, some consensus is observed only at positions 'from −5/4 to +6/7', whereas for rest of the positions, there is no consensus. The results are clearly indicating towards the universality of structure and energy signals compared to sequence signals for splice site identification. The presently available methods (6)(7)(8)(9)(10)(11)(12)(13)(14)(15)(16)(17) for splice site predictions are predominantly based on sequence information for various functional regions at and near to the acceptor and donor splice site. However, since the present study does not offer the designing of a method/algorithm for splice site prediction (though it is the eventual target of our study), rather represents the first stage only (dealing with the identification and characterization of structure and energy signals at splice sites), a comparison with the existing methods for prediction is not possible at this stage.
When signals at splice junctions were compared for housekeeping versus tissue specific genes, comparatively sharper and distinct signals were observed for housekeeping genes ( Figure 3A and C) than tissue specific genes ( Figure  3B and D) (Figure 3).
Further, for both types of genes, signals were sharper when sequences move from intron to exon ( Figure 3C and D) compared to exon to intron ( Figure 3A and B). This observation can lead to some deeper insights into the role of in-built designs of genes in gene expression, though it is difficult to comment further on this issue with the present set of observations. Further studies are needed to give clear insights on this aspect.
We attempted to understand our findings in the light of existing mechanisms for pre-mRNA splicing. Two unique spliceosomes coexist in most eukaryotes. We preferred the most common mechanism involving U2-dependent spliceosome, ignoring the one with less abundant U12-dependent spliceosome which is present in only a subset of eukaryotes (62).
There are many cis-acting elements present on both sides of splice junctions which play important role in spliceosome assembly and splicing ( Figure 4A). These include branch site (BS), polypyridine tract (PYT), exonic and intronic splicing enhancers (ESEs and ISEs) or silencers (ESSs and ISSs). The BS is typically located 18-40 nucleotides upstream from the 3 SS while PYT is present variably between   BS and 3 SS. Location and sequence of silencers and enhancers is highly variable.
To correlate these important sequence positions to the corresponding state of DNA structure and energy, we put a simplified structure and energy parameter plot (involving 26 parameters only) just below it (as Figure 4B-E) (details in method section). Further, to gain some insights into the underlying correlations/anti-correlations among these 26 parameters, position specific correlation coefficients were calculated for the positions from −25 to −35, for both the datasets (Supplementary File S3) and corresponding heat plots were also generated (Supplementary Figure S1 . Such an observation opens up gates for understanding many underlying mechanisms governing DNA structure as well as the purpose behind the changes in various structural parameters at the intron-exon boundary junctions, definitely calling for more research for further clarity. Some changes corroborated well with the established facts like--the negative coupling of two pairs of dihedral angles: epsilon-zeta (epsilon increased while zeta decreased) (63) and alpha-gamma (alpha decreased while gamma increased (64), which are associated with change of DNA structure from canonical to non-canonical state.
Spliceosome assembly occurs by the ordered interaction of the spliceosomal snRNPs (small nuclear ribonucleic proteins) and numerous other splicing factors (SF) (65)(66). In the first step, the 5 SS is recognized by U1 snRNP, the BS by SF1, and the PYT by U2AF65 (SF). The BS and PYT show poor sequence consensus while for the −3 to +6 region of the 5 SS, >9000 sequence variants have been recently recorded (67). Our effort to find the nucleotide frequency at individual positions, from −25 to +25 with respect to each splice site, also revealed no significant consensus frequencies for these positions ( Figure 4F and G) (Position wise frequency data is given in Supplementary File S4). What then drives the initial identification of these three important sites? Figure 4B-E indicate towards a unique structural and energetic state of DNA at these positions which might act as the identification signal for these sites. In recent times, many studies have emerged which show that DNA structural and energetic properties greatly aid the targeting and functionality of DNA-binding proteins in a wide variety of ways (24,68).
In the second step of spliceosome assembly, the U2 snRNP joins BS by replacing SF1 (forms the A complex), followed by subsequent joining of the U4/U6.U5 tri-snRNP (the B complex); extensive structural rearrangements occur at this stage to activate the spliceosome. After the rearrangements (where U1 and U4 snRNP leave the assembly), the U6 snRNP directly interacts with 5 SS and U5 snRNP directly interacts with 3 SS. Since U6 and U5 are linked, being part of tri-snRNPs, it brings 5 SS and 3 SS in a juxtaposed orientation (activated B complex). It is followed by first trans-esterification reaction at 5 SS, resulting in formation of C complex where second catalytic reaction occurs at 3 SS, resulting in release of intron and ligation of exon ends. In the light of the results obtained in the present study, it can be speculated that exact pattern of structural and energy changes occurring at both ends of intron (Figures 1 and 4) might have some role to bring the two ends of intron in juxtaposed orientation; more studies are definitely needed to corroborate the fact.
The above results, undoubtedly, affirm the active role of DNA structure and energetics at the SS junctions, though from the present study it is difficult to interpret the exact nature of their role. On further contemplation, a surge of queries emerges. What type of structure and energy state DNA adopts at SS junctions? Why is the DNA undergoing such changes at SS junctions when it is not directly interacting with spliceosome? How RNA polymerase II responds to this change in template? Does splicing of pre-mRNA at SSs is affected/stimulated by structure and energy states of corresponding sites at template DNA, if yes then how? Numerous evidences do exist for post-transcriptional mechanisms of pre-mRNA splicing, but the role of DNA templatebased changes is unclear. Attempts to answer above questions would reveal fundamentally new insights into the regulation of gene expression. We plan to address some of these questions in near future and anticipate that many new dimensions would be added by the scientific community involved in similar work.

CONCLUSION
Structural and energy analysis of 6 56 736 genomic sequences, pertaining to exon-intron boundary sites (3 28 368 sequences of each type--exon to intron and vice versa) clearly points to the existence of physico-chemical fingerprints for these locations, irrespective of whether consensus exists at sequence level or not. Identification of precise splice sites in eukaryotic genes/genome annotation has re-mained a big challenge till date because of poor sequence consensus at these sites. Using the observations of present study, we hope to develop an efficient algorithm/method for splice site prediction. Existence of physico-chemical fingerprints conveying the functional destiny of DNA sequences has earlier been used for identification of many elements, like promoters, operator/regulators as well as for different classes of RNAs, in genome/genomic segments by many scientific groups.

SUPPLEMENTARY DATA
Supplementary Data are available at NAR Online.