-
PDF
- Split View
-
Views
-
Cite
Cite
Wenzhen Cheng, Conghao Hong, Fang Zeng, Nan Liu, Hongbo Gao, Sequence variations affect the 5′ splice site selection of plant introns, Plant Physiology, Volume 193, Issue 2, October 2023, Pages 1281–1296, https://doi.org/10.1093/plphys/kiad375
- Share Icon Share
Abstract
Introns are noncoding sequences spliced out of pre-mRNAs by the spliceosome to produce mature mRNAs. The 5′ ends of introns mostly begin with GU and have a conserved sequence motif of AG/GUAAGU that could base-pair with the core sequence of U1 snRNA of the spliceosome. Intriguingly, ∼ 1% of introns in various eukaryotic species begin with GC. This occurrence could cause misannotation of genes; however, the underlying splicing mechanism is unclear. We analyzed the sequences around the intron 5′ splice site (ss) in Arabidopsis (Arabidopsis thaliana) and found sequences at the GC intron ss are much more stringent than those of GT introns. Mutational analysis at various positions of the intron 5′ ss revealed that although mutations impair base pairing, different mutations at the same site can have different effects, suggesting that steric hindrance also affects splicing. Moreover, mutations of 5′ ss often activate a hidden ss nearby. Our data suggest that the 5′ ss is selected via a competition between the major ss and the nearby minor ss. This work not only provides insights into the splicing mechanism of intron 5′ ss but also improves the accuracy of gene annotation and the study of the evolution of intron 5′ ss.
Introduction
Most eukaryotic genes consist of exons and introns. The precise removal of introns from pre-mRNAs is necessary to enable protein biosynthesis. Typically, the intron in pre-mRNA usually starts with GU and ends with AG, and the whole intron contains 3 conserved sequences: intron 5′ end starting sequence composed of GUAAGU, intron 3′ end sequence composed of (Py) nNPyAG, and a “branch point” of a conserved adenine 18 to 40 nucleotides upstream of the 3′ end (Sheth et al. 2006; Wilkinson et al. 2020). A single gene could undergo alternative splicing to produce multiple mRNA transcripts through different combinations of exons and introns (Reddy et al. 2013). Alternative splicing is mainly classified into the following categories: alternative 5′ splice site (ss), alternative 3′ ss, exon skipping, intron retention, and mutually exclusive exons (Reddy et al. 2013). Alternative splicing increases the diversity and complexity of gene expression as well as regulates the function and interactions of proteins (Staiger and Brown 2013; Jia et al. 2016; Li et al. 2020).
Splicing of pre-mRNAs into mature mRNA requires the involvement of a highly dynamic supramolecular ribonucleoprotein complex spliceosome. Two different types of spliceosomes exist in most cells: the major class or U2-type spliceosomes prevalent in eukaryotes and a minor class or U12-type spliceosomes that may not be present in some organisms (Burge et al. 1999). The splicing reaction undergoes several independent stages, and the recognition of ss depends on the conserved sequence motifs at the ends of introns. The first step in the splicing process is the recognition of the splice donor by U1 snRNP at the highly conserved GU dinucleotide (Konarska 1998; Pomeranz Krummel et al. 2009). U1 snRNAs pair their single-stranded 5′ end with a string of 4 to 6 bp bases at the 5′ ss. Structural and RNA binding analyses indicate that U1 snRNP selects 5′ ss nucleotides mainly by complementary pairing with U1 snRNA bases (Kondo et al. 2015). The U1 snRNP complex comprises U1 core and U1 auxiliary subunits. The U1 core mediates the interaction between U1 snRNA and the 5′ ss, and the U1 auxiliary subunits stabilize this interaction (Forch et al. 2002; Chang et al. 2022; Espinosa et al. 2022).
Although the 5′ ss selection and its splicing are believed to be mainly determined by base pairing, it was reported that the key component of U1 snRNP, U1C, can recognize the 5′ ss in the absence of base pairing (Du and Rosbash 2002), suggesting other factors may also be involved in 5′ ss selection. The selection of the 5′ ss is a crucial step in pre-mRNA splicing, as it defines the boundaries of the intron–exon junction and affects the coding potential of the resulting mRNA. Many factors affect the selection of the 5′ ss, including the sequence of the ss, splicing regulatory elements, splicing factors, and their regulators (Reddy 2007; Gao et al. 2015). These factors are important for the selection of the alternative 5′ ss.
Genome-wide analyses of the transcripts and the ss of various species, including Arabidopsis (Arabidopsis thaliana), many other plants, and metazoans, revealed that GT-AG introns are the dominant types of introns, accounting for ∼98% of introns, while GC-AG introns, a typical type of noncanonical introns (Supplemental Fig. S1), account for ∼1% of introns (Sheth et al. 2006; Ye et al. 2017; Pucker and Brockington 2018; Frey and Pucker 2020). There are various sequence combinations at the 5′ ss of introns that should affect the efficiency and the selection of the 5′ ss (De Conti et al. 2012; Carranza et al. 2022). Previous studies of 5′ ss mainly focused on bioinformatic and statistical analysis of the sequence at the 5′ ss. However, the underlying molecular mechanism of these sequence variations and their consequences are not functionally dissected.
In our sequence analysis of the Filamenting temperature-sensitive mutant Z1 (FtsZ1) gene in plants, we found that it has noncanonical GC-AG introns in various species, which could cause alternative splicing of the genes and misannotation of the genes with insertion or deletions. We then conducted a comprehensive analysis of the sequence at the 5′ ss in Arabidopsis and many sets of elaborately designed mutational analyses to study the relationship between sequence variations at 5′ ss and their consequences. Our results suggest that besides affecting the base pairing with U1 snRNA, sequence variations at 5′ ss influence the interaction with U1 snRNP by steric hindrance. This occurrence not only has an effect on the splicing efficiency of the ss but also may result in alternative splicing of the intron.
Results
GC-AG splicing results in misannotation of FtsZ1 in various plant species
FtsZ1 is an important chloroplast division protein well-conserved in plants (Osteryoung and McAndrew 2001; Liu et al. 2022). In a BLAST search of FtsZ1 in the NCBI protein database, we found that in many plants, the sequences between “KRSLQ” and “ALEAI” often have insertions or deletions (INDEL) of amino acids comparing with the FtsZ1 protein in Arabidopsis and other plants (Fig. 1A and Supplemental Table S1). This region is highly conserved (Fig. 1A), and insertion or deletion of amino acid residues should have adverse effects on the function of the protein. Therefore, we sought to verify the sequence and the annotation of this region in these species.
Initially, we selected poplar (Populus tomentosa) to conduct reverse transcription PCR (RT-PCR). The sequencing result of cDNA was translated into protein sequences, and no amino acids were found to be added or missed in this region (Fig. 1, A and B). This finding suggested that the gene was misannotated. Further analysis of the genomic DNA sequence revealed that instead of using GT bases at the 5′ end of the ss of the third intron as in Arabidopsis, GC bases were present at this site. The misannotation was caused by choosing a GT ss close to the true ss. Similarly, the FtsZ1 gene was also misannotated in many other plant species (Fig. 1 and Supplemental Table S1).

GC-AG splicing results in misannotation of FtsZ1 in various plant species. A) Multiple sequence alignment and misannotation of FtsZ1 sequences flanking the third intron in various plant species. FtsZ1 were misannotated with insertion or deletion in some plants, which are shown with red font. B) Multiple sequence alignment of the third intron and the flanking sequence of FtsZ1 in some plant species. Sequences with red font are misannotated as exon or intron, which are shown with upper and lower cases, respectively. Sequence analysis of more proteins can be found in Supplemental Table S1.
Sequences near the ss of noncanonical GC-AG intron are more conserved than canonical GT-AG intron
Based on TAIR 10.0 annotation, the Arabidopsis genome contains 122,809 introns, including 1,240 GC-AG introns (Fig. 2A). We have also expanded our analysis to rice (Oryza sativa), maize (Zea mays), cotton (Gossypium hirsutum), and rose (Rosa chinensis) (Supplemental Table S2). The result indicated that in rice, there are a total of 139,472 introns, including 1,801 predicted GC-AG introns. In maize, there are a total of 242,267 introns, including 3,357 predicted GC-AG introns. In cotton, there are a total of 93,305 introns, including 1,203 predicted GC-AG introns. In rose, there are a total of 130,158 introns, including 1,102 predicted GC-AG introns. Although the GT-AG splice type is predominant in these plants, the GC-AG splice type does exist as a minor type. Therefore, there might be a special splicing mechanism for GC-AG introns.

Frequency of the sequences flanking the 5′ intron ss in Arabidopsis. A) The number of introns with different splicing types in Arabidopsis genome. Sequences of +1 and + 2 positions at the 5′ ss and the sequences of −2 and −1 positions at the 3′ ss were analyzed. B) WebLogo plots of the frequencies of bases from +1 to +6 and −6 to −1 in GT-AG and GC-AG introns. Base frequencies from +1 to +20 and −20 to −1 are shown in Supplemental Fig. S3. C) Frequencies of different sequences at +3 to +5 positions of the 5′ ss in GT-AG and GC-AG introns. Frequencies of the top 20 combinations are shown. D) and E) Frequency of the upstream sequence of 5′ ss. 20 bp exon sequences upstream of the intron ss were plotted with WebLogo. Sequences of the first 5 bp of the introns and their frequencies were shown above each plot. D) The top 6 types of GU introns. E) The top 6 types of GC introns.
To explore the underlying mechanism for GC-type splicing, we analyzed the sequence flanking the 5′ ss of the introns of protein-coding genes and transposable elements in the Arabidopsis genome. Based on the base pairing with snRNP, 6 bp sequences at the 5′ or 3′ ends of all the GT-AG and GC-AG introns were chosen from the genomic DNA for the analysis (Fig. 2B). The ratios of the bases near the GT-AG and GC-AG ss differ, with “GCAAG” accounting for a substantially higher ratio in GC-AG introns than “GTAAG” in GT-AG introns (Figs. 2B and S2). A more detailed analysis of the combinations of the 3 bp following GT or GC suggested that although AAG is the most prevalent sequence combination in both modes, it accounts for only 17.17% in the GT mode but 58.06% in the GC mode (Fig. 2C). Since the 2 sequence groups have different sizes, the Mann–Whitney U test and t test were performed to minimize the effect of sample sizes. The P values were far <0.01 (<1e−99). Therefore, the sequence combination of AAG at +3, + 4, and +5 positions is much more preferred by GC introns. The next top 4 combinations in GT mode and GC mode are the same. However, the combinations in GT mode are much more diverse in total (Figs. 2C and S3 and Supplemental Tables S3 and S4).
We also analyzed the 20 bp sequence upstream of the ss of the top 6 base combinations (Fig. 2, D and E). Interestingly, for the most prevalent combination, “GTAAG,” there is no strong base preference at −1 and −2 positions. In contrast, all 6 combinations of the GC splicing mode have a very high preference for AG at −1 and −2 positions. This is another special character of the GC splicing mode.
+2 position is very important for intron splicing, while C at +2 position is only conditionally important
Since the FtsZ1 gene has both GT-AG and GC-AG splicing modes in various species, we selected the AtFtsZ1 gene and PtFtsZ1 for the experimental test to learn whether the splicing process would be normal when GT and GC ss are swapped (Figs. 3, A and B, and S4). In Arabidopsis, the splicing mode of the third intron of the AtFtsZ1 gene is GT-AG, while in poplar, the splicing mode is GC-AG (Fig. 1). These 2 genes and the corresponding base-swapping mutants were then transiently expressed in Nicotiana benthamiana, a widely used dicot plant for transient gene expression assays. Sequencing of the RT-PCR products indicated that when the ss GT was mutated to GC at the 5′ end of the third intron of AtFtsZ1, a 5 bp sequence or the whole intron was retained, whereas for PtFtsZ1, the minor aberrant splice isoform disappear after GC was mutated to GT (Fig. 3B). Next, we further tested 3 introns of Tic20-I and LLG1 (Figs. 3, A and B, and S4) by constructing vectors that mutated the ss of the second intron of Tic20-I and the second intron of LLG1 from GC to GT. The sequencing results of RT-PCR products showed that the introns were spliced normally or even with higher efficiency (Fig. 3, A and B). However, when GT was mutated to GC at the 5′ ss of the first intron of LLG1, the first intron was entirely retained (Fig. 3, A and B). These results indicated that the 5′ intron ss could tolerate the swapping from GC to GT at +1 and +2 positions but not GT to GC.

A preliminary analysis of the effects of C to T or T to C mutations at +2 position on intron splicing. A) Electrophoresis of the RT-PCR products. L, the lower band. B) Sequencing analysis of the RT-PCR products in A). Genomic DNA sequences are shown below the name of each gene or mutant. The mutated bases are shown in red font. L indicates that for Tic20-I, LLG1, and their mutants, the lower bands in A) were sequenced. The mutated bases in DNA sequencing results are indicated with “*”.
As shown above, swapping the sequence at the ss of the first intron of LLG1 from GTACA to GCACA (Fig. 3B) abolished splicing. Since GCAAG is the dominant type of sequence at the GC intron ss, we mutated the sequence further from GCACA to GCAAG to test the effect (Fig. 3). In this case, the first intron was spliced out normally. Therefore, the following 3 bp sequence has an important effect on the splicing of the GC intron.
To further dissect the role of the +2 position on intron splicing, we conducted mutational analysis with the UMP1a gene, which has a 5′ ss sequence of AG/GTAAGC of the third intron (Fig. 4). When the +2 ss base was mutated from T to C, it had no effect on mRNA splicing (Fig. 4, A, B, and C). When the +2 ss base was mutated to A, a large proportion of the intron was unspliced, whereas mutating to G resulted in the intron being entirely unspliced and retained (Fig. 4, A, B, and C).

A further mutational analysis of +2 bases on intron splicing. A) Electrophoresis of the RT-PCR products. L, lower bands; M, middle bands. B) The effect of base mutations on the splicing of the second intron of UMP1a. The relative proportions of the middle bands (M, unspliced) and the lower bands (L, spliced) in A) were analyzed. Error bars represent Sd of 3 measurements. C) Sequencing analysis of the RT-PCR products in A). Genomic DNA sequences are shown below the name of each gene. The mutated bases are shown in red font. L and M indicate the lower or middle bands, respectively, in A) were sequenced. The mutated bases in DNA sequencing results are indicated with “*”. D) Structural diagram of the U1 snRNP protein complex and a set of mutants at +2 position of the ss. Shown are the snapshots of the Arabidopsis U1 snRNP after 200 ns molecular dynamics. The ss of UMP1a and its 3 mutants were analyzed. The mutated bases are shown in red font. Lengths of the hydrogen bonds are indicated. Global views of these complexes are shown in Supplemental Fig. S5.
We then performed molecular dynamic simulations using the predicted U1 snRNPs complex and the ss sequences of the wild type and the 3 mutants of UMP1a (Figs. 4D and 5A). The root-mean-square deviation (RMSD) values of the distance between +1 to +3 bases and corresponding pairing bases from U1 snRNA were calculated from the molecular dynamics (MD) trajectories. The results showed that the wild-type pre-mRNA sequence AG/GUAAGC had the lowest distance. The distances were increased when the +2 position of the ss base was mutated from T to C, A and G, respectively. Especially for the A and G mutations, the distances were greatly increased, which would severely impair the base pairing with U1 snRNA and then the splicing of the intron. It is worth noting that for the T to C mutant, the base distance was not increased so much, which could be because the base C is smaller than A and G and its structure is similar to U. This may explain why this mutant was spliced normally. Thus, the steric hindrance effect of +2 position seems to play an important role in intron splicing.

Molecular dynamics simulations of base distance maps of the mutants at different time periods. The effects of mutations at various positions were analyzed. A) Mutations at +2 position of UMP1a intron as shown in Fig. 4. B) Mutations at +5 position of Tic20-I intron as shown in Fig. 6. C) Mutations at −1 position of LLG1 intron as shown in Fig. 7. D) Mutations at −1 and −2 positions of LLG1 intron as shown in Fig. 7.
Mutations at +5 positions affect intron splicing
The +5 position of the intron of Tic20 is G, which is the most preferred base. When this G was mutated to C and T in Tic20-5c and Tic20-5t, respectively, the ratios of the retained intron were apparently increased (Fig. 6, A and B). When this G was mutated to A in Tic20-5a, the splicing of intron was almost abolished. However, a further C to T mutation at +2 position in Tic20-2t5a reversed the splicing to levels comparable with that of the wild type (Fig. 6, A, B, and C). Molecular dynamic simulations were also performed on these mutants (Fig. 5B). The result was generally consistent with the RT-PCR results (Fig. 6). We can see that the distances between the ss and U1 snRNA were increased in Tic20-5c and Tic20-5t, greatly increased in Tic20-5a, and returned to normal in Tic20-2t5a. Thus, the +5 position also plays an important role in intron splicing, especially for the GC intron.

Mutational analysis of +5 bases on intron splicing of Tic20-I.A) Electrophoresis of the RT-PCR products. B) The effect of base mutations on intron splicing of Tic20. The relative proportions of the upper bands (U, unspliced) and the lower bands (L, spliced) in A) were analyzed. Error bars represent Sd of 3 measurements. C) Sequencing analysis of the RT-PCR products in A). Genomic DNA sequences are shown below the name of each gene. The mutated bases are shown in red font. U and L indicate that the upper and lower bands, respectively, in A) were sequenced. The mutated bases in DNA sequencing results are indicated with “*”.
Mutations at −1 position affect intron splicing
In the WebLogo plot presented in Figs. 2D and 2E, the probability of the AG bases at −2 and −1 ss is very high in both the GT-AG splicing mode and GC-AG splicing mode in most of the sequence combinations. To assess the role of these 2 bases on intron splicing, we mutated the original AA/GCAAGT of the LLG1 gene to AA/GTAAGT and AG/GTAAGT (Fig. 7). The RT-PCR result showed that the ratios of spliced cDNA were increased in both mutants. The efficiency of splicing decreased when the original AG/GCATGT of the Tic20-I gene was mutated to AA/GCATGT and GG/GCATGT (Fig. 7, A and B). For the AG to AA mutation, the major ss was moved to a 4 bp downstream ss in most of the transcripts (Fig. 7C), suggesting it was preferred by the spliceosome over the mutated ss. For the AG to GG mutation, the splicing was slightly affected. These data suggested that the −1 position has an important role in the splicing of GC intron, but probably not GT intron.

The effect of mutations at −1 and −2 positions on intron splicing. A) Electrophoresis of the RT-PCR products. B) Analysis of the relative proportions of the upper bands (U, unspliced) and the lower bands (L, spliced) in A). Error bars represent Sd of 3 measurements. C) Sequencing analysis of the RT-PCR products in A). Genomic DNA sequences are shown below the name of each gene. The mutated bases are shown in red font. U and L indicate that the upper and lower bands, respectively, in A) were sequenced. The mutated bases in DNA sequencing results are indicated with “*”.
A nearby GT ss is more competitive than the GC ss
The results of AtFtsZ1-3i2c, PtFtsZ1, and Tic20-−1A (Figs. 3 and 7) shown above suggest that 2 5′ ss close to each other may lead to alternative 5′ ss selection in 1 intron. To investigate the relationship between ss selection and sequences near the ss, several mutants of Tic20-I were created and analyzed (Fig. 8, A and B).

The effect of a close GT ss on the splicing of GC ss in Tic20-I.A) Electrophoresis of the RT-PCR products. B) Sequencing analysis of the RT-PCR products in A). Genomic DNA sequences are shown below the name of each gene. The mutated bases are shown in red font. L indicates that the lower bands in A) were sequenced. The mutated bases in DNA sequencing results are indicated with “*”.
Sequencing results showed that Tic20-gt1 (AG/GCAGGT) underwent alternative splicing, with either a 4 bp GC-AG retention or correct splicing, with the mutant transcript having a higher ratio, indicating that the newly created downstream mutant GT ss was preferred (Fig. 8). In Tic20-gt2 (AG/GCAAGGT) with 2 bp AG insertion, GCAAG was retained in almost all of the transcripts, indicating that the upstream GC ss has become less competitive, although its sequence combination is the most preferred GC type.
The GC and GT ss are very close to each other in Tic20-gt1 and Tic20-gt2 mutants, which may interfere with each other. Therefore, we designed 2 additional mutants, Tic20-gtgc and Tic20-gcgt, with 6 bp between the 2 sites (Fig. 8). In these 2 mutants, both the GC and GT ss have the optimal base combination, i.e. “AG/GCAAG” and “AG/GTAAG.” In Tic20-gtgc, the GT ss is located upstream, while in Tic20-gcgt, the GC ss is located upstream. This design may also be able to test if a downstream GT splice is more competitive than an upstream GC ss. The sequencing results showed that in Tic20-gtgc, the mRNA was alternatively spliced, with the majority spliced normally and a small proportion spliced at the downstream GC site. In Tic20-gcgt, the mRNA was not spliced at the upstream GC ss but spliced at the downstream GT ss. Thus, the GT ss was more competitive than the GC ss, even if it was downstream, and it seems that the downstream GC ss was more competitive than the upstream GC ss.
An analysis of the splicing-defective FtsZ1 gene in transgenic plants
As shown in Fig. 3, the AtFtsZ1-3i2c mutant is splicing defective with either a 4 bp insertion or retention of the intron 3 in the mRNA in the transient assay. To learn its splicing in stable transgenic plants, both the wild-type AtFtsZ1 gene and the splicing-defective AtFtsZ1-3i2c mutant gene were transformed into the ftsZ1 mutant (Fig. 9).
![Overexpression of a splicing-defective ftsz1 gene in ftsz1 mutant. A) Chloroplast phenotype of transgenic plants in Arabidopsis, scale bar = 10 µm. The wild-type AtFtsZ1 and splicing-defective AtFtsZ1-3i2c shown in Fig. 3 was transformed into ftsz1 mutant. The mutant was complemented in line 1 (L1) but not in line 2 (L2). B) Semiquantitative PCR analysis of expression levels of AtFtsZ1 gene in plants shown in A). The black triangle indicates the gradient dilution of RT-PCR from left to right: 1, 1/4, and 1/16, respectively. AtPP2AA3 was used as the internal control gene. C) Immunoblot analysis of the protein level of FtsZ1. For each line, 2 different plant samples were used for the analysis. CBB staining was used as a loading control. Because FtsZ1 was highly overexpressed in L2, its loading was 1/5 of the other samples. D) Sequencing results of the FtsZ1 cDNA in different plants. For AtFtsZ1-3i2c transgenic lines, different splice variants were indicated. E) A diagram of the alternative splicing of AtFtsZ1-3i2c in L1 as shown in D). Please note that overexpression of the wild-type FtsZ1 could block chloroplast division [A) to C)].](https://oup.silverchair-cdn.com/oup/backfile/Content_public/Journal/plphys/193/2/10.1093_plphys_kiad375/1/m_kiad375f9.jpeg?Expires=1747933283&Signature=fM7lAhoHYk8dsLzLR-tDyUgRTUVb5lmBvz4yQ7351Pn9wRoWHnsP0UX4T9fEXHA0htbZDg2gEhv1~8JFX6VIvLYl02q2WQVLpnH0sw20sXDWHlZpEpVvZeA4I-Lfzuupmr~IvTfoGSM3l88d-0snFRxREAaV8ozk7WkFbQO~AwQwPJ45Pe9eqc5ju0JUjbsqSeGkimXRlNoodhufzQwCJI1rEnlz5pfRymieds-P9PpNN8AHh5cqv17a0Kc0nxX5csFqdrIaRcmOJcatk1jqoVaqTyWMlYMOmKRSMcAyyTUcexhlN7ou49lhVb7tHNU1SlOr1-amJUaT-CFHoSB9AQ__&Key-Pair-Id=APKAIE5G5CRDK6RD3PGA)
Overexpression of a splicing-defective ftsz1 gene in ftsz1 mutant. A) Chloroplast phenotype of transgenic plants in Arabidopsis, scale bar = 10 µm. The wild-type AtFtsZ1 and splicing-defective AtFtsZ1-3i2c shown in Fig. 3 was transformed into ftsz1 mutant. The mutant was complemented in line 1 (L1) but not in line 2 (L2). B) Semiquantitative PCR analysis of expression levels of AtFtsZ1 gene in plants shown in A). The black triangle indicates the gradient dilution of RT-PCR from left to right: 1, 1/4, and 1/16, respectively. AtPP2AA3 was used as the internal control gene. C) Immunoblot analysis of the protein level of FtsZ1. For each line, 2 different plant samples were used for the analysis. CBB staining was used as a loading control. Because FtsZ1 was highly overexpressed in L2, its loading was 1/5 of the other samples. D) Sequencing results of the FtsZ1 cDNA in different plants. For AtFtsZ1-3i2c transgenic lines, different splice variants were indicated. E) A diagram of the alternative splicing of AtFtsZ1-3i2c in L1 as shown in D). Please note that overexpression of the wild-type FtsZ1 could block chloroplast division [A) to C)].
Probably due to the positional effect, the expression levels of these transgenes varied. When the wild-type AtFtsZ1 gene was expressed at a level close to that of the endogenous AtFtsZ1 (L1), the chloroplast division defect was corrected (Fig. 9, A, B, and C). When the wild-type AtFtsZ1 gene was highly overexpressed (L2), it blocked chloroplast division (Fig. 9, A, B, and C). Interestingly, when the AtFtsZ1-3i2c mutant gene was highly overexpressed (L1), the mutant phenotype was rescued (Fig. 9, A and B). Immunoblot results indicated that the FtsZ1 protein level in these lines was close to that of the wild type (Fig. 9C). When the AtFtsZ1-3i2c mutant was overexpressed, but the actual expression level was not very high (L2), the mutant phenotype was retained (Fig. 9, A and B). Immunoblot results indicated that the FtsZ1 protein level in these lines was barely seen (Fig. 9C).
To further dissect the splicing mechanism of the AtFtsZ1-3i2c mutant in stable transgenic plants, the cDNA of these plants was sequenced (Fig. 9D). The results indicated that the third intron was also aberrantly spliced, which is very similar to the result of the transient expression (Fig. 3B). In the plant of recovered phenotype (L1), a very minor part of the cDNA that came from the correctly spliced mRNA was detected (Fig. 9, D and E). While in the plant of unrecovered phenotype (L2), the correctly spliced cDNA was undetectable (Fig. 9, D and E). One explanation of this special phenomenon is that the ratio of correctly spliced cDNA is very low; it can only be seen in the lines with a very high expression level. Moreover, the correctly spliced mRNA is more stable than the mis-spliced mRNA, which could be degraded through nonsense-mediated mRNA decay.
Discussion
The selection of 5′ ss is an important process of pre-mRNA processing, which could be complicated by the sequence at the ss (De Conti et al. 2012; Leader et al. 2021) and may result in alternative splicing of a gene (Ast 2004; Wong et al. 2018). The 5′ ss could be misannotated due to the complex situation at the ss. In the case of FtsZ1, due to the noncanonical GC ss in some plants and the forced choosing of a nearby in-frame GT as the ss by the computer program, many genes were misannotated with INDEL (Fig. 1 and Supplemental Table S1). To uncover the underlying splicing mechanism of GC introns, we analyzed all the introns in Arabidopsis, which has a high-quality genome and annotation. The results revealed that although GT introns are prevalent, GC introns do exist as a minor part, with a total of 1,240 introns identified (Fig. 2). Further analyses of the sequences around the ss of GT introns and GC introns revealed that they are generally more stringent and less diverse in GC introns in that GC introns have a much higher preference for AG at −1 and −2 positions and AAG at +3, + 4, and +5 positions.
To learn the different effects of GT or GC on intron splicing, we performed some swapping experiments. The results showed that the mutation of GC to GT resulted in normal or improved intron splicing, while the mutation of GT to GC resulted in worse or failed intron splicing (Fig. 3). A further mutation of the following sequence from ACA to AAG resulted in the recovery of failed splicing to normal levels. Analysis of the +2 position of the second intron of UMP1a, which has the optimal sequence combination for the 5′ ss, revealed that the GT to GC mutation had no effect on splicing, while GC to GA and GG mutations severely affected splicing (Fig. 4). Therefore, the +2 position plays an important role in intron splicing. We have also made a set of mutations on the +5 positions of the second intron of Tic20 (Fig. 6). The result showed that only the G to A mutation had a severe effect, but it was recovered by an additional C to T mutation at the +2 position. Furthermore, mutation analysis on A and G at −2 and −1 sequence suggested that G at −1 position is more important for splicing (Fig. 7).
Based on our experimental results and the sequence and structural analyses shown above (Figs. 2 to 8 and S2 to S5 and Supplemental Tables S2 and S3), we proposed a working model for the splicing of the 5′ ss by the spliceosome (Fig. 10, A and B). The traditional view is that the splicing efficiency of pre-mRNA by U1 snRNP complex depends primarily on the base pairing between the 5′ ss of pre-mRNA from −2 to +5 and U1 snRNA (Pomeranz Krummel et al. 2009; Malard et al. 2022). Our results support this view. For example, the splicing was increased in the mutants with C to T mutation at +2 positions (Fig. 3), and it was recovered in the GCACA to GCAAG mutant (Fig. 3). However, our data also suggest that steric hindrance may play an important role in splicing. This is supported by the mutation of T to C at the +2 position having no effect on the splicing; however, the mutation of T to A and G severely affected the splicing (Fig. 4). C does not base-pair with A, but A and G do not base-pair with A either. However, A and G are much bigger than C and occupy more space. Similarly, G to C or T mutations at the +5 position only had a mild effect on splicing; however, G to A mutation at the +5 position almost abolished the splicing (Fig. 6). This probably is because A is larger; therefore, it cannot fit the space in the spliceosome as well as C or T do. Similarly, sequence variations of other ss, especially at +5 and −1 positions, will not only cause base pair mismatches but also cause steric hindrance (Figs. 6 and 7). This will increase the RMSD value and reduce the splicing efficiency.

Models of sequence variations on intron splicing at 5′ ss. A) A schematic diagram of the effect of +2 site mutations on splicing efficiency. The effect of the mutations on splicing depends on the base changes. Mutations at +2 site may not only affect base pairing with U1 snRNA but also cause steric hindrance and thus reduce the catalysis of U1 snRNP complex. The asterisk (*) marks the position of the +2 site. B) A schematic diagram of the effect of base variations on splicing efficiency. Various mutations may not only affect the base pairing with U1 snRNA but also the binding of U1 snRNP complex and thus increase RMSD value and reduce the splicing efficiency. The effect of mutations on splicing depends on their positions, base changes, and numbers. C) A schematic diagram of the selection mechanism of the 5′ ss. The ss were simply divided into 3 classes, with class I as the strongest and class III as the weakest. The situations of splicing were also simply divided into 3 types. In type (1), if 2 ss are strong, although many of the introns could be spliced well, part of them may have alternative 5′ ss. In type (2), if one ss is not strong, but the other is very weak, the weak ss may be visible as the alternative 5′ ss. In type (3), one ss is apparently stronger than the other so-called hidden ss, no alternative 5′ ss will be detectable. This is the situation for most of the introns.
Our results showed that if the dominant ss is not strong enough, another weak ss may be chosen by the spliceosome, leading to alternative 5′ ss or even intron retention. For instance, the AtFtsZ1-3i2c and PtFtsZ1 gene use GC as the major ss and GT as the minor ss (Fig. 3). While in the AtFtsZ1 and PtFtsZ1-3i2t, which use GT as the major ss at the corresponding position, splicing at the minor GT ss was hard to be detected. In a set of GC and GT ss competition experiments, we found that a minor ss near the major ss could also be detected (Fig. 8).
Based on these results and our sequence analysis data, we proposed a working model for the selection mechanism of the 5′ ss (Fig. 10C). It has been observed that sequences close to a normal 5′ ss will often contain a hidden weak 5′ ss (Thanaraj and Clark 2001). In many mutants with a defective 5′ ss, a nearby alternative 5′ ss was chosen by the spliceosome (Frances et al. 1998; Isshiki et al. 1998; Li et al. 2019; Luo et al. 2019; Huang et al. 2022). The major reason why these 5′ hidden ss were not detected in the wild type may be that they were too weak compared with the normal 5′ ss. Of course, there should be many situations in the genome. In general, the selection of the 5′ ss depends on the competition and the strength of the normal ss and the nearby hidden ss. If one is apparently stronger than the other, alternative 5′ ss will be invisible, which is the case for most of the introns. If both of them are strong, it will result in alternative 5′ ss. If both are weak, it will cause alternative 5′ ss and intron retention.
As shown above, competition of upstream 5′ ss within an intron can lead to alternative splicing and exons of different lengths and sequences (Fig. 8). Alternative 5′ ss can also affect mRNA stability, translation efficiency, protein localization and stability, protein–protein interactions, and other aspects (Cheng et al. 2020; Huang et al. 2022). The regulation of alternative splicing and alternative 5′ ss was believed to mainly depend on specific splicing factors (Fang et al. 2004; Raczynska et al. 2009). The tissue-specific or condition-specific expression of these factors can result in specific alternative splicing patterns and functional regulation of the genes (Wang et al. 2012; Ding et al. 2014; Xin et al. 2017; Kathare et al. 2022; Misra et al. 2022; Steward et al. 2022). Our work suggests that the sequences near the ss may contain hidden ss. For GC-type 5′ ss and other weak 5′ ss, or mutants of the original 5′ ss, the hidden 5′ ss can be chosen with a considerable frequency (Figs. 3, 4, and 6 to 8). Thus, sequence variation of 5′ ss could have an important effect on the function of a gene and is another important factor contributing to alternative 5′ ss.
The identification of alternative 5′ ss is essential for accurate gene annotation in plants. Since sequence variation of 5′ ss and alternative 5′ ss can complicate the expression and regulation of genes (De Conti et al. 2012; Singh and Singh 2019; Leader et al. 2021), they can also affect the accuracy of gene annotation and the study of gene function. For instance, they can affect the prediction of start codons, length of exons, and even the reading frames, which could alter or disrupt protein domains with important functions. FtsZ1 is an essential chloroplast division protein highly conserved in plants (Osteryoung and McAndrew 2001; Liu et al. 2022). However, due to the usage of atypical GC ss, it was computationally mispredicated and thus misannotated in many plant species (Fig. 1 and Supplemental Table S1). Our analyses and results provide insights into the 5′ ss selection mechanism, which are useful for improving the accuracy of gene annotation.
Materials and methods
Plant materials
Arabidopsis (A. thaliana) wild-type and the filamenting temperature-sensitive mutant z1 (ftsz1) mutant (Liu et al. 2022) are Col-0 ecotype. Tobacco (N. benthamiana) plants used for transient assay are 35 d old. Plants were grown in perlite/vermiculite/peat nutrient soil (1:1:1, v/v/v) in a constant temperature incubation chamber at 22 °C with 16 h light/8 h dark at a light intensity of 80 to 100 μmol m−2 s−1. Poplar (P. tomentosa) plants were grown in flasks with 1/2 MS solid medium and provided by Professor Xinmin An, Beijing Forestry University.
Bioinformatic analysis
The FtsZ1 protein sequences in various plant species were downloaded using the following sequence IDs: AT5G55280 (TAIR, available at https://www.arabidopsis.org/) for Arabidopsis, XP_021608385.1 for cassava (Manihot esculenta), XP_031491207.1 for lotus (Nymphaea colorata), KAG6793582.1 for poplar, KAF7145509.1 for azalea (Rhododendron simsii), and KAG5597923.1 for potato (Solanum commersonii). The corresponding genomic DNA sequences of the FtsZ1 genes were retrieved from TAIR for AtFtsZ1 and from NCBI (available at https://www.ncbi.nlm.nih.gov/) in the genomic DNA assemblies RSFT01003764.1, VYXN01000280.1, lcl|CM031968.1, lcl|CM024956.1, and lcl|CM031112.1, for cassava, lotus, poplar, azalea, and potato, respectively. Additional FtsZ1 protein sequences with misannotations can be found in Supplemental Table S1.
The intron sequences of rice (O. sativa), maize (Z. mays), cotton (G. hirsutum), and rose (R. chinensis) were extracted using custom Python scripts, which utilized species-specific GFF annotation files and corresponding genome sequence files. The sequence files of rice (accession: GCA_004007595), maize (accession: GCF_000005005), and cotton (accession: GCF_000005005) were downloaded from NCBI (https://www.ncbi.nlm.nih.gov/). The sequence files of rose were downloaded from the website of https://www.rosaceae.org/species/rosa/chinensis/genome_v1.0.
To validate the hypothesis that the frequency of the combination “AAG” is significantly greater in the GC group as compared with the GT group, we employed statistical evaluations using the Mann–Whitney U test and Student's t test, with the integration of a bootstrapping methodology encompassing 10,000 iterations. Each iteration had a sample size of 1,200, allowing for the random selection of combinations from both groups to calculate the frequency of the “AAG” sequence occurrence. The U-stat or T-stat and the P values were then calculated.
Protein–RNA structure preparation
To study the interaction of 5′ exon–intron junction of pre-mRNA with U1 snRNP, the amino acid sequences of different chains in human U1 snRNP complex were retrieved in FASTA format from RCSB (https://www.rcsb.org) and the homologous proteins in Arabidopsis were identified by BLASTP on TAIR website. The Arabidopsis homologous sequences used for the U1 snRNP modeling are as follows: AT1G76300.1, AT5G44500.2, AT3G62840.1, AT4G30330.1, AT4G30220.1, AT2G23930.1, AT3G50670.2, AT4G02840.1, and AT4G03120.1. Alphafold2 multimer was used to generate the predicted U1 snRNP complex based on the crystal structure of human (4PJO). After aligning the 2 complexes, the human snRNP complex was removed, leaving only the Arabidopsis snRNP complex and snRNA. The initial models were visualized by PyMOL software (www.pymol.org/), where align function was applied to combine the predicted Arabidopsis U1 snRNP with the 5′ ss of pre-mRNA. We carefully eliminated the clashes that may have been caused by the alignment tools in PyMOL and conducted a 10 ns “pre” MD simulation to mimic the crystallization process, which made the structure more stable and suitable for subsequent MD simulations. Our statistical analysis showed that “AGGUAAG” is present in the majority of introns in Arabidopsis. Therefore, we used the reverse complement of this sequence as the standard U1 snRNA for this study.
To stabilize the complex, a short-time molecular dynamics simulation was conducted. CHARMM36 (Lee et al. 2016) all-force field parameters were generated by CHARMM-GUI (https://charmm-gui.org/). The system was solvated in a TIP3P water model and neutralized with counterions and a salt concentration of 0.15 M KCl. System equilibration was performed at a constant number of particles, volume, and temperature (NVT), whereas production simulations were performed at constant particles, pressure, and temperature (NPT). Simulations were conducted at the temperature of 277.15 K and 1 atm pressure to mimic the crystallization of protein. System equilibration was performed for 1 ns, followed by production simulation for 10 ns, to get the standard U1 snRNP complex.
Mutations and molecular dynamics
The standard U1 snRNP complex from the previous step was loaded into PyMOL to simulate the mutation of bases near the ss of 5′ ss of pre-mRNA using the mutagenesis function. The system was solvated in a TIP3P water model and neutralized with counterions of KCl. Instead of setting the default ions concentration to 0.15 M, the concentration was decreased to 0.05 M to relieve the random disturbs between bases and ions. Simulations were conducted at the temperature of 295.15 K and 1 atm pressure. System equilibration was performed for 1 ns at NVT, whereas production simulations were performed for 200 ns at NPT.
Visual Molecular Dynamic (VMD) version 1.9.4a55 (Humphrey et al. 1996) was used to visualize the molecular dynamics results. The RMSD value was extracted from the MD simulation to evaluate the stability of bases binding between the 5′ ss of pre-mRNA and U1 snRNA.
Vector construction and Agrobacterium transformation
To generate various vectors for analyzing the effect of mutations at the −2 to +5 site of 5′ ss on intron splicing efficiency, we selected AtFtsZ1, LLG1, Tic20-I, and UMP1a in A. thaliana and PtFtsZ1 in P. tomentosa. These genes were used to make various wild-type and mutant constructs for the transient expression assay, with all the genes driven by the 35S promoter. For FtsZ1, we made 2 wild-type constructs, AtFtsZ1 and PtFtsZ1, and 2 mutant constructs, AtFtsZ1-3i2c and PtFtsZ1-3i2t. For LLG1, 4 mutant constructs were made, namely LLG1-2i2t, LLG1-1i2c, LLG1-1i2c4a5g, and LLG1-2i-1G2t. For Tic20-I, 11 mutant constructs were made, and for UMP1a, 9 mutant constructs were made. These constructs were transiently expressed in N. benthamiana leaves via Agrobacterium infiltration. Primers used are listed in Supplemental Table S5.
Gene expression analysis
Forty-eight h after infiltration, a small piece of leaf was excised, and total RNA was extracted for cDNA synthesis and RT-PCR. The PCR products were separated via 1.5% agarose gel electrophoresis, and the fully spliced and unspliced bands were quantified using the multiband analysis function of AlphaView software (Bio-Techne, Minneapolis, MN, USA). The DNA bands in gels were cut out for sequencing analysis.
The expression levels of the AtFtsZ1 gene in Arabidopsis plants were analyzed semiquantitatively with a gradient dilution of 1, 1/4, and 1/16. AtPP2AA3 was used as the internal reference gene. Primers used are listed in Supplemental Table S5.
Chloroplast phenotype analysis
Rosette leaves of 4-wk-old Arabidopsis plants were detached for fixation as before (Liu et al. 2022). Chloroplasts were observed with an Olympus CX21 microscope (Olympus, Tokyo, Japan) equipped with a USB 2.0 digital camera (Changheng, Beijing, China).
Immunoblot analysis
Total proteins were extracted from Arabidopsis leaves concurrently with RNA extraction. The protein samples were then separated on a 12% SDS–PAGE gel and transferred onto a PVDF membrane. The membrane was blocked and then probed with AtFtsZ1 primary antibodies (Liu et al. 2022) and followed by incubation with secondary antibodies. Finally, an ECL Western Blot kit (Beijing ComWin Biotech Company, China) was used for the film development. Coomassie Brilliant Blue (CBB) staining was used as a loading control.
Accession numbers
Sequence data of FtsZ1 proteins in various plants from this article can be found in the GenBank/EMBL data libraries under accession numbers: A. thaliana, NP_200339.1; M. esculenta, XP_021608385.1; N. colorata, XP_031491207.1; P. tomentosa, KAG6793582.1; R. simsii, KAF7145509.1; and S. commersonii, KAG5597923.1. More sequence data of FtsZ1 proteins in various plants can be found in Supplemental Table S1. Sequence data of genes for vector construction in this article can be found in the TAIR (www.arabidopsis.org) database under accession numbers: AT5G55280 (AtFtsZ1), AT5G56170 (AtLLG1), AT1G04940 (AtTic20-I), and AT1G67250 (AtUMP1a).
Acknowledgments
We would like to thank Jinjie An for the technical support.
Author contributions
H.G. conceived the project; W.C., C.H., and H.G. designed the experiments and prepared the manuscript; W.C., C.H., F.Z., and N.L. performed the experiments. All authors read and approved the final manuscript.
Supplemental data
The following materials are available in the online version of this article.
Supplemental Figure S1. Schematic diagram of the GU-AG and GC-AG splicing modes.
Supplemental Figure S2. Diagrams of the base frequency from +1 to +20 and −20 to −1 in GT-AG and GC-AG introns, respectively.
Supplemental Figure S3. Base combination frequency of sequences at +3, + 4, and +5 splice sites in GT-AG and GC-AG introns.
Supplemental Figure S4. Schematic diagram of the gene structures of genes AtFtsZ1, PtFtsZ1, LLG1, Tic20-I, and UMP1a.
Supplemental Figure S5. Structural diagram of the U1 snRNP protein complex and a set of mutants at +2 position of the splice site as shown in Fig. 4D.
Supplemental Table S1. FtsZ1 was misannotated in various plants with INDEL.
Supplemental Table S2 The number of introns of different splicing types in the genome of various plants.
Supplemental Table S3. Frequencies of the base combinations from +3 to +5 in GT-AG introns.
Supplemental Table S4. Frequencies of the base combinations from +3 to +5 in GC-AG introns.
Supplemental Table S5. Primers used in this study.
Funding
This work was supported by the Fundamental Research Funds for the Central Universities (2022BLRD14) and the National Natural Science Foundation of China (32070696).
Data availability
All experimental data are available and accessible via the main text and/or the supplemental data.
References
Author notes
The author responsible for distribution of materials integral to the findings presented in this article in accordance with the policy described in the Instructions for Authors (https://academic.oup.com/plphys/pages/General-Instructions) is Hongbo Gao.
Conflict of interest statement. None declared.