Non-coding RNA may be associated with cytoplasmic male sterility in Silene vulgaris

Highlight A comparison of mitochondrial transcript levels and RNA editing between females and hermaphrodites of Silene vulgaris suggests that mitochondrial non-coding RNA is putatively associated with cytoplasmic male sterility.


Read mapping
Initial alignment was performed using GSNAP v. 2014-12-23 (Wu andNacu, 2010) in paired-end mode using known splice junctions (Sloan et al., 2012) and default settings otherwise. Alignments were strand-separated using the view function in SAMtools v. 1.1 (Li et al., 2009), filtering by SAM flags according to read-pair orientation. First reads in forward orientation corresponded to positive-strand transcription, while those in reverse orientation corresponded to the negative strand. The opposite rules applied for second reads. An initial round of variant discovery using the HaplotypeCaller module in GATK 3.4 (McKenna et al., 2010) was performed separately on the plus-and minus-strand alignments, with default settings except that the minimum call and emit thresholds were set to 20. Only variant sites that could be attributed to canonical mitochondrial RNA editing were retained (C-to-T variants from the plus-strand alignment, G-to-A from the minus strand).
High-identity (>99%) repeat regions longer than 300 bp and regions with high plastid similarity, identified through mitochondrion-by-mitochondrion and mitochondrion-by-plastid genome blast searches, were ignored during variant calling. Variant sites with a GATK quality score of 1000 or lower were discarded, reducing false positive rates at the expense of excluding some legitimate editing sites, particularly partial and intergenic sites.
This threshold was chosen based on manual inspection of low quality variant sites not fitting the canonical editing pattern. Filtered variant sites were compared with raw variant sites to estimate false positive rates. The set of filtered, high quality variant sites and known splice junctions were then used to generate splice-and editing-tolerant alignments with GSNAP using the same settings as before, but including the set of high quality variant sites in "SNP-tolerant" mode. The GATK base recalibrator module was used with the set of high quality variant sites and the set of splice-and editing-tolerant alignments to recalibrate base quality scores. These recalibrated alignments were used for a final round of variant discovery. High quality (GATK haplotype caller score of >1000) variant sites were retained for final analysis. The complete set of variant sites from the final variant discovery, regardless of quality score, was retained for select analyses, particularly comparison with edit sites reported in other studies (Sloan et al., 2010, Bentolila et al., 2013, Wu et al., 2015a.
Lastly, a set of final splice-and editing-tolerant alignments using the final, filtered set of editing sites and known splice junctions was generated as before and used for unbiased editing rate estimation as well as transcript abundance estimation.

Defining islands of transcription
In order to objectively identify high coverage intergenic features, we first gathered mt-genome-wide information about distributions of alignment depth of coverage. We used the makewindows function in bedtools v 2.24.0 (Quinlan and Hall, 2010) to generate 200-bp sliding window "tiles" across the mt genome at 100-bp intervals. We then used the bedtools intersect function to classify these windows as genic, intronic, ORF, or intergenic. A window with any overlap with an exon was classified as genic. Any remaining windows that overlapped introns were classified as intronic. In the case of trans-spliced introns, the boundaries were estimated, so it is possible that additional windows adjacent to our conservative trans-spliced intron boundaries actually contained intronic sequence. Any remaining windows that overlapped with ORFs with a minimum length of 300 bp were classified as such. All remaining windows were classified as intergenic. As the true transspliced intron boundaries are unknown and transcription from all transcribed features may continue beyond their defined boundaries, we calculated the distance between intergenic features and the nearest annotation.
Intergenic tiles were further subdivided into "isolated" or "adjacent to another feature" categories, depending on whether or not they were within 500 bp of an annotated feature. Normalized depth of coverage, based on the total number of mt-mapped reads for each sample, was calculated for all windows using the bedtools coverage function and averaged across samples. Intergenic features with average depth of coverage exceeding 1020× were classified as "islands of transcription." This threshold was chosen as the depth of coverage at which genic features had a higher probability density than intergenic features. All windows overlapping, adjoining, or within 100 bp of each other were merged into a single "island" using the merge function in bedtools.

Transcript abundance estimation
Transcript abundance estimation was performed using the bedtools coverage function and a custom AWK script. Per-base coverage was calculated and averaged over the length of annotated features for each sample. To calculate rates of RNA editing, the SAMtools mpileup function was supplied with the list of identified editing sites. Editing extent was calculated as the count of Ts divided by the sum of Cs and Ts. These calculations were performed to determine the overall editing rate and to distinguish between the extent of editing in transcripts before and after intron splicing for all sites where pre-and post-splice reads could be identified in each sample.