Detection of internal N7-methylguanosine (m7G) RNA modifications by mutational profiling sequencing

Abstract Methylation of guanosine on position N7 (m7G) on internal RNA positions has been found in all domains of life and have been implicated in human disease. Here, we present m7G Mutational Profiling sequencing (m7G-MaP-seq), which allows high throughput detection of m7G modifications at nucleotide resolution. In our method, m7G modified positions are converted to abasic sites by reduction with sodium borohydride, directly recorded as cDNA mutations through reverse transcription and sequenced. We detect positions with increased mutation rates in the reduced and control samples taking the possibility of sequencing/alignment error into account and use replicates to calculate statistical significance based on log likelihood ratio tests. We show that m7G-MaP-seq efficiently detects known m7G modifications in rRNA with mutational rates up to 25% and we map a previously uncharacterised evolutionarily conserved rRNA modification at position 1581 in Arabidopsis thaliana SSU rRNA. Furthermore, we identify m7G modifications in budding yeast, human and arabidopsis tRNAs and demonstrate that m7G modification occurs before tRNA splicing. We do not find any evidence for internal m7G modifications being present in other small RNA, such as miRNA, snoRNA and sRNA, including human Let-7e. Likewise, high sequencing depth m7G-MaP-seq analysis of mRNA from E. coli or yeast cells did not identify any internal m7G modifications.


Supplementary methods: Calculation of mutation frequencies and statistical analysis
We use the mpileup function from the samtools package to compile sequencing results for each base position present in the sequence file that was used for mapping. For each position, this function summarises the sequencing data into the observed bases with their corresponding quality score for each sample (1,2). We developed the getFreq tool to estimate the mutation frequencies for each position by taking the observed sequenced bases and possibility of sequencing/alignment error into account.

Formal description of the estimation of mutation frequencies
For each position, we first removed bases/indels that are observed at very low frequency (default=0.1%) or bases/indels that are only observed a low number of times (default=1). For the remaining bases/indels, we estimated the allele frequency based on a likelihood function that includes the possibility of sequencing error modelled based on the base quality score.
where X is the sequencing data for the position and f in the frequency of the possible alleles including indels e.g. f=(fA,fG,fAGG) if both A, G and indel AGG is observed at this loci. The likelihood assumes independence per read such that for N reads The probability of observed the sequencing data for a single read depends on the true allele that was sequenced and we assumed that the true allele, A, is one of the T observed ones such that ( 2 | ) = 7 8 2 9 = ; < ( = ; | ) > ;45 Where we sum over all of the possible true alleles aj. The probability of ( = ; | ) is simply the frequency of the aj allele. To obtain the probability of the sequencing data of the i th read we convert the quality score in to a probability of error, e, based on the Phred scaling. We assumed that if we observe allele of type b that 8 2 = 9 = ; < = @ 1 − , = ; /3, ℎ Where the division of 3 is motivated by the fact that sequencing error of a base can results in three other bases with equal probability. Since we also allow for indels this should be seen as an approximation.

Formal description of the calculation of p-values for positions having mutations:
To test if a position is polymorphic (has a significant amount of mutations), we calculated the likelihood based on the estimates allele frequency " and under the null where all alleles are the same, L , using a likelihood ratio statistics = R 8 9 " < ( | L ) S.
If there are T types of possible alleles then ~> W5 X , where T-1 is the number of degrees of freedom.
When multiple samples are analysed, we can test for differences between samples or groups of samples by estimating the frequencies jointly and separately in the groups. The test statistics then become = R ∏ 8 Z 9 Z [ < Z4\ Z45 8 9 " < S Where X k is the sequencing data in group k of K groups, f k is the allele frequency in group k. Here we will have (K-1)*T degrees of freedom.

Outputs from the getFreq function:
For each position, the getFreq function reports the estimated mutation frequencies for the control (AltFreqCC0) and treated (AltFreqCC1) samples as well as the mutation rate difference (relFreq). Plot of the raw reveres transcription termination counts (collapse on barcodes to remove potential PCR duplicates) for NaBH4 treated sample across A) E. coli SSU rRNA and B) LSU rRNA as determined by the RNAprobR R package. C) Log2 ratio of counts from NaBH4 treated samples divided by counts from control samples for the same three replicates shown in Figure 1D). The analysis in D was performed as previously described using barcode counts (3).    * Only alleles that are observed with more than "minCount" in at least one sample are counted. The getFreq function takes the "minCount" as an input and will only perform analysis of the position if the total number of observed mutations at the position (totalCounts) > minCounts. Yeast_tRNA-Val-AAC-1 + + + *The three tRNAs, met-CAU, Lys-UUU and Cys-GCA all have increased relative frequency and -10*log transformed p-value, but does not reach our cut-off, most likely due to low coverage. Conversely, these tRNA obtain the highest normalized cleavage values in Marchand et. al, which may indicate that abasic sites created in these tRNA are more prone to strand breakage. ** MODOMICS: a database of RNA modification pathways (5).