FilTar: using RNA-Seq data to improve microRNA target prediction accuracy in animals

Abstract Motivation MicroRNA (miRNA) target prediction algorithms do not generally consider biological context and therefore generic target prediction based on seed binding can lead to a high level of false-positive predictions. Here, we present FilTar, a method that incorporates RNA-Seq data to make miRNA target prediction specific to a given cell type or tissue of interest. Results We demonstrate that FilTar can be used to: (i) provide sample specific 3′-UTR reannotation; extending or truncating default annotations based on RNA-Seq read evidence and (ii) filter putative miRNA target predictions by transcript expression level, thus removing putative interactions where the target transcript is not expressed in the tissue or cell line of interest. We test the method on a variety of miRNA transfection datasets and demonstrate increased accuracy versus generic miRNA target prediction methods. Availability and implementation FilTar is freely available and can be downloaded from https://github.com/TBradley27/FilTar. The tool is implemented using the Python and R programming languages, and is supported on GNU/Linux operating systems. Supplementary information Supplementary data are available at Bioinformatics online.


Data selection
For analysis of miRNA transfection experiments, FASTQ sequencing data generated from RNA-Seq experiments in human or mouse cell lines with at least two biological replicates were selected for further processing. It is expected that samples transfected with a specific miRNA would lead to a reduction in expression of its target relative to the control sample.
After differential expression analysis, if by inspection of cumulative plots, the predicted miRNA targets could not be observed to be downregulated relative to non-target transcripts, then the transfection experiment was considered to have failed, and relevant datasets were not used for downstream analysis.
A summary of datasets used with relevant database accessions is provided (Supplementary

Quality control and statistics
FASTQ data quality scores, GC-content, read lengths and similar statistics were generated using FASTQC (v0.11.5) (https://www.bioinformatics.babraham.ac.uk/projects/fastqc/). Output from FASTQC was collated with data from the log files of other processes in order to produce a summary statistics report for each used BioProject using MultiQC (v1.6) (9) (output summarised in Supplementary Table S2).

Differential expression analysis
Differential expression analysis for miRNA transfection experiments was completed within the R (v.3.5.0) statistical computing environment (10). Transcript-level read count data derived from RNA sequencing of miRNA mimic or negative control transfected cell lines were imported using the tximport package (v1.10.1) (11). Differential expression analysis on length and library size normalised read counts was performed using DESeq2 (v1.22.2) (12) comparing expression between negative control and miRNA mimic transfection conditions. Log2 fold change values were subsequently shrunken using the default DESeq2 'normal' shrinkage estimator (12) to account for the large uncertainty in predicted fold change values at low transcript expression values. For plotting, records corresponding to non-coding RNA transcripts were discarded. Transcript records were discarded when there was zero expression for all control and transfection replicates and fold change values could not be calculated. Target prediction data was used to label the remaining records as either predicted targets or non-targets of the transfected miRNA.
For some differential expression analyses, null hypothesis significance testing was performed using two-sample, one-sided Kolmogorov-Smirnov tests to test whether different fold change distributions were sampled from the same underlying distribution.
For Figure 1, the filtered miRNA predicted targets curves represents protein-coding transcripts with a miRNA seed target site to the transfected miRNA mimic, which have been filtered at an expression threshold of 0.1 transcripts per million (TPM) (14).
For Figure 2, the 'added seed sites' are identified as those transcripts which had not previously been labelled as predicted miRNA targets using target prediction results derived from existing Ensembl 3'UTR annotations, but had been identified as predicted miRNA targets using target prediction results derived from 3'UTR sequences reannotated using the FilTar workflow due to 3'UTR extension.
For Figure 3, the 'removed seed sites' are identified as those transcripts which had previously been labelled as predicted miRNA targets using target prediction results derived from existing Ensembl 3'UTR annotations, but had not been identified as predicted miRNA targets using target prediction results derived from 3'UTR sequences reannotated using the FilTar workflow due to 3'UTR truncation. Filtering for all groups occurred at an expression threshold of greater than or equal to 5 TPM. This was to reduce the number of false positive 3'UTR truncations (see discussion).
Additional plots for remaining datasets analysed are contained within Supplementary Figures S1, S4 and S5 with the exception of cases were there was an insufficient number of added or removed target transcripts predicted (n < 15).