MIO: microRNA target analysis system for immuno-oncology

Abstract Summary MicroRNAs have been shown to be able to modulate the tumor microenvironment and the immune response and hence could be interesting biomarkers and therapeutic targets in immuno-oncology; however, dedicated analysis tools are missing. Here, we present a user-friendly web platform MIO and a Python toolkit miopy integrating various methods for visualization and analysis of provided or custom bulk microRNA and gene expression data. We include regularized regression and survival analysis and provide information of 40 microRNA target prediction tools as well as a collection of curated immune related gene and microRNA signatures and processed TCGA data including estimations of infiltrated immune cells and the immunophenoscore. The integration of several machine learning methods enables the selection of prognostic and predictive microRNAs and gene interaction network biomarkers. Availability and implementation https://mio.icbi.at, https://github.com/icbi-lab/mio and https://github.com/icbi-lab/miopy. Supplementary information Supplementary data are available at Bioinformatics online.

The python library miopy includes the same functionality and data sets as the web-based system MIO. The library miopy together with its documentation/Jupyter notebook and source code is available at https://github.com/icbi-lab/miopy.

Datasets
In MIO and miopy it is possible to analyze custom data consisting of one file with gene expression data, one file with microRNA expression data. An optional metafile with clinical data, including for instance survival data in months and the corresponding event data, may also be provided. Patient sample identifiers need to be identical. The analyses are not limited to specific applications, however, only human data is supported. In order to get the most benefit for immuno-oncology we have integrated a number of publicly available datasets, including immune related gene and microRNA sets as well as clinical meta data, gene and microRNA expression from large patient cohorts of various cancer types e.g. from The Cancer Genome Atlas (TCGA). This data collection is available for all users and will be expanded regularly. In order to get fast access to the results different analyses from these data are pre-calculated.

Expression datasets with meta information
We have downloaded pre-processed gene expression and microRNA expression data (RNAseq raw counts) as well as clinical metadata from 33 cancer types from The Cancer Genome Atlas (TCGA) datasets (Table S3) via the Genomic Data Commons Data Portal (https://portal.gdc.cancer.gov) using the GDCRNATools R package (Li et al., 2018). Gene IDs were updated to official gene symbols (HGNC) and data were normalized by voom transformation (Law et al., 2014). Expression data from the Gene Expression Omnibus (GEO) were downloaded using the R package GEOquery. Microarray data were normalized using the rma from the R package affy and probe sets with highest mean expression levels were kept and annotated according official gene symbols (HGNC) and mature miRBase IDs. For ovarian cancer, another data set from an Australian patient cohort were downloaded from the International Cancer Genome Consortium (ICGC) (https://dcc.icgc.org) and processed accordingly. These public datasets (available for all users) can be used for all type of analyses and filtered based on specific levels of a selected variable, which is provided as a column in the clinical meta data file. For instance, cancer datasets can be filtered for primary tumor samples or for specific tumor stages. There is also the possibility to perform differential expression analyses between two specified levels using limma (Ritchie et al., 2015). For example, only significantly differential expressed genes and microRNAs which are more than two-fold changed between paired samples from normal tissue and primary tumor tissue -at a false discovery rate (Benjamini-Hochberg adjusted p-values) <0.1 -can be selected for further analyses.

Integration of microRNA target prediction tools
A variety of target prediction tools with different advantages using sequenced based computational methods and experimental verified targets have been developed over the years and we integrated established tools in order to enable the identification of direct microRNA target genes (Table S4). We downloaded and preprocessed the data according to the information summarized in Table S5. We were taking advantage of a recent approach, MirDIP, which includes a collection of 30 resources (Tokar et al., 2018) and the R/Bioconductor package multiMIR (Ru et al., 2014). All gene and transcript information were mapped to official gene symbols (HGNC) and all microRNA names/identifiers were mapped to mature microRNA ID from miRBase Release 22.1 (Kozomara et al., 2019). Data were transformed into a matrix with binary encoding the interaction between genes (rows) and microRNAs according to the respective prediction tool (columns). We also analyzed individual microRNA targets with the experimentally validated microRNA target database DIANA Tarbase v8 (Karagkouni et al., 2018) via the online service (http://carolina.imis.athena-innovation.gr). A logic combination (and/or) of these 40 microRNA target prediction databases or the number of supporting databases can be applied to specified candidates or to filtered candidates from previously performed correlation analyses.

Detailed methods
All computational methods in MIO are implemented in Python version 3.8.8 with respective modules such as scipy and sklearn (https://www.python.org), or the statistical software environment R version 4.1.0 (https://www.R-project.org, R Foundation for Statistical Computing, Vienna, Austria).

Correlation and regularized regression analyses
We use correlation analysis methods as well as regularized regression models to compute the relationship between microRNAs and genes, based on their normalized expression across all selected samples. Correlation is based on Pearson's correlation, Spearman's rank based correlation ρ, and Kendall's τ using the scipy module. Correlation analyses based on the nonparametric Hoeffding's independence test (Hoeffding, 1948) and the randomized dependence coefficient (RDC) were directly implemented. In regularized regression models each gene is presented as a linear combination of microRNAs and the coefficients of individual microRNAs in that models can be used as another measure for the association strength (Muniategui et al., 2012). The least absolute shrinkage and selection operator (Lasso) uses L1 regularization (Tibshirani, 1996), the Ridge regression uses L2 regularization (squared coefficient as penalty term to the loss function), whereas in Elastic Net regression a combination of both is used (Zou and Hastie, 2005). This regression models were implemented using the Python sklearn module. Each method has different assumptions and advantages and it has been shown that ensemble methods, which integrate different methods, perform better than the individual component methods. Therefore, we used a rank scoring system after Jean-Charles de Borda based on individual rankings for each method (Le et al., 2015;Marbach et al., 2012). In order to rule out random correlations, a background (null) model distribution of the correlation of a microRNA with 1000 randomly selected genes without evidence of any of the 40 target prediction tools is tested (List et al., 2019).
If the survival meta-data are included, log2 hazard ratios (log2 HR) can be computed for each expression ratio of a miRNA and its potential target gene using a Cox proportional hazard model implemented in the lifeness Python module. Absolute values of log2 HR are then also included in the ranking system. The results from correlation analyses can be filtered based on the correlation and logical combination of integrated prediction tools and visualized as interactive microRNA-gene network using Cytoscape with several network layout options. MicroRNAs and genes are represented as nodes that can be colored according to log2 hazard ratio, and edges according to correlation.

MicroRNA targeting a complete pathway or gene set in total (module score)
Of interest is not only which individual genes a microRNA is targeting, but if a pathway or process can be modulated/regulated in total by this microRNA. For this purpose we have adapted a previously suggested score (Buffa et al., 2011), whereby the idea is that expression of genes are weighted based on supporting evidence to be a direct target of the microRNA i.e. number of target prediction tools (ns) for the respective gene j and the weight pj=ns/40. The weighted expression score S for a specific microRNA and a sample (patient) is calculated across all genes in the pathways (gene set) as S = ∑pj x ej / ∑pj with ej indicating the expression of the respective gene j. This score is calculated for each patient sample and the association between the expression score and the microRNA expression over all patients is calculated using the Spearman's rank correlation and visualized as scatter plot.

MicroRNAs association with the immunophenoscore or estimated tumor infiltrated immune cells
In order to analyze whether microRNAs are associated with immune scores such as immunophenoscore (IPS) or estimated tumor infiltrated immune cells the same correlation and regularized regression methods as for the calculation of the microRNA-gene correlations are available. For pre-calculation, we have used the R package immunedeconv (Sturm et al., 2019) including four tools EPIC (Racle et al., 2017), MCP counter (Becht et al., 2016), quanTIseq (Finotello et al., 2019), xCell (Aran et al., 2017) to estimate infiltrated immune cells in different cancer types from TCGA based on normalized RNA sequencing data (TPM) and implemented in Python the calculation of the immunophenoscore (IPS) using the continuous raw score (averaged z-score) as described in (Charoentong et al., 2017). Results are presented in tabular format and as a heatmap. Individual associations can be displayed as scatter plot.

Identification of prognostic biomarkers and survival analyses
In order to identify prognostic biomarkers MIO implements a feature selection analysis for microRNAs, genes, or ratios of microRNAs and genes using three survival analysis methods (Poelsterl, 2020). This include gradient boosting (using 1000 trees and a learning rate of 0.5), fast survival support vector machines, and a penalized Cox model. In order to obtain the most robust estimator, each model is run several times with different parts of the dataset to avoid over-fitting. For each method the features are ranked according to the average importance of the feature in all partitions and the overall percentage is given, indicating the number of times this feature has appeared in all models with all partitions. For overall ranking of features a scoring system after Jean-Charles de Borda based on individual rankings of each method is provided (Le et al., 2015;Marbach et al., 2012).
Patients can be divided into two groups based on microRNA or gene expression levels using quantile (e.g. median) or optimized dichotomization (with maximal log-rank statistics using R package survminer) and Kaplan Meier curves for provided survival data are visualized and the difference is tested with log-rank test and hazard ratio is calculated using the lifeness Python module.

Identification of predictive biomarkers and classification analyses
Similar to survival analyses, a feature selection analysis for classification of two groups of patients (e.g. responder versus non-responder of a therapy) including microRNAs, genes, or ratios of microRNAs and genes (representing microRNA-gene network biomarkers) is possible. For this purpose, six different machine learning methods (Hastie et al., 2011) are applied in a cross validation procedure to avoid overfitting. These different methods are implemented in the sklearn Python module and includes a random forest classifier (Breiman, 2001) with 300 trees, logistic regression with L2 penalty function and a maximum of 10000 iterations, a Ridge classifier with maximum of 10000 iterations, support vector machines (Vapnik and Chervonenkis, 1974), an Ada boosting classifier (Yoav and Schapire, 1997) with 300 trees, and a bagging classifier (Breiman, 1996) with 300 trees. Some of these tools take advantage of ensemble methods by combining the prediction of several base estimators to improve generalizability or robustness over a single estimator. For each method, the features can be ranked according to the average importance of the feature in all partitions and the overall percentage is given, indicating the number of times this feature has appeared in all models with all partitions.
Based on selected features a classification analysis can be performed by applying the widely used methods random forest (Breiman, 2001), logistic regression, or support vector machines (SVM) (Vapnik and Chervonenkis, 1974) including a k-fold cross validation procedure. The performance of the classifier can be assessed using a receiver operating characteristics curve (ROC) and the respective area under curve (AUC). Ranked feature importance (weights) can be visualized as barplots and separation of samples by the expression of selected features are accessible by principal component analyses. As it is highly recommended to test the classification model also on an independent validation data set, the model and its parameters can be saved and applied to other datasets.

Prediction of microRNAs targeting genes which are synthetic lethal to other candidates
For this type of analysis, we have taken advantage of previous efforts in identifying synthetic lethal gene pairs. The ISLE (identification of clinically relevant synthetic lethality) algorithm (Lee et al., 2018), which is based on data on CRISPR/RNAi in vitro screens of cancer cell lines with impact on survival of patients from different TCGA cohorts as well as with phylogenetic evidence was applied and synthetic lethal partner genes were identified (using R code from https://github.com/jooslee/ISLE). Individual candidate genes or genes from a specified gene set (for instance target or essential genes for immunotherapy or targeted therapy) can be selected and by using information of combined microRNA target prediction tools, microRNAs can be identified potentially targeting genes which are synthetic lethal to these candidates. Analysis of potential (off) targets of microRNAs can be performed using the respective feature in MIO for microRNA target prediction.

Use Case S1: MicroRNAs targeting immune modulators including PD-L1
We were interested in finding out which are the most important microRNAs regulating immunecheckpoints in tumor cells. For this purpose, within MIO we employed a correlation analysis for the available lung adenocarcinoma dataset (TCGA-LUAD) filtered for primary tumor (Filter samples to get correlation: Group name in metadata for filter=sample_type, Group Name in metadata for correlation=PrimaryTumor) and for the immune checkpoints genes (Select geneset to analyze=Immune Checkpoints [ICBI]). MicroRNA target predictions were then based on this correlation analysis (Use miRNA/gene correlation result=LUAD-COR), FDR<0.1 (Maximum Adjust P. Value*=0.1), with a correlation of microRNA and target gene smaller than -0.3 (Coefficient equal or lower than* =-0.3, Coefficient equal or higher than* =1.0) and at least 10 supporting prediction tools (Filter method to target Databases=OR, selection of all tools, Min db*=10) for a list of immune checkpoint genes (Select Genset to Analysis=Immune Checkpoints [ICBI]).

Use Case S2: Prognostic relevant microRNA target gene pairs
There are different types of immune modulators known, which could be stimulatory or inhibitory to the immune response thereby affecting cancer patients overall survival. We are testing which microRNAs are targeting the immune checkpoints and are prognostic in lung adenocarcinoma. One particular analysis possible with MIO is to identify which microRNA/gene ratios are prognostic. We used MicroRNA/Gene Ratio Feature Selection, Select geneset to analyze=Immune Checkpoints [ICBI], Select dataset to analyze=TCGA-LUAD, Group name in metadata=event, Obtain top predictors=50, Number of cross-validation=3. One of the top candidates with highest feature importance (Cox PH model) was the hsa-miR-150-5p/CD276 ratio (Figure S3 B). Interestingly, we could identify that hsa-miR-150-5p is targeting CD276 (also known as B7-H3) with opposite effect on patient survival outcome as indicated by the network visualization with log2 hazard ratio and in the Kaplan-Meier survival curves with optimized dichotomization ( Figure S3A). Figure S3: Network visualization of microRNA targeting the immune modulator CD276 and Kaplan-Meier survival curves for hsa-miR-150-5p and CD276 with optimized dichotomization in the lung adenocarcinoma TCGA data set (A). Feature importance for identifying prognostic microRNA/gene ratios (B). Node color represents log2 hazard ratio according to legend to the right calculated between patient groups with high gene or microRNA expression versus low gene or microRNA expression based on median dichotomization.

Use Case S3: Genes involved in antigen processing and presentation targeted by microRNAs
Deficient or down-regulated genes of the antigen processing and presentation machinery (APP) have been associated with response prediction to cancer immunotherapy. In order to study which microRNAs are potentially able to down-regulate not only individual genes but also the complete geneset/pathway we performed a correlation analysis using for each microRNA a weighted expression score calculated based on supporting microRNA target prediction tools for the respective genes. As an example, we used two cohorts from different cancer types from TCGA the bladder cancer (BLCA-TCGA) and the acute myeloid leukemia dataset (LAML-TCGA) to test against the antigen processing and presentation geneset (Select geneset to analyze=Antigen Processing and Presentation [ImmPort]). The top microRNAs with most negative correlation with the antigen processing and presentation pathway were filtered for Pearson's correlation coefficient R< -0.4 at an FDR<0.1. Using this settings there are 22 microRNAs in BLCA potentially targeting the APP (Table S6) and 3 microRNAs in AML (Table  S7). In MIO, this association can be also visualized as scatter plot and an respective example is shown in Figure S3.  hsa-miR-181b-5p < 0.0001 -0.450 Figure S4: Scatterplot showing the weighted score (module score) for the antigen processing and presentation gene set for each patient is negatively associated with the expression of hsa-miR-181c-5p in the LAML-TCGA dataset.
These two examples show that microRNAs targeting APP can vary depending on the cancer type. In bladder cancer, members of the miR-200 family of microRNAs are among the top candidates. In acute myeloid leukemia, miR-181 family members potentially target APP, which may be of particular interest for immunotherapy of NPM1-mutated AML, as the mutant NPM1 protein has been identified as the source of most immunogenic epitopes (Forghieri et al., 2021). Moreover, the miR-181 family members have been described to be directly involved in myeloid differentiation and AML (Su et al., 2015).
Use case S4: Identifying a microRNA signature predictive for microsatellite instable colorectal cancer samples.
Microsatellite instability (MSI) in colorectal cancer -often associated with deficient DNA mismatch repair pathway genes and a high mutational burdenhas been shown to be predictive for immune checkpoint inhibitor therapy (Dudley et al., 2016;Le et al., 2017). We have also previously demonstrated that the MSI molecular subtype in colorectal cancer, in particular, is associated with an immunophenotype that includes a variety of tumor-infiltrating lymphocytes (TILs) (Angelova et al., 2015). Our aim was to identify a microRNA signature that predicts microsatellite high status (MSI-H) compared to microsatellite low/microsatellite stable (MSI-L/MSS) status in colorectal cancer to learn which patients might benefit from immune checkpoint blocker therapy (e.g., anti-PD-1 or anti-PD-L1). Therefore, we retrieved data from the TCGA for a combined colon and rectal cancer cohort ( Figure S5A. Based on the expression of these 25 top microRNA candidates a principal component analysis (PCA) was performed in MIO and the projection to the first three components were visualized to indicate how the two groups can be separated ( Figure S5B).
Including the identified microRNAs (which in MIO can be saved as a microRNA set) a classification analysis was performed on the TCGA-CRC data as a training set using a logistic regression model and 10-fold cross validation to separate the MSI samples (Select data to analyze=TCGA CRC, Select Models=Logistic Regression, Number of cross validation=10, Group name in metadata=MSI, Select miRNAset to analyze=name as saved after feature selection).
In the TCGA-CRC training set the MSI groups could be separated with an area under receiver operating characteristics (ROC) curve of 0.908 and mean accuracy over 10-fold cross validation of 0.962 ( Figure S6). Feature importance (weights) of the individual microRNAs in the logistic regression model are shown in Figure S7. The trained model can be saved in MIO and the performance of this model can then be tested in the validation cohort. Ultimately, the MSI groups were also well separated in the validation cohort as evident from the ROC curve (AUC=0.860) and PCA of the expression of these microRNAs ( Figure S8).

Use Case S5: MicroRNA target genes synthetic lethal to immune (therapy) essential genes
Using genome-scale CRISPR-Cas9 library Patel et al. profiled genes whose loss in tumor cells impaired the effector function of CD8+ T cells (Patel et al., 2017). In order to identify synthetic lethal partner genes in tumor cells we have taken advantage of previous efforts and used the ISLE algorithm for calculation (Lee et al., 2018), which is available within MIO. We were interested in identifying microRNAs targeting genes which are synthetic lethal to immune (therapy) essential genes. With regard to combination immunotherapy, either the immunoessential gene is highly expressed/not mutated and tumor cells may be targeted by immunotherapy, or in the case of a low expressed/mutated immune essential gene, targeting of its synthetic lethal gene by e.g. microRNAs could be detrimental to tumor cells and reveal therapeutic vulnerabilities. In MIO we used the option Target Prediction, miRNA Synthetic Lethal Prediction, Select geneset to analyze=Immune essential genes [Patel], Filter method to target Databases=OR, select all prediction tools, Min db*=25. In order to provide robust target predictions and to limit the possible targets to a reasonable number we have chosen that microRNA targets are supported by at least 25 of the 40 target prediction tools. This resulted in 321 interactions of microRNAs and synthetic lethal partners of immune essential genes. For clarity, we summarized only the top 25 interactions (supported by 28 target prediction tools) in Table S8 and visualized them as a network ( Figure S9). As microRNAs are targeting also other genes (off targets) MIO provides the possibility to identify all targets for individual microRNAs (miRNA/Gene Target Prediction). In addition, MIO can perform an overrepresentation analysis for microRNAs based on the number of synthetic lethal target genes compared to all potential target genes (Table S9). SNX17 hsa-miR-19a-3p 28 Figure S9: Interaction network of microRNAs targeting genes, which are synthetic lethal to immune therapy essential genes supported by at least 28 microRNA target prediction tools.

Use Case S6: MicroRNAs associated with the immunophenoscore (IPS) and estimated infiltrated immune cells in ovarian cancer (TCGA-OV).
We are interested whether microRNAs are associated with the Immunophenoscore (IPS) or estimated infiltration of immune cells in ovarian cancer (TCGA-OV). Therefore, we performed miRNA/Immunophenoscore correlation analysis and Select dataset to analyze: TCGA-OV and Select the coefficient for applying the filters: Pearson(R), Maximum adjust P.value*:0.1, Coefficient equal or lower than*:-0.3, Coefficient equal or higher than*:1.0). These settings resulted in one microRNA, hsa-miR-223-3p, with the respective scatterplot ( Figure S10). Figure S10: Association of microRNAs with immunophenoscore (IPS) in ovarian cancer (TCGA-OV).
Interestingly hsa-miR-223-3p has been shown to target PARP1 which is involved in homologous recombination repair in ovarian cancer (Srinivasan et al., 2019). PARP inhibition therapy is approved in ovarian cancer and probably could impact the immune environment and immune therapy response via the STING pathway (Ding et al., 2018). Since we could observe an association of hsa-miR-223-3p with the immunophenoscore, we were also interested whether there is an association of microRNAs and in particular hsa-miR-223-3p with decreased estimated immune cell infiltration. Thus, we performed miRNA/Immune Cell Infiltration Correlation analyses, Select dataset to analyze: TCGA-OV, and selected QUANTISEQ (B cell,Macrophage M1,Macrophage M2,Monocyte,Neutrophil,NK cell,, T cell CD8+, T cell regulatory (Tregs), Myeolid dendritic cells). Results of selected examples are summarized in Table S10. Although hsa-miR-223-3p was not significantly associated with estimated CD8+ T cells it shows a positive correlation with a suppressive immune environment (e.g. Tregs).
In summary, a combination of the presented use cases may help to develop new hypotheses in immuno-oncology or identify new targets for cancer immunotherapy.