MSIsensor-RNA: Microsatellite Instability Detection for Bulk and Single-cell Gene Expression Data

Microsatellite instability (MSI) is an indispensable biomarker in cancer immunotherapy. Currently, MSI scoring methods by high-throughput omics methods have gained popularity and demonstrated better performance than the gold standard method for MSI detection. However, the MSI detection method on expression data, especially single-cell expression data, is still lacking, limiting the scope of clinical application and prohibiting the investigation of MSI at a single-cell level. Herein, we developed MSIsensor-RNA, an accurate, robust, adaptable, and standalone software to detect MSI status based on expression values of MSI-associated genes. We demonstrated the favorable performance and promise of MSIsensor-RNA in both bulk and single-cell gene expression data in multiplatform technologies including RNA-seq, microarray, and single-cell RNA-seq. MSIsensor-RNA is a versatile, efficient, and robust method for MSI status detection from both bulk and single-cell gene expression data in clinical studies and applications. MSIsensor-RNA is available at https://github.com/xjtu-omics/msisensor-rna.


Background
Microsatellite instability (MSI) refers to hypermutations of microsatellite sites due to inactivating alterations of mismatch repair (MMR) genes in malignancies [1,2].
Currently, MSI is an indispensable pan-cancer biomarker in cancer immunotherapy therapy and prognosis, and it is routinely examined in multiple cancer types, particularly in colorectal cancer (CRC), stomach adenocarcinoma (STAD), and uterine corpus endometrial carcinoma (UCEC) [2][3][4][5].For example, MSI positive patients are often resistant to 5-fluorouracil treatment but have a better outcome for immune checkpoint blockade treatment [4,5].
Genomics-based methods quantify MSI according to genetic mutations at microsatellite sites, which achieve high accuracy and are becoming popular in clinical MSI detection.For example, MSIsensor [9] detects MSI with high concordance as 99.4% on MSK-IMPACT panel [21].Epigenomics-based method MIRMMR [18] detects MSI using methylation levels in MMR pathway with 0.97 AUC.In addition, transcription levels of MSI-associated genes exhibit correlation with MSI, hinting possibility of MSI detection using transcriptomics data [15][16][17].
Besides these high-throughput technologies, deep learning algorithms were also applied to hematoxylin and eosin-stained slides to detect MSI [19,20].However, all these MSI methods detected MSI at a sample level, lacking cell-level measuring of MSI.Recently, single-cell RNA-seq (scRNA-seq) technology enables investigation of cell specific transcriptome and sheds light on tumor heterogeneity and tumor stages.In particular, the single-cell and spatial transcriptome enable the dynamic analysis of MSI in the complex tumor microenvironment, such as in metastatic and recurrent cancer [22].However, current MSI detection methods designed for bulk gene expression data do not perform well on scRNA-seq samples.For example, the only software for gene expression data, PreMSIm [16], only provided fixed signatures and a fixed model for all cancers, which limits the widely application of the methods.Moreover, the normalized method in PreMSIm also leads to poor performance with abnormal samples.Here, we developed MSIsensor-RNA, a robust method for MSI-associated genes detection and MSI evaluation for both bulk gene expression data and single-cell RNA-seq data.

Implementation
Dataset.We downloaded RNA-seq data of 1,428 TCGA samples across CRC, STAD, and UCEC from TCGA Research Network (https://portal.gdc.cancer.gov)and obtained their MSI status determined by gold standards (Table S1).We obtained 141 RNA-seq samples of ICGC from ICGC data portal (https://dcc.icgc.org),and their MSI status reported by MIMcall [23].Another 106 RNA-seq samples with the matched MSI status were downloaded from public publication of Clinical Proteomic Tumor Analysis Consortium (CPTAC) [24].We also downloaded Microarray data and their MSI status of 1,468 samples across CRC and STAD from GEO dataset (https://www.ncbi.nlm.nih.gov/geo).For scRNA-seq data, we got the gene expression data and their MSI status from 133 CRC samples in two recent publications [25,26].
Overall design.The pipeline of MSIsensor-RNA consists of data preprocessing, informative genes selection, model training, and model testing (Fig. 1 and Fig. S1).
First, we preprocess the expression values of samples from Microarray, bulk RNA-seq, and scRNA-seq.Next, we select an informative gene set for MSI detection from 1,428 TCGA samples.Then we used these TCGA samples to train a machine learning model for each cancer type for MSI scoring.Finally, we applied the trained model to independent databases to test the performance of the MSIsensor-RNA for each cancer type.
Data preprocessing.In MSIsensor-RNA, we accept Microarray expression value, FPKM, TPM, and RESM read count as input.All values of expression matrix were added 1 and followed by log2 transformed.Then, for each sample or cell, expression values were normalized as a Gaussian distribution with 0 mean and 1 standard deviation.For scRNA-seq sample, to obtain accurate MSI status, we only included high-quality cells with at least 20% genes detected for MSI detection.If the number of high-quality cells was less than 20, we sort all cells by the ratio of detected genes in descending order, and the top 20 cells would be utilized for MSI detection.To solve the dropout problem of scRNA-seq, we imputed zero values by the average of the gene expression value in the given sample.
Selection of informative genes.We select informative genes for MSI classification in terms of stability, discrimination, and generalization.Firstly, we remove ribosomal genes, mitochondrial genes, and genes with low FPKM in TCGA dataset.Secondly, we selected genes with discriminative gene expression signatures between MSI samples and MSS samples.We perform rank-sum tests for expression values between MSI samples and MSS samples for each gene, and only genes with P value < 0.01 are included for the following analysis.Furthermore, we compute the fold of ith gene by:

ቍቮ
where m is the sample number for informative genes selection, n is the MSI sample number, ‫ܩ‬ is the gene expression value of ith for j sample.We only select genes with fold > 0.5 for candidate informative genes.Finally, we keep genes with more generalization ability for MSI detection.We calculate the area under the receiver operating characteristic curve (AUC) of the gene expression value and only genes with AUC > 0.65 are kept for next step.We also calculate the 10-fold cross validation score of SVM and random forest, and only first quartet genes are included the finial informative gene set (Fig. S2).
Machine learning model training and testing.We build a support vector machine (SVM) model to classify the MSI status for CRC, STAD, and UCEC in TCGA dataset.Firstly, we utilized SOMTE [27] to correct the imbalance between MSI and MSS in each cancer type by amplifying the MSI samples.Then, we utilized the expression values from correct data as input to train SVM model for MSI classification.To evaluate the performance of MSIsensor-RNA, we tested the trained model with 1,848 independent samples of multiplatform including 247 RNA-seq, 1,468 Microarray, and 133 scRNA-seq samples.For a scRNA-seq sample, we calculated the MSI score with SVM model for each high-quality cell.Then the average cell MSI score is used to evaluate the MSI status of a scRNA-seq sample.
PreMSIm running.To compare performance of MSIsensor-RNA with the only standalone software PreMSIm, we also apply the data of Microarray, RNA-seq, and scRNA-seq from 1,848 independent samples to PreMSIm.For Microarray and RNA-seq samples, we test PreMSIm with two modes: PreMSIm-all and PreMSIm-split.In PreMSIm-all, we integrate all input samples to PreMSIm normalized module and predicted module.PreMSIm-split referred to input samples one database for each run.
Performance comparison of MSIsensor-RNA and PreMSIm.In MSIsensor-RNA, the predicted MSI probability by the SVM model was used to score the MSI status.
The probabilities were further transformed to MSI status by the Youden index [28].
We first compared the MSIsensor-RNA score between MSI and MSS samples to test the performance of MSIsensor-RNA in multiplatform by rank sum test.To further evaluate the performance of two MSI detected methods, we calculated AUC, accuracy, F-score, precision, sensitivity, and specificity of MSIsensor-RNA and PreMSIm in different sequencing technologies.

Robustness testing of MSIsensor-RNA and PreMSIm. To test the performance of
MSIsensor-RNA and PreMSIm at different normalized methods, we tested these two methods with FPKM, TPM, and read counts format of TCGA samples and calculated the AUC, F1-score, accuracy, precision, sensitivity, and specificity of each normalized method.To overcome the bias of different normalized methods and sequencing technology, we normalized the input data of each sample to a Gaussian distribution with 0 mean and 1 standard deviation.However, in PreMSIm, the normalization process was performed by genes, which means the normalized input data of a sample would be influenced by other samples in the bulk.Here, we tested the PreMSIm in two ways.Firstly, we input TCGA samples by three cancer types and calculated the performance of predicted MSI result.Secondly, we input all TCGA samples together to evaluate its performance.We further compared the MSI result and performance of these two ways and found that the performance of PreMSIm was affected by the way input was provided.

Results
The workflow of MSIsensor-RNA includes four modules (Fig. 1 and Fig. S1).First, we preprocess the expression value of Microarray, bulk RNA-seq, and scRNA-seq data.Then, we select a set of informative genes for MSI detection.Next, we train a support vector machine (SVM) model to estimate MSI scores using gene expression values of the selected informative genes.Finally, we apply the trained model to predict MSI score for either one clinical sample or a single cell (Table S1).For a given scRNA-seq sample, we also developed a model to report MSI status of this sample by integrating MSI scores of cells within.
MSIsensor-RNA accepts a variety of expression data including FPKM, RESM normalized read count, TPM, or microarray expression format as input.Input expression values were added 1 and then log2 transformed following Z-score normalization per sample or cell.In particular, for single cell module of MSIsensor-RNA, we only included high-quality cell in following steps, and the missing values of each gene in high-quality cells were imputed by the average of the gene expression value in this sample.
The informative gene selection module consists of three key steps (Fig. S2): (i) removing mitochondrial genes and ribosomal genes; (ii) filtering of genes, of which expression values do not differ significantly between MSI and MSS samples; (iii) keeping genes, of which expression values have high generalized scores for MSI detection (online methods).We applied the gene selection module to 1,428 samples based on the gene expressions (FPKM values) from three MSI-popular cancer types (CRC, STAD, and UCEC) in TCGA dataset and finally obtained 109 informative genes for MSI classification.We also performed this step for each type of CRC, STAD, and UCEC, yielding 397, 206, and 86 informative genes, respectively (Fig.

S4 and Table S2-S5
).We found that only eight informative genes are detected in all three cancer types.Of which, we found that MLH1 was the most important informative gene for MSI detection, as confirmed by previous reports [15][16][17] (Fig.

S5).
To assess the performance of MSIsensor-RNA in bulk sample data, we first trained tumor-specific models for CRC, STAD, and UCEC, as well as a model for all three MSI-popular cancer types in the TCGA dataset.Then we compared the two kinds of models (tumor-specific and MSI-popular) with the standalone software, PreMSIm, in terms of the area under the curve (AUC) of the receiver operating characteristic (ROC), accuracy, sensitivity, and specificity in 1,715 ( 1468Microarray and 247 bulk RNA-seq samples) independent samples.Notably, MSIsensor-RNA normalizes the expression value of informative genes for each sample independently, while PreMSIm must normalize each gene for multiple samples at the same time.Thus, we examined PreMSIm with all samples normalized together (PreMSIm-all) or by database (PreMSIm-split).
We found that MSIsensor-RNA achieved 0.982 ± 0.040 AUC indicating the robustness of MSIsensor-RNA regardless of the measurements of gene expression (Table S10).
To assess the performance of MSIsensor-RNA and PreMSIm in scRNA-seq samples, we applied the trained model of MSIsensor-RNA to 23,902 high-quality cells from 133 samples to obtain sample specific MSI status and compared to the ratio of cells labeled as MSI by PreMSIm.The result showed MSIsensor-RNA detected MSI for scRNA-seq samples with 0.958 AUC, 0.9231 sensitivity, and more directly reflective of the features of MSI and easier to obtain.In this study, we developed a robust method, MSIsensor-RNA, for MSI detection with gene expression data.MSIsensor-RNA provided informative gene selections, model training, and MSI detection modules.MSIsensor-RNA is able to process data from multiple platforms, including Microarray, RNA-seq, and single cell RNA-seq.
Compared to the standalone method PreMSIm, MSIsensor-RNA also provided modules for informative gene selection and model training so that users could apply MSIsensor-RNA for different cancer types.MSIsensor-RNA also improved the normalization method of the data, yielding a more robust result than PreMSIm (Fig. 2).In addition, MSIsensor-RNA facilitates the evaluation of MSI status at the single cell level, which will be critical to better understanding the mechanism of MSI in cancer immunotherapy in the future.
In most MSI detection methods, such as MSIsensor [10] and MSIsensor-pro [11], MSI is quantified according to genetic mutations at microsatellite sites, the consequence of MSI rather than the deficiency of the MMR system, the direct cause of MSI.In this study, a set of MSI-associated genes was identified, and their expression values were used for MSI evaluation.We found that MLH1 is the most important gene in all tested cancer types.In addition, unexpected expression of MLH1 is commonly seen in Lynch syndrome [29].Thus, we test the performance of MSIsensor-RNA for samples with abnormal MLH1 expression.We train a model based on all informative genes and tested it by samples with simulated abnormal MLH1 gene expression (Table S14).We found that the model achieved 0.974 and 0.972 AUCs when we set the MLH1 expression value as the maximum and minimum of all gene expression values, respectively.Furthermore, when MLH1 was excluded from the informative gene set, MSIsensor-RNA also achieved a 0.977 AUC, indicating the robustness of MSIsensor-RNA for MSI detection.
We demonstrate that MSIsensor-RNA achieved higher performance than other methods based on gene expression and comparable performance compared to DNA-based methods (Table S15).In our study design, MSIsensor-RNA detects MSI according to the gene expression signature of genes on MSI associated pathways, while MSIsensor evaluates MSI by computing the ratio of somatic microsatellite mutations.Although MSIsensor achieved slightly higher performance than MSIsensor-RNA, it cannot replace the applications of MSIsensor-RNA in gene expression data.Currently, MSIsensor-RNA reports favorable performance in all three MSI-popular cancers, including colorectal cancer, stomach adenocarcinoma, and uterine corpus endometrial carcinoma.The MSI features are different in different cancer types.Thus, the model obtained low performance when the testing samples were inconsistent with training samples in cancer types (Table S16

-S18).
Therefore, the performance of MSIsensor-RNA in other cancer types needs further validation in the future.non-commercial use by academic, government, and non-profit/not-for-profit institutions.A commercial version of the software is available and licensed through Xi'an Jiaotong University.For more information, please contact with pengjia@stu.xjtu.edu.cn or kaiye@xjtu.edu.cn.