Making mouse transcriptomics deconvolution accessible with immunedeconv

Abstract Summary Transcriptome deconvolution has emerged as a reliable technique to estimate cell-type abundances from bulk RNA sequencing data. Unlike their human equivalents, methods to quantify the cellular composition of complex tissues from murine transcriptomics are sparse and sometimes not easy to use. We extended the immunedeconv R package to facilitate the deconvolution of mouse transcriptomics, enabling the quantification of murine immune-cell types using 13 different methods. Through immunedeconv, we further offer the possibility of tweaking cell signatures used by deconvolution methods, providing custom annotations tailored for specific cell types and tissues. These developments strongly facilitate the study of the immune-cell composition of mouse models and further open new avenues in the investigation of the cellular composition of other tissues and organisms. Availability and implementation The R package and the documentation are available at https://github.com/omnideconv/immunedeconv.


Methods implementation in immunedeconv
mMCP-counter (Petitprez et al.) was accessed from the corresponding R package.In the original implementation of the method, the gene identifiers (Mouse Genome Informatics Gene Symbol) were based on the GRCm38 version of the murine genome.After the publication of the method, the Genome Reference Consortium (GRC) updated the reference mouse genome from the GRCm38 to the GRCm39 version.Since the alignment of raw reads to get the gene counts is usually performed on the latest version of the genome, this resulted in some marker genes missing from the bulk dataset.Therefore, we retrieved from ENSEMBL the updated Gene Symbols and updated the mMCP-counter gene signature.In turn, the mMCP-counter package was then updated to allow the choice of the version of the marker genes to use for the analysis.
The method seqImmuCC (Chen et al., 2018) was originally available only as a web server.It was therefore reimplemented in immunedeconv.Deconvolution can be performed using either nu support vector regression (nu-SVR, requires the user to provide CIBERSORT R script) or linear least squares regression (LLSR).The R script for the latter was provided by the authors.The seqImmuCC RNA-seq-based signature matrix (available from https://github.com/wuaipinglab/ImmuCC)was updated with the latest Gene Symbols, as done for mMCP-counter.
The original DCQ (Altboum et al., 2014) algorithm is implemented in the R ComICS package, which contains all the required files to run the analysis.It is important to note that DCQ is a method intended to be applied to differential expression data since it was first developed to find the differences in terms of cell composition across samples taken at different time points or which underwent different treatments.The authors recommend normalizing gene expressions for each sample using the mean and the standard deviation of the expression in the reference samples.Therefore, a function was implemented to perform this operation, which lets the user specify a set of samples to be considered for the normalization.If not provided, the mean expression across all samples is considered as the reference expression profile.
The BASE (Varn et al., 2016) algorithm is available as an R script with the original publication.This method relies on a "cell compendium", i.e., a signature matrix where the expression of various cell types is z-scored.We followed the authors' instructions to build the compendium for murine hematopoietic cell types starting from several signature matrices.The resulting compendiums were tested on the Petitprez validation dataset.The signatures matrices considered were those of DCQ and seqImmuCC.In addition, the two hematopoietic mouse datasets GSE109125 and GSE15907 were accessed from the Immunological Genome Project website (https://www.immgen.org/):these contain reference transcriptomes for many hematopoietic murine cell types, as well as control samples and unassigned cells which had to be removed.The compendium obtained from GSE15907 was the one selected.DCQ signature and BASE compendium encompass 207 and 185 fine-grained cell phenotypes, respectively, identified mainly by their surface markers.Using the cell type-specific metadata provided by each method, we manually curated a table to assign each cell phenotype to a coarser category (e.g. from "T.4MEM.LN" to "CD4+ Memory T cell").This table is used internally by immunedeconv once the deconvolution is performed to combine the results.The scores of the cell subtypes belonging to the same type are grouped considering either their sum in the case of DCQ, or their scaled median in the case of BASE.Three additional human-based methods were included in the package.We implemented ABIS (Monaco et al., 2019), originally available as a Shiny app, adapting its source code in the package.ConsensusTME (Jiménez-Sánchez et al., 2019) was included with its own R package.ConsensusTME, like TIMER, requires the user to specify the cancer type of the samples being analyzed, allowing only one cancer type per analysis.In order to extend the analysis to a set of samples from different cancer types, we added a function to perform deconvolution multiple times, grouping samples of the same cancer type, and merging the results together.For ESTIMATE (Yoshihara et al., 2013), the R package was obtained from rforge and its code and gene sets were integrated into immunedeconv.
Immunedeconv uses a cell type table to map the results of each deconvolution analysis to a controlled vocabulary to make the cell types comparable across different methodologies.The cell type table wasupdated to include all cell types estimated by the newly-introduced methods.
All human-based methods were also extended to the deconvolution of murine RNA-seq data.The principle is that the murine gene names from the bulk RNA-seq data are converted to their corresponding human orthologs, so that the human-based methods can be used for their deconvolution without changing their internal implementation and/or signature matrix/gene sets.Murine gene names are often converted to the respective human orthologs by capitalizing the letters, however this approach can be unreliable.Immunedeconv currently retrieves the human and murine genes by querying the ENSEMBL database, retrieving the MCG gene symbols and the corresponding HGNC names.In case the ENSEMBL website is unresponsive, the Mouse Genome Database is used (http://www.informatics.jax.org/downloads/reports/HOM_MouseHumanSequence.rpt).Finally, four methods (CIBERSORT, EPIC, ConsensusTME and seqImmuCC) allow, in their native implementation, the use of a signature matrix other than the one implemented.New method-specific functions that require the necessary data to perform custom deconvolution (i.e.signature matrix, marker genes, etc.) were therefore implemented.

RNA-seq data access and processing
Raw FASTQ files were retrieved from ArrayExpress using the accession codes E-MTAB-9271 (Petitprez dataset) and E-MTAB-6458 (Chen dataset).Both datasets were processed using the nf-core/rnaseq framework (Ewels et al., 2020) and the GRCm39 genome from the GRC.Briefly, FASTQ files were first aligned to the reference genome with STAR (Dobin et al., 2013), and then summarized with Salmon (Patro et al., 2017) to obtain both read counts and transcripts per millions (TPMs).The flow cytometry estimates for the various cell types were obtained from the authors (Petitprez dataset) and from the method's Github repository (https://github.com/wuaipinglab/ImmuCC/tree/master)(Chen dataset), respectively.The Tsuyama dataset (Tsuyama et al., 2023) was accessed from the Gene Expression Omnibus (GEO) archive through the accession number GSE202603.Along with the FASTQ files, the TPM and FPKM counts are available.The TPM values were retrieved for the samples cultivated in normoxic conditions.The Tabula Muris (Tabula Muris Consortium et al., 2018) dataset, composed of single-cell RNA sequencing (scRNA-seq) data generated with the 10x Genomics technology and the associated cell-type annotations, was accessed from Figshare (https://figshare.com/projects/Tabula_Muris_Transcriptomic_characterization_of_20_organs_and_tissues_from_Mus_musculus_at_single_cell_resolution/27733).

Deconvolution of mouse RNA-seq data
All mouse deconvolution methods were run on the TPM data, according to the methods specifications.For seqImmuCC, the nu-SVR approach was used (algorithm='SVR').For DCQ, the mean and standard deviation of all samples were used to scale the data (ref.samples=NULL).Human deconvolution methods were run on the TPM data, mapping the results to the cell types present in the flow cytometry references.For TIMER and ConsensusTME the tumor samples from the Petitprez were considered as lung squamous cell carcinoma (cancer_type='LUSC').The remaining samples and the Chen dataset were considered as diffuse large B cell lymphoma (cancer type='DLBL').This cancer type was determined to be the most appropriate for deconvolution using the TIMER 2.0 web app (T.Li et al., 2020).For quanTIseq and EPIC, the analysis was carried separately for blood PBMC samples (tumor=FALSE), and on the tissue samples (tumor=TRUE).For methods that allow analysis of RNA-seq and microarray data (ABIS, EPIC, quanTIseq), the correction for microarray data was disabled (arrays=FALSE).Finally, for EPIC and quanTIseq mRNA bias correction was enabled (scale_mrna=TRUE).

Simulation and deconvolution of mouse mammary gland RNA-seq pseudo-bulk data
The preprocessed scRNA-seq data (gene counts) of from the Tabula Muris "mammary gland" dataset was processed with the R package Seurat (Hao et al., 2023) to extract the gene counts and the cell-type annotations.20 pseudo-bulk samples were simulated using the R package SimBu (Dietrich et al., 2022) (simulate_bulk function) sampling 1000 cells from the dataset among B cells, T cells, macrophages, endothelial cells, and stromal cells; basal and epithelial cells were excluded from the simulation as they are not quantified by any of the immunedeconv methods.Cell fractions in the pseudo-bulk samples were based on the fractions in the singlecell dataset with an added variability (scenario='mirror_db', balance_even_mirror_scenario=0.05).The pseudo-bulk were simulated accounting for cell-type=specific mRNA bias (scaling_factor='expressed_genes').Deconvolution was performed on the count per million (CPM)-scaled data, as for the other analyses.For ConsensusTME and TIMER, the breast cancer type was selected (cancer_type='BRCA'). Deconvolution was performed with the same options described above.For quanTIseq and EPIC, the option tumor='TRUE' was used.The stromal cells were matched with deconvolution results obtained for stromal cells (BASE, xCell) or fibroblasts (EPC, MCP and mMCP-counter, ConsensusTME).

Deconvolution of mouse pancreatic RNA-seq data using custom signatures
Four methods currently allow the use of custom signature matrices (EPIC, CIBERSORT, ConsensusTME and seqImmuCC).Since they provide cell fractions estimates, CIBERSORT, seqImmuCC and EPIC were used to deconvolve the Tsuyama dataset.The methods require the user to provide a signature matrix, while the latter also requires a set of marker genes for the cell types of interest, respectively.Single-cell transcriptomic profiles of the pancreatic cells were retrieved from (Baron et al., 2016) and a signature matrix was generated with CIBERSORTx (Newman et al., 2019).Pancreatic Islets marker genes were accessed from the R package Bseq-SC (Baron et al., 2016) (R objects pancreasIslet and pancreasMarkers).Since these were human-gene-based, we converted them to the correspondent mouse orthologous genes using the newly introduced immunedeconv function.(Yoshihara et al., 2013)