MetChem: a new pipeline to explore structural similarity across metabolite modules

Abstract Summary Computational analysis and interpretation of metabolomic profiling data remains a major challenge in translational research. Exploring metabolic biomarkers and dysregulated metabolic pathways associated with a patient phenotype could offer new opportunities for targeted therapeutic intervention. Metabolite clustering based on structural similarity has the potential to uncover common underpinnings of biological processes. To address this need, we have developed the MetChem package. MetChem is a quick and simple tool that allows to classify metabolites in structurally related modules, thus revealing their functional information. Availabilityand implementation MetChem is freely available from the R archive CRAN (http://cran.r-project.org). The software is distributed under the GNU General Public License (version 3 or later).


Introduction
Metabolomics is a post-genomic research field that consists in the unbiased analysis of small molecule metabolites in biological samples such as tissue specimens (Priolo et al., 2014), cell lines (Zadra et al., 2019) and biofluids (Aimetti et al., 2012;Bertini et al., 2012). Each cell type or biofluid is characterized by unique metabolic signature, which can be altered in pathophysiological conditions (Cacciatore and Loda, 2015;Cacciatore et al., 2021). Metabolic concentrations are regulated by complex biological pathways, of which enzymes are key modulators. Despite their known specificity, many enzymes are reported to have promiscuous activity and catalyze reactions of molecules that are chemically like their specific substrates (Piedrafita et al., 2015). Pathological events can also change the concentration of specific molecule classes in non-enzymatic manner, such as through mechanical occlusion (Elebo et al., 2021). During both biological activity and preanalytical procedures, nonenzymatic reactions can also muddle the structure and concentration of metabolites that either belong to the same chemical class or harbor specific functional groups (Cacciatore et al., 2013(Cacciatore et al., , 2017a. Metabolites with a similar chemical structure tend to share similar properties and biological activity (Ekaney et al., 2021;Menikarachchi et al., 2013).
Most conventional computational tools for metabolite mapping and enrichment analysis rely primarily on a wide variety of curated pathway databases (Banimfreg et al., 2022;Pang et al., 2021). Unfortunately, the coverage of currently annotated pathways in the human metabolic network remains poor. Recently, chemical similarity has been used to extend known metabolic pathways to include unreported metabolites (Grapov et al., 2015;Pertusi et al. 2015).
The first study that applied structural similarity to classifying Escherichia coli metabolome into chemically and functionally related groups was reported by Nobeli et al. (2003). After this, there have been many attempts to develop computer aided tools that integrate chemical structure into metabolite analysis and visualization such as MetaMapR (Grapov et al., 2015) and ChemRICH (Barupal and Fiehn, 2017). However, conventional algorithms fail to identify local clusters of metabolites with similar structures (Andronov et al., 2021). Thus, there is still an unmet 1 need for tools that can identify chemical similarity cluster with high accuracy.
Here, we present MetChem, a new R package built on the KODAMA, an unsupervised machine learning algorithm for dimensionality reduction that has been successfully applied for clustering identification (Cacciatore et al., 2014, Zinga et al., 2023. MetChem is designed to provide metabolomics researchers with a user-friendly pipeline able to investigate features in metabolic modules defined by their structural similarity. MetChem is also equipped with different functions to provide identified clusters with information from the Human Metabolite Database (HMDB) (Wishart et al., 2022).

Data input
The MetChem pipeline requires as input the metabolites' concentrations of each sample, the simplified molecular-input line-entry system (SMILES) and the HMDB identifiers associated to each chemical name. These can be retrieved from PubChem Identifier Exchange Service (https://pubchem.ncbi.nlm.nih.gov/idexchange/ idexchange.cgi).

Metabolite structural clustering
As the first step in the pipeline, the clusters.detection function uses the molecular structure of metabolites represented by the SMILES is used to visualize in a two-dimensional space the chemical similarity across metabolites and, subsequently, to identify the local clusters ( Fig. 1). This function converts SMILES into molecular fingerprints, encoding their structural characteristics as a vector. The distance between two metabolites is calculated using a distance method such as Tanimoto to produce a dissimilarity matrix (Fig. 1A). This dissimilarity matrix is then converted into a multidimensional space (Fig. 1B) prior to being processed by KODAMA, using the homonym package (Cacciatore et al., 2017b). The output generated by KODAMA is a two-dimensional plot representing the 'chemical distances' between all metabolites (Fig. 1C). Metabolite hierarchical clustering is then built on the KODAMA output. Rousseeuw's Silhouette quality index (Rousseeuw, 1987) is used to calculate the optimal number of clusters.
To ensure the reproducibility of the results, the entire procedure is repeated (10 times as default) and average Rousseeuw's Silhouette quality index values are calculated (Fig. 1D).

Weighted metabolite chemical similarity analysis
The weighted gene co-expression network analysis (WGCNA) is a method that was originally developed not only to quantify the correlation between a pair of genes, but also to determine the extent of shared neighbors and define topological overlap in microarray dataset (Langfelder and Horvath, 2008). Recently, WGCNA has been also applied to metabolomic research (Mock et al., 2018). Akin to WGCNA, we have developed the Weighted Metabolite Chemical Similarity Analysis (WMCSA). In this pipeline, WMCSA is applied to a selected cluster of metabolites identified by the clusters.detection function ( Fig. 2A). WMCSA is implemented in the homonym function WMCSA. This function summarizes metabolite concentrations in modules that are defined by metabolite chemical similarity. Modules are generated by cutting all possible different branches of the hierarchical clustering performed on the selected class of metabolites. Each module is represented by one hypothetical vector called 'module eigen metabolite', which is equivalent to the first component of the principal component analysis (Fig. 2B).

HMDB data retrieval and feature analysis
In HMDB, the metabolite structural and functional information is stored as a so called 'metabocard' where each metabocard contains more than hundred descriptors of biological, clinical and physical-chemical data hyperlinked to other databases (e.g. KEGG, PubChem, MetaCyc, ChEBI).
The MetChem package includes functions to download, extract and process information from HMDB. The function readMet gives full access to metabolite descriptors that are stored as individual metabocards in the HMDB and convert them into an accessible data format for R users. The functions nameMet, propertiesMet, substituentsMet and taxonomyMet, EnzymesMet, diseasesMet and pathwaysMet extract names, chemical properties, substituents, taxonomy, enzymes, disease and pathways for all metabolites in each module.
After retrieval of the relevant information from HMDB, each module is associated with chemical and biological attributes. The function features compare the different attributes of each metabolite Fig. 1. Metabolite clustering based on chemical similarity. In the first step, (A) SMILES are converted in a molecular fingerprint. Tanimoto distance is used to produce a dissimilarity matrix between metabolites. (B) Multidimensional scaling is applied to convert the dissimilarity matrix in a multidimensional space, with 50 dimensions as default. (C) KODAMA analysis is performed on the output of the multidimensional scaling to highlight the chemical similarity between metabolites. (D) The optimal number of clusters representing different chemical classes is defined by Rousseeuw's Silhouette quality index according to their belonging to each module (Fig. 2C). The final output of the MetChem pipeline is a summary of metabolite data in different modules with defined chemical and biological properties.

Case study
The MetChem pipeline was applied to a metabolomic dataset from the study by Labbé et al. (2019). The dataset contains the metabolic profiles from ventral prostate tissues of mice that overexpress the human c-MYC transgene (MYC) in the prostate epithelium and wild-type littermates (WT). Mice were fed either a high fat diet (HFD; 60% kcal from fat; lard-rich in saturated fat) or a control diet (CTD; 10% kcal from fat) for 9 weeks. The dataset includes six replicates for each group (i.e. WT_CTD, MYC_CTD, WT_HFD and MYC_HFD). A total of 398 metabolites were associated with their own SMILES representation. In this dataset, eight different chemical classes were identified ( Fig. 1C and D). A WMCSA was performed in one selected chemical class as representative example to generate module representation (Fig. 2B). Statistically significant different modules between MYC and WT were identified. The module 12 was chosen for further inspection. The module consists of the metabolites anserine, carnosine, cyclo(his-pro), N-acetylhistidine and N-acetyl-1-methylhistidine, which are characterized by the presence of common significant substituents, such as imidazolyl carboxylic acid. Related enzymes were also identified (Fig. 2C). Although, the conversion of methylhistidine from anserine and carnosine has been described as indication of muscle breakdown (Hsieh et al., 2022), the relation with cyclo(his-pro), N-acetylhistidine and N-acetyl-1-methylhistidine has not been characterized yet, which opens opportunities for further investigation. We observed that reduced concentration of these metabolites was associated with the overexpression of MYC, a well-known oncogene involved in the regulation of the glycolysis (Dang, 2012). Since carnosine exerts metabolismdependent effects, including potential inhibition of glycolysis (Gaunitz and Hipkiss, 2012), we speculate that the metabolic effects of carnosine reduction in the MYC-transformed prostate may also involve other imidazolyl carboxylic acids of the module, which is warrant further investigation.

Summary and outlook
MetChem allows integrating chemical and concentration information in the analysis of metabolomic data. Thus, MetChem may provide important biological insights in pathological and physiological conditions.