imputomics: web server and R package for missing values imputation in metabolomics data

Abstract Motivation Missing values are commonly observed in metabolomics data from mass spectrometry. Imputing them is crucial because it assures data completeness, increases the statistical power of analyses, prevents inaccurate results, and improves the quality of exploratory analysis, statistical modeling, and machine learning. Numerous Missing Value Imputation Algorithms (MVIAs) employ heuristics or statistical models to replace missing information with estimates. In the context of metabolomics data, we identified 52 MVIAs implemented across 70 R functions. Nevertheless, the usage of those 52 established methods poses challenges due to package dependency issues, lack of documentation, and their instability. Results Our R package, ‘imputomics’, provides a convenient wrapper around 41 (plus random imputation as a baseline model) out of 52 MVIAs in the form of a command-line tool and a web application. In addition, we propose a novel functionality for selecting MVIAs recommended for metabolomics data with the best performance or execution time. Availability and implementation ‘imputomics’ is freely available as an R package (github.com/BioGenies/imputomics) and a Shiny web application (biogenies.info/imputomics-ws). The documentation is available at biogenies.info/imputomics.


SI1 Acquisition of missing value imputation algorithms
We gathered a collection of articles related to missing value imputation algorithms (MVIAs) in metabolomics research were gathered using the following PubMed: • "imputation missing metabolomics" • "metabolomics missing values", • "metabolomics missing" • "metabolomics imputation" We checked if found publications used metabolomics data for the imputation.Then we selected only those MVIAs that were written in R (R Core Team 2022).The full pipeline of collection, curation, and extraction of imputation functions from the articles can be seen in Figure S1.
Our search yielded 26 articles that employed various methods for imputing missing values in metabolomics datasets.20 of them (76.92%)employed at least one method implemented in R (R Core Team 2022).Within those articles, we identified 52 distinct MVIAs with 70 different implementations.
As a baseline model, we have added random imputation, where a missing value in a column is imputed with a randomly sampled non-missing value from the column.We assume that if every MVIA that fails to outperform the random imputation should be applied very carefully.As the code necessary to use certain MVIAs (such as MINMA, BayesMetab, or ORI) was not readily available, we have emailed the respective authors to obtain access to them.We received access to MINMA and BayesMetab from their respective authors.

SI1.1 Exclusion of MVIAs
We have chosen MVIAs to implement using the following criteria: 1. Runs on default parameters without requiring additional data besides metabolite intensities.2. Each column can contain missing values.3. Does not modify original data besides imputing missing values.Thus, several imputation methods were excluded from our analysis as they didn't meet our standards.The kNN-obs-sel and MICE-pmm methods from the UnMetImp package and impute.SLR function from MINMA were eliminated, as we were unable to run them on default hyperparameters as they need more than metabolites intensities.Similarly, llsImpute from pcaMethods, rfImpute from randomForest, TS.Lasso from GSMimpute, and MINMA from MINMA were removed because they required at least one column to be free of missing values.The MBimpute function from DanteR and remat from rMisbeta were excluded as they modify the data during the imputation process.
We chose not to include the random forest, kNN-TN, and kNN imputation methods employed in the study published in (Li et al. 2020), as the specific implementations used were not specified.Furthermore, we were unable to obtain the ORI code from the authors despite our request for access.
In terms of the MAI package, we only incorporated the best-performing imputation methods for missing value types ('Random forest' as the MCAR algorithm and 'Single' as the MNAR algorithm).

SI2 MVIAs
Table S1: List of all found MVIAs implemented in R along with their corresponding implementations and articles where they were used.The full name column indicates the name of the MVIA.The implementation column contains the R package and function used, the square brackets indicate the package name.If no package is listed, the imputation function was taken from the paper supplements.If there wasn't a specified implementation of MVIA then it is annotated as unknown.The 'base' label indicates functions implemented in base R. The citation column lists the papers where the method was used.full name implementation citation half-minimum imputation [base] Di Guida et al. (2016); Miller et al. (2021); Wei et al. (2018); Kumar, Hoque, and Sugimoto (2021); S. Taylor et al. (2022); S. L. Taylor et al. (2016); Jin, Kang, and Yu (2018); Kokla et al. (2019)

SI3.1 Obtaining datasets from MW -datasets preparation
On March 10, 2023, we employed HTTP requests through the REST API to retrieve 2,306 sets of data from Metabolomics Workbench (MW).Subsequently, the datasets lacking numeric values and consisting solely of named samples with named groups were excluded.These datasets represent much larger (sample-or metabolite-wise) studies that we have excluded due to the computational limitations on our side.This resulted in a collection of 2,288 datasets of metabolite intensities (Fig. S3, DATA ACQUISITION).
Next, we defined the number of metabolites in a dataset as the number of metabolites that have no more than 40% missing values (denoted as zeros in MW), and we selected those studies that have at least 5 metabolites/samples, which resulted in a total of 1812 datasets.Then, referring to the marginal distributions of numbers of metabolites/samples, we created a grid comprising 23 points (6 compounds × 4 samples -1).The compounds spanned a range from 10 to 370, while the samples varied from 10 to 195 (Fig. S4).This grid represents the most commonly occurring relations between sample number and metabolite number in MW (Fig. S3, GRID DESIGN).
Next, we were looking for datasets without missing values to find ones suitable for the benchmark of imputation methods.Thus, we identified 488 datasets without zeroes in MW studies (representing missing values, which was confirmed by personal correspondence with the MW support), out of which 417 contained at least 5 samples/metabolites.Afterward, we calculated the distance of each dataset from the points within our grid.In this context, distance is defined as the joint difference between the number of metabolites/samples present in the dataset and the corresponding number of metabolites/samples for a given point in the grid.Based on that measure, we identified the datasets that are the closest to the points from our grid.If more than one dataset was at the same distance from the grid point, we have randomly sampled one of them (Fig. S3, DATASET SELECTION).The 23 selected datasets (and their distance from the specific grid point) are presented in Table S2.The selected datasets are also presented visually in Figure S5.

SI3.2 Amputation
Next, we amputed each grid point 200 times based on a given missing value scheme.Each iteration produced a modified dataset with different missing values.After generating the amputed datasets, the next step was to filter and reject any datasets that had more than 40% missing values in any metabolite.This ensured that only datasets with a manageable level of missing data were considered for further analysis.Once the excessive missing values were removed, we assessed if any datasets remained.If there was at least one dataset left, we then selected the case where the overall missing ratio was closest to the assumed value.This selection process aimed to find the dataset that best represented the expected level of missingness for subsequent analysis or modeling.
Below, we provide detailed descriptions of the amputation schemes for MCAR, MAR, and MNAR.In each scheme, the dataset is amputed to ensure that the total number of missing values falls within a 40% threshold, and the missing values are distributed randomly across the columns.

MCAR
MCAR mechanism is implemented within ampute_MCAR function as follows: 1. Generate a random vector from a multinomial distribution to determine the number of missing values per column.
2. Iteratively distribute the excess number of missing values above the threshold among the columns.
3. Randomly ampute values according to the assigned numbers of missing values per column.

MAR
Our implementation within ampute_MAR function follows the steps: 1. Generate a random vector from a multinomial distribution to determine the number of missing values per column, excluding one randomly selected complete column.
2. Calculate the excess of missing values above the threshold.
3. Enter a loop that continues as long as there is an excess of missing values.
• Randomly select columns with missing values below the threshold and increment their missing values by the total sum of excess values.
• Set the missing values in columns exceeding the threshold to the threshold value.
4. Once the number of missing values for all columns is determined, we proceed to perform data amputation and for each column separately we execute the following steps: • Select a random number k of columns without missing values to sample from.
• Assign missing values to the corresponding rows in the column, based on the order of the scaled sum.

MNAR
Metabolomics version of MNAR mechanism is related to LOD (limit of detection, where the missing values are measurements of intensity below the sensitivity of the measuring device).The function ampute_MNAR follows the following steps: 1. Generate a random vector from a multinomial distribution to determine the number of missing values per column.
2. Iteratively distribute the excess number of missing values above the threshold among the columns.
3. Remove the observations with the lowest values in each column.This ensures that the missing values are introduced in a systematic manner based on the order of values within each column.

Mixtures
In cases where multiple mechanisms are present, the data is amputed independently for different missing value patterns via the function simulate_miss_value.It is possible that more than one pattern attempts to remove the same observation, resulting in an overall missing ratio lower than the assumed value.

SI3.3 Imputation
In order to impute missing values in datasets, we utilized the implementations of the methods listed in Table S5.Each technique was invoked following the safe imputation strategy.This strategy involves attempting to impute the missing values using the provided imputing function running it with the default parameters.The process is repeated a maximum of 3 times, or until successful imputation is achieved.If successful imputation is achieved, the function returns the imputed dataset.However, if imputation fails in all attempts, the function returns the original dataset without any modifications.By employing safe imputation strategy, the function aims to offer a reliable imputation solution while taking into account potential challenges or errors that may arise during the imputation process.Moreover, each method has 1 minute time limit for evaluation.Thus, we identify three different types of imputation error: • computional -method failed to impute data in each iteration of safe imputation, i.e. the data still contains missing values, • modification -considerable modification of the original observations that were not missing (difference above 2.220446e − 15), • timeout -method failed to impute data in two minutes.
We have not scaled the data before the imputation.

SI4 Performance measure: NRMSE
In our study, we employed normalized root mean squared error (NRMSE) as the chosen performance measure.
Let x = (x 1 , . . ., x n ) denote a vector of all observed values that have been removed from the data where n is a total number of missing observations.Let x = (x 1 , . . ., xn ) be a vector of imputed values corresponding to x.
We have chosen NRMSE, because it allows comparison of MVIAs regardless of the scale in the original data.

SI6 Summary of simulation results
To establish the best-performing MVIAs for specific missing value scenarios, we have filtered out the method with stability of 0.99 or better.Next, we excluded the results for scenarios allowing mixtures of missing values.For scenarios with only a single source of missing values, we computed median NRMSE (Supplementary Table S4 and Supplementary Figure S6).The outcomes from our simulation highlight that there is no singular MVIA that universally proves effective for every type of missing value MV.Instead, the selection of MVIAs should follow a meticulous analysis of the specific datasets under study.According to our findings, for the MCAR type, the most favorable results, ranked from the lowest NRMSE to higher, were observed for metabimpute_rf, eucknn, missmda_em, ppca, bpca.In the case of MAR, the optimal choices were missmda_em, bpca, svd, ppca, eucknn.Lastly, for MNAR, the top-performing methods were metabimpute_min, min, halfmin, metabimpute_halfmin, cm.For further insights into the usage of imputomics package, refer to the vignette.

SI8 imputomics web server
The imputomics application serves as an intuitive online graphical interface for the imputomics package.It is tailored to simplify the management of missing data using sophisticated imputation techniques, eliminating the requirement for programming skills.It is available as a web server and within R package under the function imputomics_gui().
Within the imputomics interface, users have the option to upload their own data or utilize pre-prepared sample data provided by us, accessible through the "Example Data" button.When uploading your data, choose the appropriate representation for missing values, such as "0," "1," or "NA."For instance, if zeros signify missing values in your dataset, select "0".Within the "Manage Variables" tab, you are able to specify columns that should be excluded from the imputation process.This involves marking columns that, for instance, represent categorical variables, such as biological groups, but were inadvertently loaded as numerical data.
Figure S8: Managing variables after uploading.
Within the "Imputation Dataset" tab, you can review the variables that will be included in the imputation process.Moving on to the "Visualization" panel, you gain insights into the distribution of missing observations throughout the entire dataset.This includes details such as the percentage of gaps in each column and the overall pattern of missing data.The charts' appearance can be tailored to your preferences, and you have the option to download them using the "Download" button.Subsequently, within the "Missing Values Analysis" tab, you can execute the removal of variables based on the 80% rule, meaning variables with less than 80% observed values will be excluded.Furthermore, when selecting a column describing biological groups, you are able to eliminate variables with less than 80% observations in each group.The application also provides the option to specify a threshold other than 80%.
Utilizing the switches "Add the 10 fastest methods" and "Add the 10 best-performing methods," you can effortlessly augment the list of selected methods with those that have demonstrated superior speed or performance in terms of NRMSE (Normalized Root Mean Squared Error) and a high frequency of successful imputation in our simulations.
Moreover, you have the option to choose 5 methods that have proven to be the most effective in scenarios involving specific missing value patterns, such as MCAR, MAR and MNAR.
In the "Results" tab, you can inspect the data frame post-imputation for each method.Meanwhile, within the "Summary" tab, a chart is available illustrating the distribution of imputed data in contrast to the observed data.Finally, within the "Download" tab, you have the option to download an Excel spreadsheet containing the imputation results for each method, presented in individual sheets.

Figure S1 :
Figure S1: Scheme for gathering and curating articles.

Figure S3 :
Figure S3: Scheme of the dataset selection, data amputation and MVIAs benchmark.
Figure S4: Selected MW datasets within a computed grid (red points).Brown and oceanblue points represent datasets without and with missing values (respectively).The contours and shades represent the 2D kernel estimator of density.The dashed line represents the threshold of 5 samples and metabolites, respectively.Both X-and Y-axis are presented in the log-scale.

Figure S5 :
Figure S5: Selected datasets (green) with a computed grid (red) in log-scale.

Figure S7 :
Figure S7: Uploading data in the imputomics web server

Figure S10 :
Figure S10: Visualization of missing values frequency.Percentage occurrence of missing values per each column.

Figure S11 :
Figure S11: Visualization of missing values pattern.Heatmap of missing values against the background of entire dataset.

Figure S12 :
Figure S12: Visualization of missing values pattern.Heatmap of missing values against the background of entire dataset.

Figure S14 :
Figure S14: Results of imputation contained in the table.

Figure S15 :
Figure S15: Visualisation of the results: imputed data againts observed data.

Figure S16 :
Figure S16: Downloading the results

Table S3 :
Imputation results performed on simulated data.The MCAR, MAR, MNAR, total MV columns indicate the fraction of missing values of a given type in the simulated data.The NRMSE column represents the Normalized Root Mean Square Error, which quantifies the accuracy of the imputation method compared to the true values.If the algorithm failed to converge, NRMSE has NA value.The computation time column denotes the time taken by the imputation algorithm to process the data.Finally, the fraction of successful computation column displays the proportion of cases where the imputation process was completed successfully without errors.

Table S3 :
Imputation results performed on simulated data.The MCAR, MAR, MNAR, total MV columns indicate the fraction of missing values of a given type in the simulated data.The NRMSE column represents the Normalized Root Mean Square Error, which quantifies the accuracy of the imputation method compared to the true values.If the algorithm failed to converge, NRMSE has NA value.The computation time column denotes the time taken by the imputation algorithm to process the data.Finally, the fraction of successful computation column displays the proportion of cases where the imputation process was completed successfully without errors.(continued)

Table S3 :
Imputation results performed on simulated data.The MCAR, MAR, MNAR, total MV columns indicate the fraction of missing values of a given type in the simulated data.The NRMSE column represents the Normalized Root Mean Square Error, which quantifies the accuracy of the imputation method compared to the true values.If the algorithm failed to converge, NRMSE has NA value.The computation time column denotes the time taken by the imputation algorithm to process the data.Finally, the fraction of successful computation column displays the proportion of cases where the imputation process was completed successfully without errors.(continued)

Table S3 :
Imputation results performed on simulated data.The MCAR, MAR, MNAR, total MV columns indicate the fraction of missing values of a given type in the simulated data.The NRMSE column represents the Normalized Root Mean Square Error, which quantifies the accuracy of the imputation method compared to the true values.If the algorithm failed to converge, NRMSE has NA value.The computation time column denotes the time taken by the imputation algorithm to process the data.Finally, the fraction of successful computation column displays the proportion of cases where the imputation process was completed successfully without errors.(continued)

Table S3 :
Imputation results performed on simulated data.The MCAR, MAR, MNAR, total MV columns indicate the fraction of missing values of a given type in the simulated data.The NRMSE column represents the Normalized Root Mean Square Error, which quantifies the accuracy of the imputation method compared to the true values.If the algorithm failed to converge, NRMSE has NA value.The computation time column denotes the time taken by the imputation algorithm to process the data.Finally, the fraction of successful computation column displays the proportion of cases where the imputation process was completed successfully without errors.(continued)

Table S3 :
Imputation results performed on simulated data.The MCAR, MAR, MNAR, total MV columns indicate the fraction of missing values of a given type in the simulated data.The NRMSE column represents the Normalized Root Mean Square Error, which quantifies the accuracy of the imputation method compared to the true values.If the algorithm failed to converge, NRMSE has NA value.The computation time column denotes the time taken by the imputation algorithm to process the data.Finally, the fraction of successful computation column displays the proportion of cases where the imputation process was completed successfully without errors.(continued)

Table S3 :
Imputation results performed on simulated data.The MCAR, MAR, MNAR, total MV columns indicate the fraction of missing values of a given type in the simulated data.The NRMSE column represents the Normalized Root Mean Square Error, which quantifies the accuracy of the imputation method compared to the true values.If the algorithm failed to converge, NRMSE has NA value.The computation time column denotes the time taken by the imputation algorithm to process the data.Finally, the fraction of successful computation column displays the proportion of cases where the imputation process was completed successfully without errors.(continued)

Table S4 :
Summary of simulation results.The MCAR, MAR, MNAR columns contain average NRMSE of each method for appropriate type of MV.