NAguideR: performing and prioritizing missing value imputations for consistent bottom-up proteomic analyses

Abstract Mass spectrometry (MS)-based quantitative proteomics experiments frequently generate data with missing values, which may profoundly affect downstream analyses. A wide variety of imputation methods have been established to deal with the missing-value issue. To date, however, there is a scarcity of efficient, systematic, and easy-to-handle tools that are tailored for proteomics community. Herein, we developed a user-friendly and powerful stand-alone software, NAguideR, to enable implementation and evaluation of different missing value methods offered by 23 widely used missing-value imputation algorithms. NAguideR further evaluates data imputation results through classic computational criteria and, unprecedentedly, proteomic empirical criteria, such as quantitative consistency between different charge-states of the same peptide, different peptides belonging to the same proteins, and individual proteins participating protein complexes and functional interactions. We applied NAguideR into three label-free proteomic datasets featuring peptide-level, protein-level, and phosphoproteomic variables respectively, all generated by data independent acquisition mass spectrometry (DIA-MS) with substantial biological replicates. The results indicate that NAguideR is able to discriminate the optimal imputation methods that are facilitating DIA-MS experiments over those sub-optimal and low-performance algorithms. NAguideR further provides downloadable tables and figures supporting flexible data analysis and interpretation. NAguideR is freely available at http://www.omicsolution.org/wukong/NAguideR/ and the source code: https://github.com/wangshisheng/NAguideR/.


II. Supplementary tables and figures
. Description of 23 missing value imputation methods. Table S2. The summary of NAguideR tested on different operation systems and browsers.

I. Supplementary notes
NAguideR integrates up to 23 commonly used missing value imputation methods (described in Table   S1) and provides two categories of evaluation criteria (four classic computational criteria and four empirical proteomics criteria) to assess the imputation performance of various methods. Here we present the detailed introduction and operation of NAguideR, by which users can follow to analyze their own data freely and conveniently.
Users can visit this site: http://www.omicsolution.org/wukong/NAguideR. Then the website homepage can be shown like this:
The data required here could be readily generated based on results of several popular tools such as MaxQuant (20), PEAKS (21), Spectronaut (22), DIA-NN (23), OpenSWATH (24), and so on. The users then can upload the two data into NAguideR with right formats respectively and start subsequent analysis.

Expression data
There are currently four types of proteomics expression data supported in NAguideR (i.e., 'Peptides+Charges+Proteins', 'Peptides+Charges', 'Peptides+Proteins', 'Proteins'), among which the main differences are the first few columns. In addition, users may upload other kinds of omics data (e.g., genomics, metabolomics), for which they can just need to choose the fifth type ('Others').
Please note, the fifth type cannot generate the results based on the proteomic criteria.

Expression data with peptide sequences, peptide charge states, and protein ids
In this situation, peptide sequences, peptide charge states, and protein ids are sequentially provided in the first three columns of input file. Peptide sequences in the first column can be peptides with any post-translational modification (PTM, written in any routine format) or stripped peptides (sequences without PTM). The second column is peptide charge status. The protein ids in the third column should be UniProt ids. From the fourth column, peptides/proteins expression intensity or signal abundance in every sample should be listed. The data structure is shown as below:

Expression data with peptide sequences and peptide charge states
Similar to the above situation, peptide sequences and peptide charge status are sequentially provided in the first two columns of input file. Peptide sequences in the first column can be peptides with post-translational modification (PTM, written in any routine format) or stripped peptides (without PTM). The second column is peptide charge states. From the third column, peptides/proteins expression intensity or signal abundance in every sample should be listed. The data structure is shown as below:

Expression data with peptide sequences, and protein ids
Under this circumstance, peptide sequences, and protein ids are sequentially provided in the first two columns of input file. Peptide sequences in the first column can be peptides with posttranslational modification (PTM, written in any routine format) or stripped peptides (without PTM).
The protein ids in the second column should be UniProt ids. From the third column, peptides/proteins expression intensity or signal abundance in every sample should be listed. The data structure is shown as below:

Expression data with protein ids
In this situation, protein ids are provided in the first two columns of input file. The protein ids here should be UniProt ids. From the second column, peptides/proteins expression intensity or signal abundance in every sample should be listed. The data structure is shown as below:

Other kinds of omics data
If users want to use NAguideR for other omics data (i.e. genomics, metabolomics), gene/metabolite ids/names should be provided in the first columns of input file. From the second column, genes/metabolites expression intensity or signal abundance in every sample should be listed. The data structure may be shown as below:

Samples information data
Sample information here means that users should provide sample group identity information. This information could e.g., enable filtration strategy for different group respectively in a later step (see below). The sample names are in the first column and their orders are same as those in the expression data. Group information is in the second column. The data structure is shown as below:

Download example datasets
If users want to download the example datasets to their own computer and check the data format locally, they can download them from here: First, select "Load example data" and the example data will be shown on the right panel interactively.
Users can visually observe what the data looks like.
Second, users can download the example data (expression data and sample information data) by clicking the corresponding button. The data are saved as .csv format and users can open them in other software, such as Excel.

Import Data
This is the first step, in which users should upload data here or load the example data with the above data formats. By default, we use the example data to show result of every step.

Uploading data.
When users prepare their data (expression and sample information data set), they can upload these data from here: There are two main panels: first, parameters panel, users can adjust parameters here; second, results panel, many results after users set the parameters will be shown here and users can also download these results.
In the parameters panel of "Import Data", there are two choices for users: a. Load experimental data. When users choose this option, they can upload their own data here.
Users should select the right format based on their data and then click "Browse" button to import the data; First row as column names: this means whether the first row is column names. If true, you should choose this parameter.
First column as row names: this means whether the first column is row names. If true, you should choose this parameter.
b. Load example data. As described in part 1.3, users can choose this option and download the example data to check them locally.
In the results panel of "Import Data", if users don't upload their data, here will show "NAguideR detects that you did not upload your data. Please upload the expression data (or sample information data), or load the example data to check first" to warn users.
Before uploading expression data, users should also recognize which type their data belongs to and choose the right parameter by adjusting the "The first few column types". The instruction of the column types can be found above (Data Preparation part).

NA Overview
Users can check the missing value situation of their own data and filter those data with a high proportion of missing value in this step. Note, "NA" is short for Not Available, which means missing value here (see below).

Missing value type:
what the missing values look like in the expression data, for example, Spectronaut (25,26) software usually export "Filtered" as missing values, so users should change this parameter to "Filtered" if their data contain "Filtered". NAguideR will recognize these characters and replace them with NAs. Any other characters indicating a missing value can be similarly defined. There are 2 groups (10 biological replicates in each group) here, if users select this parameter, NAguideR will calculate 2 NA ratios for this peptide (first group: 1/10=0.1, second group: 5/10=0.5), otherwise, only one NA ratio: 6/20=0.3.
3. NA ratio: the threshold of NA ratio. Those peptides/proteins with NA ratio above this threshold will be removed.

4.
Median normalization or not: if true, NAguideR will process median normalization for original data.
(Note, NAguideR was not designed to perform sophisticated normalization analysis. Any normalized datasets with NA can be accepted for analysis).

Log or not:
if true, the data will be transformed to the logarithmic scale with base 2.
6. CV threshold (raw scale): the threshold of coefficient of variation. Those peptides/proteins with NA ratio above this threshold will be removed. "raw scale" here means the CV of each peptide/protein is calculate using the data before logarithm transformation. If users set these parameters well, then click "calculate" button, the results will appear on the right panel.

results of NA overview
a. NA Distribution. This part contains three sub-parts: a.1 NA data. Here shows the result where the "Missing value type" defined by "NA" will be shown with a blank cell and users can click "Download" button to download this result to their own computer: b. NA filter. This part will show the filtered result. That means, on the basis of the preset parameters (i.e. NA ratio, CV threshold), those objects (peptides/proteins/genes/metabolites) without meeting these requirements would be removed.
c. Input data check. This part will show the checking information as a summary note for input data.
By default, if there still remain more than half (>50%) objects in the filtered data, NAguideR would think that this is acceptable, and will give users a message like below: Otherwise, NAguideR will give some warnings to users, which means users should pay more attention to their own data and those preset parameters. It is recommended that the users should then make sure that there are no problems before they can proceed to the next step:

Methods
In this step, users can select any of 23 missing value imputation methods that are currently supported. All methods have been classified into three categories based on their algorithm (Single value approaches, global structure approaches and local similarity approaches). In order to control the running time, we set these fast methods (17 methods) chosen by default. If users choose those slow methods (6 methods), that means the running time will be longer. If users want to try these slow methods, they just need to select the corresponding methods. The detailed information about each method can be found in Table S1. In addition, we also provide the reference for every method just blow each option on the web: After selecting suitable methods, users need to click 'Calculate' button, and a popup window will be jumped out to show the selected methods, then click 'OK' button and continue:

Results and Assessments
This step will process missing value imputation and performance evaluation of every method that users select in "Methods" step. Click "Results and Assessments", NAguideR will start to impute these missing value items, a process bar will appear in the bottom right corner to tell users where it goes: The result from every imputation method will be shown on the "Results" panel: Step 4 (left panel). The figure example below shows the results of method "zero".
Next, click 'Final check' for checking final imputation results as a summary note. NAguideR will recheck those scores based on every criterion. If everything is acceptable (see below), NAguideR will show a message like: Here, NAguideR performs a simple check to report if there is any big difference among these imputation methods under more than half of the criteria (by default, NAguideR check the fold change between the maximum score and the minimum score for each criterion, if the fold change is below 2, a fact suggesting that no big difference under the corresponding criterion, i.e., that NAguideR cannot provide a significantly discriminant guidance on NA method selection), NAguideR will give some warnings and possible solutions for users to review/re-calculate these imputation results: Last but not least, NaguideR implements one optional function, 'Targeted check', which is designed for many biologists with specific experimental aims. For example, this feature conveniently allows users to directly visualize the results of a particular peptide or protein item (i.e., spiked-in standard peptides, proteins, or known housekeeping proteins like beta-actin, etc.). Therefore, by following their experimental design, they can type in the peptide sequence or protein id in the text area and click the 'Check' button.
Then, NAguideR will locate this peptide or protein id in the input and resultant matrix (if the peptide/protein is not listed in the user's input data, it will give a message, "Target protein/peptide not found. Please make sure the item is included in the input table", example 1 as below). If the peptide/protein is searched, NAguideR will show the results before and after imputation by using bar plots and provide a note "Target protein/peptide was missed in N=X samples among all N=Y samples" (example 2 as below). This plot should help the users to inspect results following their particular experimental design. If the target protein/peptide is quantified without the need of NA imputation, NAguideR will still display the bar plots and provide a note, "Target protein/peptide was not missed in any sample" (example 3 as below).
Example 1 (Target protein/peptide not found. Please make sure the item is included in the input

Help
This part provides brief introductions and operation manual about NAguideR for users to quickly learn this tool and start to use this tool.

How to run this tool locally?
NAguideR is an open source software for non-commercial use and all codes can be obtained on   Table S1. Description of 23 missing value imputation methods. The optimal combination is found by LLS  Note: 'Total number' here means the identified peptides/proteins number in each dataset. 'Missing value number' means the number of quantified peptides/proteins with missing value in at least one sample, the number in parentheses is the rate of missing value corresponding to "Total number".

II. Supplementary tables and figures
'Number after filtered' means the number of quantified peptides/proteins after removing those with high proportion of missing values and coefficient of variation (e.g., those peptides/proteins with 50% proportion of missing values or coefficient of variation above 30% will be removed). Figure S1. Distribution of the time consumption of each imputation method. Results were obtained from the ProtSWATH dataset, only for the demonstration of speed difference between methods. We repeated 100 times for every method Note, the time is just a reference for users because it is also related to data size and internet status (or whether computer hardware configuration if running