3D-MSNet: a point cloud-based deep learning model for untargeted feature detection and quantification in profile LC-HRMS data

Abstract Motivation Liquid chromatography coupled with high-resolution mass spectrometry is widely used in composition profiling in untargeted metabolomics research. While retaining complete sample information, mass spectrometry (MS) data naturally have the characteristics of high dimensionality, high complexity, and huge data volume. In mainstream quantification methods, none of the existing methods can perform direct 3D analysis on lossless profile MS signals. All software simplify calculations by dimensionality reduction or lossy grid transformation, ignoring the full 3D signal distribution of MS data and resulting in inaccurate feature detection and quantification. Results On the basis that the neural network is effective for high-dimensional data analysis and can discover implicit features from large amounts of complex data, in this work, we propose 3D-MSNet, a novel deep learning-based model for untargeted feature extraction. 3D-MSNet performs direct feature detection on 3D MS point clouds as an instance segmentation task. After training on a self-annotated 3D feature dataset, we compared our model with nine popular software (MS-DIAL, MZmine 2, XCMS Online, MarkerView, Compound Discoverer, MaxQuant, Dinosaur, DeepIso, PointIso) on two metabolomics and one proteomics public benchmark datasets. Our 3D-MSNet model outperformed other software with significant improvement in feature detection and quantification accuracy on all evaluation datasets. Furthermore, 3D-MSNet has high feature extraction robustness and can be widely applied to profile MS data acquired with various high-resolution mass spectrometers with various resolutions. Availability and implementation 3D-MSNet is an open-source model and is freely available at https://github.com/CSi-Studio/3D-MSNet under a permissive license. Benchmark datasets, training dataset, evaluation methods, and results are available at https://doi.org/10.5281/zenodo.6582912.

. The loss functions of 3D-MSNet. Semantic loss is the loss of the feature semantic prediction branch. Center loss is the loss of the feature center prediction branch. Polar mask loss is the loss of polar mask prediction branch. The total loss of 3D-MSNet is the weighted sum of the semantic loss, the center loss, and the polar mask loss.  Table S2. The accuracy functions of 3D-MSNet. These accuracy functions were used in the calculation of the accuracy curve. Semantic accuracy is the accuracy of the feature semantic prediction branch.
Center accuracy is the accuracy of the feature center prediction branch. Polar mask accuracy is the accuracy of the polar mask prediction branch.  Table S3. Optimized parameters used in the TripleTOF 6600 dataset evaluation.

Methods
Step Parameter Value    In the matching of the high-confidence features in feature detection results, each high-confidence feature may match more than one detected feature, which is often attributed to duplicate or erroneous feature extractions. 3D-MSNet achieved the best feature extraction accuracy in comparison with other software with the lowest multi-match ratio in most samples (56 of 57) and the lowest average multimatch ratio in all 12 samples.   To perform comprehensive comparisons of 3D-MSNet, we also evaluated proteomics analysis software on the metabolomics dataset. However, the proteomics analysis software did not achieve competitive analysis results. Since the data distributions were different between proteomics and metabolomics datasets, we believed that it was unfair to present their inferiority on metabolomics datasets in the main text.

Samples
The results were summarized below for readers' reference, and it can also show the value of 3D-MSNet that can support the analysis of metabolomics and proteomics data sets at the same time.
The specific results are shown in the  Table S11. Differences between 3D-MSNet and PointIso.
Considering that readers will be interested in the difference between the two point-cloud-based deep learning methods, 3D-MSNet and PointIso, we present the difference comparison table below. 3D-MSNet and PointIso were developed with different development intentions. Appendix S1. Marker selection criteria on the metabolomics datasets.
For differential marker selection from metabolomics data, we changed the original criteria in the benchmark study (FC < 0.5 or FC > 2) and replaced it by "consensus features whose fold-changes were out of 20% tolerance of the range of (0.5, 2)".
Although the original criteria are widely used for marker selection, it is a compromise method to help initial screening of potential markers when we do not know the distribution of relative quantitative results. In this study, we know that there are two groups of compounds (Gd3, Gd4) which have quantification fold-changes exactly equal to 0.5 and 2. Following the original criteria, due to the bias caused by sample preparation and LC-MS acquisition, half of the compounds in these two groups will be rigidly classified into "non-markers", and the number of markers can no longer be used as a measure of software performance. For example, suppose there are 30 compounds in Gd4, and the ground-truth fold-changes are distributed between 1.8 and 2.2. Accounting for bias in LC-MS acquisitions, we assume that 10 compounds have a fold-change less than 2 and 20 compounds have a fold-change greater than 2. Through original criteria screening, software A detects 18 markers, and software B detects 22 markers. Then, what can be inferred from the marker numbers? Is software B better than software A? We cannot tell. Using the original criteria, quantization bias of the LC-MS system deprives the number of markers as a measure of software accuracy. But if we add a relaxation to the criteria, for example, adjust the threshold to 1.7, outside the bias distribution interval, assuming that software A detects 27 markers and software B detects 29 markers, then the number of markers can clearly reflect that B is better than A.
Considering that the commonly used quantification deviation is 20%, and we also used 20% tolerance for accurate quantified feature selection, we added a 20% relaxation to the original criteria [0, 2 !" ] || [2 " , +∞), and got the criteria used in this study [0, 2 !("!#.%) ] || [2 "!#.% , +∞). Using the relaxed criteria, the number of markers can be directly used as a parameter to measure the accuracy of the software. The relaxed criteria have greater meaning for software comparisons than the original criteria.
Appendix S2. Feature matching method in evaluation of the proteomics Orbitrap XL dataset.
In comparison of the proteomics dataset, we matched the feature detection results of different software with the high confidence identifications of MASCOT. When the feature detection result only has center RT and m/z information, we performed matching according to m/z and RT tolerance. When the feature detection result contains the m/z and RT distribution range information of the feature, we filtered again based on the information, and only kept the feature that contained the MASCOT result in the feature m/z and RT range. Because the high confidence identifications were obtained by scoring of MS2 spectra, we believed that only the features that contains the precursor m/z and RT of the MS2 spectra were correct matches.
In the result matching of PointIso, the detection ratio for PointIso (94.19%) in Fig. 5a of this study is not consistent with that reported by PointIso itself ("an average detection rate of PointIso is 98.01% across 12 samples"). With the filtering method described above, before filtering by RT range, PointIso indeed achieved a detection rate of 98%, and after filtering, it turned to about 94% as shown in the paper. In the evaluation of the proteomics software, only MaxQuant was exempted from RT range filtering, since we did not find RT range information in its result table.
Appendix S3. Evaluation on the Orbitrap XL dataset with the targeted peptide library.

Dataset
The Orbitrap XL dataset contains 12 samples at different dilution levels. Each sample has 4 to 7 replicate injections and is composed by synthetic potato peptides, synthetic human peptides, and nonvariable Streptococcus pyogenes strain SF370 background peptides. The dilution levels are listed in the following table.

Library
The Orbitrap XL dataset did not provide a spike-in and background peptide library with peptide m/z and RT in its paper and project FTP. However, the Orbitrap XL dataset has a twin targeted dataset, which was acquired with targeted SRM (selected reaction monitoring) method using the same LC platform. The SRM dataset has a peptide library, which contains 47 synthetic potato peptides, 78 synthetic human peptides, and 29 background peptides.
To evaluate the feature detection and quantification performance on the Orbitrap XL dataset, we tried to use the targeted SRM library for result matching.

Alignment
In the evaluations of the metabolomics datasets, we did not introduce additional alignment methods, since the datasets has less injections (8 and 10), and almost no bias in RT. However, the Orbitrap XL dataset has 57 runs. RT shifts obviously so that we cannot ignore it.
We performed alignment by G-Aligner, a self-developed hybrid non-centric alignment method, which first performs coarse RT warping and then performs fine assignment based on graph theory and discrete optimization to achieve feature-to-feature alignment. The alignment procedure was performed in two steps. In the first step, we aligned the replicates for each of the 12 samples. In the second step, we aligned the 12 samples based on the median m/z and RT of the aligned replicates and assembled the alignment results. We used the same set of parameter on all software results.
To find features corresponding to the peptide library, we searched feature extraction results with a settled searching window (0.01Da m/z tolerance, 2min RT tolerance). The m/z tolerance was the same as its used in MASCOT high-confidence feature library evaluation. The RT tolerance was the same as the tolerance set in the targeted SRM acquisition, which was the narrowest RT range we could find from the dataset. In our experiments, the number of matches had almost no change when using larger RT tolerance.
The matched features should not only be detected in the matching window, but also follow the distribution trend as the theoretical dilution concentrations. In the trend estimation step, trend of each peptide was estimated by least-square linear regression on part of the results, potato peptides on results of sample 1 to 6, human peptides on sample 7 to 12, and background peptides on all samples. The unselected part of results had low intensity, hard to detect, and prone to noise.
As for multi-matched peptide selection, the first priority was the number of detected features in all injections, and the second priority was high intensity.

Normalization
In the evaluation of the metabolomics datasets, we did not introduce additional normalization methods since the quantification results were stable. However, the intensity varies obviously in the Orbitrap XL dataset, even among the replicates of the same sample.
To eliminate the intensity deviation caused by pretreatment and LC-MS system, we estimated the intensity trend of background peptides among injections. The trend estimation was performed in two steps. In the first step, we calculated the trend for each background peptide, which was the quantitative results across all samples of the current peptide divided by the median. In the second step, the trend was estimated by calculating the median of the background peptide trends.
The estimated trends of compared software are shown in Figure A3.2. The trends of MaxQuant, Dinosaur, and 3D-MSNet were highly similar, proving the high quantification accuracy of these software. The trend of PointIso was less similar, which was caused by less accurate quantification ( Figure A3.3, A3.5, A3.7). The trend of DeepIso was different from MaxQuant, and the results of DeepIso could only be normalized with its own trend, which might be caused by different quantification strategy. If normalized by other trends, for example, the trend of MaxQuant, the quantification results showed larger deviations in replicates. We normalized software with their trends and present the results in Figure A3.3. Each colored line represents the intensity change of a matched peptide in different samples. As shown in Figure A3.3, MaxQuant, Dinosaur, DeepIso, and 3D-MSNet showed results that complied with the distribution of the dilution levels. PointIso was significantly lower than other methods in number of matches and quantitative accuracy. PointIso had low quantification accuracy, resulting in few results that met the matching rules (see Appendix S1 4. Feature matching). Considered together with the high false positive rate (see Table S8), we believed that the current version of PointIso still needed more tuning. a b c

Feature detection result evaluation
The matched results are presented in Comparing to the feature detection rate presented in Fig.5a, the detection rate rankings of DeepIso and PointIso are different. This is because we screened out quantitatively biased results in feature matching (see Appendix S1 4. Feature matching). PointIso showed lower quantification accuracy, which resulted to declines in the matching rates.

Feature quantification result evaluation
To evaluate the stability of quantification, we calculated the coefficient of variance (CV) values of normalized quantification results of matched features among replicates. The distributions of CV values are shown in Figure A3.5. Each point represents the CV value of the quantitative results of a peptide in all replicates of the current sample. Since the low-concentration features that are rarely detected will bring statistical errors, we only need to pay attention to the distribution of samples with more feature detection (Potato set: sample 1 to 7, Human set: sample 5 to 12, Background set: all samples). As shown in Figure A3.5, MaxQuant, Dinosaur, DeepIso, and 3D-MSNet had similar distributions.
DeepIso had higher CV values. For a clearer quantitative evaluation, we calculated the mean of the CV means among samples (Potato set: sample 1 to 7, Human set: sample 5 to 12, Background set: all samples). The calculated results are summarized in the Table A3.6. 3D-MSNet achieved the lowest CV value in all the three sets, which represented that 3D-MSNet had the highest quantification stability.
The stability ranking is summarized as follows: 3D-MSNet > Dinosaur > MaxQuant > DeepIso > PointIso. To evaluate the accuracy of the quantification, we calculated the mean area in replicates to obtain the quantitative results of each matched peptide in each group. Then, we divided areas between adjacent samples to fold changes (FC) and compared the fold changes with the theoretical value in Table A3.1.
For each measured fold change, we divided the fold change by theoretical concentration ratio to obtain FC ratio. The distributions of FC Ratios are shown in Figure A3.7. Each point in the figure represents the measured FC ratio of a peptide in two adjacent samples. The closer to 1, the more consistent the quantitative result is with the theoretical concentration ratio. Since the low-concentration features that are rarely detected will bring statistical errors, we only need to pay attention to the distribution of samples with more feature detection (Potato set: sample 1 to 7, Human set: sample 5 to 12, Background set: all samples). As shown in Figure A3.7, MaxQuant, Dinosaur, DeepIso, and 3D-MSNet had similar distributions.
For a clearer quantitative evaluation, we calculated the mean quantification bias percentage (|FC ratio -1| * 100) among samples (Potato set: sample 1 to 7, Human set: sample 5 to 12, Background set: all samples). The calculated results are summarized in the Table A3.8. 3D-MSNet obtained the lowest quantification bias percentage in the human set and the background set. MaxQuant achieved the lowest quantification bias percentage in the potato set. The lower the quantification bias percentage, the higher the quantification accuracy. The accuracy ranking is summarized as follows: 3D-MSNet = MaxQuant > Dinosaur > DeepIso > PointIso.

Conclusion
The evaluation rankings of feature detection and quantification evaluation results are shown below: