LipidFinder 2.0: advanced informatics pipeline for lipidomics discovery applications

Abstract Summary We present LipidFinder 2.0, incorporating four new modules that apply artefact filters, remove lipid and contaminant stacks, in-source fragments and salt clusters, and a new isotope deletion method which is significantly more sensitive than available open-access alternatives. We also incorporate a novel false discovery rate method, utilizing a target–decoy strategy, which allows users to assess data quality. A renewed lipid profiling method is introduced which searches three different databases from LIPID MAPS and returns bulk lipid structures only, and a lipid category scatter plot with color blind friendly pallet. An API interface with XCMS Online is made available on LipidFinder’s online version. We show using real data that LipidFinder 2.0 provides a significant improvement over non-lipid metabolite filtering and lipid profiling, compared to available tools. Availability and implementation LipidFinder 2.0 is freely available at https://github.com/ODonnell-Lipidomics/LipidFinder and http://lipidmaps.org/resources/tools/lipidfinder. Supplementary information Supplementary data are available at Bioinformatics online.


Supplementary Introduction
Liquid chromatography-mass spectrometry (LC/MS) is an essential method for discovery and characterisation of lipids in biological samples. Pre-processing is a required step in the lipidomics pipeline, however most available informatics solutions are tailored for metabolomics and not specifically lipidomics and thus, they retain large numbers of artefactual ions in datasets. This leads to major challenges for statistical analysis and renders robust identification of novel lipids impossible.
Liquid chromatography coupled to mass spectrometry (LC/MS), is one of the most popular analysis methods used for profiling of lipids in biological samples. This method (untargeted lipidomics) is widely used due to its high coverage of molecules, aiming to support discovery of biomarkers related to development, disease and therapeutic response (Covey, 1986). However, depending on the experimental methodology and the tissue analysed, output datasets can contain up to tens of thousands of ions/peaks (features). For example, with human platelets, raw datasets can comprise close to 60 K features, of which it is estimated only around 4-5 K are real lipids (Slatter, et al., 2016). The rest are common artefacts, e.g. isotope peaks, electrospray ionisation (ESI) contaminant ions, salt clusters, in-source fragments and others. Also, a single lipid may be represented by many features. If these are not removed, two main problems result. First, novel lipids cannot be easily distinguished from non-lipids. Second, when comparing datasets for biomarker discovery, statistical power is reduced. Thus, new computational methods are urgently required to help researchers cope with this significant Big Data problem.
In lipidomics/metabolomics (Weckwerth, 2003), the common data processing pipeline starts with interrogation of biological samples using LC/MS. The output can be represented in a 3D chromatogram: mass-to-charge ratio (m/z) vs retention time vs signal intensity. It is essential to detect all "features" in the data, that is, every unique representation of m/z and retention time, and return these in a usable format for further analysis. Hence, the raw dataset needs to be "preprocessed". For this, several tools have been implemented over the last 10-15 years: XCMS, MZmine and MetAlign are examples of open access options (Lommen and Kools, 2012;Pluskal, et al., 2010;Smith, et al., 2006). The last step of the pipeline is to match each feature to a known analyte, while also retaining unidentified ions for further analysis as potential new lipids. LC/MS is extremely useful for lipid discovery and comparative profiling. However, it is always recommended to follow LC/MS up with targeted methods such as tandem mass spectrometry (MS/MS), to validate the putative profiling returned by this "untargeted" lipidomics. and extract every peak's key information from the raw LC/MS data (e.g. mass centroid position and area under the curve). Peak alignment corrects for minor changes in retention time arising from fluctuations of environmental temperature and humidity, and health of chromatographic columns (Smith, et al., 2015;Zhang, et al., 2014). Insufficiently addressing alignment negatively impacts subsequent statistical analysis and identification of lipids/metabolites (Zhou, et al., 2012). Furthermore, some tools also include additional filters and methodologies to confirm that returned features correspond to known analytes. XCMS/CAMERA incorporates baseline and noise elimination, and an isotope and adduct detection method, as well as putative identification for features (Zhou, et al., 2012).

Input dataset and configuration process
In contrast to LipidFinder 1.0, which only worked with XCMS and SIEVE™ (ThermoFisher) datasets, LipidFinder 2.0 works with input files from widely used preprocessing tools. Here, the input file layout is far more flexible: a first column with a unique identifier per feature, m/z and retention time columns, and all sample measurement columns together are the only requirements. All columns except the identifier can be placed anywhere in the dataset. Also, the file can contain extra columns that can be retained in the output. The list of formats was extended so as well as comma-separated values (CSV), LipidFinder can read tab-separated values (TSV) and Microsoft Excel spreadsheet (XLS and XLSX) files.
Configuring LipidFinder 1.0 was considered challenging for inexperienced users, because it required editing a CSV file containing the parameter names, their description, restrictions and values. Thus, typos, incorrect data types and invalid values (due to threshold violations) were common mistakes during this step. To address this, a new configuration process with two interface options has been added to each stage. A graphical user interface (GUI) and a command-line interface (CLI) have been developed paying special attention to their usability to improve user interaction with the software. For example, we include an additional help text for parameters to provide support for new users and default values optimised for high-resolution LC/MS data. The resulting configuration file is still saved in a readable text format, so experienced users can edit it manually. We have also incorporated the functionality to transfer parameters between stages, that is, users can import the configuration file of a previous stage to copy the values of shared parameters, saving time and preventing errors during the configuration. Finally, LipidFinder 2.0 provides backward compatibility, so any configuration file from its previous version can be used as a foundation to generate the new ones.

PeakFilter
The first module in LipidFinder's workflow is PeakFilter, designed to improve the quality of datasets through applying a series of corrective steps, outlined in Supplementary Figure 1. However, LipidFinder 1.0 still retained artefactual ions that need to be removed in order to ensure data quality. To this end, we include 4 new modules into PeakFilter, depicted in orange in Supplementary Figure 1, along with significantly improving the functionality of our Stack Removal module. These are described below.

In-source fragmentation removal
In-source fragmentation is a common problem that arises in LC/MS where fragmentation of ions, to generate product ions, occurs in the source despite ESI being considered a soft ionization method (Gabelica and De Pauw, 2005). These product ions can be impossible to tell from real molecular ions and are misidentified using database searches, as they are isobaric with the precursor ions of real lipids (Xu, et al., 2015). To solve this issue, we developed a method that searches for common well-known lipid-related in-source ion fragments and removes them from the dataset. The most common forms of these are either i) fatty acyl ions, and related ketenes generated from phospholipids and detected in negative mode, and ii) neutral loss ions from the loss of common fatty acyls, headgroups, CO2 or H2O which can appear in either positive of negative mode. The default in-source fragments and common neutral losses targeted as well as the algorithm are provided in Supplementary Information (Appendix 1). The use of this module requires LC separation since default targets will induce the filter to remove free fatty acids unless clearly separated from larger lipids.

Stack removal
LipidFinder 1.0 included a module to remove lipid and contaminant stacks. However, the implementation of this filter was too strict and some stacks remained in the resulting datasets. In this update we revisited the definition of both types of stacks to improve the accuracy of the algorithm. Generally, a stack is formed by a series of features differing from each other by the same m/z gap (or multiples of it), specifically those identified as common contaminating ions and adducts in ESI, http://www.waters.com/webassets/cms/support/docs/bckgrnd_ion_mstr _lst_4_13_2010.pdf (Keller, et al., 2008). Here, a lipid stack is a cluster of features with the same retention time. On the other hand, a contaminant stack comprises a series of features that elute with a gap between retention times that is maintained across the series. Thus, if plotting features in a m/z versus retention time scatter plot, lipid stacks are visualized as vertical sets whilst contaminant stacks are observed as diagonally spaced out features. Importantly, some lipid classes appear visually in the aforementioned scatter plots as clusters eluting in a diagonal fashion. For example, triglycerides that differ by 2C fragments and saturation/unsaturation follow predictable patterns in terms of retention time shift as the mobile phase lipophilicity increases. To avoid removing true lipids that follow this pattern, we have added the condition that a stack must have at least 4 features to be categorized as such before it is removed.

Isotope removal
Isotope peaks due to the presence of one or more 13 C atoms per molecule demonstrate a predictable pattern of features in scatter plots, eluting at the same retention time. These isotopes need to be either removed or combined with the molecular ion prior to searching databases for matches. This is important since related lipids often show multiple M+2 molecular species (differences in double bonds), and depending on instrument resolution an isotope with two 13 C atoms could appear isobaric with a similar lipid to the molecular ion, but with one double bond less.
XCMS includes in its latest versions the CAMERA package which, among other features, detects and labels isotopic peaks. However, we found that the intensity ratio check used in its algorithm is not adequate for lipidomics, because the decrease in intensity of isotope peaks follows a nonlinear relationship with the m/z increment in lipids (Yergey, 1983). Also, CAMERA labels a feature as an isotope if at least a chosen number of samples comply with the requirements (50% by default). Many studies in lipidomics involve diverse sample conditions, so different lipids (and isotopes) may be present only in some samples. Thus, CAMERA's approach can be fallible, particularly since it does not distinguish between technical and biological replicates. We have developed a new deisotoping filter that calculates the isotopic distribution based on polynomial expansion of the parent intensity rather than the linear functions used in CAMERA. Furthermore, the conditions are checked independently for each set of technical replicates belonging to the same biological sample, treating them as separate entities. The method is described in detail in Appendix 2, and we have validated our formulae using the Yergey algorithm.

Salt cluster removal
We have also implemented the salt cluster filter published by (McMillan, et al., 2016). This algorithm targets the common artefacts of ESI recognized by their particular high mass defects and early elution in LC/MS data. The method uses a mass-defect cutoff and a whitelist of bonafide metabolites (with high mass defects) to screen out the salt clusters.

False Discovery Rate
The presence of diverse isotope and adduct features in samples is a critical challenge for the identification process, leading to false positives. The false discovery rate (FDR) is a common metric used to estimate the error of metabolite-spectrum matches and, hence, evaluate the confidence of the resulting profile (Schrimpe-Rutledge, et al., 2016). Although there is currently no agreed upon method to determine FDR in lipidomics, it is widely accepted in MS-based proteomics to use a target-decoy strategy (Elias and Gygi, 2007;Kall, et al., 2008).
The last feature added to PeakFilter is a FDR method developed adopting a similar approach. This uses target and decoy databases to measure robustness of a method, retrieving the number of hits from a dataset found in the decoy database ( ! ) versus the number of hits in the target database ( " ), and calculating FDR as ! " ⁄ . The target database is based on real molecules, while the decoy database is made up of false ions, e.g. m/z values which would not be expected in nature (described in detail below). In this method, the number of decoy hits would be considered to mirror the frequency of false lipid matches by LipidFinder's putative profiling. Overall, FDR estimates provide a sense of robustness of the data after PeakFilter, that is, a theoretical estimate of how many lipid-like artefacts and/or non-lipid metabolites might remain afterwards.
To create its decoy database, JUMPm, a metabolite identification tool, altered the metabolites m/z values from the target database by violating the octet rule on chemistry (Jones, 2016). This means that each m/z value was adjusted to another theoretical value that did not fit this rule. We found that this m/z alteration does not work in lipidomics since many lipids differ from each other by the mass of one hydrogen, resulting in a decoy database that contains a significant percentage of m/z matching real lipids. Thus, we instead built a decoy database by adding 0.5 Da to every analyte, since this is a very rare mass defect in lipids. For our target database, we used an in-silico database (COMP_DB) composed of major classes of lipid species, generated from a list of commonly occurring acyl/alkyl chains. The almost 30,000 lipid species in COMP_DB are listed in "bulk format", specifying total number of carbons and double-bonds in their constituent chains. Our decoy (COMB_DB_0.5Da) was generated from COMP_DB, as above. We show a validation and also an application of the approach using real data in Sections 4.1 and 4.4.

MS Search
Lipidomics datasets are large and complex, and there are significant challenges relating to both their identification and their subsequent visualisation. MS data can only ever provide bulk information on putative structures (e.g. potential lipid category, number of carbons and rings/double bonds on fatty acyl substituents, etc.). It cannot be used to define a lipid down to actual fatty acids, confirm head groups, assign either geometric isomers/enantiomers or position of functional groups (e.g. for oxidized groups). With this in mind, m/z searches must only return the correct bulk structures for MS data without providing detailed structural annotation. To address this issue, once lipidomics data has been processed through PeakFilter's modules (Supplementary Figure 1) and amalgamated (if needed), MS Search performs putative mass assignment through a bulk structure search on LIPID MAPS. This module queries the selected repository lipid database to determine whether each remaining feature is a known analyte or not based on its m/z. This revised stage offers 3 in-house lipidomics databases: COMP_DB: a computationally generated lipid database composed of about 30,000 bulk (isobaric) species covering the major classes of lipid species. It has been customized for precursor ion searching. ALL_LMSD: the entire LIPID MAPS structure database (LMSD) of over 43,000 unique biologically relevant lipid structures. CURATED_LMSD: a subset of approximately 21,000 curated structures from LMSD that have been reported in the literature (excluding computationally generated structures). MS Search allows the user specify mass tolerance and restrict the list of adducts and lipid categories being searched. The putative profile will only include matched lipids for a feature when searching against one of the LMSD databases and no bulk structures are found. The output file format has been changed from CSV to XLSX to enable inclusion of interactive content (via hyperlinks) for output results, as we did on LipidFinder's online version (Fahy, et al., 2019).
Once putative matches have been assigned, visualization of data is required in a manner that allows the user to interpret the large amount of resulting data. This is a major challenge when dealing with often thousands of individual m/z values, including a large proportion of ions that have not been matched to any database entry. To aid this, MS Search now allows the user to create a summary file and a lipid category scatter plot from the putative identifications. The summary file retains information on the closest match of the most frequent lipid category for each unique m/z and retention time. To generate the lipid category scatter plot, the eight main lipid categories described by the LIPID MAPS Lipid Classification System (http://www.lipidmaps.org/data/classification/LM_classification_exp.php) are used, with the addition of an "unknown" category for unidentified features. Each lipid category is always assigned the same color, easing the comparisons between plots during the analysis. Additionally, LipidFinder 2.0 provides the option to use a color blind friendly palette.
It is important to note that unlike metabolites, lipids generally behave similarly across individual categories/classes in large cohort sample sets (e.g. phospholipids as a group tend to change together with disease or genotype), and it is highly unusual that a single lipid will stratify without others that are structurally related. We propose that taking the approach of plotting and statistically analyzing lipids in their LIPID MAPS categories has several advantages: i) if several lipids within a category alter similarly, it provides confidence that the difference is real, and ii) if analyzed in sub-groups, the statistical power is increased when applying corrections for multiple comparison testing. Our suggested approach is to screen categories for differences first using an untargeted method as described herein, then to validate the changes across the individual lipid categories using gold standard targeted quantitative MS/MS methods (generally restricted to single categories only) as the second step.

FDR analysis
Decoy databases are designed to be populated with ions that should not commonly exist in the types of samples being analysed, and as expected, ours (COMP_DB_0.5Da) does not share any m/z values with COMP_DB ions. Thus, high quality lipidomics data from well-calibrated instruments, that has been effectively cleaned up should return a low FDR using a decoy-versustarget database approach, as only implausible lipids will be recognised by the decoy. To validate this approach, we created two "theoretical" sets of experimental data for testing, each of which contain 400 high resolution m/z values randomly selected from the target database, COMP_DB.
common) which are more representative of real lipidomics datasets, or b) screen for all available ions in the LIPID MAPS MS search tool, including a number of potential adducts of less direct relevance to lipids (all). We chose a standard mass tolerance for high resolution MS (± 0.001 Da) for both tests to fit with the expected instrumentation imperfections, e.g. suboptimal calibration, and other random electronic noise that will make real experiment m/z values differ slightly from the theoretical ones. Supplementary Table 1 shows the results for both datasets in each search condition. The number of features that returned at least one match in the target ( " ) and in the decoy ( ! ) databases is also shown for completeness.
Supplementary Table 1 shows that our method returns a 0% FDR (with 100% " ) when searching for common lipid ions. These results demonstrate that the construction of a decoy database by adding 0.5 Da to every bulk lipid structure included in the target database is an effective strategy, since the method did not return any decoy hits even with a standard tolerance. Even if we expand our database search space to include every possible adduct available in the LIPID MAPS MS search tool, the FDR is negligible. This suggests that our proposed FDR approach can be used as a metric to assess the quality of LC/MS lipidomics datasets, e.g. retained artefactual ions or potential calibration drifts.

Source ions
Target ions Tolerance (Da)

Isotope removal analysis
Here we compared annotation of isotope peaks as achieved by CAMERA and LipidFinder 2.0 using a dataset of lipids generated by high-resolution LC/MS analysis of lipid extracts from RAW cells and peritoneal macrophages. Both approaches were applied to the datasets first preprocessed with XCMS Online with the Orbitrap II parameter set (Tautenhahn, et al., 2012). Following this, we ran MS Search to generate putative identifications of as many features as possible. This profiling was done searching against the COMP_DB database with a mass tolerance of ± 0.001 Da, and screening for all available ions (of the corresponding polarity). Supplementary Table 2 presents the isotope annotation returned by both software tools for 8 features (4 positive, 4 negative ions) and their putative identification.
Overall, CAMERA and LipidFinder 2.0 agreed on many isotope annotations. In several cases this agreement was over all sample replicates (6 RAW 264.7 and 6 resident peritoneal macrophages). For instance, in Supplementary Table 2 we find m/z 668.6188 (positive mode) and m/z 802.5756 (negative mode). On average, disagreement between both methods happened in less than 50% of replicates. As a result, CAMERA with its "majority rule" annotated whole features as isotopes that LipidFinder 2.0, with its more restrictive intensity thresholds, only annotated partially, i.e. defining those features as isotopes for some but not all sample replicates. This problem becomes critical when the discrepancy delineates the diverse biological groups of the dataset. See two examples in Table 2: m/z 941.6996 (+ve mode), whose M+1 isotope was found only in the peritoneal group; and m/z 730.5391 (-ve mode), whose M+1 isotope was only annotated in raw samples. In both cases the intensity was too low in the untagged group, meaning that those features are likely to be isotopes in their annotated group, and another type of analyte in the other.
Finally, we focus on features where each method has returned a different annotation. In Supplementary Table 2 we show m/z 831.6834 (+ve mode) and m/z 751.5724 (-ve mode), for which LipidFinder 2.0 did not detect isotopes in any sample replicate whilst CAMERA did. In the former instance, the isotope has a higher intensity than our defined upper threshold for every sample. In the negative example, the mass of the feature annotated as the M+1 isotope by CAMERA is much lower than the expected increment of one 13 C: the absolute difference between the "parent" and its isotope is 0.9887 Da (instead of 1.0033 Da). Furthermore, both M+1 isotopes were putatively identified as MGDG(38:0) and PC(32:0), respectively, another reason to believe they may have been incorrectly annotated as isotopes. Conversely, Supplementary not. The latter method is assigning different peak groups (pcgroup) to the "parent" features and their corresponding isotopes, even though they are less than 3 seconds apart from each other in both cases.
Overall performance analysis LipidFinder 2.0 includes several new modules all designed to improve data quality. To demonstrate this, we provide a comparison of effectiveness and performance of the lipidomics pipeline (Supplementary Figure 2) with and without either the old or new versions of LipidFinder. For XCMS we chose the online version with the Orbitrap II set of parameters. We used the output files from XCMS Online as input for both LipidFinder versions, with the default parameterization of LipidFinder 2.0 applied in both cases. Supplementary Figure 2 shows the retained features of the macrophages dataset in both positive and negative mode in scatter plots of m/z versus retention time. Last, an example of the color blind pallet for both a scatter plot and volcano plot of the data is presented in Supplementary Figures 3,4.
Prior to using LipidFinder, XCMS datasets contain eluting features that form repeating patterns (Supplementary Figure 2, left panels). For example, series of dots following (almost) straight lines. These are in many cases related to well-known artefacts such as in-source fragments, lipid stacks, ESI contaminants or salt clusters. Applying the FDR method to these XCMS datasets we obtain high %FDR values, indicating that many features match our decoy lipid m/z values (Supplementary Table 3). LipidFinder 1.0 performs reasonably at removing several of these artefacts, resulting in a reduced FDR (Supplementary Table 3, Supplementary Figure 2, middle panels). However, significant artefacts still remain, e.g. the dense column of dots throughout the first minute (salt clusters), or the short vertical lines of equally distant dots (isotopes). In contrast, the incorporation of the new modules presented here into the lipidomics pipeline results in further improvements to data output as shown by improved FDR (Supplementary Table 3, Supplementary Figure 2, right panels). The first release of LipidFinder removed a third of the total features in each dataset and reduced the number of decoy hits up to a fourth of that of XCMS Online. The reduction of target hits is also expected as LipidFinder has modules designed to "simplify" the dataset. For instance, adduct removal will find all adducts for a given lipid and retain only the most intense one. LipidFinder 2.0 improves on the results of its predecessor removing two thirds of the total number of features from XCMS Online, but retaining almost as many target hits as LipidFinder 1.0 and decreasing FDR by half. Here, the minor reduction of target hits is due mainly to in-source fragment removal, e.g. where fatty acids will be retained only if they do not co-elute with other features from which they may have originated. Our previous version of LipidFinder typically took 30 minutes to analyze 12 samples (with about 23K features for each mode). In the new version, we sought to improve processing speed through parallelization and algorithm redesign based on computational cost analysis. Here we compare time taken to process the macrophages datasets (positive and negative mode) after using the same configurations and conditions as for the previous analysis, using both versions and finding that the total has reduced by 10 minutes (Supplementary Table 4). The benchmarks show increased time cost for PeakFilter. This is due to the computational requirements of the FDR method, creating a bottleneck. If we omit that step, the new implementation outperforms its predecessor, even with the addition of three new modules. We observe similar improvements in Amalgamator and MS Search. Furthermore, during the algorithm redesign aforementioned we have paid special attention to the computational complexity. This means that not only is LipidFinder 2.0 more efficient, but its time cost will increase more slowly with larger datasets.
LipidFinder on LIPID MAPS LipidFinder 2.0 replaces LipidFinder 1.0 on its online version, available on LIPID MAPS as a web application (Fahy et al., 2018). As an additional feature, LipidFinder on LIPID MAPS incorporates the option to import pre-processed files directly from XCMS Online. To do so, users are requested their username from XCMS Online and the JobID that has generated the preprocessed file they want to import. This feature will only fetch datasets from pairwise and multigroup job types. After the file has been imported correctly, it can be processed with LipidFinder 2.0 and analysed with the statistical methods available.