Integrative open workflow for confident annotation and molecular networking of metabolomics MSE/DIA data

Abstract Liquid chromatography coupled with high-resolution mass spectrometry data-independent acquisition (LC-HRMS/DIA), including MSE, enable comprehensive metabolomics analyses though they pose challenges for data processing with automatic annotation and molecular networking (MN) implementation. This motivated the present proposal, in which we introduce DIA-IntOpenStream, a new integrated workflow combining open-source software to streamline MSE data handling. It provides ‘in-house’ custom database construction, allows the conversion of raw MSE data to a universal format (.mzML) and leverages open software (MZmine 3 and MS-DIAL) all advantages for confident annotation and effective MN data interpretation. This pipeline significantly enhances the accessibility, reliability and reproducibility of complex MSE/DIA studies, overcoming previous limitations of proprietary software and non-universal MS data formats that restricted integrative analysis. We demonstrate the utility of DIA-IntOpenStream with two independent datasets: dataset 1 consists of new data from 60 plant extracts from the Ocotea genus; dataset 2 is a publicly available actinobacterial extract spiked with authentic standard for detailed comparative analysis with existing methods. This user-friendly pipeline enables broader adoption of cutting-edge MS tools and provides value to the scientific community. Overall, it holds promise for speeding up metabolite discoveries toward a more collaborative and open environment for research.


INTRODUCTION
Classical purification/isolation procedures for chemical characterization in the field of natural products (NPs) are known for their laborious nature, involving multiple chromatographic steps and frequently afford well-known compounds.To solve this problem, more recently, chemical annotation using liquid chromatography coupled with high-resolution mass spectrometry (LC-HRMS) has become the gold standard in the pursuit of a more rapid and efficient metabolite content assessment for either known compounds, as well as the isolation of the unknown ones [1][2][3][4].
Data-independent acquisition (DIA) is a mass spectrometry (MS) acquisition mode that systematically fragments precursor ions within a specific mass-to-charge ratio (m/z) range.It has the advantage of detecting low-abundance metabolites, which are often overlooked by conventional data-dependent acquisition (DDA) methods, due to their unavoidable loss of MS data coverage [5,6].MS E , developed by Waters™ for Quadrupole Time of Flight (Q-TOF) MS analyzers, is a DIA method that fragments all precursor ions within the entire acquisition window by alternating between low-and high-collision energies, thereby obtaining consecutive scans of precursors and their fragments.This unbiased tandem MS approach is therefore considered DIA due to its unbiased fragmentation of precursors, irrespective of their abundance [7,8].The terms MS all and all-ion fragmentation (AIF) have been also employed for a similar type of fragmentation with Orbitrap analyzer-based instruments from Thermo Fisher™ [5,7,9].
Despite the advancement in MS techniques, challenges persist, especially in software availability for processing data obtained through DIA methods.While there are robust options for proprietary-specific software, e.g.UNIFI (Waters™), open software options are limited.In this scenario, the MS-DIAL is an extensively used option for users of a wide variety of mass spectrometers.Other more recent approaches for processing and annotation include DIAMetAlyzer, DecoID and MetaboMSDIA, each of which has particular advantages and limitations [10][11][12][13].
In molecular networking (MN), each processed mass spectrum is represented as a node, and spectral similarities between nodes can be calculated using different algorithms such as the cosine similarity of the Global Natural Product Social Molecular Networking (GNPS) [14,15].Despite its potential, processing only DIA-MS data for automated annotation and MNs in metabolomics remains challenging compared to the well-established DDA workflows [14,16].Therefore, these challenges have motivated the DIA-IntOpenStream pipeline to be built.The present study brings novelty by offering a comprehensive pipeline for processing LC-HRMS-DIA/MS E data that automates the generation of custom databases using free commercial software.We additionally showcased the pipeline's advantages with the successful utilization of the universal .mzMLMS data format to process, annotate and generate functional MN.Finally, we focus on the current challenges in LC-HRMS-DIA/MS E data analysis, offering strategies to mitigate these drawbacks and providing critical insights for future advancements.
This integrative approach enhances confidence in the annotation of known compounds and facilitates the discovery of novel and/or structurally related compounds.Therefore, it also enables the prioritization of unknown metabolites of interest for further investigation.Our study validates the DIA-IntOpenStream pipeline with two independent datasets.Dataset 1 consists of LC-HRMS/DIA data from 60 Ocotea plant extracts, showcasing the pipeline's applicability in plant metabolomics.Dataset 2 is a publicly available actinobacterial extract dataset enriched with a diverse pool of chemical authentic standards, encompassing a range of antimicrobial and naturally occurring compounds.The inclusion of known standards allows evaluation of the pipeline's annotation accuracy and efficiency.Thus, this dataset provides a solid foundation for a detailed comparative study with the original well-designed research that has performed the study using non-open software [17].A key advantage of DIA-IntOpenStream is that it relies exclusively on open-source software.Thus, it is a cost-effective alternative to achieve equivalent and even complementary results to the standard approaches, thereby also enabling high-quality metabolomics analysis based on LC-MS/DIA data.

General pipeline workflow
LC-HRMS/DIA techniques such as MS E generate highly complex datasets that require specialized software for processing and annotation.Until recently, MN generation required vendor software or the use of non-universal MS data format (e.g.ABF from MS-DIAL), limiting the execution of integrated and fully MS E data analyses.In contrast, DIA-IntOpenStream uses a standardized MS data format (.mzML) and open software tools for MS E data processing and annotation.The integrated workf low provides enhanced confidence for general, automated annotation strategies and provides an accessible way to increase the reliability of metabolome annotation coverage using DIA data.Indeed, the pipeline is adaptable for any MS E or AIF LC-MS analysis.Step 1 starts with the raw MS E acquisition data.In step 2, MS E raw data are converted into standard .mzMLformat using a Waters2mzML (https://github.com/AnP311/Waters2mzML), first published to GitHub in late 2022; however it is still under development and limited to Microsoft Windows operating systems.Waters2mzML implements a Python-based wrapper for ProteoWizard msConvert (https://proteowizard.sourceforge.io/), the most used open MS converter software.Step 3 is the Konstanz Information Miner (KNIME) workf low that can be rapidly executed, and the result is the generation of custom 'in-house' databases (DBs).These DBs are imported in the following steps 4 (MZmine 3 processing) and 5 (MS-DIAL processing).Step 3 is important for automatic enhanced annotation with level 3 of confidence during the data processing steps, leveraging the quality, processing power and annotation of both MZmine 3 and MS-DIAL 4.9 software [ 18,19].Of note, the generated KNIME database output is exported as a .csvfile and subsequently imported into MZmine 3, while a.txt file is used for MS-DIAL.Furthermore, from MS-DIAL, DIA data are exported in .mgfspectra format together with the GNPS feature table (.csv).In step 6, these two files and the additional metadata are submitted using WinSCP remote server software for feature-based molecular networking (FBMN) in the GNPS platform.Step 7 applies FBMN analysis and automated annotation with level 2 of confidence.Step 8 consists of semiautomated strategies in Cytoscape software that are employed for data, inspection, visualization and integration of FBMN and in-house DB annotations.
The integration of the processed data with online MS spectral libraries allowed for the automated annotation of metabolites.Supplementation with data gathered from customized 'in-house' annotations has bolstered confidence in the annotation of the molecular families generated.The construction of tailored inhouse DBs with metabolites of interest is critical to increase annotation reliability, as it provides matches with metabolites specific to a given taxon under investigation.For example, the utilization of the OcoteaDB and ActinomarineDB built with KNIME dramatically enhanced the reliability of our metabolite annotations by reducing the likelihood of potential false positives, thereby increasing true hits.Overall, strategic integration of automatic custom, automated 'in-house' annotations with online libraries and optimized GNPS parameters described in this pipeline enables MN with robust metabolite annotation of DIA/MS E data, as schematically demonstrated in Figure 1.

Figure 1.
Schematic representation of the integrated process for generating an 'in-house' automatic database and performing LC-HRMS/DIA data processing to create molecular networking.After data acquisition (step 1), LC-HRMS/DIA data are converted to the standardized .mzMLuniversal format using Waters2mzML V1.2.0 (step 2).Custom 'in-house' DB specific to the research is automatically prepared in the KNIME platform (step 3) using drawn chemical structures or can be downloaded from online libraries in various standard formats (.mol, .mol2,.sdf).Table formats (.csv) containing SMILES, CAS number, InchKey, or IUPAC chemical name can also be utilized.The converted .mzMLdata are then imported into MZmine 3 (step 4) and MS-DIAL (step 5) for processing, where the output 'in-house' DB generated in KNIME is integrated to enable automatic annotation (level 3 of confidence).MS-DIAL exports the align results as a GNPS feature table (.txt) and MS2 file (.mgf) that along with the custom metadata (.txt) are submitted via WinSCP remote server software to the GNPS environment (step 6).Feature-Based Molecular Networking (FBMN) analysis with automated level 2 of confidence annotation is performed on GNPS (step 7).Semi-automated strategies in Cytoscape software are employed for the data visualization and integration of FBMN and in-house DB annotations at levels 2 and 3 of confidence, respectively (step 8).* Low energy channel scans (LECS)/MS 1 .

In-house database and KNIME workflow
In general, this workf low accepts the four most common types of chemical input data, namely, .mol,.mol2,.sdfand .csv.The table input files could be formatted as Simplified Molecular Input Line Entry System (SMILES), International Chemical Identifier (InChIKey), Chemical Abstracts Service (CAS) number or International Union of Pure and Applied Chemistry (IUPAC) names.The output is a .csvfile with three columns: chemical structure name, calculated molecular formula and calculated monoisotopic mass.To generate the OcoteaDB dataset, the KNIME workf low was run with 492 molecular structures from Ocotea spp. in .molformat, drawn from online databases.The total runtime of the workf low was 84 s (see Methods section for desktop configuration check) and was used for later 'in-house' annotation during data processing.The ActinomarineDB dataset was generated with 6481 NPs sourced from the online npatlas database (https://www.npatlas.org/ ) in .csvformat and comprised of the genera of Actinomyces, Streptomyces, Salinospora, Micromonospora, Nocardia, Actinomadura and Rhodococcus, running for 63 s.The OcoteaDB and the Actino-marineDB .csvfiles were successively uploaded into MZmine 3 for annotation.Additionally, .txtexport versions can be imported into MS-DIAL for annotation.Figure 2 illustrates the reader, converter and writer nodes.

LC-HRMS data conversion and processing
Dataset 1 along with quality controls (QCs) and blanks in the Waters™ .rawformat was effectively converted with Waters2mzML to generate functional centroided .mzMLfiles.The same step was performed for dataset 2. These conversions took ∼36 and 1.5 h, respectively, on our computer configuration (detailed in the Methods section).The files were then processed using the MZmine 3 and MS-DIAL 4.9.For MZmine 3, despite the large cohort of dataset 1, final batch processing required only ∼7 min per ionization mode, while dataset 2 took only 1 min.While actual processing is quite fast (a few minutes), software parameter optimization is time-demanding, although empowers robust data processing for complex samples.Dataset 1 yielded 18 805 aligned features in the positive mode, including 3983 annotation hits from OcoteaDB with all potential adducts identified.Similarly, the negative mode yielded 23 304 features with 3216 database annotation hits.
In contrast, the positive mode analysis with MS-DIAL 4.9 required ∼2.11 h for dataset 1, resulting in the annotation of 22 572 features, while the analysis from the negative mode analysis took around 1.58 h, yielding 21 838 features.For dataset 2, MS-DIAL 4.9 processing required only 4 min.The dataset acquisition parameters and data size have a great inf luence on processing time, especially aligning an elevated number of samples, as in the case of dataset 1.In addition, despite inherent variations in parameters and algorithms employed by the two programs, the results generated were comparable.More specifically, MS-DIAL exhibited a longer processing time as it can appropriately process and assign MS 2 fragment ions to MS 1 precursor ions in DIA data.This step is the slowest during data processing and is particularly mandatory for MN implementation.Details of the dataset 2 processing results are provided at SM-4 and 5.
Although DIA processing algorithms are present in MZmine 3, they were not employed in this pipeline as they remain in an experimental phase without publicly available guides or tutorials to standardize parameter values, different from DDA data processing, which is very well established.As such, we used MZmine only to perform MS 1 data processing, which explains Figure 2. KNIME workf low for high confidence in-house automatic database assembly.The workf low includes .mol,.mol2and .sdfchemical structures as input data for the respective node readers.The last node is a .csvtable reader that accepts tables containing either SMILES, CAS numbers, InChIKey or IUPAC names as input data.The final output is a .csv(or .txt)file with three columns: chemical name, calculated molecular formula and calculated monoisotopic mass.The in-house DB specific for the aiming samples can now be integrated into the data processing step to increase confidence in the metabolite annotation.
the increased speed of data processing results when compared to MS-DIAL.Nevertheless, despite this limitation, we highlight that the MZmine 3 demonstrates remarkable transparency and guidance in data processing and annotation.Further insights and considerations are further provided in the Discussion section.

Chemical annotation
Following established guidelines for untargeted metabolomics, QC samples were prepared for dataset 1.After data analyses, consistent peak distributions and reproducible metabolic fingerprints across the QC injections were observed (Figures 3 and S1).Using QC samples to ensure that predefined quality thresholds are met is a critical step, thereby validating that the analytical system can acquire high-quality metabolomics data from the experimental samples [20,21].The developed LC-HRMS/MS E method effectively separated and detected major and minor metabolite components in Ocotea spp.The comprehensive list of annotations with level 2 of confidence for the acquired metabolic fingerprint is shown in Table 1.Even with the complexity of our metabolomics data, careful and rigorous data processing led to reliable and clear reproducible results, as shown by the superimposed chromatograms of experimental replicates (Figure S2).The superimposed chromatograms of extract samples and QCs from both positive and negative modes are illustrated in Figures 3 and S3-S5, evidencing the high chemical complexity of data.Details of dataset 2 are provided in Figures S16-S18 and Tables S3 and S4.
Aporphines and benzylisoquinoline alkaloids were annotated as major compounds, all of which have known fragmentation patterns and were typically distinguished from each other by neutral losses (Figure 4).Fragmentation patterns of the morphinandienones and phenanthrenes alkaloids, as well as the NP subclasses of lignoids and f lavonoids, were also demonstrated (Figures 5 and 6).Automated integration of 'inhouse' DB by matching MS 1 monoisotopic masses enabled manual examination of MS E spectra using MZmine 3, leading to level 2 confidence identification of 66 specialized metabolites.No public mass spectral data were available for 15 annotated metabolites, so fragmentation was proposed based on relevant literature and chemical knowledge.Thus, for these metabolites, level 2 of confidence was given based on the classification by Schymanski et al. [22].Diagnostic ion fragments were searched on the Mass Bank of North America (MoNA) and GNPS libraries.In parallel, automated annotation of MS E spectra via MS-DIAL-GNPS-FBMN identified 155 NPs.A comparison between the manual and automated annotations revealed just 18 shared annotations (Table 1).The complete observed product ions and GNPS annotations are detailed in Table S1 and online (https:// zenodo.org/records/10383866), respectively.

Gas-phase fragmentation reactions
The key distinction observed for alkaloids was based on the Nsubstitution pattern.Norapomorphines showed a 14.01 Da mass reduction versus aporphines due to the presence of a radical hydrogen instead of a methyl N-substitution.This allowed a distinction between the two alkaloid subclasses.Prevalent neutral losses were 17.03 Da (NH 3 ) and 31.01Da (CH 3 NH 2 ) from isoquinoline ring opening (Figure 4).Additionally, losses of CH 3 OH

GNPS Yes
The IDs with the respective ion fragments observed on MS E spectra are detailed in Table S1.Chemical structures are provided in Figures S8 -S12 .Spectral matching was done manually for all IDs using online MS reference spectra libraries (MoNA and GNPS).The proposed fragmentations were based on the literature with diagnostic ions for the annotated metabolite classes [23 -27 ].
(32.03 Da) occurred due to adjacent hydroxyl and methoxy groups in aporphine rings followed by neutral CO loss (27.99 Da).Fragmentation patterns of some of the less common alkaloids found in Ocotea spp., including benzylisoquinolines, morphinandienones and phenanthrenes, are also depicted in Figure 5.This reveals some shared and distinctive fragmentation patterns among the diversity of the annotated alkaloids, as explained in ST-1.Fragmentation proposals were based on chemical knowledge and supported by the literature [23][24][25][26][27].
Flavonoids and lignoids displayed characteristic neutral losses and fragment ions as well (Figure 6).The fragmentation of lignoids was evidenced by neutral losses of methyl (14.01 Da), methoxy (32.03 Da), retro-Diels-Alder reactions and aromatic ring cleavages.These fragmentations formed diagnostic ions that allowed the differentiation of bicyclo neolignans and benzofuran lignoids.For f lavonoids, fragmentation predominantly involved glycosidic bond cleavages and losses of saccharide units.These included losses of pentoses (132.04Da), deoxyhexoses (146.05Da), hexoses (162.05Da), glucuronic acids (176.03Da) and rutinoses (308.09Da), giving characteristic product ions.These neutral losses provided clues to the types of glycosylation present on the f lavonoid scaffolds.Key diagnostic ions for f lavonoid aglycones allowed differentiation between subclasses such as apigenin, quercetin and kaempferol.Overall, these characteristic fragmentation patterns allowed differentiation between the main f lavonoid subclasses present in Ocotea species (ST-1).

FBMN
Regarding the FBMN jobs with GNPS, positive mode analysis required ∼5 h, whereas the negative mode took ∼6 h.These analyses resulted in the generation of highly complex metabolic networks (Figures S6 and S7).Besides overall complexity, MN revealed intricate cluster families in the metabolome of Ocotea spp., which could be individually analyzed to get deeper information.In addition, to extract more nuanced insights, LC-HRMS-DIA/MS E data were reprocessed with higher amplitude cutoffs (e.g.50 000 counts).The FBMN jobs from both datasets required less than 15 min to finish.The simplified MNs generated from the reprocessed data aided in the visualization and identification of key Ocotea spp.molecular families (Figure 8) and actinobacterial MNs (Figure 9).Spectra were queried against GNPS libraries related to our dataset (e.g.IQAMDB and NIH NPs for positive and negative mode, respectively) and a complete list of matches is listed in Tables 1 and S1 and online at the Zenodo open digital library (https:// zenodo.org/records/10383866).The FBMN analysis revealed distinct families of aporphine and benzylisoquinoline alkaloids (positive mode), alongside predominately O-glycosylated f lavonoids (negative mode) across the Ocotea spp.( Figure 8).The pie charts illustrate relative metabolite abundance across 60 Ocotea species based on MS 1 precursor ion areas.Visual inspection of a positive mode alkaloid cluster shows 3-hydroxynornuciferine as highly abundant but specific to only a few Ocotea species, while glaucine to only a few others.Reticuline appears conserved across most Ocotea spp., suggesting a potential genus chemomarker.Nmethylcoclaurine also arises broadly present, but with reduced abundance in this cluster family.
All of the spectral matches represented were thoroughly inspected to check their annotation and spectral similarity accuracy.Regarding the MN in negative mode, the highlighted family of glycosylated f lavonoids is demonstrated, with the main metabolites across the highlighted cluster family  conservation of these f lavonoid metabolites suggests the presence of essential, shared biosynthetic pathways within the Ocotea genus.

DISCUSSION
A core challenge in omics fields, including metabolomics, is the conversion and processing of DIA data (e.g.MS E ) compared to traditional DDA workf lows.In DDA, preselected precursors are fragmented, enabling straightforward data conversion and processing by most open tools.However, DDA induces significant losses in spectral data coverage, varying with the sensitivity of the instrument and, in relation, the cycle time of the method, because it selects only ions above a certain cut-off area or intensity for fragmentation [5,28].In contrast, MS E methods fragment all ions without any previous precursor ion selection, generating complex but unbiased spectra with a more complete metabolomic data coverage [7].The lack of predefined precursors in MS E means that fragment-precursor relationships must be reconstructed post-acquisition through computational deconvolution approaches, which can be performed using different algorithms [5][6][7]29].This is more challenging compared to the inherent precursor-to-fragment associations made with DDA methods.Herein, we present an integrative strategy to leverage MS E data in the community standard .mzMLformat using only freely available software and platforms.This study demonstrates the power of MS E with accessible tools while highlighting current, ongoing challenges in data conversion, processing and interpretation.
This integrated pipeline provides an efficient and customizable solution for extracting the most biological information from MS E data.To date, this is the first mention of the Waters2mzML in an applicability case, which is a recently introduced tool specifically designed to address the challenges associated with converting and centroiding Waters MS E data, without the need to use vendor software such as UNIFI, Symphony or Progenesis QI.Waters2mzML is a simple tool that ensures compatibility and offers independence to convert raw Waters MS E data into a more widely used format.This open tool can correctly reassign MS 2 level data to MS E MS/MS scans.Therefore, it is now possible to freely generate standard and functional .mzMLspectra from MS E data for further integrated downstream metabolomics analyses.
Moreover, this pipeline includes the ability to generate a custom, 'in-house' database through a KNIME workf low, enhancing its utility.This database integration into LC-HRMS-DIA/MS E data processing significantly improves metabolite annotation of MNs when using MZmine and MS-DIAL.In our case studies, our focus was to enhance the annotation of metabolites specifically from Ocotea plants and marine actinobacteria datasets.The creation and incorporation of OcoteaDB and ActinomarineDB provided a Regarding data processing, the export functionality of MS-DIAL allows DIA data to be cosine matched with high-quality spectral libraries on the GNPS platform, also allowing integration with any GNPS tools for enhanced data interpretation, including FBMN, which was used in this workf low.It also allows other GNPS analyses, such as MS2LDA, Network Annotation Propagation (NAP) and MolNetEnhancer, which are already well-implemented for DDA data in the GNPS platform [32][33][34].MS-DIAL software therefore enables the full processing of MS E spectra with correct GNPS export in a generic file format (.mgf).Importantly, processed data from MS-DIAL can be employed for spectral similarity searches through a range of different algorithms and tools.The strength of MS-DIAL lies in its robust algorithm, MS 2 dec, which successfully deconvolutes precursor ions and reassociates precursor-fragment links and whose effectiveness has been widely proven [ 14,19,32,33].
In contrast, MZmine 3 is a powerful software for processing and analyzing DDA data, while effectively handling DIA data and integrating with GNPS are still ongoing challenges.The dissociation of MS 1   manual inspection of raw chromatograms and spectra, which was previously only possible with vendor-provided software.While the integration of DIA algorithms within MZmine 3 remains a work in progress, it has emerged as a well-known ecosystem for open MS data processing [11,18].
Even though MZmine 3 could not export processed DIA data to GNPS, it was utilized as the software platform in this pipeline to visualize the metabolic fingerprints of the pooled QC and crude extract samples, as well as to perform accurate level 3 annotation (Figure 3), providing an integrative MS E data analysis.The f lexibility in setting up parameters is particularly beneficial for the sample alignment step, allowing researchers to make modifications and visualize new results without the need to reprocess previous steps, as required in MS-DIAL.The utilization of MZmine 3 therefore enabled us to generate precise metabolic fingerprint images of the crude extracts from both dataset cases.Manual annotation of metabolites matching the 'in-house' DBs and online spectral libraries, such as the MoNA spectra database, was also incorporated for increased confidence in the annotation of matched metabolites.
The great advantage of this integrated pipeline is the ability to reliably process the entirety of MS E data, providing optimal visualization of MS 1 and MS 2 raw and processed data within a user-friendly and open software pipeline environment.Even though both programs used here present limitations, we tried to benefit from their advantages to help overcome the bottlenecks of MS E data analysis with MN implementation.The extensive development of MZmine 3 is evident through its active GitHub community and frequent updates, showcasing its commitment to continuous improvement.This dynamic environment fosters innovation, and DIA implementation tools seem to be on the horizon.In contrast, MS-DIAL has seen slower recent development, less frequent updates and fewer data processing features and parameters, indicating the need for further improvements.However, the limitations of MS-DIAL do not diminish its effectiveness in performing the necessary tasks for DIA data analysis.
This workf low offers guidance to the community for handling LC-HRMS-DIA/MS E (and AIF), for which standardized protocols were previously lacking.Future software developments (e.g.DIA algorithms in MZmine 3) will build upon, rather than invalidate, the core foundations established here, as we have delineated key data handling steps for DIA workf low implementation, from data conversion to parameter tuning (see Supplementary Section).Overall, this provides an open-source framework to empower DIA-AIF/MS E users with customizable workf lows for enhanced metabolomics analyses.
Specific parameter adjustments were performed to ensure reliable results for our MS E data during FBMN jobs, considering that most MN examples available are based on DDA data.Given the larger size of the dataset and the complexity of MS E , we carefully modified search parameters such as cosine score, number of matched fragment ions and network organization parameters including TopK and maximum connected component size.It is crucial to fine-tune these parameters due to the lack of MS 2 specificity in MS E data, where fragment ions originate from all co-eluted precursors (see Methods section).Thus, the TopK value directly inf luences the number of edges retained in the network and should be considered, as it inf luences the connections between nodes and the overall structure of the MN.We highlight the importance of considering appropriate values for TopK in the investigation of molecular families of any DIA or MS E data.In addition, DDA is generally less effective compared to DIA for low-abundance compounds [35].Lowering cosine parameters is also applicable to DIA data since the search criteria need to be less restrictive for matches to occur.The association of these strategies provides a solid foundation for future improvements in metabolite identification and cluster analysis of DIA, AIF or MS E data.
Furthermore, we recommend using more specific metabolite libraries in GNPS-like IQAMDB (IsoQuinoline and Annonaceous Metabolites Database) and NIH natural products-for broad metabolite coverage.DB-based annotation was consolidated with FBMN through feature metrics.For optimal annotation accuracy, curated, phylogeny-relevant libraries are preferable over comprehensive public counterparts.Targeted matching of detected metabolites to expected biosynthetic origins avoids erroneous assignments.Overall, harnessing biosynthetic knowledge through tailored libraries boosts reliability by connecting metabolites specifically to validated biological sources [36].
In this study, we rigorously demonstrate the utility and robustness of the DIA-IntOpenStream pipeline through its application to two distinct and carefully selected datasets, each chosen to showcase different aspects and capabilities of the workf low.In addition, the gas-phase fragmentation reactions were proposed for the different NP classes.The selection of dataset 1 was driven by its potential impact and applicability.Although, Ocotea spp.hold significant ethnobotanical importance, display promising medicinal potential and face taxonomic and ecological challenges.In addition, only a limited number of species within the genus have been chemically characterized.Given the high therapeutic potential of the Ocotea genus for drug discovery, there is an urgent need for NP chemical studies to support the bioprospecting use of Ocotea species, particularly those endangered in Brazil.Research topics focusing on the Ocotea genus have importance by themselves, and thus, this dataset also adds value and purpose to our study.
Dataset 2, an actinobacterial extract of MS public data spiked in high and low concentrations with 20 different chemical standards, was specifically chosen for a detailed comparative analysis with existing methods.The addition of known standards allows robust validation of the pipeline's annotation accuracy and efficiency.By using an actinobacterial extract, we also demonstrate the workf low's applicability to microbial metabolomics, an area of significant interest due to the role of microorganisms in environmental processes and human health.A comparative analysis with existing methods that have used this dataset highlights the advancements and improvements that DIA-IntOpenStream offers in terms of data processing efficiency, annotation accuracy and the ability to handle complex NP matrices.Several highconfidence annotations for both datasets were achieved.The results include a significant number of chemical annotations with level 2 confidence according to MSI guidelines [30,31,37,38], with spectra having matched comprehensive spectral libraries of standard compounds (GNPS and MoNA).
Dataset 2, previously examined in a high-quality study [17] , involved an advanced LC-HRMS analysis of complex NP mixtures.Among the strategies explored was the use of MS data acquired by DIA, specifically MS E .In that study, vendor software was used for data analysis, which is an extremely commonly used approach at the time of this study.We re-analyzed their DIA data with the IntOpenStream pipeline, and the obtained results reinforced the pipeline's effectiveness.It successfully allowed the annotation of many authentic chemical standards in the complex NP microbial sources, a key indicator of its reliability.The initial study successfully identified 18 high-abundance and 16 lowabundance standards.In comparison, our pipeline yielded similar results, with a minimal difference of only two and three standards fewer at each concentration level, respectively (Figure S15 and Table S3).
On the other hand, our pipeline demonstrated enhanced efficacy regarding 5 main aspects.(i) FBMN matched seven authentic standards with annotation confidence level 2 (Table S3), surpassing the two standard matches in the original study that used classical MN. (ii) We obtained an additional seven annotations with confidence level 2 for the original actinobacterial extract by utilizing the built-in ActinomarineDB and manual data inspection (Figures S16-S19 and Table S4).(iii) FBMN analysis revealed various GNPS-matched standards and related compounds in the actinobacterial extract, including the annotation of rosamicin in the azithromycin standard cluster family.(iv) We uniquely detected important genus-specific compounds such as f luostatin A, kinobscurinone and γ -actinorhodin in the actinobacterial extract (Figures 9 and S14).(v) Lastly, our pipeline annotated 450 features at confidence level 3 (details published online at https://zenodo.org/records/10383866 ).As such, our approach not only aligns with existing literature data but also provides complementary insights.The robust comparison against a dataset of established benchmarks highlights the reliability and validity of DIA-IntOpenStream.Our commitment is to offer a freely accessible, robust tool for the metabolomics community to provide independence from the advantages of proprietary software.
As a final comment, the metabolite annotation of constitutional isomers, as observed in the Ocotea dataset, can be facilitated using standard compounds for combined MS/MS experiments.However, the stereochemistry of compounds with a high degree of structural similarity demands additional characterization to confirm chemical identity, as in the case of the aporphines boldine and isoboldine, which exhibit the same parent ion and product ions (Figure 4 and Table 1).The ratios and proportions of formed ion fragments differ and might help to elucidate isomers and epimers, at standardized MS conditions, for reliable spectral matching.Implementation of our pipeline enabled us to state the chemical diversity of the studied Ocotea species as mainly alkaloid producers.Multiple aporphine alkaloids bearing various substituent patterns were annotated with level 2 of confidence.A range of different glycoside f lavonoids were annotated as well.Lastly, a wide variety of lignoids were annotated with level 3 confidence (available online at https://zenodo.org/records/10383866 ).It is worth mentioning that the first report on the evaluation of the chemical composition of several of these endemic Ocotea spp. in Brazil was published in 2023 [ 39].
Several metabolites not previously reported in the Ocotea genus were annotated at level 2 confidence.For instance, dehydronuciferine (24) annotated here in some Ocotea extracts has only been documented in other plants like the Nymphaeaceae family, encountered in the sacred lotus Nelumbo nucifera.NP research on the N. nucifera allowed authors to isolate the dehydronuciferine together with other aporphines such as the nuciferine (21) and nornuciferine (29), which are common compounds found in the Ocotea genus and also reported by us in the present investigation.Also, the alkaloid leucoxylonine (37) is reported in the literature as produced only by two species of the Ocotea genus, including Ocotea leucoxylon and Ocotea minarum [40,41].In this work, it was successfully annotated in other species with high-intensity peak areas, for the first time, in the VZ, VA and PU Ocotea extracts (Table S2).In this manner, the present research is also filling this gap and might contribute with chemical characterization data to further taxonomic classification studies associating chemosystematics strategies.
Using a custom database of metabolites previously isolated from the same biological source greatly aids annotation confidence.Large databases can complement this approach but require careful analysis to avoid improbable assignments.Our 'inhouse' DBs built in KNIME enabled high-confidence annotations since matched compounds were previously isolated in the targeted genera of the studies.For study cases of biological samples such as urine and blood, a range of other databases are available in HMDB (https://hmdb.ca/ ) as well as other online repositories.In addition, automated annotation with a higher level of confidence can be also performed in MS-DIAL with metabolomics MSP spectral kits or by directly exporting data to the GNPS platform and selecting available spectral libraries.Critically important to perform quality analyses, manual and automated annotations were largely complementary.For example, dataset 1 contained only 18 common level 2 annotations, indicating both strategies are relevant and that combining them can be highly effective.
In conclusion, all these ongoing challenges around LC-HRMS/DIA analysis have motivated us to build this pipeline.We believe it represents an advancement in the field, providing an accessible and efficient workf low for handling complex MS E data and conducting MN analyses.It can globally aid bioprospecting NP, as we did by unlocking the chemical diversity of plants and bacterial marine extracts.Also, the inclusion of known standards in dataset 2 allowed robust validation of the pipeline's annotation accuracy and efficiency.The use of both datasets highlighted DIA-IntOpenStream's versatility and potential in diverse metabolomics studies.By prioritizing accessibility and transparency, our pipeline ensures that all aspects of data analysis, including processing steps, parameters, software versions and computational setup, are precisely documented and available to the scientific community.This commitment to reproducibility fosters scientific progress and collaboration.Future works may integrate other valuable open data preprocessing, MS/MS annotation and in silico fragmentation tools into this pipeline, such as the TidyMS python library, SIRIUS software and MS-FINDER, respectively [42][43][44].Overall, this pipeline embodies scientific rigor, and its implementation holds promise for speeding up chemical discoveries, ultimately guiding researchers toward a more collaborative and open environment for research.

Solvents, plant material and crude extract preparation
Details regarding the solvents used and sample preparation methods are provided in the Supplementary Material SM-1.Information on solvent sources, purity levels, vegetal material, maceration extraction conditions and sample-handling procedures are all included.

Data acquisition and sample analysis
Chromatographic analysis was performed on an ultra-performance liquid chromatography-quadrupole time-of-f light tandem MS instrument (Xevo qTOF MS, Waters Corp., Milford, USA).Details concerning the QC preparation, chromatographic column, method details and mobile phase system information are described in Supplementary Material SM-2.The electrospray ionization (ESI) source operated in both positive and negative ion modes to capture a comprehensive range of analytes.MS E , a type of DIA analysis, was conducted using MassLynx™ (v4.2;Waters Corp., Milford, USA).The mass spectrometer and MS E acquisition parameters are fully detailed in Supplementary Material SM-2.

Public samples dataset
We validated our pipeline using a publicly available LC-HRMS-DIA/MS E dataset of a marine actinobacteria extract.This dataset, as described in the publication by Carnevale et al., was enriched with a pool of 20 authentic standards [17].The chosen standards encompass a wide range of antimicrobial and chemotherapeutic agents, along with naturally occurring compounds, thereby providing a diverse chemical profile suitable for comprehensive analysis.The dataset was obtained from the MassIVE repository (MSV000088316) and is accessible through the Global Natural Products Social Molecular Networking (GNPS) platform.

Data processing and annotation workflow
The KNIME workf low and subsequent open software in the pipeline were executed on a Windows 11 desktop computer with a 12-core (8 used) 64-bit Intel Core i7-12700-2.10GHz processor with 32 GB of RAM.The GPU consisted of an NVIDIA T1000 8GB.

KNIME in-house database workflow
To perform the experiments, we have developed a robust workf low to establish an integrated 'in-house' database within Mzmine and MS-DIAL data processing software using the KNIME (University of Konstanz, Zurich, Switzerland, version 4.6.5).KNIME (www.knime.org ) is an open-source workf low system with a graphical user interface built on a set of nodes known as 'extensions' that process data and transmit it via connections between those nodes.Thus KNIME provides a simple visual workbench that allows scientists to build and visualize complex workf lows [ 45,46].The workf low is online and is publicly available to use (https://hub.knime.com/-/spaces/-/&#x007E;8bZEbbknV8tVptea/current-state/ ).Details regarding our custom 'in-house' DB are provided in the SM-3.The 'in-house' database allows level 3 annotation following the guidelines of the MSI [ 30,31].However, it holds more confidence because the 'in-house' DB supports fast annotation of previous metabolites previously isolated in the family or genus of the study.

MS data conversion
To ensure compatibility, accessibility and comparability, the raw Waters MS E data from both independent datasets were converted to the widely used .mzMLformat using the recently developed open-source tool Waters2mzML 1.2.0, available on GitHub (https://github.com/AnP311/Waters2mzML ) (SM-3).The generated .mzMLfiles can be readily processed using Mzmine 3 and MS-DIAL software for further analyses and interpretation.

Mzmine 3 data processing and analysis
The raw data containing peak area and R t -m/z pairs of 71 Ocotea samples (60 Ocotea spp.sample extracts, four QCs, four blanks and three VI sample extracts [replicates]) and 17 samples from actinobacterial extract (replicates and blanks), previously converted to .mzMLformat, were imported into MZmine 3.4.27(https://mzmine.github.io/; MZmine Development Team).One QC and one blank sample replicate were excluded from processing due to higher shifts in the R t compared to other replicates.The detected peaks were deconvoluted, isotopes were eliminated, identical peaks in the different chromatograms were aligned, the remaining gaps were filled, duplicated features were filtered and the blank chromatograms were subtracted.Then, the features were annotated according to their monoisotopic masses.Data from each ionization mode were processed separately.The data processing parameters are fully detailed in the Supplementary material SM-4.
The treated MS data was then exported in .xlxsformat.The Mass Bank of North America (MoNA) (https://mona.fiehnlab.ucdavis.edu/ ) was used for manual spectral comparisons and fragment MS data matches.These manual annotations were listed as level 2 according to the current standards initiative [ 31,37,38,47].The chosen modules and algorithms of processing were the standard ones, although this software offers an array of different modern tools that can be used to improve data processing results [6,11,18].

MS-DIAL data processing and analysis
The 'Analysis Base File' (.abf) format, generated using Reifys Abf converter software (https://www.reifycs.com/AbfConverter), is a traditional data format for MS-DIAL MS E data processing aimed at MN implementation.However, to ensure maximum compatibility with other software, we chose a more universal approach by converting the data into .mzMLformat, enabling simultaneous MZmine and MS-DIAL usage and MS data comparison.
The converted .mzMLdata were successfully loaded into MS-DIAL version 4.9.2 (http://prime.psc.riken.jp/compms/msdial/main.html ) for data processing, following a procedure similar to the one used with MZmine 3. In MS-DIAL, we configured the project settings according to our specific data requirements: ionization mode (soft ionization; chromatography; conventional DIA-all-ions method-AIF), the experiment file (available in the Supplementary Material) and data type (i.e.centroid MS 1  and MS/MS data).It is also necessary to process positive and negative ion modes data separately.For data processing, the parameters are fully detailed in the Supplementary Material (SM-5).Subsequently, the data were uploaded to the GNPS server using the open FTP tool named WinSCP (https://winscp.net/eng/download.php).All the other existing parameters not mentioned were left at software default.

Molecular networking and metabolite annotation analysis
Metabolite annotation in our study involved a combination of automated and manual approaches (detailed in SM-6).Post-data processing is performed by exporting the results from MS-DIAL, i.e. 'MS2 File' (.mgf), 'Feature Quantification Table' (.txt) and the metadata to the GNPS (https://gnps.ucsd.edu/ ) environment.The metadata file was built in .txtformat with the filenames and the respective attributes of species, sample type, region/state of plant collection and endemic occurrence in Brazil.These files are available online on the Zenodo platform (https://zenodo.org/records/10383866 ).FBMN was generated using the respective workf low in the GNPS ecosystem [ 48] using FBMN parameters described in SM-6.Most of the metabolites were annotated at levels 2 and 3 according to MSI levels.All combined FBMN jobs with level 3 annotated metabolites are listed in Tables S3 and S4 and can be found on the Zenodo platform (https://zenodo.org/records/10383866 ).

Molecular networking visualization and interpretation
The generated networks from GNPS were downloaded and visualized using Cytoscape network software (version 3.8.2).The metadata-rich GNPS table, when opened in Cytoscape, can be exported as a .csvfile.This facilitates semi-automated integration with the 'in-house' annotation.Subsequently, the annotated table can be reimported into the software to perform MN investigation and analysis.

Key Points
• An open, integrated workf low is presented that leverages both universal data formats (.mzML) and open-source software tools (KNIME, MZmine, MS-DIAL and GNPS) for enhanced DIA-MS E data handling.• The workf low demonstrated its applicability by characterizing Ocotea crude plant and marine actinobacterial extract, revealing the chemical diversity of different natural product classes.
• By promoting open science, the pipeline provides a framework to advance DIA-MS E data handling, transparency, reproducibility and analysis through integrative approaches, overcoming the limitations of commercial solutions.• We aim to propel the field forward, empowering researchers to achieve more accessible MS E data processing, with a reliable annotation process, leveraging the potential of DIA-MS to drive the community toward further improvements.

Figure 8 .
Figure 8.Molecular families of aporphine and benzylisoquinoline alkaloids as well as the glycosylated f lavonoid cluster families derived from the FBMN.Different alkaloids and f lavonoids were annotated with levels 2 and 3 of confidence using GNPS and MoNA spectral matches, and the 'in-house' OcoteaDB.ESI + demonstrates representative reticuline alkaloid MS E spectra and fragmentation product ions.ESI − -Clustering of predominantly O-glycosylated f lavonoids identified across Ocotea spp.and respective aglycones.Each node represents an MS E -acquired mass spectrum, and the edges connecting them show MS/MS fragmentation similarity (cosine > 0.6).The pie charts show the relative abundance of each Ocotea plant species (n = 60).In MS 1 scans, node diameters are related to the sum of peak regions of the precursor ion in both positive (upper) and negative (lower) modes of ionization.

Figure 9 .
Figure 9. Molecular families of the actinobacterial extract derived from the FBMN at high and low concentrations of spiked chemical authentic standards.Different chemical standards were annotated with level 2 confidence using GNPS spectral matches, including azithromycin, tetracycline and doxycycline.Each node represents an MS E -acquired mass spectrum, and the edges connecting them show MS/MS fragmentation similarity (cosine > 0.6).The pie charts show the relative abundance of each sample (AE-H-actinobacterial extract spiked with a high concentration of standards, AE-L-actinobacterial extract spiked with a low concentration of standards, STD-H-chemical standards at high concentration and STD-H-chemical standards at low concentration).In MS 1 scans, node diameters are related to the sum of the peak regions of the precursor ion in positive ionization mode.

Table 1 :
ESI-MS E positive and negative modes including annotation with level 2 of confidence of Ocotea metabolites from based on the in-house database.This table includes 41 alkaloids (pyrrolidine, proaporphine, noraporphine, aporphine, benzylisoquinoline, morphinandienone and protoberberine subclasses), 6 lignoids (1 lignan and 5 neolignans), 18 f lavonoids (glycosylated quercetin, kaempferol and apigenin derivative subclasses) and a cyclic polyol.The last two columns indicate the MS 2 spectral source used for manual annotation and if spectra were matched automatically on the GNPS platform, respectively