ODNA: identification of organellar DNA by machine learning

Abstract Motivation Identifying organellar DNA, such as mitochondrial or plastid sequences, inside a whole genome assembly, remains challenging and requires biological background knowledge. To address this, we developed ODNA based on genome annotation and machine learning to fulfill. Results ODNA is a software that classifies organellar DNA sequences within a genome assembly by machine learning based on a predefined genome annotation workflow. We trained our model with 829 769 DNA sequences from 405 genome assemblies and achieved high predictive performance (e.g. matthew's correlation coefficient of 0.61 for mitochondria and 0.73 for chloroplasts) on independent validation data, thus outperforming existing approaches significantly. Availability and implementation Our software ODNA is freely accessible as a web service at https://odna.mathematik.uni-marburg.de and can also be run in a docker container. The source code can be found at https://gitlab.com/mosga/odna and the processed data at Zenodo (DOI: 10.5281/zenodo.7506483).


Supplementary Information Text Data Retrieval, Encoding and Data Analysis
The data used for the machine learning approach was retrieved from NCBI. First, we downloaded all data from the NCBI Organellar Sequences Database, release number 214 (see Figure S2). Second, we downloaded all genome assemblies from NCBI Genbank, referenced through the Organellar Sequence Database. Then, we generated the machine learning data by performing 405 genome annotations via MOSGA 2 (see Figure S4), resulting in 829,769 annotated sequences. Subsequently, we developed ODNA (see Figure S5) as a modified fork from MOSGA v.2.1.5 with an integrated machine learning model.
To encode the data, we used multiple tools, filtered the results (see ODNA parameter), and computed the features listed in Table S4. A Spearman and Pearson correlation analysis did not reveal a clear correlation between the target variable Organelle and any feature (see Figure S6), indicating that more than one feature is required to classify the organellar DNA sequence.
Although the Random Forest model was well-performing but in the end not used for the ODNA since AdaBoost performed better, the visualization of the mean decrease in impurity of the Random Forest model in Figure S7 from the Random Forest classifier shows convincing results regarding the feature's contribution. The features Mito, Plastid, their density MD and PD have the greatest effect. To analyze the influence of these features on the the model's performance, we removed the features Mito, Plastid, MD, and PD in one approach, resulting in worse classification performance. We could observe a similar reduced performance using only these features. Only the combination of all features resulted in the best-performing models. Interestingly, MitoFinder uses primarily a database of mitochondrial genes to identify the mitogenome, which is similar to our Mito feature.
Furthermore, we analyzed the taxonomical association of each genome assembly used for machine learning (see Figure S8). Less surprisingly, we can confirm that there is an overrepresentation of available genome data for some taxas, such as Chordata. This imbalance in the availability of genomic data is already well-described in the literature.

Machine Learning
The scikit-learn library carried out all machine learning computations within a Jupyter-Notebook in Python 3. The data was split in a stratified fashion, resulting in 663,455 training (80 %) and 165,864 test (20 %) data entries. A standard scaler for all data was fitted, and training was performed using 10-fold cross-validation. To evaluate the models' performance, we trained for the best Matthews correlation coefficient (MCC) according to Equation 1.
The hyper-parameterization was performed within the cross-validation (see Figure S9) and realized through a Halving Grid Search and 100 Randomized Searches. To be unbiased to specific kinds of machine learning, we tested multiple machine learning models, including AdaBoost, Bagging, Decision Tree, Extra Trees, Gradient Boosting, K-nearest neighbor, Multi-layer perceptron (MLP), and Stochastic Gradient Descent (SGD).
The best model was an AdaBoost Classifier derived from the Halving Grid Search hyper-parameterization. It has an MCC of 0.90 and an F1-Score of 0.90 (see Figure S3) with the test dataset. We also tested our model in a real-world use case with multiple eukaryotic genome assemblies across all different taxonomical clades with 14,514 sequences for mitochondrial and 98,882 sequences for chloroplasts, which we considered the validation data.
Additionally, to ensure that our model does not apply only to the sequences from already learned taxa, we successfully classified the mitochondrial DNA from the protist Cafeteria burkhardae (GCA_008330645.1), whose respective taxonomical class was not represented in the training and test data.

Benchmark
In order to compare the capability of ODNA and MitoFinder in identifying organellar DNA sequences, we randomly selected ten genome assemblies that were used neither in training nor during machine learning (see Machine Learning section). Since we assume that both tools would be able to identify the organelles or, respectively, the mitochondrial DNA, we tried to minimalize the parameterization. Sequences within a genome assembly were ignored for the evaluation if they belong to other organelles except for mitochondria. Similarly to that, we checked the performance of ODNA to classify chloroplast DNA based on different unseen genome assemblies. All used software is listed in Table S5.
We run this command exemplary for each genome assembly. Here for GCF_000001215.4.
/MitoFinder/mitofinder -j GCF_000001215.4 -a~/VALIDATION/genomes/GCF_000001215.4.fasta -r /VALIDATION/reference.gb -p 32 -o 1 --ignore --override ODNA parameter. The beauty of ODNA lies in the fact that it is neither necessary nor possible to parameterize ODNA. It is pre-configured, and user can only provide a valid genome FASTA or a genebank flat (GBFF) file with sequences. However, the underlying tools are parameterized in the sense as MOSGA 2 allows parameterization. Red: Repeats were only considered with the minimal length of 100 bases. tRNAScan-SE 2.0 transfer-RNAs with a minimal score of 70. barrnap: only rRNAs from the eukaryota kingdom, length cut off of 80, rejection rate of 25 and a minimal evalue of 10 -6 .

Computational resource
We used two servers from the German governmental deNBI cluster to perform all MOSGA genome annotations, each with 32 CPU cores and 64 GB of main memory. The required genome annotation time for the dataset creation is plotted against the genome size in Figure S10. Additionally, the machine learning computations, and the MitoFinder and ODNA executions were run independently on one of these machines.  Fig. S1. Visualization of the technical workflow. Based on the selected DNA extraction method, organellar DNA will be extracted for a sequencing procedure and require special treatment during the annotation and submission. However, depending on the data quality and reference knowledge, it is challenging to differentiate which sequence derives from an organelle. ODNA presents software that can classify organellar DNA based on specific annotation patterns.

Fig. S2. Data Retrieval and Machine Learning Workflow. 1.) A bash script downloads all whole mitochondrial and plastid genomes from the NCBI Organelle Genome
Resources Database. Extracts the FASTA sequence headers and write them into the data/organelles_ids.csv file. 2.) The identified sequence headers were searched in the NCBI Assembly database for corresponding genome assemblies containing organellar sequences in the summary description. 3.) All identified genome assemblies with a minimal and maximal threshold of included sequences were downloaded and annotated with one prepared MOSGA 2 configuration, including the relevant tools and parameterization. The organellar analysis summaries for each annotation run were stored, and corresponding organellar sequences were marked according to the previously identified organelles-to-assembly association. 4.) All results not failing during the annotation pipeline were summarized inside one results.csv data, which will be further used as the dataset for machine learning. 5.) Machine learning training and evaluation will be performed. The best model will finally be stored inside the best_model.pick file.    Fig. S8. Taxonomic distribution of the used genomes categorized in the kingdoms (a), phyla (b), and classes (c). Portions with less than two percent are summarized into the category "Others" while genomes with missing taxonomic information for a specific level are collected in the category "Missing". Species from the SAR clades, for example, do not have an assignment for a kingdom, phylum, or class.  Fig. S9. Hyperparameterization. This violin plot shows the performance of all models generated with a 10-fold-cross-validation through hyper parameterization. The suffix in the model name indicates whether the HalvingGridSearchCV (grid) or RandomizedSearchCV (rnd) was used.