iDeLUCS: a deep learning interactive tool for alignment-free clustering of DNA sequences

Abstract Summary We present an interactive Deep Learning-based software tool for Unsupervised Clustering of DNA Sequences (iDeLUCS), that detects genomic signatures and uses them to cluster DNA sequences, without the need for sequence alignment or taxonomic identifiers. iDeLUCS is scalable and user-friendly: its graphical user interface, with support for hardware acceleration, allows the practitioner to fine-tune the different hyper-parameters involved in the training process without requiring extensive knowledge of deep learning. The performance of iDeLUCS was evaluated on a diverse set of datasets: several real genomic datasets from organisms in kingdoms Animalia, Protista, Fungi, Bacteria, and Archaea, three datasets of viral genomes, a dataset of simulated metagenomic reads from microbial genomes, and multiple datasets of synthetic DNA sequences. The performance of iDeLUCS was compared to that of two classical clustering algorithms (k-means++ and GMM) and two clustering algorithms specialized in DNA sequences (MeShClust v3.0 and DeLUCS), using both intrinsic cluster evaluation metrics and external evaluation metrics. In terms of unsupervised clustering accuracy, iDeLUCS outperforms the two classical algorithms by an average of ∼20%, and the two specialized algorithms by an average of ∼12%, on the datasets of real DNA sequences analyzed. Overall, our results indicate that iDeLUCS is a robust clustering method suitable for the clustering of large and diverse datasets of unlabeled DNA sequences. Availability and implementation iDeLUCS is available at https://github.com/Kari-Genomics-Lab/iDeLUCS under the terms of the MIT licence.


Supplementary Material
The purpose of this document is to provide the user with all the information required to understand the functionality of the software tool iDeLUCS.The document is divided into three main parts: Appendix A (Software Documentation) contains information on how to use the tool and should be treated as a user manual, as it describes the training parameters, input formatting, data preparation, and the general functionality of the main tabs in the GUI.Appendix B (Methodology) describes the underlying theoretical principles behind iDeLUCS.Lastly, Appendix C (Performance Evaluation), contains a description of the datasets used to benchmark the tool, the testing protocols and the results.  2 This is the first interaction between the user and the tool, and an example of its configuration is illustrated in Figure 1.There are several hyper-parameters that are required for iDeLUCS to obtain the clustering assignment.The user may use the default values or select a specific value depending on the amount of information that is available about the dataset.All these parameters can be specified in the Settings tab of the program or through the command line.The following is a brief description of each parameter.
• sequence file: iDeLUCS is multi-fasta format capable but not capable of multiple fasta files.All the sequences to be clustered must be joined into a single FASTA file and the path to this file must be provided as input in both the command line (CLI) and the graphical user interface (GUI) versions of iDeLUCS.The header of each sequence in the sequence file must be a unique sequence identifier, and each sequence must follow the IUPAC nomenclature code for nucleic acids.
• GT file: Tab-separated file with a hypothesized labelling for the training dataset, it must contain the following column names: sequence id with the sequence identifiers, which must correspond to the identifiers provided in the headers of each sequence in the file.
cluster id with the ground truth assignments for each sequence.
Note that this is an optional parameter.The hypothesized labels will not be used during training, only for post-hoc analysis.
• n clusters Expected or maximum number of clusters to find.This quantity should be greater or equal to n true clusters when GT file is provided.(Default: 5; Range: [2,100]).The value of 0 is used for automatically finding fine-grained clusters through HDBSCAN.
• n epochs: Number of training epochs.An epoch is defined as a training iteration over all the training pairs (mimics).(Default: 50; Range: [50,150]).
• batch sz: Number of data pairs the network will receive simultaneously during training.A larger batch may improve convergence but it may harm the accuracy.(Default: 256; Range: [1,1024]).
• lambda: Hyper-parameter to control cluster balance.(Default: 2.8; Range: [1,3]).Use a lower value for highly imbalanced datasets and a value closer to 3 when perfectly balanced clusters are expected.
• weight: Hyper-parameter to control the weight of the different loss functions.(Default: 0.25; Range: [0,1]).Use a higher value when low intracluster distance is expected and a lower value when high intra-cluster variability is expected.

A.2 Training Tab
In this tab users can monitor the status of the training process through qualitative information provided in the interactive plots (Figure 2).The left panel displays a summary of the main training parameters, as well as some statistics about the dataset under study.The center panel contains a dynamic figure representing how the assignments are evolving as training progresses.In this figure (central panel), each point represents a sequence, and its position indicates the probability that it is assigned to different clusters.Points that are located near the center have approximately equal probability to belong to all clusters, moreover, a point that is located near one of the vertices has a high probability of belonging to the cluster represented by that vertex and will be shown using the same color.The dynamic figure in the right panel of the tab, displays how the value of the contrastive loss changes as a function of the training epochs for each model used during training.These learning curves allow the user to visualize and assess the effects of the selected hyper-parameter on the training process.

A.3 Results Tab
This tab will be made available to the user either after the training is complete or after it was manually paused by the user (Figure 3).The left panel of the tab contains a table with the numeric cluster assignment for each sequence in the dataset, and a confidence score for each assignment.In the right panel, the tab will display a visual representation of the lower dimensional space that was learned by the network (model) with minimum loss during training.If the ground truth was provided, the tab will also display the confusion matrix calculated using the Hungarian matching algorithm, as well as the unsupervised clustering accuracy (ACC).If a ground truth is not provided, the Silhouette Coefficient and the Davies-Bouldin Index, two other clustering metrics, are displayed for qualitative assessment of the clustering process.(See Appendix C.2 for a detailed description of the evaluation process)

A.4 iDeLUCS Output
After the user selects the 'Save Results' option on the Results Tab, or after the execution of the CLI version of the software is completed, a folder with the results will be provided as an output to the user.The name of the folder will be given according to the format: Results/sequence file mmm dd hh:mm:ss.The folder will contain the following files: • assignments.tsv:File containing tabular data describing the cluster assignment of each sequence and the respective confidence score.
• metrics.tsv:File containing the value of the different clustering measures (ACC, Silhouette, Davies-Bouldin).When no ground truth is provided, only internal measures are calculated.

A.5 Software Implementation
iDeLUCS is fully implemented in Python 3.9, and the source code is publicly available in the Github repository https://github.com/Kari-Genomics-Lab/iDeLUCS.For circumstances where the user only has access to a headless server, the CLI version of iDeLUCS provides similar functionality as the GUI version of the software.All the neural networks are implemented using PyTorch 1.12 and the additional libraries are used in the software are: numpy, pandas, sciki-learn, umap-learn, scipy and matplotlib.For the CLI application, the training parameters hold the same identifiers as in the GUI but must be provided in a single line according to the following notation: idelucs --sequence_file=<FASTA> --GT_file=<Optional Labelling> --n_clusters=3 --n_epochs=100 --scheduler=None --lambda=2.8 --batch_sz=512 --n_mimics=3 --weight=0.25

B Methodology B.1 The Contrastive Learning Framework
Various contrastive algorithms have been developed for unsupervised tasks such as clustering and representation learning.Although they are all different, most of these algorithms share the same principles, now known as the contrastive learning framework [2].In this framework, which is at the core of iDeLUCS, there are three major components: • The construction of a dataset of paired data.At this stage, given a data set X = {x 1 , . . ., x n }, the objective is to construct the set where the data points in each pair (x i , xj i ) are considered similar according to some criteria based on prior knowledge, e.g., invariance to distortions or spatial proximity.Note that samples xi and xj i may not be present in the original dataset, and can be both an augmented version of the original sample x i .This module is often referred to as the data augmentation module.
• The definition of a neural encoder.In order to learn from the paired dataset, the goal is to learn a mapping that encodes only what is common between xi and xj i , while dropping all the irrelevant information.If such a mapping Φ is found, the image Y = Φ(X ) becomes a lower dimensional representation of the original space X and can be used for downstream tasks.The best candidate for Φ is then a deep neural network Φ θ as its parameters θ can be optimized through backprogapation.
• The contrastive loss function.Depending on the unsupervised learning task of interest, any "pretext" task that attempts to minimize the distance between representations of pairs of samples can be used as inspiration for the loss function.A general form of a potentially successful contrastive loss is described in [1] as where L alignment encourages representations of paired samples to be consistent, while L distribution encourages representations to match a target distribution.w is the weighting parameter defining the importance of each term in the final loss.

B.2 iDeLUCS Pipeline
The methodology proposed in iDeLUCS builds upon the pipeline proposed in [6], in this subsection we illustrate how it fits into the contrastive learning framework.We describe its main components, and contrast them against the pipeline in [6].

B.2.1 Data Augmentation
The data augmentation module in [6] corresponds to the creation of m artificial mimic sequences per original sequence using a probabilistic model based on transitions and transversions, while preserving the original sequences.The stochastic data augmentation model proposed in iDeLUCS transforms every DNA sequence in the dataset.This process produces two artificially created training samples, which are considered as positive training pairs even though they are not present in the original training dataset.We apply three different augmentations sequentially: Two random DNA substitution mutations, i.e., transitions and transversions with fixed independent substitution probabilities p ts = 10 −3 and p tv = 5 −3 respectively, and a random assignment of r = 20 nucleotides to symbol "N" (representing an unidentified nucleotide).This composition of augmentations incorporates robustness into the model and allows the networks to learn the structure of more complex datasets.Each augmented training sequence is then converted into a numerical vector containing the counts of all of its k-mers, where a k-mer is defined as a subsequence of length k that does not contain the symbol "N".Finally, each k-mer count vector is converted into a k-mer frequency vector, via dividing its k-mer counts by the total length of the sequence minus the number of "N" symbols.

B.2.2 Neural Network -Base Encoder
We divide our architecture into a base encoder f θ (•), that extracts a meaningful lower dimensional representation, and a clustering layer ).We employ the same architecture constructed in [6] as the backbone for iDeLUCS.For a mini-batch X MB = (x, x) of 2N augmented k-mer vectors, all the k-mer vectors are passed through the encoder, that consists of two fully connected layers, Linear (512 neurons) and Linear (64 neurons), each one followed by a Rectified Linear Unit (ReLU) and a Dropout layer with dropout rate of 0.5.The hidden representation z i = f (x i ) is then passed through the clustering layer Linear (c clusters), where c is a numerical parameter representing the upper bound of the number of clusters, followed by a Softmax activation function.

B.2.3 Contrastive Loss Function
The pipeline in [6] uses the negative weighted mutual information as the loss function.
Note that Eq (1) fits the description of the general contrastive loss introduced in B.1.With the minimization of the loss function, the conditional entropy term H(Φ(x) | Φ(x)) should be as small as possible, resulting in sample x being perfectly predictable from x.The entropy term can be interpreted as the Kullback-Leibler divergence between the output distribution and a uniform distribution over the cluster assignments . Furthermore, we observe that it is possible to learn the cluster assignments simultaneously with a hidden representation that allows the computation of distances between samples in different clusters.For this purpose, iDeLUCS combines the negative of the weighted mutual information with a loss function that enforces the consistency of the intermediate representations computed during training.We use the Normalized Temperature-scaled Cross Entropy (NT-Xent) for this purpose: Here, and τ is a "temperature" parameter, used to restrict the range of the similarity scores, we use τ = 1.The combined loss function in iDeLUCS can be written as: The hyper-parameter w in Eq (3) controls the weight of the different loss functions.Figure 4 illustrates how iDeLUCS incorporates the additional contrastive term into the final loss to enforce the consistency of the hidden representations learned by the artificial neural networks.This provides robustness with respect to unbalanced datasets, as the learned representation of sequences in the same cluster are close to each other, but far from sequence in other clusters, see C.3 for more details.
Figure 4: iDeLUCS maximizes the mutual information between the soft assignments σ, σ of the augmentations x, x from each training sequence after random mapping t, while maximizing the similarity of the hidden representations z and z.

B.2.4 Information Theoretic Clustering Ensemble
For a given dataset X = {x 1 , . . ., x N }, a partition π can be represented as a set of K clusters π = {L 1 , . . ., L K }, such that π(x i ) denotes the cluster label assigned to x i by the partition.Suppose we are given a set Π = {π 1 , ..., π H } of H partitions of the data set X .The problem of clustering combination is to find the consensus partition π C that best summarizes the information present in Π.
In general, the combination of multiple partitions in an unsupervised setting is a challenging problem, as each partition in the combination is represented as a set of labels assigned by an independent clustering algorithm with no trivial mapping between the assignments.Here we provide a detailed explanation of the concepts used in iDeLUCS which are a combination of the work presented in [7,10,11].We use an information theoretic approach to the problem that does not compute an explicit mapping between the assignments.In this framework, the quality of the consensus partition π C = {C 1 , . . ., C K } is determined by the amount of information it shares with all the partitions The best possible partition is then determined by where is the classical Shannon mutual information between partitions.The previous optimization problem represents a difficult combinatorial problem [10].However, the work in [11] shows that it is possible to consider a generalized definition of mutual information to simplify the problem.The generalized entropy of degree s for a discrete probability distribution P = (p 1 , . . ., p n ) is defined as Hence, the generalized quadratic mutual information (s = 2) becomes: With the following estimates, [10].This is relevant because in [7] Mirkin showed that a solution of the optimization problem of the utility function can be obtained through a transformation of the categorical labels into standardized binary features.iDeLUCS uses the same transformation and replaces each partition π i by K binary features, and standardizes each binary feature to a zero mean.More specifically, for each data point x, and each partition π i ∈ Π, the values of the new features are calculated as , where 1 [x=y] ∈ {0, 1} is an indicator function, evaluating to 1 if and only if x = y .The final solution of the consensus partition problem can be obtained by a classic clustering algorithm operating over the new features y ij .This clustering ensemble technique introduces robustness into the method and provides a better estimate of the confidence score of iDeLUCS for each sequence in the dataset.

B.3 iDeLUCS + HDBSCAN
It is possible that the number of expected clusters is unknown, in which case non-parametric clustering tools that automatically identify the number of clusters may be preferred.Fortunately, the contrastive learning framework can be seamlessly integrated with non-parametric clustering algorithms when some homology is expected.In this context, we have augmented iDeLUCS with an additional option to infer the number of clusters using the classical non-parametric clustering algorithm HDBSCAN [5].To achieve this, we set the invariant information component of the loss in Equation 3, responsible for network assignment to zero, while the predominant component becomes the SimCLR loss.The learned 64-dimensional latent features are then used as input to HDBSCAN to compute the final clustering.We recommend using this feature only when the resulting clusters are expected to correspond to the lowest possible taxonomic level since HDBSCAN is a density-based method.

C Performance Evaluation C.1 Sample Datasets
Besides the datasets provided in [6], users can test the performance of iDeLUCS over 3 additional mitochondrial datasets obtained from NCBI in June, 2022; one dataset of metagenomic reads simulated from eight microbial genomes using the Pacific Biosciences SMRT error model for long metagenomic reads, and 14 datasets of artificial DNA sequences created by first generating random template sequences and then mutating them by using various identity scores as provided by [4].Dataset composition is summarized in Tables 1, 2, 3.

C.2 Evaluation Metrics
Clustering results can be evaluated post hoc using both internal and external validation methods, given the available information at test time.For the application domain of this paper (genomic datasets to be clustered according to their taxonomy), external evaluation methods seem more appropriate, as agreement with the ground truth (taxonomic groups) is ultimately a more important measure than other internal cluster properties such as separation or compactness.We select the unsupervised clustering accuracy (ACC) as the main external measure, as it utilizes the optimal mapping f between discovered clusters c i ∈ C = {c i | i = 1, . . ., Q}, and ground truth cluster labels l i ∈ L = {l i | i = 1, . . ., K}, as provided by the Hungarian algorithm calculated using the contingency matrix A = {a ij } where a ij represents the number of Table 1: Summary of the six mitochondrial DNA datasets of vertebrates, two bacterial datasets, and three viral datasets from [6] included in this study.

Dataset
Total no.sequences.sequences with label l i assigned to cluster c j .The metric is formally defined as:

Min
where n is the total number of sequences, and for each DNA sequence x i , 1 ≤ i ≤ n, we have that f (c i ) is the taxonomic label assigned to c i by the optimal mapping f , and O is a comparison operator returning 1 if the equality in the argument holds and 0 otherwise [6].A value of 100% stands for a perfect match, and a value of 0% indicates that all samples were wrongly assigned.A larger value is correlated with a better matching with the ground truth.That being said, we also provide two additional external clustering metrics and two internal clustering metrics to complmenet the evaluation process and to provide a quantitative assessment of the performance of iDeLUCS for cases where the ground truth of the training data is not available.
• Homogeneity (higher is better): This external evaluation metric measures the extent to which each cluster contains only samples belonging to a single class [8].It is defined as: otherwise where H(L|C) is the conditional entropy of the class distribution given the proposed clustering and H(L) is the entropy of the true class labels.
The entropies are calculated as: where a c is the number of sequences in cluster c and n l is the number of sequences in true class l.Note that the homogeneity score ranges between 0 and 1, with 1 indicating perfect homogeneity.As noted by the authors in [8], in the perfectly homogeneous case, the value of H(L|C) = 0, when each cluster contains members of a single class.
• Completeness (higher is better): This external evaluation metric measures whether or not all data points that belong to a given class are assigned to the same cluster.In other words, a clustering result satisfies completeness if all data points from a single class are assigned to a single cluster.This metric, defined in [8] is symmetrical to homogeneity: The entropies are calculated as: where a c is the number of sequences in cluster c and n l is the number of sequences in true class l.The completeness score ranges from 0 to 1, with higher values indicating better clustering performance.
• Silhouette Coefficient (higher is better): This measure compares the cluster assignment of a sequence with the assignment of the closest sequence assigned to a different cluster.Specifically, where N is the number of sequences in the dataset, b is the distance between the representation of a sequence and the representation of the nearest sequence in a cluster it does not belong to, and a is the mean intra-cluster distance.The best possible score is 1, and the score decreases as the overlap between the clusters increases.Scores with negative values indicate that most of the sequences have been placed in a wrong cluster, with the worst possible value being -1.
• Davies-Bouldin Index (lower is better): Defined as the average similarity measure of each cluster with its most similar cluster, it measures the average distance between clusters, relative to their sizes.
where s i is the average distance between each point of cluster i and the centroid c i and d ij is the distance between cluster centroids c i and c j .Thus, assignment with high inter-cluster distance and low intra-cluster distance will result in a value closer to zero, which indicates a better score.

C.3 Performance Evaluation
To assess the performance of iDeLUCS on the new datasets, we compare it against two classic clustering algorithms.The comparison is summarized in Table 4 and an example of the output confusion matrices is illustrated in Figure 5.In addition, Tables 5a and 5b summarize the comparison between iDeLUCS and all the other clustering algorithms on the eleven benchmarking datasets.Tables 6 contains the comparison on the dataset of simulated metagenomic reads from [12].Table 7 contains the comparison on the synthetic datasets from [3].Finally, Table 8 contains performance evaluation of iDeLUCS using a CPU cluster.
Although iDeLUCS performs better overall on balanced datasets, both the improved clustering ensemble and the new contrastive loss function provide robustness with respect to unbalanced datasets, as Eq (3) is not dominated by the entropy term in favor of a uniform output distribution.
Specific improvements of iDeLUCS over the previous pipeline were also explored.In particular, we first explored the impact of the improved contrastive loss.For that purpose, we trained a single network using the loss function in Eq (3) over the new datasets.Figure 6 illustrates how enforcing the consistency of the hidden representations provides robustness with respect to unbalanced datasets, as the network learns an embedding where the representations of sequences in the same cluster are close to each other, but far from sequences in other clusters.We then compare the results against a single network trained using Eq (1) over the bechmarking datasets provided in [6]. Figure 7 shows how the newly introduced loss leads to better results both in terms of accuracy and convergence.Note that unlike the previous pipeline, iDeLUCS does not require the introduction of external noise on the parameters during training, as the model is less susceptible to getting trapped in local optima.Finally, we explored the impact of the clustering ensemble model introduced in iDeLUCS.For that purpose we trained a model with five voters, and compared the results of the final clustering ensemble with that of a majority voting scheme similar to the one used in [6]. Figure 8 shows how the newly introduced clustering ensemble produces better mean accuracy and lower variance for all the datasets.Note: Reproducing Machine Learning experiments is a hard task.Completely reproducible results are not guaranteed across PyTorch releases, individual commits, or different platforms.Furthermore, results may not be reproducible between CPU and GPU executions, even when using identical seeds as it is said in PyTorch's official documentation.That being said, users may attempt to produce similar results as the ones obtained in this paper, by using the default parameters.Additional information about extra hyper-parameters and test scripts can be found on the paper repository in the Examples folder.All of the tests were performed on one of the nodes of the Beluga cluster of the Digital Research Alliance of Canada (16 x Intel Gold 6148 Skylake @ 2.4 GHz CPU, 32 GB RAM) with NVIDIA V100SXM2 (16 GB memory).Table 4: Comparison of the performance of iDeLUCS against k-means++, GMM, DeLUCS and MeShClust v3.0 clustering algorithms on the new mtDNA datasets (b), using intrinsic cluster evaluation metrics (Davies-Bouldin Index, Silhouette coefficient) and external evaluation metrics (Homogeneity, Completeness, unsupervised accuracy ACC), as well as time and memory.Boldface indicates the best result, (↑)/(↓) indicate that higher/lower is better, "balanced" indicates the balanced version of the datasets."MeShClust -auto" denotes MeshClust v3.0 run with the option of automatic identification of the identity threshold parameter, and "MeshClust -p" denotes MeshClust v3.0 run with a manually optimized identity threshold p ∈ [0.5, 0.9].

Figure 1 :
Figure 1: Two snapshots of the Settings Tab.(a) displays the "Basic" settings panel, where the users may select the parameters as they would for any classic clustering algorithm.(b) displays the "Advanced" settings panel, where the users can select all the training hyper-parameters specific to iDeLUCS.

Figure 2 :Figure 3 :
Figure 2: Snapshot of the training tab of iDeLUCS as it learns to cluster 9,027 mitochondrial genomes of insects into 7 different clusters.The left panel displays a summary of the main training parameters, as well as some statistics about the dataset under study.The center panel contains a qualitative assessment of the learning progress.The right panel contains a dynamic plot with the learning curves of the different models.Four models have been trained for thirty epochs each, and the training process of the fifth model is going through the third epoch.

Figure 5 :
Figure 5: Confusion matrices with maximal accuracy obtained from the consensus clustering assignment and calculated via the Hungarian algorithm.The predicted labels are numeric cluster assignments, but are omitted for readability.23

Figure 6 :
Figure 6: Comparison of the learned embedding with the original representation of the imbalanced newly introduced mtDNA datasets.iDeLUCS maps each sequence into a lower dimensional space that encodes the underlying taxonomy of the sequences for datasets with different compositions.

Figure 7 :
Figure 7: Comparison of the performance of iDeLUCS against the proposed pipeline in [6] on 11 benchmark datasets.(a) Box-plot representing the distribution of the unsupervised clustering accuracy produced by 100 neural networks trained independently over each dataset.Although, both DeLUCS and iDeLUCs produce an output with high variance, the mean ACC of the networks trained using the methodology of iDeLUCS is higher for ten out of eleven datasets.(b) Contrastive loss as a function of the training epoch for 100 runs of the training algorithm on the Vertebrata dataset.(c) Unsupervised clustering accuracy as a function of the training epoch for 100 runs of the training algorithm on the Vertebrata dataset.

Figure 8 :
Figure8: Box-plot representing the performance of the clustering ensemble of iDeLUCS against the majority voting in[6].Fifty models with five voters were trained over the eleven benchmark datasets using both strategies.The clustering ensemble used by iDeLUCS outperforms the majority voting in[6] in all datasets.

Table 2 :
[12]ary of the three new mitochondrial DNA dataset and one dataset of simulated metagenomic reads from eight microbial genomes introduced by[12].Note that there is a balanced version of each new dataset (Fungi, Protists, Insects), where the number of sequences per cluster in the balanced version was selected according to the number of sequences available in the smallest cluster.

Table 3 :
[3]mary of the twelve synthetic datasets from[3]included in the study.The number in the name of each dataset represents an identity score threshold, indicating that each sequence in a cluster is within this threshold from the cluster center.

Table 5a :
[6]parison of the performance of several clustering algorithms on the benchmark datasets introduced by[6], using both internal and external clustering evaluation measures.Boldface indicates the best result, (↑)/(↓) indicates higher/lower is better."MeShClust -auto" denotes MeshClust v3.0 run with the option of automatic identification of the identity threshold parameter, and "MeshClust -p" denotes MeshClust v3.0 run with a manually optimized identity threshold p ∈ [0.5, 0.9].

Table 5b :
[6]parison of the performance of several clustering algorithms on the benchmark datasets introduced by[6], using both internal and external clustering evaluation measures.Boldface indicates the best result, (↑)/(↓) indicates higher/lower is better."MeShClust -auto" denotes MeshClust v3.0 run with the option of automatic identification of the identity threshold parameter, and "MeshClust -p" denotes MeshClust v3.0 run with a manually optimized identity threshold p ∈ [0.5, 0.9].

Table 6 :
[12]arison of the performance of iDeLUCS against k-means++, DeLUCS and MeShClust v3.0 clustering algorithms on the dataset of simulated metagenomic reads from eight microbial genomes introduced by[12], using intrinsic cluster evaluation metrics (Davies-Bouldin Index, Silhouette coefficient) and external evaluation metrics (Homogeneity, Completeness, unsupervised accuracy ACC), and Time, and Memory.Boldface indicates the best result, (↑)/(↓) indicate that higher/lower is better."MeShClust -auto" denotes MeshClust v3.0 run with the option of automatic identification of the identity threshold parameter, and "MeshClust -p" denotes MeshClust v3.0 run with a manually optimized identity threshold p ∈ [0.5, 0.9].Note: GMM did not converge in this experiment.

Table 7 :
Comparison of the performance of iDeLUCS and iDeLUCS + HDBSCAN (iDeLUCS -auto) against k-means++ and MeShClust v3.0 clustering algorithms on the synthetic datasets introduced by [4], using intrinsic cluster evaluation metrics (Davies-Bouldin Index, Silhouette coefficient) and external evaluation metrics (Homogeneity, Completeness, unsupervised accuracy ACC), and Time, and Memory.Boldface indicates the best result, where (↑) indicates that higher is better, and (↓) indicates that lower is better."MeShClust -auto" denotes MeshClust v3.0 run with the option of automatic identification of the identity threshold parameter.

Table 8 :
Performance evaluation of iDeLUCS when running on a CPU cluster with 16 cores.We use intrinsic cluster evaluation metrics (Davies-Bouldin Index, Silhouette coefficient) and external evaluation metrics (Homogeneity, Completeness, unsupervised accuracy ACC), and Time, and Memory.Boldface indicates the best result, (↑)/(↓) indicate that higher/lower is better.