Investigating alignment-free machine learning methods for HIV-1 subtype classification

Abstract Motivation Many viruses are organized into taxonomies of subtypes based on their genetic similarities. For human immunodeficiency virus 1 (HIV-1), subtype classification plays a crucial role in infection management. Sequence alignment-based methods for subtype classification are impractical for large datasets because they are costly and time-consuming. Alignment-free methods involve creating numerical representations for genetic sequences and applying statistical or machine learning methods. Despite their high overall accuracy, existing models perform poorly on less common subtypes. Furthermore, there is limited work investigating the impact of sequence vectorization methods, in particular natural language-inspired embedding methods, on HIV-1 subtype classification. Results We present a comprehensive analysis of sequence vectorization methods across machine learning methods. We report a k-mer-based XGBoost model with a balanced accuracy of 0.84, indicating that it has good overall performance for both common and uncommon HIV-1 subtypes. We also report a Word2Vec-based support vector machine that achieves promising results on precision and balanced accuracy. Our study sheds light on the effect of sequence vectorization methods on HIV-1 subtype classification and suggests that natural language-inspired encoding methods show promise. Our results could help to develop improved HIV-1 subtype classification methods, leading to improved individual patient outcomes, and the development of subtype-specific treatments. Availability and implementation Source code is available at https://www.github.com/kwade4/HIV_Subtypes

(3) where   represents the target variable,   are the feature variables,   are the coefficients, n is the number of samples, p is the number of features, and  is the regularization parameter.During the training process, LASSO reduces the coefficients of less important features to zero.

Section 1.4: Naïve Bayes
At their core, Naive Bayes classifiers compute the posterior probability of each class based on the input features using the formula: where (| ) denotes the probability of class k given a feature vector x, ( | ) is the likelihood of observing the features x in class k, () is the prior probability of class k, and () is the overall probability of the features.

Section 1.5: K-Nearest Neighbours
The K-Nearest Neighbors (KNN) algorithm is widely used in multi-task classification tasks.The core of the KNN model involves classifying each data point based on the majority label of its closest neighbours in the feature space.KNN has two key parameters: the number of neighbours (K) and the distance metric used for identifying neighbours.During training, the model identifies K nearest neighbours based on the distance metric and the classification is performed by a majority vote among these K neighbours.The class that appears most frequently within this subset is assigned to the data point.In our work, we consider two distance metrics: Euclidean and Manhattan.
The Euclidean distance, also known as straight line distance, between two points ( 1 ,  1 ) and ( 2 ,  2 ) is calculated as: The Manhattan distance, also known as the city block distance, between two points ( 1 ,  1 ) and ( 2 ,  2 ) is calculated as: Section 1.7: Support Vector Machines Support Vector Machines (SVMs) are commonly used for classification tasks and involve finding a hyperplane that best separates classes in feature space.Mathematically, the decision function for an SVM in the binary classification setting is given by: where x is the input feature vector,   are the support vectors,   are the labels of the support vectors,   are the learned weights, K is the kernel function, and b is the bias.

Section 1.8: One-dimensional Convolutional Neural networks
One Dimensional Convolutional Neural Networks (1D-CNNs) have shown success for tasks involving sequential data such as genetic data.Our 1D-CNN architecture is constructed using the Keras framework and begins with a 1D convolutional layer and we specify the number of filters and the kernel size.Each filter in this layer performs convolution operations on the input sequence, which can effectively capture local dependencies.The convolution operation is mathematically represented as: where   is the output, I is the input sequence, and   is the weight of the filter.Following the convolutional layer, a max-pooling layer with a pool size of 2 is used to reduce the dimensionality of the data, enhancing the network's ability to generalize, and reducing the computational load.The network then flattens the pooled features and passes them through a dense layer with a specified number of units, each employing a ReLU activation function for non-linearity.The final layer is a Softmax layer, which can output the probability distribution across the HIV-1 subtypes.

Table 2 :
Feature Dimensionality Before and After PCA

Table 3 :
Results of PCA Ablation Study Using k-mer and Sub-sequence Natural Vector Encodings.

Table 4 :
Performance of Word2Vec-Based Encoding Techniques Across Machine Learning Models with Varying k-mer Token and Vector Sizes.