Mutate and observe: utilizing deep neural networks to investigate the impact of mutations on translation initiation

Abstract Motivation The primary regulatory step for protein synthesis is translation initiation, which makes it one of the fundamental steps in the central dogma of molecular biology. In recent years, a number of approaches relying on deep neural networks (DNNs) have demonstrated superb results for predicting translation initiation sites. These state-of-the art results indicate that DNNs are indeed capable of learning complex features that are relevant to the process of translation. Unfortunately, most of those research efforts that employ DNNs only provide shallow insights into the decision-making processes of the trained models and lack highly sought-after novel biologically relevant observations. Results By improving upon the state-of-the-art DNNs and large-scale human genomic datasets in the area of translation initiation, we propose an innovative computational methodology to get neural networks to explain what was learned from data. Our methodology, which relies on in silico point mutations, reveals that DNNs trained for translation initiation site detection correctly identify well-established biological signals relevant to translation, including (i) the importance of the Kozak sequence, (ii) the damaging consequences of ATG mutations in the 5′-untranslated region, (iii) the detrimental effect of premature stop codons in the coding region, and (iv) the relative insignificance of cytosine mutations for translation. Furthermore, we delve deeper into the Beta-globin gene and investigate various mutations that lead to the Beta thalassemia disorder. Finally, we conclude our work by laying out a number of novel observations regarding mutations and translation initiation. Availability and implementation For data, models, and code, visit github.com/utkuozbulak/mutate-and-observe.

As described in the main text, we use TISRover, a carefully designed DNN for the task of TIS detection (see Figure 1) that is composed of four convolutional layers followed by a linear one. Our primary experiments on this model with the dataset described in the paper revealed that this model is able to overfit the training data very quickly which leads to subpar results on test and validation datasets. In an attempt to overcome this challenge as well as to allow experimentation with mutations, we distill the original model into a more lightweight one while incorporating base pair (bp) masking as a method of data augmentation. We detail our modifications to the original architecture and training methodology as follows: • Convolutional kernels -Keeping the number of convolutional layers the same as the TISRover architecture, we reduce the number of filters to the closest smaller number that is a power of 2. As a result, we reduce the filter counts from 70 to 32 in the first, 100 to 64 in the second, and finally 150 to 128 in the third convolutional layer. We do not modify the convolutional kernels as we have observed the setup to be optimal for the data at hand. The reduction in the number of filters in the convolutional layers correspond to a reduction of 36, 440 trainable parameters, which approximately corresponds to half the trainable parameters of convolutional layers in TISRover. • Dropout positioning -We follow [10] and [9] and remove the dropout layers during the convolution phase of the model and instead apply a dropout of frequency (0.5) only once before the final linear layer, similar to [9]. • Linear layers -We experiment with one, two, and three layers of linear layers (also called dense or fully-connected layers) after the flattening of features after convolutions. We find that the two layer setup yields the best results and do not change it compared to TISRover. However, we modify the number of neurons to be in line with the reductions in the number of convolutional filters. This operation reduces the number of trainable parameters in linear layers from 462, 338 to 98, 690, corresponding to a reduction of 80%. • Warm-up -Linear warmup training has proven to be beneficial for training in recent research efforts. We employ this technique to perform a single epoch of warmup where the learning rate starts from 0 and linearly increases to the initial learning rate at each batch. • Optimizer -We continue to use SGD as an optimizer and train our model for 50 epochs but set the initial learning rate to 0.1 instead of 0.05 as in TISRover. Furthermore, following [5], instead of decaying the learning rate with multiplier of 0.5 for each 10 epochs, we decay it with a multiplier of 0.1 for every 20 epochs. • Class weights -We also experiment with class weights as a method to combat the data imbalance (1 to 26 positive to negative ratio) and find that the usage of class weights positively affects the training outcome (by small margins). Although the class weights do not significantly affect the results, we find that weighting positive sequences 4 times as a single negative one yields the best results.
• Batch size -Although we experimented with batch sizes of range {32, 64, 128, 256} (with accordingly adjusted learning rates), we found that the batch size of 32 leads to the best results, similar to the original TISRover architecture. • Masking -In an attempt to overcome the data imbalance as well as to accommodate for the subsequent experiments involving mutations, we employed base pair (bp) masking as a method of data augmentation. Given a sequence x composed of one-hot representations, we replace the bp at the position m with a zero vector 0 = [0, 0, 0, 0] as follows: We evaluate a number of different scenarios for bp masking and find randomized masking up to a certain number (in this case, 40) to be the one that improves the results the most. As such, during training, for each sequence we sample a random number k ∈ [0, 40] and mask the sequence at k different positions. Naturally, when k = 0, then no bp is masked and the sequence is fed to the network as-is.

B.1 Datasets and model performance
In Table 1, we provide additional details about the properties of datasets used in this work.
Human-Genome-M -Performance of our models on the newly created Human Genome-M dataset can be found in Table 2. As it can be seen, on the newly curated Human Genome-M dataset, our final model achieves an overall accuracy of .916 (.86 TPR and .98 FPR) when adopting standard accuracy evaluation using arg max(g(θ, x)) ≥ .5. When using the best threshold calculated with an ROC curve, the accuracy further increases to .946 (.95 TPR and .94 FPR).
Gencode -In order to maintain comparability with past research efforts, we used the dataset published in the work of [6] which was published in 2007. Naturally, this dataset lacks the most recent genomic sequences. In order to demonstrate the effectiveness of our model and to addresss this potential limitation, we employed the latest GENCODE dataset (GENCODE release 43 corresponding to GRCh38.p13), an extensive reference annotation resource for the human genome. In particular, we employed protein-coding transcript sequences from the most recent reference chromosomes of humans and reformatted the dataset by trimming the sequences to 203 nucleotides and relocating the start codon (ATG) to position 61, conforming to the input requirements of our model. Then, we remove all overlapping sequences between this dataset and our previous datasets (CCDS, Chromosome-21, Human Genome-M) which leaves us with a total of 73,590 novel sequences. We use this dataset to evaluate our models without any additional training to observe whether our models can identify most-recently discovered transcription sequences.
In Table 3, we provide results on the GENCODE dataset with our models that are incrementally improved. Note that the GENCODE dataset contains only TIS-positive samples obtained from human genome, thus, the accuracy corresponds to ACC = TP/FN. We also experiment with the thresholding where this threshold is calculated based on the Human Genome-M dataset and find that our final model can successfully identify 92.3% (without thresholding) and 98.5% (with thresholding) of all sequences in the GENCODE dataset as correctly belonging to the TIS-positive class, indicating the usefulness of features learned by the model.

B.2 Generalization of Experimental Results
The bootstrap method (see [1]) is a resampling technique that has been widely used in statistical analysis to estimate the uncertainty associated with a given estimate or model. It is a non-parametric approach that involves repeatedly resampling a dataset with replacement, and using the resulting samples to estimate a statistic of interest. One of the key advantages of this method is that it does not require any assumptions about the distribution of the data, which makes it useful in situations where traditional statistical methods may not be appropriate or too complex. This is especially relevant    Table 3: Accuracy of traned models on the GENCODE dataset. The threshold for the third column is calculated from the Human Genome-M dataset.
in the context of machine learning, where the data is often high-dimensional and the underlying distribution is unknown or non-normal.
In order to assess the stability of our model and its generalization performance, which is crucial when dealing with real-world data, we employ bootstrapping technique and train 50 additional models. The performance of the bootstrapped models is described in Table 4 and is generally comparable (though somewhat lower) with the performance of the model that was trained on the full dataset. This indicates that our model has good generalization performance and is likely to do well on new data.
Note that the mean bootstrap accuracies are indeed somewhat lower than the accuracy of the full model. This is because the bootstrapped models are trained on subsets of the full dataset, obtained by sampling with replacement. Even though there are strategies to approximately correct this negative bias (see [1]), we chose not to employ them, as their appropriateness for deep neural networks may not be assured. Moreover, our conclusion regarding generalization performance is in fact sharpened by showing that the model continues to perform well even under slightly pessimistic assumptions.

B.3 Prediction Confidence and Confidence Calibration
Confidence calibration refers to the ability of machine learning models, such as a neural networks, to output predictions that are well-calibrated with their true probability of being correct. It is a critical aspect of model reliability, as it allows practitioners to assess the quality of model predictions and make informed decisions based on them. Moreover, confidence calibration is vital in detecting and mitigating model errors and biases. A poorly calibrated model may exhibit overconfidence in its predictions, leading to untrustworthy predictions. By ensuring proper calibration, machine learning practitioners can gain insights into the model's behavior and use this knowledge to refine the model and improve its performance.
In order to measure the calibration confidence of the models evaluated in this study, we employ three metrics: negative-log likelihood loss (NLL), estimated calibration error (ECE), and Brier score.
NLL -Negative log loss, also known as cross-entropy loss, is a commonly used metric for evaluating the performance of a classification model. In the context of confidence calibration, negative log loss measures the discrepancy between the predicted probabilities and the true probabilities of the events being classified. Given a probabilistic model f (Y |X), NLL for N data points is defined as: ECE -Estimated calibration error (ECE) is a popular metric for assessing the calibration performance of a classification model. In the context of confidence calibration, ECE measures the difference between the predicted probabilities and the true probabilities of the events being classified, averaged over a set of confidence intervals. Following [4], we calculate ECE as follows: where the error is measured over M bins of confidence intervals with B m referring to the set of indices of samples whose prediction confidence falls into the interval ( m−1 M , m M ]. Brier score -The Brier score is a widely used metric for evaluating the performance of probabilistic predictions generated by a classification model. It is a proper scoring rule that measures the mean squared difference between the predicted probabilities and the true binary outcomes of the events being classified. Given a probabilistic prediction g(θ, x) and its label y, Brier score is calculated as follows: Results on Confidence Calibration -In Table 5 and Table 6, we provide calibrations of incremental models evaluated in the main text measured with NLL, ECE, and Brier score. Apart from these results, we also experiment with a post-processing routine called temperature scaling [4] to observe the potential shortcomings of the models and whether this postprocessing step leads to dramatic improvements. As it can be seen, all three metrics lead to small calibration errors (often < 0.1), indicating that the models used having desirable properties in terms of calibration.
Prediction Confidence -Apart from the calibration results discussed above, we also present the histogram of prediction confidences for correctly classified TIS-positive and TIS-negative samples as well as incorrect classifications in Figure 2, Figure 3, and Figure 4 for Chromosome-21, Human Genome-M, and GENCODE, respectively. Prediction confidences on the aforementioned datasets are taken from the final model which we performed the mutations on.
As it can be seen, the majority of the TIS-positive and TIS-negative sequences in all three datasets are correctly classified with high confidence. To add, while misclassifications in Human Genome-M dataset follow a pattern of uniform distribution, for the other two datasets, misclassifications come with lower confidence values, indicating that the mistakes made by our model are, more often than not, without overconfidence.

Chromosome-21
Human    Table 6: Calibration properties of trained models on the Human Genome-M dataset.

C Discarding Unstable Sequences
In the main paper, we indicated that 505 sequences from Human Genome-M dataset were not used for experiments involving mutations. Those sequences fall into two categories: (1) incorrectly classified sequences and (2) unstable sequences that are not suitable for mutations. In what follows, we explain how those sequences are discarded.
Discarding incorrectly classified sequences -The Human Genome-M dataset described in Section 2.1 of the main text contains 2, 000 TIS-positive sequences that were left out of training in order to perform mutations on. Nonetheless, not all sequences in this dataset are suitable for experiments that involve mutations. Of 2, 000 sequences, 278 are incorrectly classified by the model as belonging to the TIS-negative class, therefore we discard them.
Discarding unstable sequences -In order to maintain the authenticity of experiments involving mutations, we take another step in identifying sequences that are not suitable by investigating the validity of predictions for sequences when bps are masked one at a time. Our approach is quite straightforward: similar to mutations, we mask, one at a time, a single bp in the given sequence and create additional masked sequences (200 masked sequences for each sequence) using Eq 3 in the main paper. If any of the 200 masked sequences have their predictions changed, we discard the (unmasked) sequence. Our rationale for this operation is simple: these sequences are already misclassified due to the masking of a bp, before any mutations take place. As such, including them in experiments involving mutations may result in misleading observations when it comes to investigating the nature of mutations. Doing so, we get rid of 227 additional sequences. In Figure 5 we provide an example for the discarding operation using masks detailed above. (left) A sequence that does not change its prediction when a single bp is masked and (right) A sequence that has its prediction changed when a bp is masked. According to the discarding operation described in Section C, we retain the sequence on the left for experiments involving mutation while discarding the right one. Total 46.5 Figure 6: Frequency of various codons that can be mutated into a stop codon in the coding region are grouped for T, A, and G, original data taken from [3].

D Codon Frequency and the Impact of T Mutations
In the main paper we noted that the mutations in the first nt of each codon in the coding region were far more likely to cause a prediction change. Specifically, we observed that T mutations that occur in the first nt of codons were over-represented compared to A or G mutations (that occur in the second or the third nt). Upon a detailed inspection we notice that this topic is highly related to the frequency of codons that are readily mutable to stop codons in the coding region. In Figure 6, we provide the frequency of these codons (codon frequency in 1, 000 obtained from [3]) and observe that the number of codons that are readily mutable to one of the stop codons with T mutations is significantly higher than the other two, which explains this phenomenon.

E Mutations to the Beta-globin Gene
In Section 4.2 of the main paper, we conducted a case study on the HBB gene (also known as the Beta-globin gene) in order to examine the properties of our model with regards to mutation on a particular gene. The HBB gene directs the production of a protein called Beta-globin, which is a component of hemoglobin, which makes it one of the fundamental building blocks of blood. Hemoglobin consists of four protein substructures, two of them being Beta-globin and the other two being α-globin. Various mutations in the HBB gene may lead to a disease called Beta thalassemia which can be characterized as the measurable deficiency of Beta-globin chains in hemoglobin. As a result, this disease may lead to severe complications such as anemia.

E.1 Pathological Mutations
The comprehensive work of [11] characterized a variety of mutations that may lead to Beta thalassemia into three categories: mutations in the transcription process, mutations in RNA processing, and mutations in RNA translation. In order to evaluate the mutations detailed in [11], we take the functional HBB gene from Human Gene Bank and format it according to the convention in our paper (60 nt upstream + ATG + 140 nt downstream). Naturally, we cannot evaluate all those mutations due to limitations regarding the length of the sequence as well as positioning. The gene that we employ, after formatting, starts from the transcription initiation site and ends approximately around the middle of the second exon. Based on this sequence, we are able to perform 26 mutations detailed in [11].
Our model predicts that the functional HBB gene has a 95% likelihood to initiate translation. Based on this gene, we perform the aforementioned 26 mutations and display the results in Figure 7. This figure shows the reduction in the translation likelihood for the mutated sequences. Furthermore, we also display the origin as well as the precise position of the mutation. Note that, differently from the main text, we also conduct frame shift, insertion, and deletion mutations that are documented in [11]. Similar to the results obtained in the main text, we observe that when mutations occur relatively far away from the start codon, their impact lessens. Of all the mutations evaluated, CAP+40 to +43 (-AAAC) and CAP+10 (-T) have the most significant effect, followed by a variety of other mutations. These observations confirm that our model is able to identify that for the majority of the documented mutations the likelihood of translation initiation decreases, thus indicating the potential for gene-related issues. Despite this, not all known mutations that lead to Beta thalassemia lead to a TIS-negative prediction. This can be attributed to the fact that some of those mutations still lead to synthesized proteins (albeit dysfunctional ones), in other words, translation initiation is not an issue. However, making a clear distinction between correctness and incorrectness of predictions made due to mutations is not a simple task and a direction worthy of further exploration.

E.2 Benign single nucleotide polymorphisms
Above, we investigated a variety of mutations for the HBB gene that are discovered to carry pathological properties. While such mutations can cause disease or increase the risk of developing a certain condition, others are considered benign, meaning they do not have any significant impact on health. Among other things, this implies that they should not significantly affect translation initiation. Therefore, our goal in investigating these mutations is to observe whether our model's predictions will change significantly or not.
dbSNP, also known as the NCBI database of genetic variation [8], is a comprehensive public database of common and rare genetic variations found in humans and other organisms. This database is maintained by the National Center for Biotechnology Information (NCBI) and serves as a valuable resource for researchers studying genetic variation and its implications for human health and disease. The database also includes information on the frequency and distribution of these variations in different populations, as well as their association with various health conditions. Using dbSNP (build 155), we investigate known benign mutations of the HBB gene and find 33 of them that fall under the specifications of our data (i.e., mutations that are within 140bp of the translation initiation).
Using the final model as well as the bootstrapped models, we provide predictions obtained with sequences that contain these mutations in Figure 9 and Figure 10. As it can be seen, predictions obtained with our models indicate those mutations to be not detrimental to translation initiation and confirm the authenticity of our observations from another angle.

E.3 Investigating Remaining Mutations
Nevertheless, after performing the mutations documented in the [11], we continue with experimenting all possible point mutations similar to the experiment conducted in the main paper. Performing 600 point mutations on this gene, we identify 12 mutations that impact the likelihood of translation of which 4 are already documented in [11] (see Figure 8). Of those undocumented mutations, our model believes that the (HBB+25 A→T) mutation will be the most significant one with this mutation having 86% likelihood of TIS-negative prediction.

E.4 Reproducibility of Observations
Reproducibility is one of the most important principles in scientific research, and it is especially crucial in bioinformatics research where computational tools are employed in conjunction with genomic, transcriptomic, proteomic, and metabolomic data. In this context, bootstrapping is particularly useful since it provides a way to estimate the variability of the results and the uncertainty of the statistical measures. Furthermore, bootstrapping can also help to detect potential sources of bias and errors in the analysis. By comparing the results obtained from the resampled data sets, we can identify whether our findings are consistent across different subsets of the data or whether they are sensitive to particular outliers or subsets.
Recall from Section B.2 that 50 additional models were trained with bootstrapping. Now, we employ these models to determine if the observations made with the final model can be replicated. Consequently, in Figure 7 and Figure 8, next to the predictions obtained from the final model, we provide the mean and the standard deviation of predictions obtained from 50 bootstrapped models. The data provided in Figure 7 are also visualized with boxplots in Figure 11. As it can be seen, for the majority of the observations, the final model and the bootstrapped average align. However, it is crucial to note that in certain cases, the results from the final model are slightly different, due to the bias inherent in the (unadjusted) bootstrap method, as

F Mutations and the Reduction in the Likelihood Translation Initiation
In Section 4.1 of the main paper, we provided the impact of mutations on translation by measuring the change in the likelihood of translation initiation. In Figure 12, we provide a more detailed view of those changes. While Figure 12a shows the distribution of change in the likelihood, Figure 12b to Figure 12g provide a detailed overview for individual mutation categories. Note that different mutation categories have different distributions, indicating that not all mutations that cause a prediction change are the same in terms of reduction in the translation initiation likelihood. Overall, we observe that mutations grouped in Figure 12c and Figure 12e have the highest impact.
• Mutations in the Kozak sequence -We represent the mutations performed on the Kozak sequence that occur in the upstream (uKozak mutations) as well as downstream regions (dKozak mutations). As briefly shown in Figure 8 of the main paper, we observe that the mutations occurring in the downstream part of the Kozak sequence have a higher impact on the likelihood of translation initiation compared to the ones occurring in the upstream Kozak region (see Figure 12b and Figure 12c). • Upstream ATG mutations -We observe that the mutations that lead to the creation of ATG triplets in the 5'UTR (uATG) have a greater effect on reducing translation initiation likelihood compared to other mutations that occur in 5'UTR. Also, this type of mutation accounts for the majority of the mutations in the upstream region. • Downstream stop codon mutations -Mutations of this type create a stop codon (TAA, TGA, or TAG) in the downstream region which may lead to premature termination of translation. We observe that the mutations that create a stop codon in the downstream region cause the most severe effect on the translation initiation compared to other mutations. Also, since the majority of mutations are of this type, we can see that distributions in Figure 12e and the overall distribution displayed in Figure 12a are extremely similar. • Other mutations -Besides the aforementioned mutations, we also provide the distributions of prediction change for other mutations that occur in the upstream and the downstream region in Figure 12f and Figure 12g, respectively. Note that these mutations are far less impactful on translation initiation compared to others.

G Mutations in 5'UTR
In the main paper, we identified a select few non-ATG mutations in the 5'UTR that cause TIS-positive sequence to be predicted as a TIS-negative one. In Figure 13 and in Figure 14, we provide the frequency maps for bp that mutated and mutatations, respectively. In what follows, for those 149 non-ATG mutations in the 5'UTR, we list the original sequence, their mutated variant, and predictions obtained from our model. Note that a number of original (unmutated) sequences have multiple appearances, which means that there exists multiple mutations which lead to a prediction change for that particular sequence. In order to visualize the distribution of repetitions, we provide the histogram of sequence occurences in Figure 15. We believe that the existence of these sequences indicate that certain sequences are inherently unstable.

G.1 List of Impactful Mutations in 5'UTR
For each gene, the first sequence represents the original sequence, and the second sequence represents the mutated version with ATG denoting translation initiation.