Predicting evolutionary targets and parameters of gene deletion from expression data

Abstract Motivation Gene deletion is traditionally thought of as a nonadaptive process that removes functional redundancy from genomes, such that it generally receives less attention than duplication in evolutionary turnover studies. Yet, mounting evidence suggests that deletion may promote adaptation via the “less-is-more” evolutionary hypothesis, as it often targets genes harboring unique sequences, expression profiles, and molecular functions. Hence, predicting the relative prevalence of redundant and unique functions among genes targeted by deletion, as well as the parameters underlying their evolution, can shed light on the role of gene deletion in adaptation. Results Here, we present CLOUDe, a suite of machine learning methods for predicting evolutionary targets of gene deletion events from expression data. Specifically, CLOUDe models expression evolution as an Ornstein–Uhlenbeck process, and uses multi-layer neural network, extreme gradient boosting, random forest, and support vector machine architectures to predict whether deleted genes are “redundant” or “unique”, as well as several parameters underlying their evolution. We show that CLOUDe boasts high power and accuracy in differentiating between classes, and high accuracy and precision in estimating evolutionary parameters, with optimal performance achieved by its neural network architecture. Application of CLOUDe to empirical data from Drosophila suggests that deletion primarily targets genes with unique functions, with further analysis showing these functions to be enriched for protein deubiquitination. Thus, CLOUDe represents a key advance in learning about the role of gene deletion in functional evolution and adaptation. Availability and implementation CLOUDe is freely available on GitHub (https://github.com/anddssan/CLOUDe).


Figure S1
| Classification performance of the CLOUDe NN trained on unbalanced "redundant-skewed" and "unique-slewed" datasets.The "redundant-skewed" training set consisted of 16,000 observations of the "redundant" class and 4,000 observations of the "unique" class, whereas the "unique-skewed" training set consisted of 4,000 observations of the "redundant" class and 16,000 observations of the "unique" class.(A) Receiver operating characteristic curves zoomed in to show false positive rates <= 25% and true positive rates >= 75%.(B) Confusion matrices depicting classification rates for the two classes.Receiver operating characteristic curves zoomed in to show false positive rates <= 25% and true positive rates >= 75%.Given that expression values are log-transformed here, additions of noise with standard deviation (sd) equal to 1 effectively represents the addition of error of different orders of magnitude for the raw non-logtransformed expression values.Noise with mean zero and sd ∈ {0.001, 0.01, 0.1, 1} was added to simulated expression values for D, S, and L genes across the  = 6 conditions.Rejection sampling was applied, and final expression values (with added noise) that were higher than the maximum empirical expression value were rejected.
[ Heatmap generated from log-transformed expression values of D, S, and L genes for each deletion event.Labels eX1, eX2, eX3, eX4, eX5, and eX6, represent X = D, S, or L gene expression in the carcass, female head, ovary, male head, testis, and accessory gland, respectively.Rows represent triplets of D, S, and L genes for each of the 100 deletion events.S3 and S4).

Figure S2 |
Figure S2 | Classification power of CLOUDe when varying degrees of Gaussian noise are added to simulated test data.Receiver operating characteristic curves zoomed in to show false positive rates <= 25% and true positive rates >= 75%.Given that expression values are log-transformed here, additions of noise with standard deviation (sd) equal to 1 effectively represents the addition of error of different orders of magnitude for the raw non-logtransformed expression values.Noise with mean zero and sd ∈ {0.001, 0.01, 0.1, 1} was added to simulated expression values for D, S, and L genes across the  = 6 conditions.Rejection sampling was applied, and final expression values (with added noise) that were higher than the maximum empirical expression value were rejected.

Figure S3 |
Figure S3 | Classification power of CLOUDe when an alternative evolutionary scenario is considered.(A) In this scenario, the expression optima of the duplicate genes ( ! and  " ) diverged from that of the ancestral gene ( # ).(B) Receiver operating characteristic curves across the full range of false positive rates (left) and zoomed in to show false positive rates <= 25% and true positive rates >= 75% (right).

Figure S4 |
Figure S4| Classification performance of the four optimal models of CLOUDe (TableS5) applied to balanced test data simulated under ranges for  ∈ [, ] and   ∈ [−, −].(A) Power curves zoomed in at power >= 75% in which each datapoint represents the true positive rate at a 5% false positive rate for a pair of ranges for  and  " .(B) Accuracy curves zoomed in at accuracy >= 82% in which each datapoint represents the accuracy for a pair of ranges for  and  " .

Figure S6 |
Figure S6| Parameter prediction performance of the four optimal models of CLOUDe (TableS5) applied to data simulated under ranges for  ∈ [, ] and   ∈ [−, −].Each datapoint represents the mean squared error of a parameter estimate (row) for each pair of  (column) and  " (x-axis) across  = 6 conditions.

Figure S7 |
Figure S7| Parameter prediction performance of the four optimal models of CLOUDe (TableS5) applied to data simulated under ranges for  ∈ [, ] and   ∈ [−, −].Each datapoint represents the mean squared error of a parameter estimate (row) for each pair of  (column) and  " (x-axis) across  = 6 conditions.

Figure S8 |Figure S9 |
Figure S8 | Shapley analysis of the NN classifier on the simulated training dataset.Points represent Shapley importance values for each feature.
Figure S10 | Distributions of gene expression values for 100 Drosophila deletions, as classified by the CLOUDe NN.Log-transformed expression values of D, S and L genes are plotted across  = 6 conditions.

Table S5 ) applied to balanced test data simulated under ranges for
Power curves zoomed in at power >= 75% in which each datapoint represents the true positive rate at a 5% false positive rate for a pair of ranges for  and  " .(B) Accuracy curves zoomed in at accuracy >= 82% in which each datapoint represents the accuracy for a pair of ranges for  and  " .

Figure S5 | Confusion matrices depicting classification rates of the LRT for balanced data simulated under specific parameter ranges for 𝜶 and 𝝈
.This method is skewed toward predicting the "redundant" class when  is large and  " is small.

Table S1 | Performance of the four optimal models of CLOUDe (Table S5) applied to data simulated under ranges for
∈ [, ] and   ∈[, ].Due to the implemented rejection sampling step for data generation, and the ranges of log-transformed expression values in the empirical data, the Ornstein-Uhlenbeck model of CLOUDe is conservative in finding acceptable simulated expression values for many combinations of log !# () ∈ [3,10] and log !# ( " ) ∈ [3,10], effectively failing to generate simulated data when log !# ( " ) is three or more orders of magnitude higher than log !# ().Because it is not possible to explore the entire grid of values for log !# () and log !# ( " ), here we present only results for the combinations from which data were generated.Power represents the true positive rate at a 5% false positive rate.MSE1, MSE2, and MSEsv denote mean squared errors in predicting  !,  " , and the log-transformed stationary variance, respectively.

Table S3 | Statistically significant results of DAVID analysis for "redundant" genes.
* After Benjamini-Hochberg procedure.

Table S4 | Statistically significant results of DAVID analysis for "unique" genes.
* After Benjamini-Hochberg procedure.