Domain Expansion and Functional Diversification in Vertebrate Reproductive Proteins

Abstract The rapid evolution of fertilization proteins has generated remarkable diversity in molecular structure and function. Glycoproteins of vertebrate egg coats contain multiple zona pellucida (ZP)-N domains (1–6 copies) that facilitate multiple reproductive functions, including species-specific sperm recognition. In this report, we integrate phylogenetics and machine learning to investigate how ZP-N domains diversify in structure and function. The most C-terminal ZP-N domain of each paralog is associated with another domain type (ZP-C), which together form a “ZP module.” All modular ZP-N domains are phylogenetically distinct from nonmodular or free ZP-N domains. Machine learning–based classification identifies eight residues that form a stabilizing network in modular ZP-N domains that is absent in free domains. Positive selection is identified in some free ZP-N domains. Our findings support that strong purifying selection has conserved an essential structural core in modular ZP-N domains, with the relaxation of this structural constraint allowing free N-terminal domains to functionally diversify.


Multiple Sequence Alignment
Sequences for multiple ZP-N containing proteins were curated from the Ensembl database (release 104) (Howe et al. 2021). Sequences were preliminarily labelled as one of the ZP genes of interest based on PSI-BLAST e-value scores (Altschul et al. 1997). Sets of orthologous genes were aligned with MAFFT (Katoh and Standley 2013) and then trimmed to individual ZP-N domains. Groups of orthologous ZP-N domains were deemed "orthogroups". Sequences with ambiguous characters were removed, and then sets of orthologous ZP-N sequences were realigned with MAFFT. A full multiple sequence alignment was generated by concatenating orthogroup alignments together using a representative paralog alignment:individual representative sequences were selected from each orthgroup, and aligned using the structural based PROMALS tool (Pei et al. 2008). This approach was used because of the low sequence identity, but high structural similarity between paralogous Z-N domains. A custom script was used to algorithmically add gaps to orthogroup alignments to form a full multiple sequence alignment. For phylogenetics, CD-Hit was used to remove highly cluster highly similar sequences (>90% identity) (Li and Godzik 2006;Fu et al. 2012), in order to improve computing speed, and because this study was not concerned with very recent evolutionary splits. A full dataset was used for machine learning training because those methods are less computationally strained by large alignments and can gain greater sensitivity with a high depth of taxonomic sampling.

Phylogenetics
Maximum likelihood phylogenies were built using RAxML-NG(Kozlov et al. 2019), and multiple different amino acid substitution matrices were tested (LG+G, JTT+G,WAG+G), to evaluate the robustness of the deepest phylogenetic divide. The maximum likelihood tree was selected from 100 replicate runs using different starting trees. Nodal support was calculated with transfer bootstrap expectation (Lemoine et al. 2018), a modified form of bootstrapping that is more effective at detecting deep phylogenetic relationships in datasets with large number of taxa. Sequence labels were initially based on BLAST results but later refined based on phylogenetic clustering, (e.g. ZP1-N1, ZP2-N1, ZP4-N1).

Machine Learning
A basic machine learning algorithm using mean squared regression and regularization was coded in Python to distinguish the two free and modular groups of ZP-N domains. Logistic regression models are well suited for these classifications, because their outputs are bounded between 0 and 1, which can be interpreted as probability that a given domain is modular (Bewick et al. 2005). The multiple sequence alignment was identical to that used for phylogenetic analysis. The alignment was split into a testing (25%) and training set (75%), and logistic regression modelling with cross-validation was performed on the training set using fiveway cross validation. The final model scores were based on performance in the testing dataset.
For machine learning analysis, aligned ZP-N sequences were one-hot encoded: each position in the sequence was converted into a vector of twenty digits, corresponding to the twenty amino acids. The value was set to 1 for the entry in the vector corresponding to that residue, and all other values are set to 0. Gapped sites were set to a vector of twenty 0's. Thus, the classifier was trained using (1+20n) features (there is an additional intercept term), where n is the alignment length. Each of these features has a parameter associated with it and the value of the parameter indicates how informative that feature is, and whether it supports a modular ZP-N or free ZP-N classification. There are a large number of possible parameters in this model (9321 including the intercept), ,which introduces a risk for "overfitting"(Hawkins 2004), and motivates our regularization strategies.
To determine the minimal number of highly informative parameters, elastic net regularization was employed to penalize overparameterization and reduce overfitting (Zou and Hastie 2005). In our sci-kit learn implementation (Pedregosa et al. 2011), both the strength of regularization and the L1/L2 penalty ratio between the two penalty types were optimized by grid search. The highest scoring model was identified according to the negative mean-squared error scoring metric. In order to choose a suitable sparse model (i.e. fewest non-zero parameters), we adapted the one standard error rule common in machine learning (Hastie et al. 2009), where the sparsest model that is still within one standard error of the highest scoring model is selected. For this analysis we used 95% confidence intervals (~1.96 standard errors), to identify the sparsest model (fewest non-zero parameters) that is not statistically different from the highest scoring model sampled. Raw parameter values were plotted in the style of sequence LOGO plots (Schneider and Stephens 1990). The sum of the raw parameter values for matching amino acids in the alignment (and the intercept term), are equivalent to the log odds that a given sequence is classified as modular. For simplicity, each parameter is described as the log odds associated with a particular residue. In addition to the initial binary classification (free vs modular), our analysis was repeated using a three-way multiclassification (first N-terminal, internal, and modular). This procedure used alignments, hyperparameter grid searching, and regularization strategies in the same manner as the binary classification.

Sequence Divergence and Positive Selection Analyses
Our analyses of sequence divergence and positive selection was performed on a set of Boreutherian mammals., and we used the mammalian ZP-N domains coming from zp1, zp2, zp3, zp4, umod, tecta, and cuzd1. Boreoeutherian sequences were mined from Ensembl (Howe et al. 2021), and were included in these analyses if they were present in 10 or more of these ZP-N domain orthogroups. Phylogenetic distances both within and between orthogroups were calculated in MEGA using Poisson estimation with a gamma distribution of variation between sites (Kumar et al. 2016;Kumar et al. 2018).
Evidence of positive selection was measured using PAML analyses (Yang et al. 2005;Yang 2007) on the same sets of ZP-N domains from the sequence divergence estimation. A likelihood ratio test between a model allowing positive selection (M8) and a neutral model (M8a), was used to determine which domains showed evidence of positive selection. Likelihood ratio tests were performed by comparing M8 and M8a, using a chi-squared distribution with one degree of freedom. We also performed a Benjamini-Hochberg p-value correction to account for multiple testing (Benjamini and Hochberg 1995). Positively selected sites were visualized on a published crystal structure (ZP2-N1 ) (Raj et al. 2017), or the alpha-fold predicted structure (Jumper et al. 2021 Jul 15) when this did not exist (ZP2-N2). Sites were labelled if they had a posterior probability of being positively selected > 75% according Bayes Empirical Bayesian (BEB) analysis.

Visualization and Other methods
When protein structures were not available Alpha-Fold2 tertiary structure prediction was used (Jumper et al. 2021), and three-dimensional protein structures were visualized using either pymol (Schrödinger 2015) or ChimeraX (Pettersen et al. 2004). Docking simulations of homodimerization for ZP2-N1 and ZP3-N were performed using Rosetta 3.5 (Chaudhury and Gray 2008;Sircar et al. 2010). Briefly, each template structure was energy minimized in Rosetta using the relax function, each structure was duplicated, aligned to the dimeric ZP-N structure of uromodulin (PDB 4wrn), 10000 independent docking simulations performed, and interface scores analyzed for the top 5% lowest energy structures.