Understanding the Early Evolutionary Stages of a Tandem Drosophilamelanogaster-Specific Gene Family: A Structural and Functional Population Study

Abstract Gene families underlie genetic innovation and phenotypic diversification. However, our understanding of the early genomic and functional evolution of tandemly arranged gene families remains incomplete as paralog sequence similarity hinders their accurate characterization. The Drosophila melanogaster-specific gene family Sdic is tandemly repeated and impacts sperm competition. We scrutinized Sdic in 20 geographically diverse populations using reference-quality genome assemblies, read-depth methodologies, and qPCR, finding that ∼90% of the individuals harbor 3–7 copies as well as evidence of population differentiation. In strains with reliable gene annotations, copy number variation (CNV) and differential transposable element insertions distinguish one structurally distinct version of the Sdic region per strain. All 31 annotated copies featured protein-coding potential and, based on the protein variant encoded, were categorized into 13 paratypes differing in their 3′ ends, with 3–5 paratypes coexisting in any strain examined. Despite widespread gene conversion, the only copy present in all strains has functionally diverged at both coding and regulatory levels under positive selection. Contrary to artificial tandem duplications of the Sdic region that resulted in increased male expression, CNV in cosmopolitan strains did not correlate with expression levels, likely as a result of differential genome modifier composition. Duplicating the region did not enhance sperm competitiveness, suggesting a fitness cost at high expression levels or a plateau effect. Beyond facilitating a minimally optimal expression level, Sdic CNV acts as a catalyst of protein and regulatory diversity, showcasing a possible evolutionary path recently formed tandem multigene families can follow toward long-term consolidation in eukaryotic genomes.

situ hybridization on mitotic chromosomes finding a single signal, which indicated that the clustering of the Sdic copies at two different sites of the X chromosome is an assembly artifact ( Fig. S2).

Benchmarking of CNVnator
First, we examined under which conditions CNVnator (Abyzov, et al. 2011) can provide reliable CN estimates given the complexity of the Sdic region, i.e. the presence of multiple copies with high sequence identity among themselves, as well as with their flanking single-copy parental genes, AnxB10 and sw. To this end, we used arguably the most reliable assemblies so far generated in D. melanogaster: GCA_000778455 (Berlin, et al. 2015) for ISO-1; and GCA_002300595.1 for A4 ). First, we generated a set of synthetic X chromosomes for the A4 and the ISO-1 strains in which different ad hoc modifications were implemented, i.e. deleting all but one Sdic copy and the parental genes. A separate synthetic X chromosome was generated for each Sdic copy in both strains, five from A4 and six from ISO-1, which were used as references for read-depth analysis. The average read-depth values obtained with the sequencing data of the A4 strain were 6.18 and 6.16 when using the A4 and the ISO-1 synthetic reference chromosomes, respectively. Rounding off the average between both values to the closest integer and subtracting one because of the contribution of the reads from the parental genes, the estimated number of Sdic copies in A4 is 5. Following the same rationale with the ISO-1 strain, the estimated number of copies was 6 (average read-depth values were 7.38 and 6.73 when using the A4 and the ISO-1 synthetic reference chromosomes, respectively). These estimated numbers are identical to the number of copies found by annotating the indicated assemblies. For 13 strains of the Drosophila Synthetic Population Resources (King, et al. 2012b) and OR-R, the average read depth values across the five and six reference genomes from A4 and ISO-1, respectively, were highly correlated (r 2 = 0.73, P < 0.0001; Fig. 2B; Table S3). Further, we also examined whether sequence coverage could be positively correlated with CN estimates, finding no evidence. Specifically, we parsed this association in two sets of strains, with the first including 70 datasets from the Global Diversity Lines (Grenier, et al. 2015), and the second including 63 datasets from a Zambian population (Lack, et al. 2016a). For the first set, r 2 = 0.0008 (P = 0.8198) and for the second r 2 = 0.0055 (P = 0.5661).

Calibration of qPCR assays
Our control experiments with sw confirmed our ability to discern between 1, 2, and 3 copies (Table S2; Fig. 3C). This variation in copy number for sw is associated with differences between males and females of the ISO-1 strain, males carrying 2 copies of endogenous sw as a result of an induced duplication of the region (2T and 4M; this work), males carrying 2 copies of a sw transgene on chromosome 2 upon making it homozygous (Aand E -; (Clifton, et al. 2017), and heterozygous females possessing 3 copies as a result of carrying one chromosome with the wildtype configuration for the Sdic region and another chromosome with its duplicated version (II and IV in Fig. 3B). For these same genotypes, the estimates about the number of copies of Sdic were also identical to the expectation (Fig. 3D): 6 and 12 copies for the males and females of the ISO-1 strain, respectively; 0 copies for the males that carry the deletion of the Sdic region (Aand E -; (Clifton, et al. 2017)); 12 copies for the males that carry the duplication of the Sdic region (4M; this work); and 6, 12, or 18 copies in particular progenies from controlled crosses involving w 1118 , 2T, and 4M (I-IV in Fig. 3B). The only exception to this good agreement was the estimate for the males from the duplication strain 2T, for which the qPCR estimate was of 12.5 copies instead of 12. Collectively, these results are consistent with a suitable ability to infer the number of Sdic copies through our qPCR assay at least between 0 and 18.

Patterns of nucleotide variation across the Sdic repeat
For the fraction of each Sdic repeat that corresponds to the Sdic transcriptional unit, the magnitude of within-strain pairwise sequence identity at the nucleotide level was very similar across the strains considered, with median sequence identity values ranging from 98.62% (B3) to 99.53% (B2); 99.44% when all 31 copies are considered jointly (Table S11). Nevertheless, nucleotide differences in the two exons most proximal to the Sdic stop codon result in notable differences at the amino acid level, impacting the length of the putatively encoded variants as previously documented in the ISO-1 strain (Clifton, et al. 2017). These variants varied by up to 29% in length (388-544 residues; Table S13). Further, and also within the Sdic transcriptional unit, there are 112 nt corresponding to the presumed Sdic promoter (Nurminsky, et al. 1998).
We found two additional promoter sequences in relation to the two previously documented (Clifton, et al. 2017). Both additional promoters show nucleotide differences at the same two sites already known to vary among previously delineated promoter sequences of Sdic (Clifton, et al. 2017).
We also examined the level of nucleotide differentiation at other sequence intervals that are part of the Sdic repeat, i.e. ~1,800 nt corresponding to a combination of non-deleted intervals of the canonical sequence of the TE Rt1c, and a ~850 nt portion corresponding to the presumed pseudogene AnxB10-like (Fig. 3A). We found a striking degree of conservation (number of base differences per site assuming a Jukes-Cantor substitution model; TE Rt1c, d = 0.005; AnxB10-like, d = 0.002; Sdic exonic sequence, d = 0.007). For AnxB10-like, we examined the possibility that this strong nucleotide conservation could actually reflect functional constraints contrary to previous reports (Yeh, et al. 2012a). By using an in-home pipeline that tracks small sequence motifs to differentiate expression between very similar duplicated sequences (Clifton, et al. 2017), we screened RNA-seq datasets corresponding to 29 biological conditions (Material and Methods; Table S12), finding no evidence of AnxB10-like expression. In the absence of evidence for functionality, the high-level sequence conservation for these intervals of the Sdic repeat might be suggestive of structural constraints.

Detecting positive selection across the Sdic repeat
Several approaches were used to determine the pattern of sequence evolution across the Sdic repeat taking into account the presence of both coding and noncoding sequences. The first method was used to test if positive selection occurred on Sdic protein-coding sequences (i.e. whether there is proportion of sites with an excess of nonsynonymous substitutions in relation to the expectation under a neutral model) for each branch of the phylogeny. In this model, the number of site classes with a particular nonsynonymous to synonymous rate ratio ( ) in each branch is not fixed but estimated using a small sample AIC. Then, a likelihood-ratio test (LRT) was used to compare the positive selection to the null model (classes with > 1 are not allowed), and the p-value for each branch was corrected for multiple testing using the Holm-Bonferroni correction (Holm 1979 (Benjamini and Hochberg 1995). Figure S1. Annotation of the Sdic region across 13 populations of the DSPR panel and the wild-type stock OR-R. The most reliable organization of the region at 19C1 on the X chromosome in the ISO-1 is provided as a reference (Clifton, et al. 2017). The region is depicted from centromere (Cen) to telomere (Tel), including the flanking genes sw and AnxB10 (grey filled arrows). Population names are color-coded based on the broad continental region where they were collected: green, Africa; red, Americas; and blue, Eurasia. The number of annotated Sdic copies in reference-quality genome assemblies (Chakraborty, et al. 2019;) is indicated in parentheses next to the name of the population. Arrows filled with vertical lines are partial copies. Sdic copies in the ISO-1 strain are named as reported (Clifton, et al. 2017). In the rest of populations, the copy identifiers are roman numerals according to their relative order from sw to AnxB10. In the most reliable genome assemblies, copies are color coded, and a lower character (a-m) added to their identifier, both indicating the associated paratype. The size of the TE insertions (solid boxes), as well as their location, are indicated. Ns, assembly gap. e, exon. The apostrophe in the case of A7_IIIa indicates a no longer coding exon, as the STOP codon is upstream of the TE insertion. The Sdic region was found unfragmented except for the strains from Bogota (A2) and Georgia (A6). In the case of A2, 6 full Sdic copies form two different clusters ~1.6 Mb apart on the X chromosome. The distal cluster harbors 3 of the copies, which are flanked by sw and AnxB10. In contrast, the proximal cluster is flanked by gap assemblies, which in turn are adjacent to TEs. Within this cluster, we found 3 full copies, another copy almost in its entirety, and the remnants of 2 other copies, which are separated by a cluster of TEs. In the case of A6, only two complete Sdic copies were found, upstream of which a 1,190 nt long fragment corresponding to the 3' end of either sw or an Sdic copy is present. This fragment is separated from other genes further upstream of the parental gene sw, such as obst-A, which is not found in the assembly due to an assembly gap.  between the engineered TEs P{XP}d03903 and PBac{WH}f02348, following the same mating scheme used to previously generate the deletion of the Sdic region (Yeh, et al. 2012b). The TE P{XP}d03903 is located in the intergenic region between the genes AnxB10 and Sdic1 while PBac{WH}f02348 is between sw and obst-A. Therefore, the actual duplication spans from the Sdic copy adjacent to AnxB10 to sw, inclusive. To discern which females, out of 174 obtained in G3, were actual carriers of the duplication of the Sdic region, we used two approaches. First, we visually inspected and PCR-screened the male progeny of 174 females, classifying each female as a duplication or non-duplication bearer based on the eye color of their male progeny.
Male progeny was PCR-screened through four controls that provided complementary information (Table S4). Once females carrying the duplication were identified, the mating scheme was continued to make the duplication homozygous. (B) Gene and TE molecular organization along the original TE-bearing chromosomes (top) and those resulting from an ectopic recombination event (bottom). As a duplication event results into a hybrid TE carrying only the 3' ends of the two TEs, two controls (amplicons 1 and 2 respectively in Table S4) were designed to confirm their presence in the PCR screening. Male progeny of these females should give rise to two amplicons (one per 3' end), which were multiplexed in the same PCR reaction. After separating the females presumably carrying the duplication of the Sdic region from those carrying its deletion, the females were subjected to two additional PCR controls (amplicons 3 and 4 respectively in Table S4). Amplicon 3 allows to confirm that the X chromosome under examination does not carry the deletion or the original chromosome carrying P{XP}d03903; the amplicon corresponding to the downstream end of the duplicated Sdic region should not be detected. Lastly, a fourth amplicon that corresponds to the hybrid TE that preserves the 5' ends of the two original chromosomes, and should only result from a deletion event, should not be observed either (Yeh, et al. 2012b). The combination from these four PCR controls designated 36 females out of the initial 174 as carriers of a duplicated Sdic region. Two of them (2T and 4M) were used in downstream analyses. (C) Chromosomal location of the duplicated Sdic region. An extremely intense, single in situ hybridization signal can be detected on the X chromosome of one of the duplication strains (4M), denoting a local duplication of the Sdic region, which is in good agreement with qPCR results.     (black) that align with Sdic, is shown at the bottom. y-axis, probability of breakpoint occurrence. Figure S9. Phylogenetic relationships among Sdic copies. The copies considered are the 31 from the strains A4, A5, A7, B1, B2, B3, and B6, plus the six copies from the reference strain ISO-1. Copy nomenclature is as in Fig. 1; also copies that belong to the same paratype are shaded according to the color code in that same figure. The phylogeny shown was inferred with RAxML 8.1.2 under a GTRGamma model of sequence evolution. The composites, i.e. the constructs generated with the alignable stretches of DNA sequence between Sdic and the parental genes sw and AnxB10, from each strain were also included in the analysis. The equivalent composite was generated for D. simulans according to the available information in FlyBase (Hu, et al. 2013). The percentage of replicate trees in which the associated copies clustered together in the bootstrap test (1,000 replicates) is shown next to the branches when higher than the cut-off value of 50. Copies representing paratype e, the ones for which is found the strongest support for the most recent action of positive selection, form a very distinctive clade. The same conclusion and a very similar overall tree topology are found when inferring the phylogeny of the copies under a best-fit substitution model.   (King, et al. 2012a).
* From the first nucleotide at the 5'UTR of sw to the last nucleotide at the 3'UTR of AnxB10. † Rounded-off average read-depth values. ‡ Rounded-off qPCR estimates derived from the difference between the amplicons sw-Sdic and sw only. In males unless specified (M, males; F, females). DSPR, Drosophila Synthetic Population Resource. * Each reference genome used carries one single Sdic copy (either from the ISO-1 or A4 strain) and lacks the two parental flanking genes sw and AnxB10. The number in the ID of the reference genome denotes the order of the Sdic copy in the original strain between the flanking genes, specifically from sw to AnxB10. For example, A4_5 refers to the fifth copy starting from sw, i.e. the copy adjacent to AnxB10 in the A4 strain. † The CN estimate is calculated as the rounded off average minus one due to the contribution of reads from sw and AnxB10 to the read-depth estimates obtained with CNVnator. † (Parks, et al. 2004). § Reverse complementary of the XP5'-primer when combined with the primer WH5'+ (Parks, et al. 2004). * (Yeh, et al. 2012b).  (Chakraborty, et al. 2019). For the local analysis, the values used are those in Table S7. Size corrected or uncorrected refer to the size of the region as interpolated from CNVnator estimates.
N50 refers to the length of the smallest contig, after ranking them from longest to smallest, such that the sum of the contig lengths up to it spans 50% of the total assembly size. NR50s refers to the median read length above which half of the total coverage is contained.     (Grenier, et al. 2015). ZM and ZI, Zambia (Lack, et al. 2016b). † Each reference genome used carries one single Sdic copy from the A4 strain and lacks the two parental flanking genes sw and AnxB10.

30
The number in the ID of the reference genome denotes the order of the Sdic copy in the original strain between the flanking genes, specifically from sw to AnxB10. For example, A4_5 refers to the fifth copy starting from sw, i.e. the copy adjacent to AnxB10 in the A4 strain.
Missing values denote not considered values because their associated target size felt outside the expected range, i.e. 7.2-8.0 kb. ‡ The CN estimate is calculated as the rounded off average minus one due to the contribution of reads from sw and AnxB10 to the read-depth estimates obtained with CNVnator.  (Grenier, et al. 2015).
33 The Jukes-Cantor substitution model was assumed to calculate the level of differentiation between the sequences of each strain in MEGA X (Kumar, et al. 2018). All positions containing gaps and missing data were eliminated (complete deletion option). The strain ID is highlighted in yellow.    Fig. S8.
† The branch number in Hyphy and its associated node and subtree are relative to the PARTITION SPECIFIC gene tree.