Dissecting the Effects of Selection and Mutation on Genetic Diversity in Three Wood White (Leptidea) Butterfly Species

Abstract The relative role of natural selection and genetic drift in evolution is a major topic of debate in evolutionary biology. Most knowledge spring from a small group of organisms and originate from before it was possible to generate genome-wide data on genetic variation. Hence, it is necessary to extend to a larger number of taxonomic groups, descriptive and hypothesis-based research aiming at understanding the proximate and ultimate mechanisms underlying both levels of genetic polymorphism and the efficiency of natural selection. In this study, we used data from 60 whole-genome resequenced individuals of three cryptic butterfly species (Leptidea sp.), together with novel gene annotation information and population recombination data. We characterized the overall prevalence of natural selection and investigated the effects of mutation and linked selection on regional variation in nucleotide diversity. Our analyses showed that genome-wide diversity and rate of adaptive substitutions were comparatively low, whereas nonsynonymous to synonymous polymorphism and substitution levels were comparatively high in Leptidea, suggesting small long-term effective population sizes. Still, negative selection on linked sites (background selection) has resulted in reduced nucleotide diversity in regions with relatively high gene density and low recombination rate. We also found a significant effect of mutation rate variation on levels of polymorphism. Finally, there were considerable population differences in levels of genetic diversity and pervasiveness of selection against slightly deleterious alleles, in line with expectations from differences in estimated effective population sizes.

The Maker package was run twice; the first time with RNA-seq and protein evidence data, that were used to train the prediction algorithm for subsequent ab-initio prediction, and a second time with both RNA-seq and protein evidence data and the ab-initio gene profiles. The RNAseq libraries were first de-novo assembled using Trinity and the assembly was processed with an in-house developed pipeline (available at: https://github.com/LLN273/RNAseq_Lsinapis) including Trimmomatic (Bolger et al., 2014), TopHat (Trapnell et al., 2009) and StringTie (Pertea et al., 2015). Two external protein data sets were used; one consisting of protein sequences from Uniprot (https://www.uniprot.org/; accessed 2016-08-01) and the other was the Heliconius melpomene annotation (Heliconius_melpomene_melpomene_Hmel2_proteins.fa; 9,995 annotated coding sequences) available at LepBase (Challis et al., 2017). The Uniprot database (Magrane & UniProt-Consortium, 2011) protein sequences were from the Swiss-prot section and contained only manually annotated and reviewed proteins (551,705 proteins).
As a basis for the construction of gene models, ab-initio predictions from four sources were used; Augustus (Stanke et al., 2006), GeneMark_ES_ET (Ter-Hovhannisyan et al., 2008), SNAP (Korf, 2004) and EVM (Haas et al., 2008). The ab-initio annotation uses statistical models of genome composition to identify the most probable location of start/stop codons and splice sites. The ab-initio tools need to be trained in order to know how genes look like in the focal species (exon and intron lengths and numbers). Augustus version 2.7 (Stanke et al., 2006) was trained with the annotation created after the first run of Maker. GeneMark-ES_ET version 4.3 (Ter-Hovhannisyan et al., 2008), which has previously been shown to be efficient for unsupervised training on fungal and eukaryotic genomes (Lomsadze, 2005;Ter-Hovhannisyan et al., 2008), was trained with the genome assembly as input. SNAP (Korf, 2004) was trained with the genome sequence and a selected set of high-quality genes from the evidence based annotation. Maker (Holt & Yandell, 2011) was then run with all four predictors and combined with evidence based annotation. This strategy, using both evidence sequences (proteins and transcripts) and ab-initio predictions, allowed us to compute high-confidence gene models (Holt & Yandell, 2011).
Maker integrates an annotation quality-control called Annotation Edit Distance (AED) developed by the Sequence Ontology Project (http://www.sequenceontology.org/). The AED metric is a means to quantify the congruency between a gene annotation and its supporting evidence. We omitted all gene predictions with AED = 1 since this indicates a pure ab-initio gene model without evidence from previously available RNA-seq or protein sequence data.
Statistical evaluation of the final annotation was performed with an in-house developed perl script (available at: https://github.com/NBISweden/GAAS). The annotation identified 15,598 protein coding genes with an average coding sequence length (CDS) of 1,134 bp (Supplementary Table 1). BUSCO version v1.1b1 (Simao et al., 2015) was run again after the structural annotation with the set of proteins predicted by the annotation process. The result showed that around 69% (n = 1,860) of the genes in the arthropod gene set (n = 2,675) were complete in the genome, 14% (400) were fragmented, 15% (415) were missing and 11% (319) were duplicated. Part of the number of duplicated copies can be explained by isoforms that are sometimes counted as duplicated genes.
Functional inference for genes and transcripts was performed using the translated CDS features of each coding transcript. Each predicted protein sequence was blasted against the Uniprot/Swissprot reference data set (downloaded 2016-08) in order to retrieve the gene name and the protein function. The predicted gene sequences were also run with InterProscan version 5.7-48 (Jones et al., 2014) in order to retrieve Interpro (Hunter et al., 2012), PFAM (Finn et al., 2014), GO (Ashburner et al., 2000), MetaCyc (Caspi et al., 2018), KEGG (Kanehisa et al., 2014) and Reactome (Fabregat et al., 2018) data. Finally, we used outputs from those analyses using the Annie annotation tool (Tate et al., 2014) to extract and reconcile meta data into predictions for canonical protein names and functional predictions. The combined analyses of protein domains resulted in 10,857 genes with and 4,741 genes without functional annotation information (Supplementary Table 2). The algorithm implemented in tRNAscan version 1.3.1 (Lowe & Eddy, 1997) was used to predict tRNA genes in the L. sinapis genome assembly. We found a large set of tRNAs (n = 22,961) but only a fraction of these (n = 51) were supported by both evidence and ab-initio models.  The associations between regional recombination rate (ρ) estimates for all 15 inter-population comparisons. The correlations were generally positive but weak (Pearson's r = -0.02 -0.20, pvalues > 0.05) with the exception of LsSwe vs. LsKaz where we found a significant positive correlation (Pearson's r = 0.21, p-value = 1.0*10 -51 ).