Structural variants and short tandem repeats impact gene expression and splicing in bovine testis tissue

Abstract Structural variants (SVs) and short tandem repeats (STRs) are significant sources of genetic variation. However, the impacts of these variants on gene regulation have not been investigated in cattle. Here, we genotyped and characterized 19,408 SVs and 374,821 STRs in 183 bovine genomes and investigated their impact on molecular phenotypes derived from testis transcriptomes. We found that 71% STRs were multiallelic. The vast majority (95%) of STRs and SVs were in intergenic and intronic regions. Only 37% SVs and 40% STRs were in high linkage disequilibrium (LD) (R2 > 0.8) with surrounding SNPs/insertions and deletions (Indels), indicating that SNP-based association testing and genomic prediction are blind to a nonnegligible portion of genetic variation. We showed that both SVs and STRs were more than 2-fold enriched among expression and splicing QTL (e/sQTL) relative to SNPs/Indels and were often associated with differential expression and splicing of multiple genes. Deletions and duplications had larger impacts on splicing and expression than any other type of SV. Exonic duplications predominantly increased gene expression either through alternative splicing or other mechanisms, whereas expression- and splicing-associated STRs primarily resided in intronic regions and exhibited bimodal effects on the molecular phenotypes investigated. Most e/sQTL resided within 100 kb of the affected genes or splicing junctions. We pinpoint candidate causal STRs and SVs associated with the expression of SLC13A4 and TTC7B and alternative splicing of a lncRNA and CAPP1. We provide a catalog of STRs and SVs for taurine cattle and show that these variants contribute substantially to gene expression and splicing variation.


Introduction
Genome-wide association studies (GWAS) and expression and splicing QTL (e/sQTL) mapping establish links between genotype and (molecular) phenotype (Visscher et al. 2012;Yang et al. 2013;Littlejohn et al. 2016;Lopdell et al. 2017;Sinnott-Armstrong et al. 2021).These approaches typically rely on SNP and small insertion and deletion (Indel, smaller than 50 bp) markers because they can be genotyped easily and accurately with short sequencing reads using reference-guided approaches.Complex DNA variations such as structural variants (SVs, larger than 50 bp) or short tandem repeats (STRs) are often neglected for GWAS and e/sQTL mapping because they are challenging to genotype.However, it becomes increasingly apparent that SVs and STRs contribute substantially to trait variation (Baker 2012;Sudmant et al. 2015;Abel et al. 2020;Bertolotti et al. 2020;Collins et al. 2020).
SVs can be classified into deletions (DEL), duplications (DUP), insertions, inversions (INV), translocations, segmental DUP, mobile element insertions, or complex rearrangements, which may be a combination of multiple types (Ho et al. 2020;Belyeu et al. 2021).Tandem repeats are consecutive repeats of units ranging from 1 bp to several kb (Gymrek 2017).STRs specifically refer to repeats of a motif between 1 and 6 bp in length, e.g.AGC 7 indicates that a trinucleotide (3 bp) AGC motif is repeated 7 times, yielding a total length of 21 bp.Polymorphic STRs can vary in length due to a contraction or expansion of the repeat motif.These variants can arise due to recombination errors, insertions of mobile genetic elements, slippage during DNA replication, or imperfect DNA repair (Ellegren 2000;Sun et al. 2012;Escaramís et al. 2015).
Historically, SVs and STRs had been genotyped using microarray-and PCR-based methods.The resulting genotypes were used to validate parentage, construct genetic linkage maps, assess genetic diversity, and map QTL (Machugh et al. 1994;Ihara et al. 2004;Wang et al. 2007;McClure et al. 2012).However, these methods investigate only a small number of polymorphic SVs and STRs.Exhaustive genome-wide discovery and genotyping of SVs and STRs have become feasible through advancements in short-read sequencing and variant detection methods (Willems et al. 2014;Sudmant et al. 2015;Saini et al. 2018;Audano et al. 2019;Abel et al. 2020;Collins et al. 2020).Yet, there are only few studies that identified SVs using whole-genome sequencing data in cattle (Boussaha et al. 2015;Chen et al. 2017;Mesbah-Uddin et al. 2018;Kommadath et al. 2019;Lee et al. 2023).To the best of our knowledge, STRs have not been profiled systematically in different cattle breeds using whole-genome sequencing data, as there is only 1 study that characterized 60,106 STRs in 5 Holstein (HOL) cattle (Xu et al. 2017).
It is well known from investigations in species other than cattle that STRs and SVs contribute substantially to complex traits and diseases through mediating gene expression and splicing (Cao et al. 2020;Scott et al. 2021;Vialle et al. 2022).For instance, an intronic AAGGG expansion in the RFC1 gene encoding replication factor C1 is associated with cerebellar ataxia with neuropathy and bilateral vestibular areflexia syndrome in humans (Rafehi et al. 2019).Analyses of the human Genotype-Tissue Expression (GTEx) data showed that SVs were the lead variants in 2.66% cis-eQTL (Scott et al. 2021) and revealed many STRs affecting gene expression (Fotsing et al. 2019).A recent study by Hamanaka et al. (2023) showed that tandemly repeated motifs of up to 20 bp contribute substantially to alternative splicing and thereby phenotype variation (Hamanaka et al. 2023).Recent efforts from the cattle GTEx project (Liu et al. 2022) have revealed noncoding SNPs and Indels that impact splicing and expression variation in multiple tissues.However, they did not explore the roles of SVs and STRs.To date, the contributions of SVs and STRs to gene expression and splicing variation are largely unknown in cattle.Therefore, we generated a catalog of polymorphic STRs and SVs from 183 whole-genome sequenced cattle and assessed the impact of these variant types on gene expression and splicing in testis transcriptomes of 75 mature bulls.Finally, we pinpoint candidate causal STRs and SVs that modulate the expression and splicing of genes in testis tissue.

Alignment and variant calling (SNPs and Indels)
We used paired-end (2 × 150 bp) whole-genome sequencing data of 183 individual cattle (mean coverage 12×) from the Brown Swiss (BSW), Original Braunvieh (OB), Grauvieh (GV), HOL, and Fleckvieh (FV) breeds and their crosses.Reference-guided alignment and variant discovery were performed as described in Lloret-Villas et al. (2021).Breed information and coverage for all animals, along with their accession numbers, can be found in Supplementary Table 1 in File 2. In brief, we aligned reads that passed quality control to the ARS-UCD1.2reference genome using the MEM algorithm of the Burrows-Wheeler Alignment (BWA) software (Li 2013) with option -M.Read duplicates were marked with the MarkDuplicates module from the Picard Tools software suite (Broad Institute 2021).Subsequent discovery and genotyping of SNPs and Indels were performed with GATK HaplotypeCaller (version 4.1) (Depristo et al. 2011).We filtered the variants with hard filtration settings recommended by GATK to retain highquality SNPs and Indels.Finally, we imputed sporadically missing genotypes with Beagle (version 4.1) (Browning and Browning 2016) and retained variants with minor allele frequency > 0.05 for downstream analysis.

Building reference STRs
A previously proposed HipSTR (Willems et al. 2017) workflow (https://github.com/HipSTR-Tool/HipSTR-references/tree/master/mouse) was applied to compile a set of reference STRs from the soft-masked ARS-UCD1.2reference genome [available from Ensembl (v. 104)].Briefly, we ran the Tandem Repeat Finder (TRF) software for each chromosome with settings 2,7,7,80,10,5,500 -h -d -l 6 -ngs (Benson 1999).We retained repeats with a motif size between 1 and 6 bp, merged overlapping repeats, and finally kept sites with high scores according to motif size as implemented in the trf_parser.pyutility.STRs that are not within 10 bp from another STR were retained.

STR genotyping
The STRs were genotyped in the cohort of 183 cattle using the default mode of the HipSTR software tool (Willems et al. 2017).The resulting VCF file was filtered using the filter_vcf.pyscript from HipSTR, with options --min-call-qual 0.8, --max-call-flank-indel 0.20, and --min-loc-depth 5x.We kept only STRs with genotyping rate higher than 60% and at least 1 bp difference.Furthermore, the minimum STR length was established at 11 base pairs, which means that mononucleotide STR required a minimum of 11 copies.

SV calling
We applied the smoove workflow to discover and genotype SVs from short sequencing reads (Pedersen et al. 2020).This approach extracts split and discordant reads from each bam file using samblaster (Faust and Hall 2014).These reads are then further filtered using lumpy (Layer et al. 2014) based on several quality metrics.The filtered reads were subsequently used to genotype SVs in each sample separately.The sample-specific SV calls were merged to obtain a set of SVs that segregate in the cohort.Each sample was then regenotyped for the common set of SVs using SVTyper (Chiang et al. 2015), and Duphold (Pedersen and Quinlan 2019) was run to add depth fold change.A single joint VCF file was eventually generated that contained DEL, DUP, INV, and breakends (BND).We retained only SVs that were longer than 50 bp, for which the breakpoints were precisely known, and that were supported by at least 1 split read.We kept DUP based on average DHFFC (fold change of the variant depth relative to flanking regions) scores as het > 1.25 and homo alt > 1.3 and DEL with DHFFC het < 0.70 and DHFFC homo < 0.50.INV were kept if their quality score was above 100.If multiple SVs were reported for the same location, we kept the variant with the highest quality score.

Annotation of variants
Both STRs and SVs were annotated according to the Ensembl annotation (v.104) of the bovine genome in a hierarchical manner using BEDTOOLS intersect (Quinlan and Hall 2010) (exon > intron > promoter > intergenic).The promoter was defined as the region located within 5 kb upstream of the gene start (5′ end).We assessed if exonic SVs overlap the whole gene or if they overlap only partially as proposed by Collins et al. (2020).SNPs and Indels were annotated with the Variant Effect Predictor (VEP) tool (McLaren et al. 2016) based on the Ensembl annotation of the bovine genome (version 104).The most severe consequence for each variant was then assigned to exon, intron, promoter, and intergenic regions as above.

Population structure and linkage disequilibrium
We used Plink1.9(Purcell et al. 2007) to calculate the principal components of genomic relationship matrices constructed from SNPs/Indels, SVs, or STRs.We used Bcftools (Danecek et al. 2021) to extract all SNPs and Indels within 50 kb of SVs or STRs.For each SV and STR, we calculated linkage disequilibrium (LD) as the squared Pearson correlation coefficient (R 2 ) with the dosage of each surrounding SNP or Indel (maf > 0.05) where dosage is 0 for the 0/0, 1 for the 0/1, and 2 for the 1/1 genotype (Fotsing et al. 2019).

Preprocessing RNA-seq data and alignment
Total RNAs of testis tissue from 76 mature bulls that are a subset of the 183 bulls used to profile STRs and SVs were available from a previous study (Kadri et al. 2021).The stranded paired-end reads were trimmed for adapter sequences, low-quality bases, and poly-A and poly-G tails with fastp (Chen et al. 2018).The filtered reads were aligned to the ARS-UCD1.2reference genome and the Ensembl gene annotation (v.104) using STAR (version 2.7.9a) with options --twopassMode, --waspOutputMode, and --varVCFfile (Dobin et al. 2013).

Gene expression quantification
Gene-level expression [in transcript per million (TPM)] was quantified with the QTLtools quan function with default settings (Delaneau et al. 2017).Raw read counts were obtained with FeatureCounts (Liao et al. 2014).We retained genes that had expression values >0.2 TPM in at least 20% of samples and >6 reads in at least 20% of samples.A PCA was conducted using log 2 (TPM + 1) transformed expression values.One sample was excluded as it appeared as an outlier in the PCA.Finally, TPM values were quantile normalized and inverse normal transformed across samples per gene using the R package RNOmni (McCaw et al. 2020).

Splicing quantification
We used RegTools (Cotto et al. 2023) and LeafCutter (Li et al. 2018) to quantify intron excision ratios.First, we filtered the STAR-aligned bam files for uniquely aligned and wasp-filtered reads (tag vW:i:1) (van de Geijn et al. 2015).Next, exon-exon junctions were obtained using RegTools with option -a 8 -m 50 -M 500000 -s 1.Finally, introns were clustered with a modified version of the leafcutter_cluster.pyscript provided by the human GTEx Consortium (The GTEx Consortium 2020).The script additionally filters introns without any read counts in >50% of samples and insufficient variability.Finally, the filtered intron counts were normalized using the prepare_phenotype_table.py script from LeafCutter and converted to BED format with the start/end position corresponding to the first position of 5′ of intron cluster.

Covariates for e/sQTL analysis
To account for hidden confounders that might cause variance of gene expression or splicing, we applied the probabilistic estimation of expression residuals (PEER) (Stegle et al. 2012).The top 3 principal components of a genomic relationship matrix that was calculated based on LD-pruned (-indep-pairwise 50 10 0.1) SNPs using Plink1.9(Purcell et al. 2007) were used to account for population structure.The influence of covariates on gene expression and splicing was quantified with the variancePartition R package (Hoffman and Schadt 2016).

e/sQTL mapping
We used the difference in length between reference and alternate (computed from the sum of the GB format tag in the output VCF file from HipSTR) alleles as dosage for the STRs for eQTL mapping (Fotsing et al. 2019;Jakubosky et al. 2020).To minimize the effect of outlier STRs, we converted the genotypes to missing if they were not observed in at least 2 samples.We kept sites with >80% genotyping rate.To prevent the removal of multiallelic sites by QTLtools, we replaced the alternate allele field of the VCF file with the string "STR."Furthermore, the GT field (genotype) was substituted with dosage values.Genotypes of SVs, SNPs and Indels, were also converted to dosages (0/0 to 0, 0/1 to 1, and 1/1 to 2).Genotypes at each variant position were normalized so that the effect size can be compared across the different variant types.All these changes were implemented using custom Python scripts.We performed cis-eQTL mapping between expressed genes and all variants in cis (±1 Mb) with QTLtools using the cis permutation mode (1,000 permutations) and accounting for covariates (5 PEER factors, 3 principal components (PC), RNA integrity number (RIN), and age).To account for multiple testing per molecular phenotype (Genes), we used the Storey and Tibshirani false discovery rate procedure that was implemented with the R/qvalue package on beta-approximated P-values (eGene) as described by Delaneau et al. (2017).This approach resulted in genes (eGenes) that had at least 1 significant eVariant and threshold P-values for all genes.Finally, we performed conditional analyses using QTLtools with threshold P-values to identify all significant independent eVariants per eGene that were used for all subsequent comparison.cis-sQTL mapping was performed as described above with QTLtools using the cis permutation mode and accounting for covariates (5 PEER factors, 3 PC, RIN, and age).We employed grouped permutations (-grp_best option) to collectively calculate an empirical P-value across all introns within an intron cluster.Normalized intron excision ratios (the ratio of the reads defining an excised intron to the total number of reads of an intron cluster) were used as molecular phenotypes.We considered sQTL to be an sVariant per sIntron cluster pair.Significant intron clusters were annotated (candidate intron boundaries per cluster) based on the ARS-UCD1.2gene annotation and strand match (Ensembl release 104).Intron cluster coordinates that mapped to multiple genes were considered as unannotated although the number of such intron clusters was less than 100 in each sQTL analysis.

Properties of e/sVariants
From each sQTL/eQTL analysis, we annotated each e/sVariant type with their respective annotation category as described above.For all enrichment analyses, we used Fisher's exact test (2 sided).All plots were created in R (v 3.6.3)with ggplot2.Gene structure was plotted with the ggtranscript R package (Gustavsson et al. 2022), and plots were combined with patchwork (Pedersen 2023).

Results
We used paired-end whole-genome sequencing data of 183 cattle from 5 breeds to genotype SVs, STRs, SNPs, and Indels.The average sequencing coverage was 12.8-fold, and it ranged from 5.0-to 30.4-fold.

Reference-guided discovery of STRs
We identified 1,202,536 STRs with a motif size between 1 and 6 bp in the current Bos taurus taurus reference sequence (ARS-UCD1.2) spanning 24.9 Mb autosomal sequence (1.0%) (Supplementary Fig. 1 in File 1 and Table 2 in File 2).The number of STRs on each chromosome was correlated (r = 0.99) with chromosome length (Supplementary Fig. 2 in File 1).Mono-and hexanucleotide loci were the most and least frequent types of STRs, respectively, amounting to 35.9 and 9.8% of all identified STRs (Supplementary Fig. 1 in File 1).Repeats of A, T, and AT were most prevalent among mono-and dinucleotide STRs.GC-rich repeats (e.g.AGC) were most frequent among trinucleotide STRs (Supplementary Fig. 3 in File 1).The overall length of the STRs varied from 11 to 10,427 bp with a median size of 18 bp.The vast majority of the STRs (n = 1,199,357, 99.7%) were shorter than 100 bp, facilitating short-read-based genotyping.

Genotyping of STRs
We obtained genotypes for 794,300 autosomal STRs in 183 cattle using HipSTR (Willems et al. 2017), of which we retained 374,822 polymorphic loci after stringent filtering for downstream analyses (Supplementary Table 3 in File 2).We identified between 73,791 and 189,658 (average: 150,104) STRs in each cattle genome, and the number of STRs detected correlated (r = 0.94) with sequencing depth (Supplementary Fig. 4 in File 1).As expected, given their prevalence in the bovine reference genome, mono-(52.9%)and hexanucleotide STRs (2.5%) were respectively the most and least frequent types of polymorphic STRs (Fig. 1a).Pentanucleotide STRs were more frequent than tetranucleotide STRs.Approximately three-quarters of polymorphic STRs (n = 266,509, 71.1%) were multiallelic and had between 1 and 41 alternate alleles, but more than 20 alternate alleles were rarely seen (Supplementary Fig. 5 in File 1).Dinucleotide STRs had the highest number of alternative alleles among all STRs (Fig. 1b).Repeats of A and T were the most frequent classes among the mononucleotide STRs, whereas AGC and CTG repeats prevailed among trinucleotide STRs (Supplementary Fig. 6 in File 1).Heterozygosity and allelic diversity were higher for dinucleotide loci than any other type of STRs (Fig. 1b and c), possibly suggesting higher mutation rate and less purifying selection in this class.
We investigated the overlap between the STRs detected in our study and 16 STRs that had been frequently used for parentage testing [including those that had been approved by the International Society for Animal Genetics (ISAG)] ( Van De Goor et al. 2009) (Supplementary Table 4 in File 2).Forward and reverse primer sequences of the 16 STRs were subjected to a BLASTn search against the ARS-UCD1.2reference sequence to identify their positions, motifs, and flanking sequences.Twelve STRs from our STR reference panel had matching coordinates and motif sequences.Among them, 10 STRs were genotyped in 183 samples with allele counts ranging from a minimum of 5 to a maximum of 11, whereas the remaining 2 STRs were excluded during quality filtering.

SV discovery and genotyping
We applied the smoove workflow to discover and genotype 61,806 SVs in the 183 cattle genomes, of which we retained 19,408 polymorphic autosomal loci (12,899 DEL, 1,043 DUP, 224 INV, and 5,242 SVs with unspecified BND) after stringent filtering for downstream analyses (Supplementary Fig. 2a in File 1 and Table 5 in File 2).The number of polymorphic SVs identified per chromosome was correlated (r = 0.94) with chromosome length (Supplementary Fig. 9 in File 1).We found between 4,259 and 6,835 SVs in each cattle genome (mean: 5,915), and this number was correlated with sequencing coverage (r = 0.60) (Supplementary Fig. 10 in File 1).A total of 6,728 (34.6%)SVs had minor allele frequency below 0.05 (Fig. 2c).Inspecting the length of the different SV types suggested that most (n = 891, 85.4%) DUP were smaller than 1 kb, whereas 3,465 (23%) DEL were larger than 1 kb (Supplementary Fig. 11 in File 1).

LD and population structure of the cattle cohort
We also discovered and genotyped SNPs and Indels in the 183 animals using the GATK haplotype caller.We considered 12,222,397 SNPs and 1,317,363 Indels with minor allele frequency greater than 5% for the downstream analyses, of which 55,010 SNPs and 89,673 Indels overlapped with STRs, and 387,593 SNPs and 47,129 Indels overlapped with SVs.The large overlap between SNP, SV, and STR variants is possibly due to nested variation but can also indicate that short sequencing reads are unable to resolve complex DNA variation.
We calculated the principal components from genomic relationship matrices built with SNP, STR, and SV genotypes of the 183 cattle.All three analyses correctly separated the individuals by breed (Supplementary Fig. 12 in File 1).Due to variation in sample size, coverage, and insert size between breeds, we did not investigate within-and across-breed diversity in SVs and STRs.Next, we investigated if SVs and STRs can be tagged by SNPs/ Indels.We calculated the LD between SNPs/Indels within 100 kb of each SV and STR.We observed that 40.1% of STRs (n = 150,393) were in high LD (R 2 > 0.8) with at least 1 SNP or Indel while this fraction ranged from 3.1 to 52.2% for the different SV types (Supplementary Fig. 13a in File 1 and Table 6 in File 2).BND and DUP were poorly tagged, possibly indicating low genotyping accuracy for these loci.The LD between SNPs/Indels and STRs was consistent across the different STR types (Supplementary Fig. 13b and Table 6 in File 2).

Properties of STRs and SVs associated with gene expression
The impact of polymorphic SVs, STRs, SNPs, and Indels on gene expression was investigated in a subset of 75 sequenced bulls that also had testis RNA-sequencing data.We performed cis-eQTL mapping between 19,415 expressed genes and 12,093 SVs, 271,450 STRs, and 13,494,075 SNPs and Indels that had minor allele frequency greater than 5% in the 75 bulls.Five eQTL analyses were conducted, i.e. for SNPs and Indels, SVs, STRs, and jointly for SVs and STRs (SV-STR), and all (ALL) variants to assess the contribution of different types of DNA variation to gene expression.
Exonic eDUP predominantly increased gene expression, while exonic eDEL mostly decreased gene expression.All other eVariant types exhibited a bimodal effect size distribution.We then explored the LD between eSTRs and eSVs and nearby SNP/ Indel.More than three-quarters (78.4%) of the eDEL and two-thirds of the eSTR (65.9%) were in high LD (R 2 > 0.8) with surrounding SNP/Indel.In contrast, eDUP and eBND were poorly tagged by SNP/Indel.
We found that 92.2% of the eGenes were protein-coding genes, 4.5% were long noncoding RNA (lncRNA), 0.98% were pseudogenes, and 1.8% were other genes (i.e.genes that are not classified in the above 3 categories) (Supplementary Fig. 15 in File 1).We observed a similar distribution of eGenes across all eVariant types except for eINV.

cis-sQTL mapping
We calculated intron excision ratios of 241,427 introns assigned to 76,083 intron clusters.More than half (n = 135,342, 56.0%) of the introns overlapped with 14,583 genes, but the annotation-free splicing event identification by the LeafCutter software also detected many introns that did not overlap with annotated features.The intron excision ratios were normalized for each intron and subsequently used as input phenotypes for cis-sQTL mapping.We mapped cis-sQTL with an approach that was similar to the eQTL mapping, i.e. we separately considered SNPs and Indels, SVs, STRs, SV-STR, and ALL.

Variant properties of sSTR and sSV
The impact of STRs and SVs on alternative splicing was assessed based on the results from the SV-STR sQTL analysis (Supplementary Table 11 in File 2).We observed that DEL were more likely to be sVariants than STRs (4.9% of DEL, OR = 1.6, P = 1.7 × 10 −17 ) (Fig. 5a; Supplementary Table 12 in File 2).Conversely, BND (OR = 0.4, P = 1.3 × 10 −9 ) were less likely to be sVariants compared to STRs.We further examined how many intron clusters are affected by an sVariant.A similar proportion of sDEL (12.2%) and sSTRs (11.2%) were associated with multiple sIntron clusters whereas this fraction was considerably lower or negligible for all other types of sVariants (Fig. 5b).Between 62 and 81% of the sVariants were located within 100 kb of their sIntron cluster (Fig. 5c; Supplementary Table 13 in File 2).
Most of the sQTL overlapped with either introns (49.7%) or intergenic regions (41.0%) but only few with promoter (6.9%) and exons (2.3%).Interestingly, sDEL were enriched in exons and depleted in introns of other genes, whereas sDUP showed enrichment in exons of sGenes (Supplementary Table 14 in File 2).On the other hand, sSTRs were depleted in exons of other genes, but they were enriched in introns of other genes (Supplementary Table 14 in File 2).We observed a high proportion of trinucleotide sSTRs among those that overlapped exons.These trinucleotide sSTRs were GC rich (Fig. 5f).Most sQTL showed bimodal effects.A bimodal effect size distribution in splicing variation encompassing both positive and negative effects is associated with variation in the relative abundance of transcripts between different genotypes (Garrido-Martín et al. 2021).sDUP had slightly positive effects on splicing phenotypes, which may indicate a relatively higher abundance of the transcript associated with the DUP (Fig. 5g).The vast majority of sDEL (94.0%) and sSTRs (84.3%) were in high LD (R 2 > 0.8) with surrounding (±50 kb) SNP/Indel, but sDUP (31.5%) and sBND (39.5%) were less frequently tagged (Fig. 5h).
Finally, we compared genes and molecular QTL (eQTL and sQTL as gene-variant pair) from both the eQTL and sQTL SV-STR analyses (Supplementary Fig. 18 in File 1).This comparison revealed that 1,988 genes and 505 QTL overlapped between both analyses.Out of the 505 shared QTLs, 479 were due to STR, while 23 were due to DEL (Supplementary Fig. 19 in File 1).The eQTL that were also sQTL mainly regulated expression due to alterations in gene transcript level abundance, and these changes were mainly modulated by STR and DEL.
Among the sQTL, we identified a candidate causal sSTR (AACTG 5 ,Bos_Tau_STR_57388,Chr1:112,307,307,890 bp) upstream of the lncRNA ENSBTAG00000054182 (Fig. 6a).An expansion of Bos_Tau_STR_57388 (the inserted motif AAATG differed slightly from the reference motif AACTG) was associated with a splicing junction (Chr1:112,307,322,799, P = 4.5 × 10 −25 ) in both SV-STR and ALL sQTL analyses.This splicing junction extends from upstream the lncRNA to the first intron of the lncRNA (Fig. 6a), and its intron excision ratio increased with an expansion of the repeat motif (Fig. 6c and d).The expression of ENSBTAG00000054182 (mean TPM 5.1 ± 1.5) decreased with the insertion of an additional repeat unit (Fig. 6b).Bos_Tau_STR_57388 was also the top eVariant for ENSBTAG00000054182 in the SV-STR eQTL analysis (P = 2.3 × 10 −15 ) but not in ALL eQTL analysis where a SNP (Chr1:112,315,134 bp) in LD (R 2 = 0.93) was the top eVariant.

Discussion
We generated a catalog of bovine polymorphic STRs, which contain motifs that vary in size, but some may also contain variation between the repeat motifs.A large number of cattle from different breeds enabled us to genotype 6-fold more STRs compared to a previous study (374,821 vs 60,661) that considered only 5 HOL cattle genomes (Xu et al. 2017).Three-quarters of the STRs genotyped in our study were multiallelic, which agrees with previous studies in cattle, pigs, and humans (Willems et al. 2014;Xu et al. 2017;Wu et al. 2021).We also genotyped almost 20k SVs.The majority of both SVs and STRs were in introns and intergenic regions likely because coding regions are less tolerant to variants affecting several bases.We also detected SVs and STRs that overlapped exonic regions, but half of the exonic SVs were only present in the heterozygous state, which may indicate that some of them manifest deleterious phenotypes in the homozygous state.However, even deleterious SVs can persist and increase in frequency over time due to drift or pleiotropic effects and balancing selection, such as a 660-kb DEL in Nordic red cattle (Kadri et al. 2014).Deleterious SVs in less conserved genes may be evolutionarily less constrained (Mesbah-Uddin et al. 2018).We also observed a high proportion of tri-and hexanucleotide STR in exonic regions possibly suggesting that nontriplet STRs are less tolerated and might be under negative selection (Willems et al. 2014).
We observed more than twice the number of DEL compared to other SV types likely because they are easier to identify from short-read sequencing data (Mahmoud et al. 2019).Only half of the STRs and DELs are in high LD with SNPs and Indels (R 2 > 0.8).The LD between SNPs and other types of SVs such as BND, DUP, and INV is even lower, which could be due to incorrect genotyping, alignment error, or their occurrence in complex regions such as segmental DUP.Thus, the direct genotyping of these variants is required to enable powerful association studies.
In the cattle breeding industry, genomic prediction is routinely applied to estimate breeding values.This estimation typically relies on dense genotypes obtained from SNP arrays or imputed sequence variants but does not consider an exhaustive set of STRs and SVs. Lee et al. (2023) added 372 SVs to the custom content of a frequently applied microarray to enable genotyping at the population scale.Our study confirms that such efforts are warranted, as SVs and STRs are only partially tagged by adjacent SNPs.
Particularly intriguing are also SVs and STRs that are associated with gene expression and splicing variation, as they potentially impact phenotypes.Considering such SVs and STRs and those exhibiting limited or no LD with genotyped SNPs carries the potential to improve the accuracy of genomic predictions and offers insights into the molecular underpinnings of QTL (Lee et al. 2021;Leonard et al. 2023).
Our results confirm that sequencing coverage and insert size have profound impacts on the genotyping of SVs and STRs (Willems et al. 2017;Lee et al. 2023).We applied stringent filters to retain only high-confidence SVs.This approach likely removed some true large and complex SVs and STRs from our data.Long sequencing reads and pangenome integration enable to reliably detect large and complex SVs and STRs (Leonard et al. 2022;Talenti et al. 2022).However, long-read sequencing is still too costly when applied at population scale.Future studies could utilize a combination of long-read sequencing and pangenome integration with short-read sequencing data to identify and genotype the full spectrum of genetic variants at the population scale (Chaisson et al. 2019;Ebert et al. 2021).
Our study examines impacts of polymorphic SVs and STRs on gene expression and splicing variation in bovine testis tissue.Further research is needed to extend this analysis to additional tissues as gene regulation is predominantly tissue specific (Gamazon et al. 2018).Regardless, our eQTL and sQTL analyses showed that SVs and STRs have profound impacts on the transcriptional profile in testis.We found that each eSV affects on average 1.11 nearby genes with most of this contribution arising from DUP.However, this value is lower than the 1.82 genes in cis per eSV reported recently in humans, where major contributions were from multiallelic copy number variants (mCNV) and DUP (Scott et al. 2021).In our study, CNV are part of the DUP category.
This difference likely indicates that our study had less power to detect s/eQTL because our variant catalog (61,668 SVs in human vs 19,408 SVs here) and sample size (643 individuals with 48 tissues vs 75 individuals with 1 tissue) were considerably smaller.Our results confirm that e/sDUP in exonic and nonexonic regions mostly increase gene expression whereas e/sDEL decrease gene expression (Chiang et al. 2017;Scott et al. 2021).An increased expression associated with an e/sDUP is frequently due to either DUP of the entire gene or exon or its regulatory regions.It is possible that some of the eSVs and eSTRs detected in our study are associated with semen quality and male fertility.Further investigations are needed to explore the integration of molecular QTL cohorts and GWAS cohorts through transcriptome-wide association testing.This approach holds the potential to reveal SVs and STRs associated with specific traits (Mapel et al. 2023).
Our analysis showed that most e/sSTRs and e/sSVs were in intronic regions rather than intergenic regions, which contrasts SV and STR impact gene expression in bovine testis | 9 with their overall distribution along the genome.This pattern agrees with the position of human e/sSTRs and e/sSVs (Chiang et al. 2017;Fotsing et al. 2019;Jakubosky et al. 2020)  overlap of TSS to STR loci, which are unassigned to any known genic or enhancer regions in humans (Grapotte et al. 2021).Most of these TSS overlapping with STRs are responsible for initiating noncoding RNAs in humans.Similarly, a candidate causal sSTR detected in our study was associated with the splicing of the lncRNA ENSBTAG00000054182, which produces a transcript that is not included in the current Ensembl annotation.This further emphasizes the need for an improved bovine annotation, particularly with respect to noncoding elements of the genome such as lncRNAs.Although the association of expression and splicing variation with STRs and SVs in e/sQTL studies does not necessarily provide the underpinning molecular mechanism of action, these variants contribute significantly to complex trait variation (Xiang et al. 2022).

Fig. 1 .
Fig. 1.Properties of 374,822 polymorphic STRs in 183 taurine B. taurus taurus cattle genomes.a) Proportion and count of STRs for each motif size.b) Number of alternative alleles observed for each STR motif size.Numbers above the boxplots indicate the average number of alleles observed in the 183 cattle genomes for each motif size.c) Heterozygosity in each STR motif.d) Proportion of loci overlapping 4 annotation categories for each STR motif size.Numbers inside the stacked bars represent the total count of STRs for each annotation category.

Fig. 2 .
Fig. 2. Properties of 19,420 polymorphic SVs in 183 taurine cattle genomes.a) Count of polymorphic loci for each SV type.b) Proportion of loci overlapping 4 annotation categories.The numbers inside the stacked bars represent the count of SVs for each annotation category.c) Alternative allele frequency distribution for each SV type.

Fig. 3 .
Fig.3.Properties of eSVs and eSTRs from the SV-STR eQTL mapping.eSNPs and eIndels are added from the ALL eQTL mapping.a) Percentage of unique eVariants for each variant type.The count of eVariants per type is shown next to each bar.b) Number of eGenes affected by each type of eVariants.c) Distribution of the absolute distance between eVariants and eGenes (5′-UTR or TSS).d) Proportion of eQTLs from different annotation categories in each variant type, such as those overlapping intergenic regions or overlapping exons, promoters, or introns of their corresponding eGenes or other genes.e) Proportion of eQTLs from different annotation categories in each STR type such as those overlapping intergenic regions or overlapping exons, promoters, or introns of their corresponding eGenes or other genes.f) Total count of the most frequent STR motifs (>30 observations) among eSTR.g) Distribution of effect size of eQTL per type based on exonic [overlap with an exon (exonic) of their eGene or other genes] or nonexonic category.h) Distribution of maximum LD (R 2 ) per variant for each eVariant type.

Fig. 4 .
Fig. 4. Candidate eSTR a-c) and eSV d-g) associated with eGene expression.The eSTR Bos_Tau_STR_126581 is a GT dinucleotide that repeats 11 times in the reference sequence and between 8.5 and 13 times in the 75 genotyped bulls.eQTL mapping revealed association between Bos_Tau_STR_126581 and TTC7B mRNA abundance.a) Schematic overview of the exon/intron structure of bovine TTC7B gene and Bos_Tau_STR_126581.The vertical line indicates the position of Bos_Tau_STR_126581, and the vertical half lines indicate exons of TTC7B.b) Normalized gene expression of TTC7B in 75 genotyped bulls in each mean dosage of eSTR.The numbers above each boxplot are the total number of animals per genotype.The mean repeat count per genotype is determined by adding the total repeat number of both alleles belonging to a genotype and then dividing this sum by 2. c) Manhattan plot of −log 10 (P)-values for all variants surrounding Bos_Tau_STR_126581 (pink colour) from the nominal ALL eQTL analysis.Different colors indicate the pairwise LD (R 2 ) between Bos_Tau_STR_126581 and all other variants.d) Schematic overview of ENSBTAG00000015551 (turquoise color) and SLC13A41 (salmon color) that are associated with a 885-bp DEL on chromosome 4 (eDEL 4_99481913_DEL).The boxes represent exons.The vertical lines indicate the position of 4_99481913_DEL.e, f) Normalized mRNA expression of ENSBTAG00000015551 and SLC13A41 in 75 genotyped bulls for each genotype of eDEL.The numbers above each boxplot are the total number of animals per genotype.g) Manhattan plot of −log 10 (P)-values for all variants surrounding 4_99481913_DEL from the nominal ALL eQTL analysis as pink colour (2 points as same eDEL corresponds to 2 genes).Different colors indicate the pairwise LD (R 2 ) between 4_99481913_DEL and all other variants.

Fig. 5 .
Fig.5.Properties of sVariants (SVs and STRs) from the SV-STR sQTL analysis.SNPs and Indels are from ALL analyses in all panels.a) Percentage of unique sVariants for each variant type.The number of sVariants per category is shown next to the bars.b) Total number of sIntron clusters per sVariant for each variant type.c) Distance between sQTL and the start position of the associated sIntron cluster for each variant type.d) Fraction of sQTL from different annotation categories in each variant type such as those overlapping intergenic regions or overlapping exons, promoters, or introns of their corresponding sGenes or other genes.e) Fraction of sQTL from different annotation categories in each STR motif size such as those overlapping intergenic regions or overlapping exons, promoters, or introns of their corresponding eGenes or other genes.f) Prevalence of the most frequent (>50) STR motif among sSTR.g) Distribution of sQTL effects.Colors differentiate between exonic and nonexonic sQTL.The dots represent the overall mean.h) Distribution of maximum LD (R 2 ) between sQTL and SNP/Indel for different variant types.

Fig. 6 .
Fig. 6.Two candidate causal sSTR a-d) and sSV f-h).The sSTR Bos_Tau_STR_57388 on chromosome 1 is associated with ENSBTAG00000054182 splicing.a) Schematic overview of ENSBTAG00000054182.Bos_Tau_STR_57388 (vertical straight line) is upstream of lncRNA ENSBTAG00000054182, and it is associated with the splicing junction spanning from Chr1:112,307,410 to 112,322,799.Intron/splice junction boundaries are indicated with dotted lines.b) Normalized ENSBTAG00000054182 expression and c) intron excision ratio for different sSTR genotypes.The numbers above each boxplot are the total number of animals per genotype.The mean repeat count per genotype is determined by adding the total repeat number of both alleles belonging to a genotype and then dividing this sum by 2. d) Manhattan plot of nominal ALL sQTL result surrounding the sSTR (pink colour).Different colors indicate the pairwise LD (R 2 ) between Bos_Tau_STR_57388 and all other variants.A candidate sSV (8_17328959_DEL) on chromosome 8 is associated with alternative CAAP1 splicing.e) Schematic overview of CAAP1 gene.A promoter DEL "8_17328959_DEL" (vertical straight line) is associated with excision ratios of splicing junction Chr8:17,329,855-17,336,097.Intron boundaries are indicated with dotted lines.f) Normalized CAAP1 expression and g) intron excision ratio for the different sSV genotypes.The numbers above each boxplot are the total number of animals per genotype.h) Manhattan plot of nominal ALL sQTL result surrounding sDEL (pink colour).Different colors indicate the pairwise LD (R 2 ) between sDEL and all other variants.
(Fotsing et al. 2019;Ho et al. 2020;Vialle et al. 2022)Vs and STRs in regulating gene expression and splicing(Jakubosky et al. 2020).Intronic and intergenic regions can contain regulatory elements that modulate splicing and gene expression via change in nucleosome positioning, open chromatin structure, RNA-binding protein, or DNA methylation(Fotsing et al. 2019;Ho et al. 2020;Vialle et al. 2022).Nearly half of the intron clusters detected in our study could not be annotated with the current cattle annotation (Ensembl 104).The FANTOM5 consortium revealed significant