Genome-wide characterization of human minisatellite VNTRs: population-specific alleles and gene expression differences

Abstract Variable Number Tandem Repeats (VNTRs) are tandem repeat (TR) loci that vary in copy number across a population. Using our program, VNTRseek, we analyzed human whole genome sequencing datasets from 2770 individuals in order to detect minisatellite VNTRs, i.e., those with pattern sizes ≥7 bp. We detected 35 638 VNTR loci and classified 5676 as commonly polymorphic (i.e. with non-reference alleles occurring in >5% of the population). Commonly polymorphic VNTR loci were found to be enriched in genomic regions with regulatory function, i.e. transcription start sites and enhancers. Investigation of the commonly polymorphic VNTRs in the context of population ancestry revealed that 1096 loci contained population-specific alleles and that those could be used to classify individuals into super-populations with near-perfect accuracy. Search for quantitative trait loci (eQTLs), among the VNTRs proximal to genes, indicated that in 187 genes expression differences correlated with VNTR genotype. We validated our predictions in several ways, including experimentally, through the identification of predicted alleles in long reads, and by comparisons showing consistency between sequencing platforms. This study is the most comprehensive analysis of minisatellite VNTRs in the human population to date.


INTRODUCTION
Over 50% of the human genome consists of repetitive DNA sequence (1,2) and tandem repeats (TRs) comprise one such class. A TR consists of a pattern of nucleotides repeated two or more times in succession. TR loci are defined by their position on the genome, the sequence and length of their repeat unit, and copy number.
TRs are commonly divided into three classes based on pattern size: short tandem repeats (STRs), or microsatellites, with pattern lengths of six or fewer base pairs (bp), minisatellites with patterns ranging from seven to several hundred bp, and macrosatellites with patterns from hundreds to thousands of bp (3,4). This study focuses on the TR minisatellite class, which comprises more than a million loci in the human genome.
Many TRs appear to be monoallelic with regard to copy number. However a significant fraction exhibit copy number variability in the population and these variant TR loci are called Variable Number Tandem Repeats (VN-TRs). Changes in VNTR copy number have been proposed to arise by slipped strand mispairing (5-7), unequal crossover (8,9), and gene conversion (8,10).
More than half of previously identified human VNTR loci (22) are located near or within genes (23), and so their potential effects on gene expression or protein products are substantial. Indeed VNTRs have been associated with changes in levels of gene expression (24), including tissue specific expression (25,26).
Minisatellite VNTRs have been proposed to regulate genes in several ways. In promoters, they can carry binding sites for transcription factors such as NF-B and myc/HLH (27,28), permitting copy number to affect transcription factor binding and therefore level of transcrip-tion (29)(30)(31)(32). VNTRs in introns have been shown to contain enhancer sequences (33)(34)(35) or to cause differential splicing (36,37). VNTRs in exons can also affect transcription (38,39) and the stability or translation rate of the resulting proteins (40).
VNTRs have been proposed as drivers of phenotypic variation in evolution (85)(86)(87). For example, an examination of functional enrichment in genes containing TRs or VNTRs, in both humans and apes, found that genes with ape-specific TR variability were associated with the senses of taste and smell, while genes with human-specific TR variability or human-specific copy number were associated with neurogenesis and neural development. These results suggest that VNTR polymorphisms may help account for 'human-specific cognitive traits' (24). Additionally, the Eichler group (87) has examined TR loci on human and ape genome assemblies from PacBio sequencing data and identified 1584 human-specific VNTR loci with 52 as candidate regions associated with disease.
Despite their biological significance, until recently, relatively few human minisatellite VNTRs have been identified and studied in detail. The ever-increasing availability of accurate whole genome sequencing (WGS) data, however, provides extensive opportunity for high throughput, genome-wide VNTR genotyping. Further, the emergence of PCR-free WGS datasets is reducing locus selection bias and enabling better filtering of false positive VNTR variants.
Nonetheless, genotyping variability in repeat sites remains challenging (88,89). Although a number of tools have been designed to detect microsatellite copy number variability such as lobSTR (90), popSTR (91), hipSTR (92), GangSTR (93) and ExpansionHunter (94), only two highthroughput tools are available for minisatellite genotyping. The adVNTR tool (95) trains a Hidden Markov Model (HMM) for each VNTR locus of interest and has been used to predict variability in 2944 VNTRs intersecting coding regions.
VNTRseek (96), developed in our lab, uses the Tandem Repeats Finder (TRF) (97) to detect and characterize TRs inside reads and then maps read TRs to TRs in a reference set. Because it builds pattern profiles before mapping, VNTRseek is robust in the presence of SNPs and small indels. While VNTRseek has high precision, it has two ma-jor limitations. It only detects minisatellites which contain a minimum of 1.8-1.9 pattern copies (depending on pattern length), a limitation from the Tandem Repeats Finder (TRF) software, and the tandem array, plus short flanking sequences, must fit completely inside a read. This means that arrays longer than the read length cannot be detected. Given these limitations, the results we report underestimate the true polymorphic nature of minisatellite TRs.
In this paper, we present the most comprehensive catalog of minisatellite VNTRs in the human genome to date, pooling results for WGS datasets from 2770 individuals, processed with VNTRseek on the GRCh38 human reference genome. We report a large collection of previously unknown VNTR loci, find that many VNTR loci and alleles are common in the population, show that VNTR loci are enriched in gene and regulatory sequences, provide evidence of gene expression differences correlated with VNTR genotype and evidence of population-specific VNTR alleles.

Datasets
Datasets comprising 2801 PCR-free, WGS samples from 2,770 individuals were used in this study (Table 1): 30 individuals from the 1000 Genomes Project Phase 3 (98), including the Utah (CEU) and Yoruban (YRI) trios (motherfather-child); 2,504 unrelated individuals mostly overlapping with the 1000 Genomes Project, recently sequenced at >30× coverage by the New York Genome Center (NYGC); 253 individuals from the Simons Genome Diversity Project (SGDP) (43), seven individuals sequenced by the Genome in a Bottle (GIAB) Consortium (99), including the Chinese (HAN) and Ashkenazi Jewish (AJ) trios and NA12878 (with ID HG001); two 'haploid' hydatidiform mole cell line genomes, CHM1 (100) and CHM13 (101); tumor/normal tissues from two unrelated individuals with breast cancer (breast invasive ductal carcinoma cell line/lymphoblastoid cell line) from the Illumina Basespace public WGS datasets (102); and the AJ child sequenced with PacBio Circular Consensus Sequencing (CCS) reads (103). Duplicates of 27 genomes were present in two datasets, 1000 Genomes and NYGC. One of these, NA12878, was also included in the GIAB dataset.
Overall, read coverage ranged from approximately 27×, in the PacBio sample to 333×, in the GIAB Chinese child. Besides the PacBio data, reads consisted of three lengths, 100/101 bp (257 samples), 148/150 bp (2508 samples), and 250 bp (35 samples). All data were downloaded as raw fastq files, except for the PacBio data which were obtained as a BAM file with reads aligned to GRCh37. SRA links to the data are given in Table 4.
The majority of the analyses in this paper were performed on the 2504 genomes from NYGC. The 253 genomes from SGDP provided insight into under-represented populations. The 27 genomes duplicated in the 1000 Genomes and NYGC datasets were used to measure consistency across sequencing platforms. The trios from the 1000 Genomes (CEU and YRI) and GIAB (AJ and Chinese HAN) datasets were used for analyzing Mendelian inheritance. The cancer datasets were used to find possible changes in VNTRs in tumor tissues. The PacBio data were used for validation purposes only.

Curating the reference TRs
The 22 autosomes and sex chromosomes from the human reference genome GRCh38 were used to produce a reference set of TRs in TRDB (105) with the TRF software and four quality filtering steps as described in (96). In addition, centromere regions were excluded from the reference set. These filtering tools are available online in TRDB. Starting with 1 199 362 TRs found by TRF, we curated a filtered reference set with 228 486 TRs. Using VNTRseek, we classified the TRs into two subcategories, singletons and indistinguishables (Supplementary Section S2.1). A singleton TR appears to be unique in the genome based on a combination of its repeat pattern and flanking sequence. An indistinguishable TR belongs to a family of genomically dispersed TRs which share highly similar patterns and flanking sequence and may therefore produce misleading genotype calls. Indistinguishable TRs (total 37 200 or about 16% of the reference set) were flagged. Simulation testing revealed that some singletons produced false positive VNTRs. To minimize this issue, an additional filtering step was added to eliminate problematic singleton loci from the reference set (see Supplementary Section S2.2 and Supplementary Material Reference TRs.txt for the reference sets). We assumed that genotyping was possible for a reference TR locus, given a particular read length, if the TR array length plus a minimum 10 bp flank on each side, would fit within the read. The number of reference TR alleles that could be genotyped using each of the read lengths in our data is summarized in Supplementary Table S1.

Genotyping TRs and VNTRs
Each dataset was processed separately with VNTRseek using default parameters: minimum and maximum flanking sequence lengths of 10 and 50 bp, respectively, on each side of the array, and requiring at least two reads mapped with the same array copy number to make an allele call. Output from VNTRseek included two VCF files containing genotype calls, one reporting all detected TR and VNTR loci, and the other limited to VNTR loci only. (A locus was considered a VNTR if a non-reference allele was detected.) The VCF files contained two specialized FORMAT fields: SP, for number of reads supporting each allele, and CGL, to indicate genotype by the number of copies gained or lost with respect to the reference. For example, a genotype of 0 indicated detection of only the TR reference allele (zero copies gained or lost), while 0, +2 indicated a heterozygous locus with a reference allele and an allele with a gain of two copies.
To remove clear inconsistencies, for this study we filtered the VCF files to remove per sample VNTR loci with more alleles than the expected number of chromosomes. The filtering criteria for these loci, termed multis is detailed in Supplementary Section S2.3. After multi filtering, a TR locus was labeled as a VNTR if any remaining allele, different from the reference, was observed in any sample.

Experimental validation
Accuracy of VNTRseek genotyping was experimentally tested for 13 predicted VNTR loci in the Ashkenazi Jewish (AJ) trio. The following DNA samples were obtained from the NIGMS Human Genetic Cell Repository at the Coriell Institute for Medical Research: NA24385, NA24149 and NA24143 (also identified as GIAB IDs HG002, HG003 and HG004). Selection criteria required that the PCR product was not contained in a repeat region, unique primers could be designed, the primer-defined allele length difference was between 10% and 20% of the longest allele, and the primerdefined GC content was between 40% and 60%. Given these criteria, we prioritized VNTRs in genes and regulatory regions which might be of interest to researchers. Primers were Nucleic Acids Research, 2021, Vol. 49, No. 8 4311 designed with Primer-BLAST (106) and used to amplify the VNTR loci from the genomic DNA of each individual using the following reagents: 0.2 l 5 U/l DreamTaq DNA Polymerase (ThermoFisher Scientific), 4.0 l 10× Dream-Taq Buffer (ThermoFisher Scientific), 0.8 l 10mM dNTP mix (ThermoFisher Scientific), 3.2 l primer mix at a final concentration of 0.5 M, 1.6 l genomic DNA (40 ng), and 30.2 l nuclease-free water. PCR cycling conditions were as follows: 30 s at 95 • C, 30 s at 56-60 • C, 20 s at 72 • C, for 30 cycles, with an initial denaturation of 3 min at 95 • C and a final extension of 7 min at 72 • C. The resulting amplicons were run on a 2% agarose gel at 100 V for 2 h and visualized with UV light using Ethidium Bromide. A complete list of loci and primers is given in the Supplementary Material as Experiment primers.txt.
In addition, VNTRseek predictions in the NA12878 genome were compared to experimental validations in the paper describing the adVNTR software (95). We had three datasets for NA12878: HG001 (148 bp) from GIAB, NA12878 (150 bp) from NYGC and NA12878 (250 bp) from 1000 Genomes. The adVNTR predictions used GRCh37 coordinates which were converted using the UCSC liftover tool (107) to coordinates to GRCh38.

Validation using long reads
Aligned PacBio reads for the AJ child (GIAB ID HG002) were processed to validate VNTRseek predictions. The read sequences were extracted from the BAM file and mapped back to the GRCh38 genome using BWA MEM default settings (108). Using bedtools (109), the reads aligning to each TR reference locus were extracted. For each read, a local wraparound dynamic programming alignment was performed using the reference pattern and the same scoring parameters used to generate the reference set (match = +2, mismatch = −5 and gap = −7). The number of copies of the pattern in the resulting alignment was then determined and compared with the VNTRseek predictions. If the difference between a PacBio copy number in at least one read and the VNTRseek copy number was within ±0.25 of a copy, we considered the VNTRseek allele to be validated.

Measuring consistency of Mendelian inheritance
A locus on an autosomal chromosome is consistent with Mendelian inheritance if the genotype of a child can be explained as one allele from the mother and one from the father. Genotype consistency was evaluated for all motherfather-child trios, i.e. the AJ, CEU, HAN and YRI trios. We evaluated loci defined by several increasingly stringent criteria: both parents heterozygous, all members of the trio heterozygous, all members of the trio heterozygous and with different genotypes. These criteria were selected to avoid false interpretations of consistency.
TR loci on the X and Y chromosome of male children were also selected for evaluation when both the son and the appropriate parent had a predicted genotype. In these cases, inheritance consistency means a son's X chromosome allele is observed on one of the mother's X chromosomes, and a son's Y chromosome allele is observed on the father's Y chromosome.

Measuring allele consistency across platforms
VNTR calls were compared for each of 27 genomes that were represented twice, once in the 1000 Genomes dataset, sequenced in 2015 on an Illumina HiSeq2500 with 250 bp read length and once in the NYGC dataset, sequenced in 2019 on an Illumina NovaSeq 6000 with 150 bp read length. The two platforms have different error profiles.
Because read length and coverage differed among datasets, for each pairwise comparison, we only considered loci that were genotyped in both samples and classified as VNTR in at least one. We extracted the non-reference VNTR alleles (detected in at least one sample) and computed consistency as the ratio of the size of the intersection set of those alleles (found in both platforms) over the size of their union (found in either). For alleles detected in the 250 bp reads, we only counted those that could have been detected in the shorter 150 bp reads. Reference alleles were excluded to avoid inflating the ratio.

Common and private VNTRs
To classify commonly polymorphic VNTRs (hereafter 'common VNTRs') and private VNTRs, we used results from the NYGC dataset (2504 individuals) as the read length and coverage were comparable across all genomes. Additionally, these genomes contain no related individuals and represent a wide set of populations (26 populations from five continents). Loci were classified as common VN-TRs if non-reference alleles were identified in at least 5% (126) of the individuals and classified as private VNTRs if non-reference alleles were identified in less than 1% (25).

Annotation and enrichment
Annotation based on overlap with functional genomic regions was performed for the reference TR loci. Genomic annotations for GRCh38 were obtained from the UCSC Table Browser (110) in BED format. Known gene transcripts from GENCODE V32 (111) were used along with tracks for introns, coding exons, and 5 and 3 exons. Regulatory annotations included transcription factor binding site (TFBS) clusters (112,113) and DNAse clusters (114) from ENCODE 3 (115), and CpG island tracks (116), comprising 25%, 15% and 1% of the genome, respectively. Bedtools (109) was used to find overlaps between TR loci and the annotation features. Any size overlap was allowed.
LOLAweb (117) was used to determine VNTR enrichment for genomic regions in comparison to the background TR annotations, and common and private VNTR enrichment in comparison to all VNTR annotations. TRs on the sex chromosomes were excluded in the background set. To identify gene and pathway functions that could be affected by common VNTR copy number change, genes with exons or introns overlapping with common VNTRs were collected and their enrichment computed using GSEA (118) for Gene Ontology (GO) terms (119) for biological process and KEGG pathways (120) with FDR P-value < 0.05.

Association of VNTR alleles with gene expression
To detect expression differences among individuals with different VNTR genotypes, mRNA expression counts from lymphoblastoid cell lines of 660 individuals by the Geuvadis consortium (Accession: E-GEUV-1) were downloaded (121). A total of 445 individuals overlapped with the 2504 NYGC genomes set. We paired VNTR loci with genes within 10 kb, and extracted the genotypes for each individual at those VNTRs. When no genotype was observed for an individual, we classified the genotype as other. We did this because we assumed that the alleles were outside the detection range, given that genotypes were observed in other individuals with similar coverage. VNTR loci were retained for analysis if at least two genotypes were detected for that VNTR across all individuals (at least three if other was one of the genotypes) and if each genotype was observed in at least 20 individuals. Genes were excluded from analysis if the median TPM (transcripts per kilobase million) expression value equaled zero.
To control for confounders we used covariates for sex and population structure and detected additional hidden covariates using Iteratively Adjusted Surrogate Variable Analysis (IA-SVA) (122) on the log 2 normalized TPM values. For population structure we used the top five principal components determined from a principal components analysis of the informative SNP genotypes from the 445 individuals as reported by the 1000 Genomes project (http://ftp.1000genomes.ebi.ac.uk/ vol1/ftp/release/20130502/supporting/hd genotype chip/). Using IA-SVA and observing that covariates sixteen and above were over 85% correlated with other covariates (Supplementary Figures S22-S24), we chose fifteen hidden factors to include in our model. Finally, we used a linear regression expr ession ∼ sex + populati on PC As + hidden f actors with the log 2 normalized TPM values to extract residuals to be used in the downstream association model.
For each gene-VNTR pair, we used a one-way ANOVA test as residuals ∼ genotype to detect if the mean of any genotype class was different from the others. The P-values of the ANOVA tests were adjusted using FDR. Any gene-VNTR pair with FDR <0.05 was reported. For significant eQTLs, we calculated, for reporting purposes, the maximum difference of the residual means over all pairs of genotype classes.
To associate eQTLs with histone marks or open chromatin, we downloaded narrow peaks data in GRCh38 in bed format from 14 experiments on histone marks and one on DNAse hypersensitive sites from the GM12878 (B-lymphocyte) cell line from the ENCODE project (123) (source IDs are given in Supplementary Table S13). Any overlaps of peaks with the eQTL VNTRs were reported.

Population-specific alleles
The 2504 genomes in the NYGC dataset consisted of 26 populations of individuals with ancestry from five superpopulations: African, American, East Asian, European, and South Asian. To investigate the predictive power of common VNTRs with regard to super-population membership, Principal Component Analysis (PCA) clustering was applied. For each sample, a vector of common loci alleles showing presence/absence (1/0) was produced. Uninformative alleles (that were not present in at least 5% of the sam- ples) were removed and principal components (PCs) calculated over the resulting vector set. Using a 70% training to 30% testing split of the data, a decision tree based on the first 10 PCs was trained using 10-fold cross validation and then validated on the testing data.
In order to find super-population markers among the common VNTRs, a one-sided Fisher's exact test was used to calculate the odds ratio and P-value of each allele being in one super-population versus being collectively in all the others. We only considered alleles over-represented rather than both over-and under-represented because of an interest in identifying alleles that have a phenotypic effect. Odds ratio values were log 2 transformed and P-values were adjusted for false discovery rate (FDR) (124). Any allele with FDR <0.05 and log 2 (odds ratio) >1 was chosen as a significant marker for that population.

RESULTS
In this section, we start with a summary and characterization of our VNTR predictions, followed by identification of commonly polymorphic VNTRs and an enrichment analysis of their association with genomic functional regions and genes sets. We next report on the effect of VNTRs on expression of nearby genes, and then identify population-specific VNTR alleles and show that they are predictive of ancestry. We conclude this section with evidence confirming the accuracy of our predictions using several validation methods.

About one in five minisatellite TRs are variable in the human population
WGS datasets from 2770 human genomes were analyzed with VNTRseek to detect VNTRs. Overall, 184 315 out of 191 286 singleton reference TR loci (∼96%) were genotyped across all samples ( Table 2) while 5% of the loci had TR arrays too long to fit within the longest reads and could only be genotyped if they lost a sufficient number of copies.
A total of 5 198 392 loci with non-reference alleles were detected, corresponding to 35 638 (∼19%) distinct VNTR loci, indicating wide occurrence of these variable repeats. Their occurrence within genes was common, totaling 7698 Nucleic Acids Research, 2021, Vol. 49, No. 8 4313 protein coding genes, and 3512 exons. The resulting genotypes were output in VCF format files (see Data Availability Section) and summarized for each genome (Supplementary Material Summary of results.txt). A website is under development to view the VNTR alleles (http://orca.bu. edu/VNTRview/).

The number of TRs genotyped and VNTRs detected depends on coverage and read length
To determine the effect of coverage and read length on genotyping, we measured two quantities: the percentage of reference singleton TRs that were genotyped, and the total number of singleton VNTRs that were detected in each genome. Only singleton loci were considered in all further analyses. Figure 1A shows that there was a strong positive correlation between coverage and the ability to genotype TRs. A strong correlation with read length was also apparent, however, the effect was larger, primarily due to the ability of longer reads to span, and thus allow VNTRseek to detect, longer TR arrays. These results suggest that our VNTR numbers are undercounts.
VNTR detection was similarly dependent on coverage and read length, as shown in Figure 1B. However, detection was also positively correlated with population, which seemed likely due to the evolutionary distance of populations from the reference genome, which is primarily European (125). For example, in the 250 bp trios with comparable coverage, the African Yoruban genomes (YRI) had the highest number of VNTRs detected, followed by the Ashkenazi Jewish genomes (AJ), and finally, the Utah genomes (CEU). Notably, within each trio, the VNTR counts were similar.
The 'haploid' genomes CHM1 (150 bp) and CHM13 (250 bp) had greatly reduced VNTR counts relative to genomes with similar coverage and read length. This was because in these genomes, which consist of two copies of an underlying haploid genome, the single allele represented at any VNTR locus would frequently be a reference allele and so the locus would not be called as a VNTR.

More than two alleles are common in VNTRs
Two alleles were detected in the majority of VNTR loci across all datasets ( Figure 1C). However at 10 698 loci (29%), three or more alleles were detected. In a substantial number of loci (5395), the reference allele was never seen, but in only 105 of these (2%) was the reference allele in the VNTRseek detectable range for the 150 and 100 bp reads, which made up the bulk of our data. Interestingly, in 1166 loci, the reference allele, although detectable, was not the major allele (Supplementary Material Major genotypes.txt).

Loss of VNTR copies relative to the reference is more common than gain
Overall, VNTRseek found approximately 1.8-fold more alleles with copy losses (3 444 128), with respect to the reference copy number, than gains (1 958 250). Loss of one copy (2 263 608) was the most common type of VNTR polymorphism ( Figure 1D). Although there were allele lengths that VNTRseek could not detect, this bias persisted even when restricting the loci to only those where gain and loss could both be observed (Supplementary Figure S5). The overabundance of VNTR copy loss may actually be an underestimate. Since VNTRseek required a read to span a TR array for it to be detected, gain of one copy would have been possible in approximately 68%, 82% and 92% of loci for samples with read lengths of 100, 150 and 250 bp, respectively. By contrast, the reference locus needed to have a minimum of 2.8 copies for a loss of one copy to be observed by TRF, and only 16% of the reference loci met this criterion. Higher observed copy loss could be explained by a bias in the reference genome towards including higher copy number repeats (126), or by an overall mutational preference for copy loss.

VNTRs have high heterozygosity
High heterozygosity in human populations suggests higher genetic variability and may have beneficial effects on a range of traits associated with human health and disease (127). Since calculating heterozygosity for VNTRs is not straightforward (because of limitations on discovering alleles, especially within shorter reads), we used the percentage of detected, per-sample heterozygous VNTRs as an estimate for heterozygosity. At read length 250 bp, per-sample heterozygous VNTR loci comprised approximately 46-55% of the total, which is comparable to previous theoretical estimates of 43-59% (19). At shorter read lengths, the bottom of the range extended lower (∼38-57% for 150 bp reads, ∼29-51% for 100 bp reads, Supplementary Figure S1), as expected, because longer alleles were undetectable if they did not fit within a single read.
Interestingly, despite the previous comment, within genomes that were comparable in read length and coverage, the fraction of heterozygous loci clustered within populations (Supplementary Figures S2-S4), with African genomes generally having more heterozygous calls and East Asians fewer. This result is consistent with previous findings of population differences in SNP heterozygosity among Yoruban and Ashkenazi Jewish individuals with respect to European individuals (128,129), and suggests higher genomic diversity among African genomes, as has been previously noted (130).

Loss of heterozygosity observed in tumor samples
A significant loss of heterozygosity (LOH) was observed in predicted VNTRs of one of the tumor tissues compared to its matching normal tissue (sample ID HC1187). The percentage of heterozygous VNTRs was roughly double in the normal tissue (∼38% versus ∼19%) (Supplementary Tables  S2 and S3). Extreme loss of heterozygosity in small variants has previously been reported in these samples by Illumina Basespace (131) with the number of heterozygous small variants in HC1187 being four times lower in the tumor tissue compared to the normal. Taken together, these results suggest that VNTR LOH could be linked to tumor progression.
Additionally, in both tumors a large number of loci exhibited loss of both alleles in comparison to the normal tissue (Supplementary Table S2   tumor samples was significantly higher than for the normal tissue, it is unlikely that these observations were due to artifacts. Also, the tumor samples did not show a higher percentage of filtered multi VNTRs (too many alleles) than the normal samples (1.37% and 1.23% in normal tissue versus 1.72% and 1.71% in tumor tissue).

Common versus private VNTRs
Following methodology used with SNPs (140), we classified VNTR loci in the 2504 healthy, unrelated individuals from the NYGC dataset (150 bp and coverage >30×) as commonly polymorphic, or 'common VNTRs,' if non-reference alleles occurred in at least 5% of a population (126 individuals) and as private VNTRs if they occurred in less than 1% (25 individuals).
We classified 5676 VNTRs as common (17% of the 33 403 VNTRs detected in this population) and 68% as private. In each sample, we detected, on average, 1951 VNTR loci, and among those, 1783 were common VNTRs (median 1,677) and 46 were private VNTRs (median 17). A total of 3627 common VNTRs overlapped with 2173 protein coding genes including 254 exons. Interestingly, increasing the threshold for common VNTRs did not reduce the number dramatically (Supplementary Figure S6), suggesting that these VNTRs have not occurred randomly, but rather have undergone natural selection. Widespread occurrence of common VNTRs indicates a fitness for use in Genome Wide Association Studies (GWAS). A list of common and private VNTRs can be found in Supplementary Material Common VNTRs.txt and Private VNTRs.txt.

Common VNTR enrichment in functionally annotated regions
To determine possible functional effects of the common VNTRs, we classified the overlap of reference TRs with various functionally annotated genomic regions: upstream and downstream of genes, 3 UTRs, 5 UTRs, introns, exons, transcription factor binding site (TFBS) clusters, CpG islands, and DNAse clusters (Supplementary Material Reference set.txt).
Our reference TR set comprised only 0.52% of the genome, however, 49% of human genes contained at least one TR and 5% of all the TFBS clusters overlapped with TRs. Moreover, high proportions of our TR reference set and common VNTRs intersected with genes (63% and 64%, respectively), TFBS clusters (38% and 51%), and DNAse clusters (21% and 28%) (Supplementary Table S5).
In comparison to TRs, VNTR loci were positively enriched in 1 kb upstream and downstream regions of genes, 5 and 3 UTRs, coding exons, TFBS clusters, DNAse clusters, and CpG islands (P-values < 0.05) (Supplementary Tables S4 and S5). The common VNTRs, on the other hand, compared to all VNTRs, were enriched in 1 kb upstream regions of genes, TFBS, and CpG islands, suggesting regulatory function. Private VNTRs were less likely to occur in 1 kb upstream or downstream regions, inside TFBS clusters, open DNAse clusters, or CpG islands.
Focusing on the common VNTRs, we used the LO-LAweb (141) online tool to perform enrichment analysis with various curated feature sets (Supplementary Figures  S7-S12 in Supplementary Section S5.2), Among the results, DNAse enrichments by tissue type (142) pointed to brain, muscle, epithelial, fibroblast, bone, hematopoietic, cervix, skin, and endothelial (Supplementary Figure S11) with brain showing up multiple times, consistent with findings in the literature (24). These results suggest that VNTR alleles may affect gene regulation in multiple tissues.

VNTR genotypes are correlated with gene expression differences
To detect association between VNTR genotypes and expression of nearby genes, we paired VNTRs to any gene within 10 kb and after removing genes with low expression and controlling for confounders, applied a oneway ANOVA test to determine if there was a significant difference between the average gene expression levels for the VNTR genotypes. A total of 1071 gene-VNTR pairs were tested and 193 pairs (187 genes, 188 VNTRs) exhibited significant expression differences at FDR <0.05 (Supplementary Figure S25 and Supplementary Material RNA VNTRs.txt)). The top 10 genes by mean expression difference were: DPYSL4, KLF11, B4GALNT3, PIP5K1B, DNAJA4, THNSL2, CD151, MXRA7, HEBP1 and FARP1. Of these eQTL VNTRs, 47 also exhibited population-specific alleles (next section and Supplementary Table S15) and 81 overlapped with peaks for histone marks and DNAse hypersensitive sites (see Supplementary Table S13 and Supplementary Figure S31), a significant level of enrichment in these marks (Supplementary Table S14).
Three of the top genes are shown in Figure 2. Gene MXRA7 is associated with a VNTR (id 182606303) in the 5' UTR exon, DPYSL4 is associated with a VNTR (id 182316137) in the first intron, and CSTB is associated with an upstream VNTR (id 182814480). The VNTR region in MXRA7 is a target site for transcription factors METTL23 and JMJD6. METTL23 is known to function as a regulator in the transcriptional pathway for human cognition (143) and has been associated with mental retardation and intellectual disability (144). JMJD6 is associated with pancreatitis (145) and tumorgenesis (146,147). Copy number expansions in the VNTR upstream of CSTB have been previously associated with progressive myoclonic epilepsy (EPM1) (148). For this VNTR, we observed the −1 and 0 alleles (2 and 3 copies, respectively), which are com-

One in five common VNTR loci have population-specific alleles
We further investigated whether VNTR alleles are population-specific and whether they can be used to predict ancestry. Understanding the occurrence of populationspecific VNTR alleles will be useful when controlling for population effects in GWAS, and more generally in interpreting gene expression differences among people of different ancestry.
A total of 4605 alleles from the common VNTR loci were classified as common if they were detected in at least 5% of the population (NYGC). We then constructed a matrix of presence/absence of each allele by sample and clustered the samples using Principal Component Analysis. We found that the first, second, fourth, and fifth principal components (PCs) separated the super populations as shown in Figure 3. Each PC captured a small fraction of the variation in the dataset, suggesting that there was substantial variation between individuals from the same population.
The first PC separated Africans, suggesting furthest evolutionary distance. The second PC separated East Asians. The third PC captured coverage bias. The fourth and fifth PCs separated South Asians and Americans, respectively. The American population had a sub-population of Puerto Ricans that clustered with the Iberian Spanish population, suggesting mixed ancestry (149). To show the power of these alleles to predict ancestry, we next trained a decision tree model (Supplementary Figure S19) using the top 10 PCs (11% of the total variation) and achieved a recall of >98% on every population when applied to the 30% test partition (Supplementary Table S10).
A one-sided Fisher's exact test was applied to determine the population-specific VNTR alleles that were overrepresented in one population versus all the others. A total of 3850 VNTR alleles were identified as populationspecific in one or more super-populations, corresponding to 1096 VNTR loci (Supplementary Figures S20  and S21). The complete list of population-specific alleles can be found in Supplementary Material Superpopulation VNTRs.txt. These loci overlapped with 689 genes and 51 coding exons. Africans had the highest number of population-specific alleles (266), followed by East Asians (65), while Americans had the lowest (13), suggesting more mixed ancestry. We observed 63 loci that had a population-specific allele in each population. Figure 4 illustrates seven of the top population-specific loci in a 'virtual gel' representation, mimicking the appearance of bands on an agarose gel for easier interpretation. Fortyeight genes that displayed expression differences correlated with VNTR genotype were associated with populationspecific VNTR loci (Supplementary Table S15), including the VNTR 182316137 associated with the gene DPYSL4, discussed in the previous section, which exhibited seven different alleles, five of which were population-specific.

Accuracy of VNTR predictions
To show the reliability of our results, we experimentally validated VNTR predictions at 13 loci in the three related AJ genomes, and also compared VNTRseek predictions to alleles experimentally validated in the literature. We additionally used accurate long reads on one genome (HG002) to find evidence of the predicted alleles. Separately, we showed the consistency of our predictions in two ways: first, we looked at inheritance consistency among four trios (mother, father, child), and second, we compared result for genomes sequenced on two different platforms.
Experimental validation. All but one of the 66 predicted VNTR alleles were confirmed at 13 loci in the three related AJ genomes (child HG002, father HG003 and mother HG004). In the remaining case, two predicted alleles were separated by only 15 nucleotides and could not be distinguished. At two loci, other bands were also observed. In one, all three family members contained an allele outside the detectable range of VNTRseek (longer than the reads).
In the other, one allele that was detectable was missed in two family members (see Table 3 for a summary of results, Supplementary Table S6 for details of the experiment, and Supplementary Figures S13-S17 for gel images).
We also compared VNTRseek predictions in three datasets from the NA12878 genome with VNTRs validated in the adVNTR paper (95). Out of the original 17 VNTR loci experimentally validated in that paper, four were not included in our reference set and for one, the matching TR could not be determined. In total, 11 out of 16 detectable alleles were correctly predicted, four were not found in the NA12878 sample with sufficient read size (250 bp), and one was incorrectly predicted in the HG001 sample and not found in the other two (Supplementary Table S7).
Validation of predicted VNTRs using long reads. PacBio Circular Consensus Sequencing reads from the HG002 genome (103), with an average length of 13.5 kb and an estimated 99.8% sequence accuracy, were computationally Numbers and labels at bottom are VNTR locus ids with nearby genes indicated and the population-specific allele expressed as copy number change (+1, −2, etc.) from the reference. For example, in the leftmost column, the +1 allele was over-represented in the African population. Note that the allele bias towards pattern copy loss relative to the reference allele is apparent and that at one locus (second from left) the reference allele was the population-specific allele since almost no reference alleles were observed in the four other populations. The details of these seven loci are given in Supplementary Table S11. tested to determine if they confirmed VNTRseek predicted alleles for the GIAB Illumina reads from the same genome. Overall, >97% of predicted alleles were confirmed, and at the predicted VNTR loci, >87% of alleles were confirmed (Supplementary Table S8).

VNTR predictions are consistent with Mendelian inheritance.
We compared the predicted alleles in four trios (CEU and YRI trios from 1000 Genomes; Chinese HAN and AJ from GIAB), testing loci on autosomes and X and Y chromosomes (see Materials and Methods). In all cases, only a handful of loci were inconsistent (Supplementary Table S9).
VNTR non-reference allele predictions showed substantial consistency across platforms. In 2015, the 1000 Genomes Phase 3 sequenced 30 genomes using Illumina HiSeq2500 at read length 250 bp. In 2020, 27 of those 30 genomes were resequenced by NYGC using Illumina Novaseq 6000 at read length 150 bp. Comparing VNTR loci genotyped in both platforms and only non-reference alleles detectable at both read lengths, agreement ranged from 76% to 91% (Supplementary Figure S18). When including reference alleles, the agreement increased to greater than 99%. Note, however, that read coverage was not the same for both datasets, causing variation in statistical power.

DISCUSSION
The current study represents, to our knowledge, the largest analysis of human whole genome sequencing data to detect copy number variable tandem repeats (VNTRs) and greatly expands the growing information on this class of genetic variation. The TRs genotyped consisted of some 184 000 minisatellites occupying the mid-range of pattern sizes, from seven to 126 bp. Our results reveal that nearly 20% (35 828) are variable, exhibiting at least one nonreference allele, a number much larger than has been generally understood. Moreover, we have classified a large subset of these (5676) as common VNTRs, with non-reference alleles occurring in >5% of the population. When considering the largest dataset in our study (2504 individuals), we found that, on average, each genome exhibited non-reference alleles at 1951 VNTR loci and among those, 1783 were common VNTRs.
In addition to their widespread occurrence, further evidence of minisatellite VNTR importance can be seen in the enrichment of these loci in genes and gene regulatory regions (promoters, transcription factor binding sites, DNAse hypersensitive sites, and CpG islands). Our entire set of VN-TRs overlapped with 7698 protein coding genes and 3512 exons. The common VNTRs occurred within or were proximal to over 2173 protein coding genes, including overlapping with 254 exons. Biological function enrichment among these genes included neuron development and differentiation, and behavior. Examples include low-density lipoprotein receptor-related protein 6 (LRP6), a co-receptor for Wnt signaling, which is down-regulated in Alzheimer's disease and which appears critical for maintaining synaptic integrity (158); Down syndrome cell adhesion moleculelike 1 (Dscaml1), which appears essential for development of GABAergic neurons in the entorhinal cortex (159); and ZMIZ1, a transcription factor co-activator, for which mutant allele over-expression leads to pyramidal neuron morphology abnormalities (160). These observations are consistent with the finding that VNTR expansions in humans compared to primates are associated with gain of cognitive abilities (24), and possible involvement of VNTRS with many neurodegenerative diseases and behavioral disorders (87).
The overabundance of VNTR proximity to genes suggests that variability at these loci could affect gene expression and indeed, we observed that the expression levels of 187 genes were significantly correlated with the presence of specific VNTR alleles in lymphoblastoid cell lines of 445 individuals. Others have also recently detected minisatellite VNTR eQTLs. In (25), expression levels in 46 tissues for 6802 genes were tested and 161 eQTLs were found. Thirteen of those genes coincided with ones detected in our study: BPTF, CCDC57, CCDC66, EPDR1, EPS8L2, LSS, MVB12A, PTPRVP, S100A10, SNAPC1, TMEM52, TRAPPC2L, ZNF736. Similarly, in (26) 21 VNTRs were found associated with expression in 38 genes. However, their VNTR alleles were larger and did not overlap the ones we tested in this study. eQTLs have also been found among the variant STR class. In (161), 2060 genes were found to be affected by eQTL STRs. In another study (162), 28 375 variant STR Thirteen VNTR loci were selected for experimental validation in the AJ trio. All but one of the 66 bands predicted by VNTRseek were validated. † For the remaining band, the results were questionable because the two predicted alleles for the father were only 15 nucleotides different in length, which was too close to distinguish in the image. * For all three individuals, the gel contained bands (bold) not predicted (or detectable) by VNTRseek. The extra band for the son corresponded to the −1 allele as found in the PacBio reads. The father's extra band appeared to match with the −1 allele. The mother's extra band appeared to be a +1 allele. # An extra band for the mother and son (bold) was not predicted by VNTRseek, although it seemed to match the +1 allele that was detectable. +These VNTRs overlapped regulatory sites that target the given genes. loci were associated with differential expression of 12 494 genes in at least one tissue (FDR < 10%) and 1420 of the loci were predicted, by CAVIAR (163), to be causal in 17 tissues. These findings are suggestive, but more study is required, both to determine if there is more evidence of tissue specific gene expression variation associated with VNTR genotypes (25,26,87) and if such correlational differences can be definitively tied to actions associated with specific VNTR alleles such as regulator binding affinity changes in regulatory regions. For more elaborate studies such as these, it will be essential that for each sample used to measure gene expression, the raw whole genome sequencing data be available, so that specialized software programs, such as VN-TRseek can be used to determine VNTR genotype.
The frequency of VNTR occurrence and possible effects on gene expression suggest that minisatellite VNTR loci could be useful in genome-wide association studies (GWAS). However, it is well known that hidden differences can lead to misinterpretation of GWAS results, and care is particularly important when those differences are tied to human ancestry. Relevant to this, we have determined that 1096 of the common VNTR loci contain alleles showing significant population specificity and that these loci inter-sect with 689 genes. Understanding such hidden variability will be essential for interpreting GWAS and future studies should investigate possible haplotype linkages between specific VNTR alleles and nearby SNP alleles.
Population-specific alleles also have the potential for use in tracing early human migration. We have shown through principal component analysis with common VNTR alleles that super-populations are easily separated. Further, we have constructed a decision tree based on common VNTR alleles that obtains nearly perfect classification of individuals at the super population level. It will be interesting to see whether, with more information, classification can be refined further to encompass specific subpopulations, whether a minimal minisatellite VNTR set can be established for high accuracy population classification, and whether VNTR alleles can be used to estimate mixed ancestry as is done now with SNP haplotyping.
Despite the high precision of VNTRseek, (see Supplementary Figure S32 and Supplementary Table S16-S20) our curation of VNTR loci has certainly produced an undercount. This is true because VNTRseek requires that the tandem array fit within a read. Longer reads will help, but the abundance of high-coverage. low-error, longread datasets is currently limited. Alternate methods exist (87,95), but these have not reported an ability to handle macrosatellite VNTRs where the arrays and patterns are hundreds to thousands of base pairs long. For this range of the tandem repeat spectrum, new tools must be developed.
Another limitation comes from use of the Tandem Repeats Finder, which requires that the array contain at least 1.9 copies to be detected. At read length 150 bp, which included the majority of our samples, a gain of one copy compared to the reference genome could be detected in 82% of the TR loci while loss of one copy could be detected in only 16%. Despite this imbalance, one copy loss was observed nearly 40% more often than one copy gain, an important observation with regard to potential tandem repeat copy number bias in the reference genome.
Previous studies on VNTR prevalence in the human genome have been limited to a subset of minisatellites inside the transcriptome and a limited number of genomes. Here, we have shown a broader prevalence of VNTR loci and suggested their importance with regard to gene function, population studies, and GWAS. Future research can be expected to further enhance our understanding of this important class of genomic variation.

DATA AVAILABILITY
The reference TR set files, output VCF files, and the preprocessed data files along with the code to create figures and tables are published at: DOI 10.5281/zenodo.4065850 VN-TRseek can be downloaded at: https://github.com/Benson-Genomics-Lab/VNTRseek.