Sequence-based mutation patterns at 41 Y chromosomal STRs in 2 548 father–son pairs

Abstract   A total of 2 548 unrelated healthy father–son pairs from a Northern Han Chinese population were genotyped at 41 Y chromosomal short tandem repeat (Y-STRs) including DYS19, DYS388, DYS389I, DYS389II, DYS390, DYS391, DYS392, DYS393, DYS437, DYS438, DYS439, DYS444, DYS447, DYS448, DYS449, DYS456, DYS458, DYS460, DYS481, DYS518, DYS522, DYS549, DYS533, DYS557, DYS570, DYS576, DYS593, DYS596, DYS627, DYS635, DYS643, DYS645, Y-GATA-H4, DYF387S1a/b, DYF404S1a/b, DYS385a/b, and DYS527a/b. In 2 548 father samples, 2 387 unique haplotypes were detected with the haplotype diversity and discrimination capacity values of 0.999 956 608 and 0.96 741 007. The average gene diversity (GD) value was 0.6934 with a range from 0.1051 at DYS645 to 0.9657 at DYS385a/b. When comparing alleles at 24 overlapped Y-STRs between the ForenSeq™ deoxyribonucleic acid (DNA) Signature Prep Kit on the MiSeq FGx® Forensic Genomics System and the Goldeneye® DNA ID Y Plus Kit on the Applied Biosystems™ 3730 DNA Analyzer from 308 father samples in mutational pairs, 258 alleles were detected by massively parallel sequencing (MPS) typing including 156 length-based alleles that could be obtained by capillary electrophoresis (CE) typing, 95 repeat region (RR) variant alleles and seven flanking region variant alleles. Hereof, we found 16 novel RR variant alleles and firstly identified two SNPs (rs2016239814 at DYS19 and rs2089968964 at DYS448) and one 4-bp deletion (rs2053269960 at DYS439) that had been validated by the Database of Short Genetic Variation. Sanger sequencing or MPS was employed to confirm 356 mutations from 104 468 allele transfers generated from CE, where 96.63% resulted in one-step mutations, 2.25% in two-step, and 1.12% in multi-step, and the overall ratio of repeat gains versus losses was balanced (173 gains vs. 183 losses). In 308 father–son pairs, 268 pairs occurred mutations at a single locus, 33 pairs at two loci, six pairs at three loci, and one pair at four loci. The average Y-STR mutation rate at 41 Y-STRs was ⁓3.4 × 10−3 (95% confidence intervals: 3.1 × 10−3–3.8 × 10−3). The mutation rates at DYS576 and DYS627 were higher than 1 × 10−2 in Northern Han Chinese, whilst the mutation rates at DYF387S1a/b, DYF404S1a/b, DYS449, DYS518, and DYS570 were lower than initially defined. In this study, the classical molecular factors (the longer STR region, the more complex motif and the order father) were confirmed to drive Y-STR mutation rates increased, but the length of repeat unit did not conform to the convention. Lastly, the interactive graphical and installable StatsY was developed to facilitate forensic scientists to automatically calculate allele and haplotype frequencies, forensic parameters, and mutation rates at Y-STRs. Key points 308 of 2 548 father–son pairs from Northern Han Chinese occurred at least one mutation(s) across 41 Y-STRs. Sanger sequencing or MPS was employed to confirm those mutations generated from CE. The longer STR region, the more complex motif and the order father drove Y-STR mutation rates increased. StatsY was developed to calculate allele and haplotype frequencies, forensic parameters and mutation rates at Y-STRs.


Introduction
The Y chromosomal short tandem repeat (Y-STR) markers are wildly used in forensic casework and kinship analyses due to their non-recombination nature and paternal transmission pattern [1].Normally, males carry single Y-chromosomes (ChrY), which are passed down from their fathers and transmitted to their sons without changing as haplotypes if the haplotypes are located in non-recombining region of ChrY and have no mutational events.Analyses of Y-STRs are beneficial in examining of a mixture with male minor component and female major component in sexual assault cases, determining the number of male suspects in "gang rape" investigations, and widening dragnets for male question samples ("Q") without suspects ("K") cases, missing persons cases, and mass disaster victim identifications by familial searching [2].
Since Roewer et al. [3] reported the first Y-STR locus in 1992, thousands of Y STR markers have been identified across the ChrY [4].A core set termed as the "minimal haplotype" loci was selected in 1997, including DYS19, DYS389I, DYS389II, DYS390, DYS391, DYS392, DYS393, and DYS385a/b [5,6].In 2003, the Scientific Working Group on DNA Analysis Methods (SWGDAM) recommended a core set "extended haplotype" that included the "minimal haplotype" loci plus DYS438 and DYS439 [2].On the basis of these recommendations, a series of commercial kits incorporating 12-38 Y-STR loci have been released by Promega [7,8] and Thermo Fisher Scientific [9][10][11].In China, Peoplespot, AGCU, HEALTH Gene Technologies, and Microread have sequentially developed their commercial Y-STR kits including 36-41 loci [12][13][14][15].Although the number of loci seems more enough, a match between a question sample and a suspect with Y-STRs does not carry the same resolution of differentiation as a match with autosomal STRs would, because Y-STRs only identify a patrilineage rather than an individual.In 2012, Ballantyne et al. [16] identified a new panel, namely the rapidly mutating (RM) Y-STRs with higher mutation rates above 1 × 10 −2 , which are expected to increase the resolution of patrilineage differentiation compared to conventional Y-STR markers.Besides length-based deoxyribonucleic acid (DNA) typing assays, massively parallel sequencing (MPS) analysis has potential to provide more genetic information via repeat and flanking region sequences.Zhao et al. [17] and Kwon et al. [18] have successfully detected sequencebased Y-STRs using in-house panels on Ion Torrent Personal Genome Machine (PGM) and MiSeq System, respectively.In 2015, Verogen (former Illumina) launched the ForenSeq™ DNA Signature Prep Kit where 26 Y-STRs can be sequenced along with >200 forensically relevant markers [19,20].The novel marker (i.e.RM Y-STR) and cutting-edge technology (i.e.MPS) would facilitate to expand the ChrY application in forensic community.

Population
Liaoning is a province of the People's Republic of China, located in the northeast of the country (38 with the Yellow Sea in the south, North Korea in the southeast, Jilin to the northeast, Hebei to the southwest, and Inner Mongolia to the northwest.The population of Liaoning is mostly Northern Han Chinese (36 169 617, 84.92%) with minorities of Manchu (5 385 287, 11.94%), Mongolian (669 972, 1.59%), Hui (264 407, 0.59%), Korean (241 052, 0.54%), and Xibe (132 615, 0.30%) at the 2020 census [21].Bloodstain samples of 2 548 unrelated healthy father-son pairs were collected from Liaoning Province after informed consent and approval of the Ethical Committee of Jinzhou Medical University (No. 202001-6).All male germline transmissions were confirmed with a minimum paternity index of 10 000 using the Goldeneye ® DNA ID 20A Kit (Peoplespot, Beijing, China) according to the manufacturer's recommendations [22].

CE-Y-STR typing
Samples were prepared using 1.2 mm discs punched from bloodstains on Whatman ® FTA ® cards (GE, Piscataway, NJ, USA) and were directly amplified without DNA extraction by Goldeneye ® DNA ID Y Plus Kit (Peoplespot) in 10 μL of reaction volume containing 2 μL of 5 × polymerase chain reaction (PCR) Reaction Mix, 2 μL of 5 × Y Plus Primer Mix, and 6 μL ddH 2 O. Thermal cycling was performed on the GeneAmp™ System 9700 (Thermo Fisher Scientific, Foster City, CA, USA) using the following conditions: denaturation for 2 min at 95 • C, amplification for 30 cycles of 5 s at 94 • C, 45 s at 60 • C and 45 s at 72 • C, extension for 15 min at 60 • C and hold at 15 • C. One microliter of PCR products or allelic ladders were diluted in 8.5 μL Hi-Di™ Formamide mixed with 0.5 μL ORG500 Size Standard.After denaturation for 3 min at 95 • C and quenching on ice for 3 min, all prepared samples were separated and detected on the Applied Biosystems™ 3730 DNA Analyzer (Thermo Fisher Scientific).Standard run parameters involved: dye set at J6, sample injection for 8 s at 1.2 kV and electrophoresis in the 36 cm capillary using POP-7™ polymer as indicated in the GeneMapper36_POP7 run module.CE raw data were analyzed using GeneMapper™ ID-X Software v1.5 (Thermo Fisher Scientific) with peak amplitude threshold set at 50 relative fluorescent units.

MPS-Y-STR typing
Bloodstain samples from mutational father-son pairs were extracted using the magnetic beads on the Microlab STAR Liquid Handling System (Hamilton, Bonaduz, Switzerland).Genomic DNA was quantified on the Applied Biosystems™ QuantStudio™ 5 Real-Time PCR System (Thermo Fisher Scientific) using the Quantifiler ® Trio DNA Quantification Kit (Thermo Fisher Scientific) according to the manufacturer's recommendations [23].Library preparation was performed using the ForenSeq™ DNA Signature Prep Kit (Verogen, San Diego, CA, USA) according to the manufacturer's recommendations [24].Hereof, one nanogram purified DNA (5 μL input volume at a concentration of 0.2 ng/μL) was amplified using DNA Primer Mix A. MPS was performed on the MiSeq FGx ® Forensic Genomics System (Verogen) using the MiSeq FGx ® Reagent Kit following the manufacturer's instruction [25] with two modifications: (i) 52 libraries were pooled for a run, including 50 samples, a positive control, and a negative control; (ii) 7 μL pooled libraries was added into 591 μL Hybridization Buffer (HT1) and then mixed with 4 μL Human Sequencing Control (HSC) mixture, not 2 μL HSC in the previous study [20].MPS raw data were processed by the ForenSeq™ Universal Analysis Software (UAS) v1.3 (Verogen) at default analysis thresholds [26].

Sanger sequencing
Samples were amplified using primers as listed in Supplementary Table S1.Amplicons were directly sequenced after purification or cloned using the TOPO ® TA Cloning ® Kit (Thermo Fisher Scientific) according to the manufacturer's recommendations.Sanger sequencing was performed on the Applied Biosystems ® 3130xl Genetic Analyzer (Thermo Fisher Scientific) using the BigDye ® Terminator v3.1 Cycle Sequencing Kit (Thermo Fisher Scientific) following the manufacturer's instructions.Raw data were analyzed using the Sequencing Analysis Software v5.3.1 (Thermo Fisher Scientific).

Statistical analysis
Allele and haplotype frequencies were estimated and shared haplotypes were screened from father samples by mere counting.The multi-copy markers (DYF387S1a/b, DYF404S1a/b, DYS385a/b, and DYS527a/b) were treated as allelic combinations.Fraction of unique haplotype (F UH ) was calculated as: F UH = N unique /N diff where N unique is the number of unique haplotypes and N diff is the number of different haplotypes.Haplotype diversity (HD) or gene diversity (GD) was determined as: HD or GD = [n/(n-1)] (1− p i 2 ) where n is the population size and p i is the frequency of the ith haplotype or allele [27].Discrimination capacity (DC) was determined as: DC = N diff /n where N diff is the number of different haplotypes and n is the population size.
The number of mutations, the number of one/two/multistep mutation(s) and the number of gains/losses were screened by mere counting.Mutation rates were calculated as the number of mutations between father-son pairs divided by the number of allele transmissions per each Y-STR marker.Confidence intervals (CI) of the mutation rates were estimated using the exact binomial probability distribution.Herein, the allele of DYS389II and the allele of DYS389II subtracted by the value of DYS389I were calculated, respectively; the number of allele transmissions for the multi-copy markers (DYF387S1a/b, DYF404S1a/b, DYS385a/b, and DYS527a/b) was counted as twice that of meioses.
All of allele and haplotype frequencies, forensic parameters, and mutation rates were automatically calculated using StatsY v1.0.It is freely available at https://www.researchgate.net/publication/359440373_StatsY_v10.
Chi-square test, Fisher's exact test, two-sample t-test, and linear regression were computed using R software version 4.0.5 [28] and figures were generated by Package "ggplot2" and "vcd" for R.

Allele frequencies and GD values
Allele frequencies and GD values at each locus from 2 548 father samples are listed in Supplementary Table S2A.A total of 358 alleles were observed at 41 Y-STRs and allele frequencies ranged from 0.0002 to 0.9447 (allele 8 at DYS645).The number of alleles at each locus ranged from 5 at DYS593 and DYS645 to 18 at DYS385a/b.Six intermediate alleles were observed: allele 19.2 at DYS448 and allele 30.2 at DYS449 had been documented on the Y Chromosome Haplotype Reference Database (YHRD) [29] already; alleles 12.2, 13.2, and 15.2 at DYF404S1a/b and allele 23.2 at DYS527a/b are not included in the YHRD and not reported in other literatures at the current time.Supplementary Table S2B showed details of allelic combinations at multi-copy loci, where DYS385a/b was considered as the most informative locus (85 allelic combinations) with 18 separated alleles and DYF404S1a/b was the lowest (37 allelic combinations) with 11 separated alleles.
The HD value was approximately equal to 0.999 956 608 and DC was 0.9674.Table 1 summarized statistics of shared haplotypes and forensic parameters amongst seven combined Y-STR systems.Generally, more unique haplotypes and higher forensic parameters were observed with the number of Y-STRs increased.Compared with AmpF STR Yfiler Kit, N diff , F UH , HD, and DC increased by 10.00%, 7.65%, 0.01%, and 10.98%, respectively.The amplitudes of increasing were mild or very slow when comparing with 25-29 Y-STRs.These improvements in haplotype resolution and forensic parameters are similar to those from Southeastern Han Chinese [31] and Shandong Han Chinese [32].
As shown in Figure 1 and Table 3, six Y-STRs presented FR variants including four SNPs at DYS19, DYS437, DYS448, and DYS481 as well as two deletions at DYS439 and DYS385a/b, which provided 2.71% polymorphisms as a whole.Two SNPs (rs2016239814 at DYS19 and rs2089968964 at DYS448) and one 4-bp deletion (rs2053269960 at DYS439) were newly found in this study and have been validated by the Database of Short Genetic Variation (dbSNP) and then released in Build 155.One SNP (rs1603182940 at DYS437) and one 4-bp deletion (rs1248860842 at DYS385b) were firstly submitted to the dbSNP from a Korean population (ss3943499107) and a Swedish population (ss3020959847), respectively, but both   were firstly reported in a Han Chinese population in this study.
Although RR variants and FR variants existed, alleles with these variants can be accurately and backward-compatibly called by the ForenSeq UAS.Further, MPS typing contributed to increasing the average GD value (0.7071) compared with that (0.6733) obtained by CE typing as shown in Supplementary Table S4.DYS437 showed the largest increment (42.19%) amongst the GD values and was also reported by Kwon et al. [18].DYF387S1a/b achieved the highest GD value by sequences.However, other forensic parameters by sequences (F UH = 1.0000,HD = 1.0000 and DC = 1.0000) remained the same as those by lengths because all of 308 haplotypes were unique with two methods.

Mutation rates
The mutation rates at 41 Y-STR loci were shown in Table 4.

Mutation patterns
Mutation patterns were investigated the correlation between mutation rates and molecular factors as below: (i) length of STR region (i.e.allele repeat number); (ii) complexity of repeat motif (i.e.simple, compound, and complex repeats); (iii) length of repeat unit; (iv) age of father at the gametogenesis (i.e. the birth year of the father subtracted from that of the son).

Number of losses
Mutations rates (×10  and the direction of mutation (gain and loss).The average mutation rate of long alleles is significantly greater than that of moderate and short alleles (P = 2.35 × 10 −12 ).Longer alleles present a tendency towards repeat losses whilst shorter alleles towards repeat gains.(C) Average mutation rates of different repeat motif types (simple, compound, and complex).Significant differences are observed between simple-compound (P = 0.0201), simplecomplex (P = 2.35 × 10 −12 ), and compound-complex (P = 0.0250) as well as the component of larger versus smaller units occurring mutations between compound-complex (P = 2.20 × 10 −16 ).(D) Average mutation rates of different repeat unit lengths.A small but negative linear is observed when the repeat unit ≥4 bp (slope = −2.1243,R 2 = 0.1603, P = 0.0190).(E) Positive linear relationship between the mutation rate and the age of father at the gametogenesis (slope = 0.3861; R 2 = 0.3048; P = 0.0077).(F) Average mutation rates of different father's age intervals.The number of mutations is significantly increased with the age intervals (P = 0.0048).
Firstly, Figure 2A showed that the statistically significant linear relationship (P = 1.76 × 10 −4 ) was observed between the number of average allele repeats and the mutation rate (Supplementary Table S6), i.e. the mutation rate was increased with the allele repeat number (slope = 0.2284), where 33.48% of the variance in the mutation rate might be explained by the allele repeat number and two outliers with remarkably high mutation rates were observed, that is, RM Y-STRs DYS576 and DYS627 (Section Mutation rates).If nine initially defined RM Y-STR loci (DYF387S1a/b, DYF404S1a/b, DYS449, DYS518, DYS570, DYS576, and DYS627) were removed, 70.14% of the variance in the mutation rate would be explained by the allele repeat number.According to Ge et al. [46], the allele repeat number based on allele frequencies in Supplementary Table S2 were classified into 25%, 50%, and 25% categories for short, moderate, and long allele sizes, respectively.Supplementary Table S7 showed the mutation rate of long-size alleles (10.1 × 10 −3 ) is significantly (P = 2.35 × 10 −12 , Fisher's exact test) greater than moderate-size (3.2 × 10 −3 ) and short-size (1.6 × 10 −3 ) alleles, which demonstrated mutations at longer alleles were more common than those at shorter ones and further validated the above-mentioned relationship between the allele repeat number and the mutation rate.Also, Figure 2B displayed the relationship between the allele size and the direction of mutation.The ratio of repeat gains versus losses in short, moderate, and long allele sizes demonstrated that longer alleles had a tendency towards repeat losses whilst shorter alleles towards repeat gains.
Secondly, Y-STR markers can be divided into simple, compound, and complex repeats based on their repeat motifs (Supplementary Table S6).Generally speaking, simple repeats contain repeat units of identical length and sequence, compound repeats comprise two or more adjacent simple repeats with commonly a difference of one nucleotide, and complex repeats embody multiple repeats with variable unit lengths and/or intervening sequences [47,48].Statistically significant differences in mutation rates were observed between simple and compound repeats (χ 2 = 5.40, P = 0.0201), between compound and complex repeats (χ 2 = 5.02, P = 0.0250), and between simple and complex repeats (χ 2 = 24.77,P = 2.35 × 10 −12 ), with complex repeats expressing a higher average mutation rate (5.2 × 10 −3 ) than compound ones (3.7 × 10 −3 ) and simple ones (2.7 × 10 −3 ).Moreover, Figure 2C showed that the location of repeat units caused the mutational events in compound and complex repeats.A majority (95.79%) of mutational events occurred at the larger variable repeat units in compound repeats, e.g.Extremely rare mutations were observed occurring at both larger and smaller variable repeat units in complex repeats, e.g.DYF387S1a/b: A statistically significant difference in the component of larger versus smaller units occurring mutations was observed between compound and complex repeats (χ 2 = 86.78,P = 2.20 × 10 −16 ).
Thirdly, the average mutation rate was 1.8 × 10 −3 for the three trinucleotide Y-STRs, 4.1 × 10 −3 for the 31 tetranucleotide Y-STRs, 1.5 × 10 −3 for the five pentanucleotide Y-STRs and 0.4 × 10 −3 for the two hexanucleotide Y-STRs.Figure 2D showed no obvious linear correlation was found between length of repeat unit and the mutation rate (P = 0.1426), but the differences in average mutation rates amongst the length of repeat units proved to be statistically significant (χ 2 = 43.02,P = 2.43 × 10 −9 ).Unlike the results of Ballantyne et al. [34] but like those of Claerhout et al. [47], the sequence-based mutation rates did not conform to the convention that the shorter repeat units the STR markers had, the higher the mutation rate went.In our study and Claerhout et al. [47], the average mutation rate for trinucleotide Y-STRs was significantly lower than that for tetranucleotide ones (χ 2 = 8.45, P = 0.0037).If we remove those trinucleotide Y-STRs, there would be a small but statistically significant linear decrease in the mutation rate as the repeat unit length increased (slope = −2.1243,R 2 = 0.1603, P = 0.0190).
Lastly, the average age of father at the gametogenesis was calculated as 26.29 ± 4.21 years and the median was 25 years, which was concordant with a generation time (25 years) usually assumed to estimate the time to the most recent common ancestor in evolutionary studies [47,49].Compared ages of fathers between with mutations (26.75 ± 4.10 years) and without mutations (26.23 ± 4.12 years), the statistically significant difference was observed (P = 0.0383, Welch Two Sample t-test).Figure 2E showed a statistically significant positive linear correlation (slope = 0.3861) between the mutation rate and the age of the father (R 2 = 0.3048, P = 0.0077), which were also observed by Ballantyne et al. [34] and Claerhout et al. [47].Further, ages were artificially demarcated by the 10-year interval for a male generation time interval most frequently ranges between 20 and 30 years [50].The number of mutations was significantly increased with the age intervals (χ 2 = 12.93, P = 0.0048), ranging from 0.5 × 10 −3 of the average mutation rate within <20-year interval to 8.7 × 10 −3 within >40-year interval (Figure 2F).

StatsY
StatsY v1.0 runs on Windows ® 7-10 operating systems with the 64-bit processor architecture and offers Chinese and English display languages.The code is mainly divided into three modules: "quality control" module that is designed for screening the format of an input file and data included and providing QC information; "haplotype" module for searching shared haplotypes and calculating allele/haplotype frequencies and forensic parameters; and "mutation" module for counting mutational events, directions, and steps at each locus by each father-son pair, estimating mutation rates and their binomial 95%CIs in a dataset and displaying the bar graph.All calculation methods and formulae were provided on Section Statistical analysis.
StatsY v1.0 only accepts a Microsoft Excel workbook ( * .xlsx)as the input file.In "haplotype" module, the first column mandatorily provides the user-defined sample identifier (e.g.NHC0001) to trace a sample with missing data and the first row mandatorily provides the locus name (e.g.DYS19) in a workbook.In "mutation" module, the first column must provide the same pedigree identifier by each two rows (e.g. one P00001 for father and the other P00001 for son); the second column must designate "P" for father and "O" for son; the first row must provide the locus name as well.The number of loci to be calculated is unlimited.Allele data are entered with one column per locus but alleles at multi-copy markers are treated as genotypes and separated by a comma (e.g.DYS385a/b: 13,13 for a homozygote and 13,20 for a heterozygote).Allele repeats must be positive and intermediate alleles can include one decimal place (e.g.23.2).Missing data must be left with an empty cell.An allele with non-positive number and text (e.g.−1 or 0; NA or 23.x) or missing data will be automatically ignored in the calculations.For convenience, two templates of the input files ("Template for Haplotype.xlsx"and "Template for Mutation.xlsx",respectively) are provided under the installation directory.
The user interface of StatsY v1.0 was shown in Figure 3.After importing an input file for haplotypes or mutations, StatsY v1.0 returned a data table on the left panel that could be investigated and exported as an EXCEL workbook and also prompted QC messages on the right panel to indicate samples or pedigrees including invalid or missing data that would be automatically ignored in the proceeding calculations.In "Haplotype" sheet, allele frequencies for all loci and for multicopy loci included in the input file, haplotype frequencies, and their forensic parameters could be checked on the left panel and exported as EXCEL workbooks after clicking on "Calculate Allele Frequency" and "Calculate Haplotype Frequency" buttons on the right panel.In "Mutation" sheet, mutation counts (events, directions, and steps) and mutation rates could be also checked and exported after clicking on "Calculate Mutation Rate" button.Additionally, allele frequencies based on father samples in father-son pairs and mutation events and directions (pink colour for gain and purple colour for loss) directly over certain alleles could be plotted by locus to inspect hotspot distributions and saved as a Bitmap (BMP) file to archive once "Display" button was clicked on the right panel.Documentation for StatsY v1.0 is provided under the installation directory.
MH: minimal haplotype; SWGDAM: Scientific Working Group on DNA Analysis Methods; YHRD Ymax: a set of markers represents all available markers at Y Chromosome Haplotype Reference Database; N L : number of loci in the combined system; N diff : number of different haplotypes; HD: haplotype diversity; F UH : fraction of unique haplotype; DC: discrimination capacity; Y-STR: Y chromosomal short tandem repeat.

Figure 1 .
Figure 1.Comparison of 24 overlapped Y chromosomal short tandem repeat between the ForenSeq™ deoxyribonucleic acid (DNA) Signature Prep Kit and the Goldeneye ® DNA ID Y Plus Kit from 308 father samples in mutational pairs.(A) Number of sequence-based alleles versus length-based alleles at each locus.Herein, sequence-based alleles by massively parallel sequencing typing are composed of length-based alleles that can be generated from capillary electrophoresis typing, repeat region variant alleles and flanking region variant alleles.(B) Gene diversity of sequence-based alleles versus length-based alleles at each locus.

Figure 2 .
Figure 2. Sequence-based mutation patterns at Y chromosomal short tandem repeat (Y-STRs).(A) Positive linear relationship between the mutation rate and the average allele repeat number (slope = 0.2284; R 2 = 0.3348; P = 1.76 × 10 −4).The initially defined rapidly mutating Y-STR markers (DYF387S1a/b, DYF404S1a/b, DYS449, DYS518, DYS570, DYS576, and DYS627) are annotated.(B) Correlation between the allele size (short, moderate, and long) and the direction of mutation (gain and loss).The average mutation rate of long alleles is significantly greater than that of moderate and short alleles (P = 2.35 × 10 −12 ).Longer alleles present a tendency towards repeat losses whilst shorter alleles towards repeat gains.(C) Average mutation rates of different repeat motif types (simple, compound, and complex).Significant differences are observed between simple-compound (P = 0.0201), simplecomplex (P = 2.35 × 10 −12 ), and compound-complex (P = 0.0250) as well as the component of larger versus smaller units occurring mutations between compound-complex (P = 2.20 × 10 −16 ).(D) Average mutation rates of different repeat unit lengths.A small but negative linear is observed when the repeat unit ≥4 bp (slope = −2.1243,R 2 = 0.1603, P = 0.0190).(E) Positive linear relationship between the mutation rate and the age of father at the gametogenesis (slope = 0.3861; R 2 = 0.3048; P = 0.0077).(F) Average mutation rates of different father's age intervals.The number of mutations is significantly increased with the age intervals (P = 0.0048).

Table 1 .
Haplotype distributions and forensic parameters for different combined Y-STR systems (n = 2 548).

Table 2 .
Novel repeat region variant alleles found in this study (N = 308 from father samples).

Table 3 .
Flanking region (FR) variant alleles observed in this study (N = 308 from father samples).
The bold sequences represent repeat regions; the underlined bases in grey represent the mutations; the hyphens in sequences represent deletions.a Newly found SNPs in this study have been validated by the dbSNP and released in Build 155.

Table 4 .
Estimated mutation rates at 41 Y chromosomal short tandem repeat in Northern Han Chinese.