Construction of the third-generation Zea mays haplotype map

Abstract Background Characterization of genetic variations in maize has been challenging, mainly due to deterioration of collinearity between individual genomes in the species. An international consortium of maize research groups combined resources to develop the maize haplotype version 3 (HapMap 3), built from whole-genome sequencing data from 1218 maize lines, covering predomestication and domesticated Zea mays varieties across the world. Results A new computational pipeline was set up to process more than 12 trillion bp of sequencing data, and a set of population genetics filters was applied to identify more than 83 million variant sites. Conclusions We identified polymorphisms in regions where collinearity is largely preserved in the maize species. However, the fact that the B73 genome used as the reference only represents a fraction of all haplotypes is still an important limiting factor.

Background Characterization of genetic variations in maize has been challenging, mainly due to deterioration of collinearity between individual genomes in the species. An international consortium of maize research groups combined resources to develop the maize haplotype version 3 (HapMap 3), built from whole genome sequencing data from 1,218 maize lines, covering pre-domestication and domesticated Zea mays varieties across the world. Results A new computational pipeline was set up to process over 12 trillion bp of sequencing data, and a set of population genetics filters were applied to identify over 83 million variant sites.

Conclusions
We identified polymorphisms in regions where collinearity is largely preserved in the maize species. However, the fact that the B73 genome used as the reference only represents a fraction of all haplotypes is still an important limiting factor.
information to allow them to be uniquely identified, should be included in the Methods section. Authors are strongly encouraged to cite Research Resource Identifiers (RRIDs) for antibodies, model organisms and tools, where possible.
Have you included the information requested as detailed in our Minimum Standards Reporting Checklist?
Availability of data and materials All datasets and code on which the conclusions of the paper rely must be either included in your submission or deposited in publicly available repositories (where available and ethically appropriate), referencing such data using a unique identifier in the references and in the "Availability of Data and Materials" section of your manuscript.
Have you have met the above requirement as detailed in our Minimum Standards Reporting Checklist? Yes 3 characterize genetic variations in maize at genomic scale. The previously released maize HapMap2 were constructed with whole genome sequencing data of 104 maize lines across pre-domestication and domesticated Zea mays varieties [1]. Since then, more maize lines have been sequenced by the international research community, and a consortium was formed to develop the next generation haplotype map. The maize HapMap 3 consortium includes, among others, BGI-Shenzen, Chinese Academy of Agricultural Sciences, China Agricultural University (CAU), International Maize and Wheat Improvement Center (CIMMYT). High-coverage data for 31 European and US Flint and Dent lines is also available in Ref. [2]. Altogether, in this work we used a total of 1218 maize lines sequenced with depth varying from below 1x to 59x.
A common approach in today's genetic diversity projects is to map the shotgun sequencing reads from each individual onto a common reference genome to identify DNA sequence variations, and the physical positions of the reference genome is used as a coordinate system for the polymorphic sites. A good example is the human 1000 genome project [3]. The computational data processing pipeline developed for the human project, GATK, has been widely adopted for identifying genetic variations in many other species [4].
As the sequencing technology is improved and sequencers' base calling error model gets more accurate, the computational challenges in genotyping by short-read sequencing have shifted from modeling sequencer machine artifacts errors to resolving genotyping errors derived from incorrect mapping of short reads to the reference genome. The problem is associated with the experimental design that uses the single-reference genome as coordinate system. Taking maize as an example, the reference being used is a 2.1 Gb assembly from an elite inbred line B73 that represents 91% of the B73 genome [5], and was estimated to capture only ~70% of the low-copy gene fraction of all inbred lines [6]. The sequence alignment software, however, can map 95-98% of the whole genome sequencing reads to the reference.
That suggests a high percentage of the reads were mapped incorrectly, either being mapped to the 4 paralogous loci or highly repetitive regions under-represented in the reference assembly. The genetic variants called from the miss-mapped reads need to be corrected computationally. The maize HapMap2 relied on linkage disequilibrium in the population to purge most of the bad markers caused by alignment errors. To construct maize HapMap 3, a new computational pipeline was developed from scratch to handle the sequencing data from 10 times more lines, and also took advantage of the high quality genetic map constructed from the GBS technology [7,8] which was not present when HapMap2 was constructed.

DATA DESCRIPTION
The sequencing data used in this work is comprised of 12,497 billion base pairs in a total of 113,702 billion Illumina paired-end reads, originating from 1,218 maize and teosinte lines. The data was collected from several sources over several years, and varies in quality, read length, and coverage. Basic information about various datasets and stages of the HapMap 3 project they were used in are summarized in Table 1.
Each of the 1,218 lines were sequenced at depth varying from below 1x to 59x, using reads of lengths ranging from 44 through 201 bp, averaging 110 bp. All reads were aligned to maize reference genome B73 5 version AGP v3 using BWA mem aligner [9]. Overall, 95-98% of the reads were mapped to the reference genome, although only about 50-60% with non-zero mapping quality. Taxa from sets "HapMap2", "HapMap2 extra", and "CAU" partially overlap. The "282" libraries, sequenced twice represent 271 taxa. A "+" means that the dataset was used in a given stage, "-" that it was not.
All sequence data used in this work is publically available. Collection and publishing of this data does not violate any local or international legislation or guidelines.

ANALYSIS Initial variant discovery
The HapMap 3 pipeline is summarized in Figure 1. First, polymorphic sites were called for a set of 916 taxa from datasets HapMap2 through CIMMYT/BGI (7,191 billion base pairs, 74,643 million reads). In the first step, a custom built software tool was used to determine genotypes for each taxon at each site of the genome based on allelic depths at that site. Bases counted towards depth had base quality score of at least 10 and originated from reads with mapping quality (MAPQ= − (10 ), where -calculated by the BWA mem aligner -is the probability of the reported read location being wrong) at or above 30.
This cut-off was chosen at mid-point between the highest MAPQ value reported by the aligner, 7 The HapMap 3.1.1 procedure involved checking for linkage disequilibrium of each site against the GBS anchor map [7,8], which consists of markers located in hypo-methylated and genetically stable regions.
Sites giving only very weak or only nonlocal (i.e., outside of 1 Mb radius) linkage Disequilibrium (LD) hits were eliminated, which resulted in the final set of 61,228,639 polymorphisms. For slightly less than 40% of these sites, LD could not be conclusively calculated due to small minor allele frequencies (MAF), whereas the remaining sites, confirmed to be in local LD with the GBS anchor, have been marked with flag "LLD". Among the sites surviving all filtering steps, 8.7 million are indels or are located near (within 5 bp) of an indel. These have been marked with the flag "NI5". Since a procedure to achieve consistent alignment across all reads covering the same indels -local realignment -is not computationally feasible at this scale and has not been performed, genotyping errors could occur, and, consequently, most such sites are tentative and should be treated with caution. Figure 2 shows overlaps between various classes of variants of HapMap 3.1.1. First, we notice a rather small overlap between sites in confirmed local LD ("LLD" flag) and those marked "IBD1". This is understandable, as the IBD1 sites represent mostly low MAF cases, where LD assessment could not be done. Indels and vicinity (labeled "NI5") constitute about 15% of sites in each of the LLD, IBD1, and the union of LLD and IBD1 sets. Only a very small fraction of sites does not carry LLD or IBD1 flag, i.e., they are strongly confirmed by the IBD filter, but could not be classified with LD. The subset of 29.8 million sites in local LD and away from indels should be considered the most reliable.
To check the sensitivity of the obtained variant set to the mapping quality threshold imposed on the reads counted towards allelic depths, we repeated the pipeline using the mapping quality threshold equal to 1.
Comparison of the variant set obtained this way (referred to as q1) with our recommended set (q30) is shown in Figure 3. While the overall number of variant sites is approximately independent of the mapping 9 quality threshold, the two pipelines produce significantly different sets of sites, with only 72% of all q30 sites reproduced by the q1 pipeline. Closer inspection shows that this variability is due primarily to the IBD1 sites, for which our filtering strategy was the least stringent. On the other hand, the LLD sites, confirmed to be in local LD with GBS anchor, are much more independent of the mapping quality threshold, which confirms high quality of such sites.
For a population of inbred lines considered here, insight into genotype quality may be obtained from the inbreeding coefficient, calculated here for each taxon using the VCFtools program [11] from the formula where is the observed number of homozygotes for a given taxon, is the number of sites at which the taxon was genotyped, and is the expected number of homozygotes given by = ∑ (1 − 2 ).
Summation in the latter formula runs over genotyped sites, is the minor allele frequency at site (computed from all taxa in the population with non-missing genotypes at this site), and = (1 − ).
Low values of the inbreeding coefficient, indicating high heterozygosity, are mostly due to genotyping errors. Importance of choosing a sufficiently tight mapping quality threshold for the quality of genotypes is apparent from Figure 4, where the distribution of inbreeding coefficient for chromosome 10 is shown for q1 and q30 variant sets. The lower MAPQ threshold results in a large number of miss-mapped reads being counted towards depth, producing overly heterozygous genotypes, especially for highly covered taxa (the peak below 0.8 is due mostly to CIMMYT lines with 10-15x coverage; these lines have higher heterozygosity than other lines which may also contribute to the peak) and thus shifting the curve to the left. Since most HapMap 3 taxa are inbred lines, one should expect the true distribution to be contained within peak around 0.95. In view of this, the q30 result is definitely an improvement over q1, although a longer than expected tail extending towards the value 0.8 indicates that the HapMap 3 variants may contain too many false heterozygotes.  Figure 1). On these sites, genotypes were called on the 263 taxa from the "282" panel of Ref. [10] using "282 2x" dataset, and on the 31 high-coverage (on average 17 x) "German" taxa [2], for the total of 1210 taxa. Some of the taxa present in the "282" and "German" sets carry the same names as the ones included in the 916-taxa HapMap 3.1.1 set. Since despite identical names such taxa often originate from different germplasm sources, they have been kept separate during genotyping, i.e., reads from different sources were not merged and separate genotypes were computed for each source. In the resulting VCF files, the names of the overlapping taxa have been prefixed by "282set_" and "german_". For example, in the case of B73, there are three columns representing different datasets for this taxon: "B73" (the original 916taxa set), "282set_B73" (sequence from the more recent "282" libraries), and "german_B73" (from Ref. [2]).
To further eliminate the false positives resulting from sequencing errors, an additional depth-based filter was applied to the 96.8 million sites. Referred to as ">1,>2" filter it accepts sites for which the read support of minor allele was greater than 1 in at least one taxon and greater than 2 across all taxa. Genotypes on the surviving 83,153,144 sites, referred to as "un-imputed HapMap 3.2.1", were then processed through 11 the LD KNN imputation procedure based on Ref. [12], where the "nearest neighbors" of a given line are selected based on sites in good local LD with the target site. Whenever possible, the procedure filled up missing genotypes with imputed ones, but the non-missing genotypes were left unchanged, even if imputation classified them differently. Non-imputable missing genotypes at the sites with (preimputation) MAF below 1% were assumed to be major allele homozygotes. Imputation reduces the fraction of missing genotypes from 50% to 7%. Most of the originally missing genotypes (about 85%) are imputed to major allele homozygotes. Accuracy of the genotype dataset can be assessed by comparing the original genotypes with imputed ones. As shown in Table 3

DISCUSSION
The maize genome, 2.3 GB in size [5], is smaller than the human genome. But some of its distinctive features makes it more challenging for variants identification. First, a recent whole genome duplication occurred 12 million years ago resulted in homologous segments that complicate the short read but also extraordinary structural variations within species [5,6]. In this study, the genome of the B73 maize line was used as the reference for variant calling from short sequencing reads. Structural variations between B73 and other individuals has been the major challenge for identification of true variants. In particular, short reads derived from regions missing in the reference genome could be mismapped to other paralogous regions, which lead to false positive genotypes. In the human 1000 genome project, a new HaplotypeCaller was used [4], which performs local de novo assembly to identify the most likely haplotypes for each individual and thus improve the genotyping results. However, HaplotypeCaller is computationally very expensive, and not always applicable in species like maize, where the single reference genome misses many haplotypes presents in the species and has a lot more mismapped paralogous reads that would disrupt the local assembly. To filter out these false positive variants called from the mismapped reads, we relied on the Zea GBS map [7,8], which was obtained from GBS markers located primarily in hypo-methylated chromosomal regions. GBS maps were used to identify IBD regions between the individual genomes, and 100 million markers with high percentage of mismatched genotype calls in the IBD regions were filtered out from the initial set of 196 million markers. The highly repetitive genomic regions derived from recent transposition activities are in general easier to identify, because the templates of these repeats are well represented on the reference genome, and sequencing reads mapped to these regions, flagged with low mapping quality, can be removed at the early stage of the analysis pipeline. For HapMap 3, reads with mapping quality lower than 30 were not included in the build.
In summary, besides the addition of more maize lines, the HapMap 3.2.1 release differs from the 3.1.1 release in three major aspects: 1) Improved rare allele calls: to increase the accuracy of the variants with rare allele, the HapMap 3.2.1 pipeline applied more stringent read depth thresholds instead of the population genetics based LD filter that could not be applied to sites with very low MAF; 2) The sites with high percentage of heterozygous calls were flagged in the VCF files; 3) Missing data was imputed using the LD KNN method. As summarized in Table 2, the VCF files of both datasets contain labels that flag the characteristics of each of the sites. To effectively use this resource, it is recommended to filter the sites based on the flags that are appropriate to the purpose of each project.

Plant material
Plant material used in this study was obtained mostly from maize inbred lines representing wide range of Zea mays diversity. 103 of these lines, used previously in the HapMap2 project [1], include 60 improved lines, including the parents of the maize nested association mapping (NAM) population [13], 23 maize landraces and 19 wild relatives (teosinte lines: 17 Z. mays ssp. parviglumis and 2 Z. mays ssp. mexicana).
Sequence datasets originating from these lines are referred to in Table 1  The sequence of 271 taxa from the libraries of the "282" panel [10] were added at a later stage (HapMap 3.2.1). DNA to construct these libraries was obtained from the collection that the North Central Regional Plant Introduction Station (NCRPIS) distributes all over the world. Additionally, the high-coverage data of Ref. [2], originating from 31 European and US inbreds was also included. The total number of taxa genotyped in the HapMap 3.2.1 build is 1218.
In this study, individuals with the same taxa name but contributed by different members of the consortium were kept as separate entries in genotyping pipeline -a decision prompted by comparison of genotyoes obtained from different datasets. For example, the newly sequenced CML103 is significantly more heterozygous from CML103 studied previously in the HapMap2 project. Also, the Mo17 sequence originating at CAU has been treated as taxon separate from Mo17 and CAUMo17. In most of those cases, 17 a prefix or suffix indicating the origin of the sequence data has been added to the taxa name (e.g., "282set_" or "german_", "-chin" 18 fragments, the remnants of Illumina adapter sequences were clipped using TRIMMOMATIC [14] and only one read was used and aligned as single-end. The bwa mem aligner is capable of clipping the ends of reads and splitting each read in an attempt to map its different parts to different location on the reference. As a result, typically over 95% of reads are reported as mapped. However, the fraction of reads with nonzero mapping quality (negative log of the probability that a read has been placed in a wrong location) is much lower -typically only 40-50%. Figure 7 shows a typical distribution of the mapping quality obtained from bwa mem alignment. In practice, we only used alignments with mapping quality of at least 30. A base was counted towards allele depth if its base quality score was at least 10.
It is well known that alignment may be especially ambiguous when reads contain indels with respect to the reference. In such cases, multiple-sequence realignment approaches have been proposed [4] to find the correct sequence and location of an indel and avoid spurious flanking SNPs. Since indels are not the primary focus of this work and since the realignment is computationally very expensive, it has not been performed by the HapMap 3 pipeline. Thus, although indels and SNPs in their immediate vicinity have been retained in the HapMap 3 VCF files, they are less reliable and have therefore been marked with "NI5" label for easy filtering.
Genotyping pipeline Raw genotypes were obtained using a custom-built multi-threaded java code. First, the code executes samtools mpileup command (thresholds on the base and mapping quality are imposed here) for each taxon individually, processing a certain portion of the genome. On a multi-core machine, several such pileup processes (i.e., for several taxa) can be run concurrently as separate threads. Since we are predominantly interested in calling SNPs, we use a simplified indel representation where insertions and deletions with respect to reference are treated as additional alleles "I" and "D", respectively, regardless of length and actual sequence of the indel. Read depths and average base qualities of all six alleles (A, C, 19 G, T, I, and D) are extracted from samtools mpileup output for each taxon at each genomic position and stored in an array shared between all threads. The amount of memory available on the machine along with the number of taxa determine the upper limit on the size of this array, and therefore -the maximum size of chromosome chunk which can be processed at one time. As base quality of I and D alleles we took the value corresponding to the base directly preceding the indel on the reference.
Extraction of allelic depths for all genomic positions is time consuming, which presents a major obstacle if joint genotyping needs to be re-run, for example, upon extending the taxa set. It is therefore advantageous to run the depth extraction only once for each taxon and save the obtained depths on disk to be retrieved (rather than re-calculated) during the genotyping step. This way, when the taxa set for genotyping is extended, mpileup step has to be run only for the newly added taxa. Thus, the program features an option to save allelic depths and average qualities in specially designed data structures stored in HDF5 files -one such file per taxon per chromosome. To save space, each allele depth and average quality is stored as one byte, which allows exact representation of integers from 0 to 182, while higher Once the allelic depths for all taxa and a given chunk of the genome are available in shared memory, each site is evaluated for presence of a tentative SNP. On a multi-core machine, the set of sites within the genome chunk is divided into subsets processed in parallel on different cores. Sites with less than 10 taxa with read coverage and those with only reference allele present are ignored. For all other sites, genotypes are called for all taxa using a simple likelihood model with a uniform error rate [15] assumed at 1%.
Alternative alleles are then sorted according to their allele frequencies and up to two most abundant alleles are kept, as decided by the segregation test described in the next Section. Sites for which all taxa this test is below 0.2, more exact p-value is obtained from a simulation procedure. The simulation procedure used here, implemented in Java, is the same as the one implemented in R package [16]. An alternative allele is kept if at least one contingency table involving this allele has p-value smaller or equal to 0.01. If none of the alternative alleles survive the ST filter, the site is skipped (not reported in output).
The ST filter tends to eliminate variant sites resulting from random sequencing errors.

GBS anchor map and IBD filter
Given a set of trustworthy SNPs and a diverse set of 916 taxa it is possible to identify, for an arbitrary region of the genome, a number of taxa pairs which are identical by descent (IBD) and are therefore 21 expected to have identical genotypes in this region. If known, these IBD pairs can be used as a powerful filter eliminating variant which violate IBD constraints.
To determine the IBD regions, we used the first step of our pipeline to call genotypes for our 916 taxa on the set of GBS v2.7 sites [7,8] which tend to concentrate in relatively well-conserved low-copy regions of the genome and can therefore be considered reliable. This set of 954,384 sites was filtered to include only SNP (not indel) sites for which the p-value from the segregation test was below 0.05 and which were more than 5 bp away from any indel. The set of genotypes at 475,272 sites obtained in this way, which will be referred to as GBS anchor, agree well with those from GBS on 167 taxa present in both sets. Alleles detected by the HapMap 3 pipeline agreed with those from GBS at 94% of the GBS sites. At 90% of the sites, fraction of (non-missing data) taxa with genotypes in agreement with those from GBS was at or above 85%. Genotypes different from GBS ones were observed for 82 taxa. These differences were most frequent (up to 19% of all sites) for teosinte lines.
The GBS anchor was used to compute the genetic distance (identity by state) between any two of the 916 lines in windows containing 2000 GBS sites each (about 8.5 Mbp on average). If the genetic distance within such a window was <= 0.02 (about 10 times smaller than the mean distance across all pairs), the two lines were considered to be in IBD. At least 200 comparable GBS sites (i.e., non-missing data simultaneously on both lines being compared) were assumed necessary to make the genetic distance calculation feasible.
This allowed for good distance estimate while keeping the number of detected IBD relationships large.
The number of taxa involved in IBD relationships in any given window were between 385 (start of chromosome 10) and 757 (middle of chromosome 7) and averaged 588, leading to large numbers of IBD contrasts, ranging from 3,710 (beginning of chromosome 4) to 42,890 (middle of chromosome 7), and averaging 13,500.

22
The tentative (ST-filtered) variant sites were confronted with the IBD information as follows: for each site, pairs of lines in IBD were determined as described above. Genotypes of IBD-related lines were compared and the numbers of allele matches and mismatches, summed over all IBD pairs, were counted for each allele present at the site. If the match/mismatch ratio was at least 2 for at least two alleles, or if only one allele was present in all IBD contrasts, the site is considered as passing the IBD filter. Such a filter is less powerful for sites where all bases in IBD lines are major allele homozygotes, i.e., the variant being evaluated occurs in lines not involved in IBD pairs. Formally, such a site passes our IBD filter, but the actual variant is not strongly confirmed. These uncertain sites, mostly with low minor allele frequency, are labeled "IBD1" in the HapMap 3 VCF files and constitute about 50% of all HapMap 3 sites.

Linkage Disequilibrium (LD) filter
Any true SNP should be in local linkage with other nearby SNPs. This observation is the origin of another filter used in this work, referred to as the LD filter. For each variable site surviving the ST and IBD filters, we evaluated LD with each site of the GBS anchor. As the LD measure we chose the p-value from a 2 by 2 contingency table of haplotype counts AB, Ab, aB, ab. For the purpose counting haplotypes, heterozygous genotypes were treated as homozygous in minor allele, so that each taxon only contributed at most one haplotype. This tends to somewhat strengthen the LD signal and simplify the calculation. For a pair of sites to be tested for LD, the following three conditions had to be satisfied to make the calculation meaningful: i) the two sites were at least 2,500 bp apart, ii) there were at least 40 taxa with non-missing genotypes at both sites being compared, and iii) at least 2 taxa with minor allele had to be present at each of the two sites.
Filtering procedure executed for each site is summarized in Figure 8. First, LD between the given site and all sites in GBS anchor was computed and up to 20 best LD hits (the ones with lowest p-values) were collected. If the p-value of the best hit exceeded 1E-6 (which roughly corresponds to the peak of the 1  2  3  4  5  6  7  8  9  10  11  12  13  14  15  16  17  18  19  20  21  22  23  24  25  26  27  28  29  30  31  32  33  34  35  36  37  38  39  40  41  42  43  44  45  46  47  48  49  50  51  52  53  54  55  56  57  58  59  60  61  62  63  64 23 overall distribution of p-values), the site was rejected. Otherwise, it was determined whether the set of best hits contained any local hits, i.e., hits to GBS sites on the same chromosome within 1 Mbp of the site in question and with the p-value smaller than 10 times the p-value of the best hit. If no such local hits were found, the site was rejected, otherwise it was kept and marked as a site in Local LD using the flag "LLD". Note that the procedure defined this way filters out sites with only non-local LD hits as well as those with only weak LD signal. Sites in local LD as well as those for which LD could not be assessed (because of low minor allele frequency or missing data) pass the filter.

Imputation
In the HapMap 3.2.1 pipeline, the ST-and IBD-filtered genotypes, after the application of the additional ">1,>2" depth-based filter, were processed through the LD KNN imputation procedure based on Ref. [12] to fill in the missing data. The procedure is a version of the "K nearest neighbors" routine where the "nearest neighbors" of a given taxon are selected based on genetic distance computed using variant sites in good local LD. Specifically, for a given target site, a list of up to 70 sites in best LD (as given by the 2 measure) with it is compiled by checking all surrounding sites within 600Kb characterized by heterozygosity lower than 3% and more than 50% taxa with non-missing genotypes. Capping this list at 70 sites leads to good compromise between distance accuracy and computation speed. Then, at the same target site, for each target taxon, up to 30 "nearest neighbor" taxa are selected, with lowest genetic distances from the target taxon. Genetic distances are computed using the set of local LD sites selected in the previous step. Taxa with more than 50% missing genotypes at LD sites, missing genotype at the target position, having distance from the current taxon larger than 0.1, or resulting in less than 10 common LD sites on which the distance can be calculated, are excluded from distance calculation process.
Genotypes of the selected nearest neighbor taxa at the target site are stored in memory along with the genetic distances from the target taxon. This information is used to compute a weight of each neighbor genotype as follows: 1  2  3  4  5  6  7  8  9  10  11  12  13  14  15  16  17  18  19  20  21  22  23  24  25  26  27  28  29  30  31  32  33  34  35  36  37  38  39  40  41  42  43  44  45  46  47  48  49  50  51  52  53  54  55  56  57  58  59  60  61  62  63 64 65 where the summation index runs over all neighboring taxa with genotype at the target site, and is the distance of taxon from the target taxon. The genotype with the highest weight is considered the imputed genotype (of the target taxon at the target site) provided its weight is at least 10 times larger than that of the second-best candidate genotype. Otherwise the imputation is considered inconclusive and the imputed genotype is set to "unknown" (missing data), as it is in the case when no close neighbors of the current taxon could be found. If a genotype imputed to "unknown" occurs at a site where MAF<1%, it is automatically converted into major allele homozygote.
The imputation procedure is run for each genotype in the input file. However, in the output only the originally missing genotypes are updated to imputed ones, whereas all others are left unchanged, even if classified differently. On the other hand, all imputed genotypes are used during a run to collect imputation statistics. The "transition matrix" showing how many genotypes originally in a given class were imputed into other classes is an indication of the accuracy of the input genotypes. Error rates calculated from the transition data are given in Table 3.

LIST OF ABREVIATIONS
Since no local re-alignment was done, the NI5 sites are not reliable.

Results
A new computational pipeline was set up to process over 12 trillion bp of sequencing data, and a set of population genetics filters were applied to identify over 83 million variant sites.

Conclusions
We identified polymorphisms in regions where collinearity is largely preserved in the maize species.
However, the fact that the B73 genome used as the reference only represents a fraction of all haplotypes is still an important limiting factor. Center (CIMMYT). High-coverage data for 31 European and US Flint and Dent lines is also available in Ref. [2]. Altogether, in this work we used a total of 1218 maize lines sequenced with depth varying from below 1x to 59x.

KEYWORDS
A common approach in today's genetic diversity projects is to map the shotgun sequencing reads from each individual onto a common reference genome to identify DNA sequence variations, and the physical positions of the reference genome is used as a coordinate system for the polymorphic sites. A good example is the human 1000 genome project [3]. The computational data processing pipeline developed for the human project, GATK, has been widely adopted for identifying genetic variations in many other species [4].
As the sequencing technology is improved and sequencers' base calling error model gets more accurate, the computational challenges in genotyping by short-read sequencing have shifted from modeling sequencer machine artifacts errors to resolving genotyping errors derived from incorrect mapping of short reads to the reference genome. The problem is associated with the experimental design that uses the single-reference genome as coordinate system. Taking maize as an example, the reference being used is a 2.1 Gb assembly from an elite inbred line B73 that represents 91% of the B73 genome [5], and was estimated to capture only ~70% of the low-copy gene fraction of all inbred lines [6]. The sequence alignment software, however, can map 95-98% of the whole genome sequencing reads to the reference.
That suggests a high percentage of the reads were mapped incorrectly, either being mapped to the 1  2  3  4  5  6  7  8  9  10  11  12  13  14  15  16  17  18  19  20  21  22  23  24  25  26  27  28  29  30  31  32  33  34  35  36  37  38  39  40  41  42  43  44  45  46  47  48  49  50  51  52  53  54  55  56  57  58  59  60  61  62  63  64 4 paralogous loci or highly repetitive regions under-represented in the reference assembly. The genetic variants called from the miss-mapped reads need to be corrected computationally. The maize HapMap2 relied on linkage disequilibrium in the population to purge most of the bad markers caused by alignment errors. To construct maize HapMap 3, a new computational pipeline was developed from scratch to handle the sequencing data from 10 times more lines, and also took advantage of the high quality genetic map constructed from the GBS technology [7,8] which was not present when HapMap2 was constructed.

DATA DESCRIPTION
The sequencing data used in this work is comprised of 12,497 billion base pairs in a total of 113,702 billion Illumina paired-end reads, originating from 1,218 maize and teosinte lines. The data was collected from several sources over several years, and varies in quality, read length, and coverage. Basic information about various datasets and stages of the HapMap 3 project they were used in are summarized in Table 1.
All sequence data used in this work is publically available. Collection and publishing of this data does not violate any local or international legislation or guidelines.

ANALYSIS Initial variant discovery
The HapMap 3 pipeline is summarized in Figure 1. First, polymorphic sites were called for a set of 916 taxa from datasets HapMap2 through CIMMYT/BGI (7,191 billion base pairs, 74,643 million reads). In the first step, a custom built software tool was used to determine genotypes for each taxon at each site of the genome based on allelic depths at that site. Bases counted towards depth had base quality score of at least 10 and originated from reads with mapping quality (MAPQ= − (10 ), where -calculated by the BWA mem aligner -is the probability of the reported read location being wrong) at or above 30.
Only sites where at least 10 taxa had coverage of 1 or more were considered. Following Ref. [1], at each site the allelic read depths were subject to segregation test (ST -see METHODS section for details). For a population of inbred lines at true variant sites, one expects depths corresponding to minor and major alleles to be concentrated in roughly different subset of taxa rather than being randomly distributed. The purpose of the ST test is to find and eliminate sites for which allelic depth distribution appears random, as such randomness, indicating high heterozygosity, is likely caused by alignment and sequencing errors.
A measure of the randomness is the p-value of the ST test (the smaller the p-value, the less random the distribution). A p-value threshold of 0.01 was used in this study. This choice was somewhat arbitrary, aimed at reducing the number of tentative variant sites to manageable size before further, more stringent filters were applied. In this first, ST-based round of filtering, a total of 196 million tentative polymorphic sites were selected. In the second step, these sites were filtered using the identity by descent (IBD) information derived from about 0.5 million of high-quality polymorphisms obtained previously [8] using the Genotyping-By-Sequencing (GBS) approach [7]. These GBS variants (GBS anchor) were used to determine regions of IBD, where certain pairs of taxa are expected to have identical haplotypes. The tentative polymorphic sites violating these IBD constraints were then filtered out, leaving 96.8 million sites. At roughly half of the sites surviving this filter, minor allele was not present in taxa involved in the tested IBD relationships. At such sites (typically with low minor allele frequency), the satisfied IBD constraints do not confirm the existence of a variant. They are therefore less reliable and have been marked with "IBD1" flag in the VCF files (see Table 2 [7,8], which consists of markers located in hypo-methylated and genetically stable regions.  To check the sensitivity of the obtained variant set to the mapping quality threshold imposed on the reads counted towards allelic depths, we repeated the pipeline using the mapping quality threshold equal to 1. Comparison of the variant set obtained this way (referred to as q1) with our recommended set (q30) is shown in Figure 3. While the overall number of variant sites is approximately independent of the mapping 1  2  3  4  5  6  7  8  9  10  11  12  13  14  15  16  17  18  19  20  21  22  23  24  25  26  27  28  29  30  31  32  33  34  35  36  37  38  39  40  41  42  43  44  45  46  47  48  49  50  51  52  53  54  55  56  57  58  59  60  61  62  63  64 9 quality threshold, the two pipelines produce significantly different sets of sites, with only 72% of all q30 sites reproduced by the q1 pipeline. Closer inspection shows that this variability is due primarily to the IBD1 sites, for which our filtering strategy was the least stringent. On the other hand, the LLD sites, confirmed to be in local LD with GBS anchor, are much more independent of the mapping quality threshold, which confirms high quality of such sites.
For a population of inbred lines considered here, insight into genotype quality may be obtained from the inbreeding coefficient, calculated here for each taxon using the VCFtools program [11] from the formula where is the observed number of homozygotes for a given taxon, is the number of sites at which the taxon was genotyped, and is the expected number of homozygotes given by = ∑ (1 − 2 ).
Summation in the latter formula runs over genotyped sites, is the minor allele frequency at site (computed from all taxa in the population with non-missing genotypes at this site), and = (1 − ).
Low values of the inbreeding coefficient, indicating high heterozygosity, are mostly due to genotyping errors. Importance of choosing a sufficiently tight mapping quality threshold for the quality of genotypes is apparent from Figure 4, where the distribution of inbreeding coefficient for chromosome 10 is shown for q1 and q30 variant sets. The lower MAPQ threshold results in a large number of miss-mapped reads being counted towards depth, producing overly heterozygous genotypes, especially for highly covered taxa (the peak below 0.8 is due mostly to CIMMYT lines with 10-15x coverage; these lines have higher heterozygosity than other lines which may also contribute to the peak) and thus shifting the curve to the left. Since most HapMap 3 taxa are inbred lines, one should expect the true distribution to be contained within peak around 0.95. In view of this, the q30 result is definitely an improvement over q1, although a longer than expected tail extending towards the value 0.8 indicates that the HapMap 3 variants may contain too many false heterozygotes.  Figure 1). On these sites, genotypes were called on the 263 taxa from the "282" panel of Ref. [10] using "282 2x" dataset, and on the 31 high-coverage (on average 17 x) "German" taxa [2], for the total of 1210 taxa. Some of the taxa present in the "282" and "German" sets carry the same names as the ones included in the 916-taxa HapMap 3.1.1 set. Since despite identical names such taxa often originate from different germplasm sources, they have been kept separate during genotyping, i.e., reads from different sources were not merged and separate genotypes were computed for each source. In the resulting VCF files, the names of the overlapping taxa have been prefixed by "282set_" and "german_". For example, in the case of B73, there are three columns representing different datasets for this taxon: "B73" (the original 916taxa set), "282set_B73" (sequence from the more recent "282" libraries), and "german_B73" (from Ref. [2]).
To further eliminate the false positives resulting from sequencing errors, an additional depth-based filter was applied to the 96.8 million sites. Referred to as ">1,>2" filter it accepts sites for which the read support of minor allele was greater than 1 in at least one taxon and greater than 2 across all taxa. Genotypes on the surviving 83,153,144 sites, referred to as "un-imputed HapMap 3.2.1", were then processed through 1  2  3  4  5  6  7  8  9  10  11  12  13  14  15  16  17  18  19  20  21  22  23  24  25  26  27  28  29  30  31  32  33  34  35  36  37  38  39  40  41  42  43  44  45  46  47  48  49  50  51  52  53  54  55  56  57  58  59  60  61  62  63  64 11 the LD KNN imputation procedure based on Ref. [12], where the "nearest neighbors" of a given line are selected based on sites in good local LD with the target site. Whenever possible, the procedure filled up missing genotypes with imputed ones, but the non-missing genotypes were left unchanged, even if imputation classified them differently. Non-imputable missing genotypes at the sites with (preimputation) MAF below 1% were assumed to be major allele homozygotes. Imputation reduces the fraction of missing genotypes from 50% to 7%. Most of the originally missing genotypes (about 85%) are imputed to major allele homozygotes. Accuracy of the genotype dataset can be assessed by comparing the original genotypes with imputed ones. As shown in Table 3, 99.8% of major allele homozygotes are imputed back into the same class. While the accuracies of minor allele homozygotes and genotypes including indels are both above 90%, only 11% of heterozygotes are imputed back into the same class, while 47% of them fail imputation altogether. This reflects the inherent difficulty in calling heterozygotes.
Relationship between variant sites included in HapMap 3.1.1 and 3.2.1 is shown in Figure 5. Both pipelines start from the same set of IBD-filtered genotypes and subject them to different kinds of filtering, with that of

DISCUSSION
The maize genome, 2.3 GB in size [5], is smaller than the human genome. But some of its distinctive features makes it more challenging for variants identification. First, a recent whole genome duplication occurred 12 million years ago resulted in homologous segments that complicate the short read  1  2  3  4  5  6  7  8  9  10  11  12  13  14  15  16  17  18  19  20  21  22  23  24  25  26  27  28  29  30  31  32  33  34  35  36  37  38  39  40  41  42  43  44  45  46  47  48  49  50  51  52  53  54  55  56  57  58  59  60  61  62  63  64 13 alignments; second, the rampant activities of transposable elements within last 1-5 million year not only resulted in accumulation of large amount of relatively young repetitive elements in the intergenic regions, but also extraordinary structural variations within species [5,6]. In this study, the genome of the B73 maize line was used as the reference for variant calling from short sequencing reads. Structural variations between B73 and other individuals has been the major challenge for identification of true variants. In particular, short reads derived from regions missing in the reference genome could be mismapped to other paralogous regions, which lead to false positive genotypes. In the human 1000 genome project, a new HaplotypeCaller was used [4], which performs local de novo assembly to identify the most likely haplotypes for each individual and thus improve the genotyping results. However, HaplotypeCaller is computationally very expensive, and not always applicable in species like maize, where the single reference genome misses many haplotypes presents in the species and has a lot more mismapped paralogous reads that would disrupt the local assembly. To filter out these false positive variants called from the mismapped reads, we relied on the Zea GBS map [7,8], which was obtained from GBS markers located primarily in hypo-methylated chromosomal regions. GBS maps were used to identify IBD regions between the individual genomes, and 100 million markers with high percentage of mismatched genotype calls in the IBD regions were filtered out from the initial set of 196 million markers. The highly repetitive genomic regions derived from recent transposition activities are in general easier to identify, because the templates of these repeats are well represented on the reference genome, and sequencing reads mapped to these regions, flagged with low mapping quality, can be removed at the early stage of the analysis pipeline. For HapMap 3, reads with mapping quality lower than 30 were not included in the build.
One of the goals of HapMap 3.1.1 is to identify genetic markers in regions where collinearity is preserved in majority of maize lines. The LD filter in the pipeline was applied for this purpose. To do this, we genetically mapped the presence/absence of the minor alleles using the GBS genetic map, and these mapped genetic positions were compared to the physical positions on the B73 reference. Among the 96. 8   1  2  3  4  5  6  7  8  9  10  11  12  13  14  15  16  17  18  19  20  21  22  23  24  25  26  27  28  29  30  31  32  33  34  35  36  37  38  39  40  41  42  43  44  45  46  47  48  49  50  51  52  53  54  55  56  57  58  59  60  61  62  63  64 14 million sites surviving the IBD filter, 25% did not have enough non-missing data or sufficient minor allele frequency for genetic mapping to be meaningful. For 38% of sites, at least one genetically mapped position matching the physical positons on B73 reference was found, 24% have no significant hits from genetic mapping, probably due to no consensus Local LD filter based on a large, diverse population may be too stringent, as some markers, good within certain sub-populations, may be thrown out. Therefore, the LD filter was not used in the HapMap 3.2.1 release, which contains a total of 83 million variant sites, subject only to ST and IBD filters and an additional depth-based filter aimed to improve reliability of rare allele calls. Although those sites are likely to have higher misalignment rates, they are still likely to capture a true association with phenotypes.

15
In the un-imputed HapMap 3.2.1, at about 10% of all variant sites, fraction of heterozygous taxa exceeds 3%. Such sites are marked "DUP", as most likely originating from duplication misalignments. Figure 6 shows the distribution of fraction of heterozygous sites per taxon for different versions of HapMap 3.2.1 release. While for the un-imputed genotypes the distribution peaks slightly below 1%, imputation significantly shifts the peak to the left, down to about 0.5%. This is a consequence of most missing genotypes being imputed to homozygotes. Interestingly, considering only sites in good local LD (marked with the "LLD" flag) leads to distributions (both imputed and un-imputed) shifted towards higher heterozygosities. This is understandable, as the LLD sites are typically those with higher minor allele frequencies, where the chance of encountering a heterozygote is higher.
In summary, besides the addition of more maize lines, the HapMap 3.2.1 release differs from the 3.1.1 release in three major aspects: 1) Improved rare allele calls: to increase the accuracy of the variants with rare allele, the HapMap 3.2.1 pipeline applied more stringent read depth thresholds instead of the population genetics based LD filter that could not be applied to sites with very low MAF; 2) The sites with high percentage of heterozygous calls were flagged in the VCF files; 3) Missing data was imputed using the LD KNN method. As summarized in Table 2, the VCF files of both datasets contain labels that flag the characteristics of each of the sites. To effectively use this resource, it is recommended to filter the sites based on the flags that are appropriate to the purpose of each project.
When constructing the maize HapMap 3, the most serious problems we were facing can be attributed to the use of a genome from a single individual (B73) as a reference for other, often very different species. This is becoming the single limiting factor in the study of maize diversity, as well as breeding practice. The only remedy is to move away from a single genome-based reference coordinate and adopt a pan-genome based reference system that incorporates all major haplotypes of the species.
17 a prefix or suffix indicating the origin of the sequence data has been added to the taxa name (e.g., "282set_" or "german_", "-chin"  1  2  3  4  5  6  7  8  9  10  11  12  13  14  15  16  17  18  19  20  21  22  23  24  25  26  27  28  29  30  31  32  33  34  35  36  37  38  39  40  41  42  43  44  45  46  47  48  49  50  51  52  53  54  55  56  57  58  59  60  61  62  63  64 18 fragments, the remnants of Illumina adapter sequences were clipped using TRIMMOMATIC [14] and only one read was used and aligned as single-end. The bwa mem aligner is capable of clipping the ends of reads and splitting each read in an attempt to map its different parts to different location on the reference. As a result, typically over 95% of reads are reported as mapped. However, the fraction of reads with nonzero mapping quality (negative log of the probability that a read has been placed in a wrong location) is much lower -typically only 40-50%. Figure 7 shows a typical distribution of the mapping quality obtained from bwa mem alignment. In practice, we only used alignments with mapping quality of at least 30. A base was counted towards allele depth if its base quality score was at least 10.
It is well known that alignment may be especially ambiguous when reads contain indels with respect to the reference. In such cases, multiple-sequence realignment approaches have been proposed [4]  Once the allelic depths for all taxa and a given chunk of the genome are available in shared memory, each site is evaluated for presence of a tentative SNP. On a multi-core machine, the set of sites within the genome chunk is divided into subsets processed in parallel on different cores. Sites with less than 10 taxa this test is below 0.2, more exact p-value is obtained from a simulation procedure. The simulation procedure used here, implemented in Java, is the same as the one implemented in R package [16]. An alternative allele is kept if at least one contingency table involving this allele has p-value smaller or equal to 0.01. If none of the alternative alleles survive the ST filter, the site is skipped (not reported in output).
The ST filter tends to eliminate variant sites resulting from random sequencing errors.

GBS anchor map and IBD filter
Given a set of trustworthy SNPs and a diverse set of 916 taxa it is possible to identify, for an arbitrary region of the genome, a number of taxa pairs which are identical by descent (IBD) and are therefore 21 expected to have identical genotypes in this region. If known, these IBD pairs can be used as a powerful filter eliminating variant which violate IBD constraints.
To determine the IBD regions, we used the first step of our pipeline to call genotypes for our 916 taxa on the set of GBS v2.7 sites [7,8] which tend to concentrate in relatively well-conserved low-copy regions of the genome and can therefore be considered reliable. This set of 954,384 sites was filtered to include only SNP (not indel) sites for which the p-value from the segregation test was below 0.05 and which were more than 5 bp away from any indel. The set of genotypes at 475,272 sites obtained in this way, which will be referred to as GBS anchor, agree well with those from GBS on 167 taxa present in both sets. Alleles detected by the HapMap 3 pipeline agreed with those from GBS at 94% of the GBS sites. At 90% of the sites, fraction of (non-missing data) taxa with genotypes in agreement with those from GBS was at or above 85%. Genotypes different from GBS ones were observed for 82 taxa. These differences were most frequent (up to 19% of all sites) for teosinte lines.
The GBS anchor was used to compute the genetic distance (identity by state) between any two of the 916 lines in windows containing 2000 GBS sites each (about 8.5 Mbp on average). If the genetic distance within such a window was <= 0.02 (about 10 times smaller than the mean distance across all pairs), the two lines were considered to be in IBD. At least 200 comparable GBS sites (i.e., non-missing data simultaneously on both lines being compared) were assumed necessary to make the genetic distance calculation feasible.
This allowed for good distance estimate while keeping the number of detected IBD relationships large.
The number of taxa involved in IBD relationships in any given window were between 385 (start of chromosome 10) and 757 (middle of chromosome 7) and averaged 588, leading to large numbers of IBD contrasts, ranging from 3,710 (beginning of chromosome 4) to 42,890 (middle of chromosome 7), and averaging 13,500.

22
The tentative (ST-filtered) variant sites were confronted with the IBD information as follows: for each site, pairs of lines in IBD were determined as described above. Genotypes of IBD-related lines were compared and the numbers of allele matches and mismatches, summed over all IBD pairs, were counted for each allele present at the site. If the match/mismatch ratio was at least 2 for at least two alleles, or if only one allele was present in all IBD contrasts, the site is considered as passing the IBD filter. Such a filter is less powerful for sites where all bases in IBD lines are major allele homozygotes, i.e., the variant being evaluated occurs in lines not involved in IBD pairs. Formally, such a site passes our IBD filter, but the actual variant is not strongly confirmed. These uncertain sites, mostly with low minor allele frequency, are labeled "IBD1" in the HapMap 3 VCF files and constitute about 50% of all HapMap 3 sites.

Linkage Disequilibrium (LD) filter
Any true SNP should be in local linkage with other nearby SNPs. This observation is the origin of another filter used in this work, referred to as the LD filter. For each variable site surviving the ST and IBD filters, we evaluated LD with each site of the GBS anchor. As the LD measure we chose the p-value from a 2 by 2 contingency table of haplotype counts AB, Ab, aB, ab. For the purpose counting haplotypes, heterozygous genotypes were treated as homozygous in minor allele, so that each taxon only contributed at most one haplotype. This tends to somewhat strengthen the LD signal and simplify the calculation. For a pair of sites to be tested for LD, the following three conditions had to be satisfied to make the calculation meaningful: i) the two sites were at least 2,500 bp apart, ii) there were at least 40 taxa with non-missing genotypes at both sites being compared, and iii) at least 2 taxa with minor allele had to be present at each of the two sites.
Filtering procedure executed for each site is summarized in Figure 8. First, LD between the given site and all sites in GBS anchor was computed and up to 20 best LD hits (the ones with lowest p-values) were collected. If the p-value of the best hit exceeded 1E-6 (which roughly corresponds to the peak of the 1  2  3  4  5  6  7  8  9  10  11  12  13  14  15  16  17  18  19  20  21  22  23  24  25  26  27  28  29  30  31  32  33  34  35  36  37  38  39  40  41  42  43  44  45  46  47  48  49  50  51  52  53  54  55  56  57  58  59  60  61  62  63  64 23 overall distribution of p-values), the site was rejected. Otherwise, it was determined whether the set of best hits contained any local hits, i.e., hits to GBS sites on the same chromosome within 1 Mbp of the site in question and with the p-value smaller than 10 times the p-value of the best hit. If no such local hits were found, the site was rejected, otherwise it was kept and marked as a site in Local LD using the flag "LLD". Note that the procedure defined this way filters out sites with only non-local LD hits as well as those with only weak LD signal. Sites in local LD as well as those for which LD could not be assessed (because of low minor allele frequency or missing data) pass the filter.

Imputation
In the HapMap 3.2.1 pipeline, the ST-and IBD-filtered genotypes, after the application of the additional ">1,>2" depth-based filter, were processed through the LD KNN imputation procedure based on Ref. [12] to fill in the missing data. The procedure is a version of the "K nearest neighbors" routine where the "nearest neighbors" of a given taxon are selected based on genetic distance computed using variant sites in good local LD. Specifically, for a given target site, a list of up to 70 sites in best LD (as given by the 2 measure) with it is compiled by checking all surrounding sites within 600Kb characterized by heterozygosity lower than 3% and more than 50% taxa with non-missing genotypes. Capping this list at 70 sites leads to good compromise between distance accuracy and computation speed. Then, at the same target site, for each target taxon, up to 30 "nearest neighbor" taxa are selected, with lowest genetic distances from the target taxon. Genetic distances are computed using the set of local LD sites selected in the previous step. Taxa with more than 50% missing genotypes at LD sites, missing genotype at the target position, having distance from the current taxon larger than 0.1, or resulting in less than 10 common LD sites on which the distance can be calculated, are excluded from distance calculation process.
Genotypes of the selected nearest neighbor taxa at the target site are stored in memory along with the genetic distances from the target taxon. This information is used to compute a weight of each neighbor genotype as follows: where the summation index runs over all neighboring taxa with genotype at the target site, and is the distance of taxon from the target taxon. The genotype with the highest weight is considered the imputed genotype (of the target taxon at the target site) provided its weight is at least 10 times larger than that of the second-best candidate genotype. Otherwise the imputation is considered inconclusive and the imputed genotype is set to "unknown" (missing data), as it is in the case when no close neighbors of the current taxon could be found. If a genotype imputed to "unknown" occurs at a site where MAF<1%, it is automatically converted into major allele homozygote.
The imputation procedure is run for each genotype in the input file. However, in the output only the originally missing genotypes are updated to imputed ones, whereas all others are left unchanged, even if classified differently. On the other hand, all imputed genotypes are used during a run to collect imputation statistics. The "transition matrix" showing how many genotypes originally in a given class were imputed into other classes is an indication of the accuracy of the input genotypes. Error rates calculated from the transition data are given in Table 3.

Results
A new computational pipeline was set up to process over 12 trillion bp of sequencing data, and a set of population genetics filters were applied to identify over 83 million variant sites.

Conclusions
We identified polymorphisms in regions where collinearity is largely preserved in the maize species.
However, the fact that the B73 genome used as the reference only represents a fraction of all haplotypes is still an important limiting factor. Center (CIMMYT). High-coverage data for 31 European and US Flint and Dent lines is also available in Ref. [2]. Altogether, in this work we used a total of 1218 maize lines sequenced with depth varying from below 1x to 59x.

KEYWORDS
A common approach in today's genetic diversity projects is to map the shotgun sequencing reads from each individual onto a common reference genome to identify DNA sequence variations, and the physical positions of the reference genome is used as a coordinate system for the polymorphic sites. A good example is the human 1000 genome project [3]. The computational data processing pipeline developed for the human project, GATK, has been widely adopted for identifying genetic variations in many other species [4].
As the sequencing technology is improved and sequencers' base calling error model gets more accurate, the computational challenges in genotyping by short-read sequencing have shifted from modeling sequencer machine artifacts errors to resolving genotyping errors derived from incorrect mapping of short reads to the reference genome. The problem is associated with the experimental design that uses the single-reference genome as coordinate system. Taking maize as an example, the reference being used is a 2.1 Gb assembly from an elite inbred line B73 that represents 91% of the B73 genome [5], and was estimated to capture only ~70% of the low-copy gene fraction of all inbred lines [6]. The sequence alignment software, however, can map 95-98% of the whole genome sequencing reads to the reference.
That suggests a high percentage of the reads were mapped incorrectly, either being mapped to the 4 paralogous loci or highly repetitive regions under-represented in the reference assembly. The genetic variants called from the miss-mapped reads need to be corrected computationally. The maize HapMap2 relied on linkage disequilibrium in the population to purge most of the bad markers caused by alignment errors. To construct maize HapMap 3, a new computational pipeline was developed from scratch to handle the sequencing data from 10 times more lines, and also took advantage of the high quality genetic map constructed from the GBS technology [7,8] which was not present when HapMap2 was constructed.

DATA DESCRIPTION
The sequencing data used in this work is comprised of 12,497 billion base pairs in a total of 113,702 billion Illumina paired-end reads, originating from 1,218 maize and teosinte lines. The data was collected from several sources over several years, and varies in quality, read length, and coverage. Basic information about various datasets and stages of the HapMap 3 project they were used in are summarized in Table 1.
Each of the 1,218 lines were sequenced at depth varying from below 1x to 59x, using reads of lengths ranging from 44 through 201 bp, averaging 110 bp. All reads were aligned to maize reference genome B73 5 version AGP v3 using BWA mem aligner [9]. Overall, 95-98% of the reads were mapped to the reference genome, although only about 50-60% with non-zero mapping quality. Taxa from sets "HapMap2", "HapMap2 extra", and "CAU" partially overlap. The "282" libraries, sequenced twice represent 271 taxa. A "+" means that the dataset was used in a given stage, "-" that it was not.
All sequence data used in this work is publically available. Collection and publishing of this data does not violate any local or international legislation or guidelines.

ANALYSIS Initial variant discovery
The HapMap 3 pipeline is summarized in Figure 1. First, polymorphic sites were called for a set of 916 taxa from datasets HapMap2 through CIMMYT/BGI (7,191 billion base pairs, 74,643 million reads). In the first step, a custom built software tool was used to determine genotypes for each taxon at each site of the genome based on allelic depths at that site. Bases counted towards depth had base quality score of at least 10 and originated from reads with mapping quality (MAPQ= − (10 ), where -calculated by the BWA mem aligner -is the probability of the reported read location being wrong) at or above 30.
This cut-off was chosen at mid-point between the highest MAPQ value reported by the aligner, 6 corresponding to unambiguous alignments (60) and that of the most ambiguous ones (0). Analysis of the inbreeding coefficient (Section HapMap 3.1.1) and of MAPQ distributions shows that our choice of cut-off leads to decent quality genotypes while allowing for over 80% of alignments with MAPQ>0 to be included.
Only sites where at least 10 taxa had coverage of 1 or more were considered. Following Ref. [1], at each site the allelic read depths were subject to segregation test (ST -see METHODS section for details). For a population of inbred lines at true variant sites, one expects depths corresponding to minor and major alleles to be concentrated in roughly different subset of taxa rather than being randomly distributed. The purpose of the ST test is to find and eliminate sites for which allelic depth distribution appears random, as such randomness, indicating high heterozygosity, is likely caused by alignment and sequencing errors.
A measure of the randomness is the p-value of the ST test (the smaller the p-value, the less random the distribution). A p-value threshold of 0.01 was used in this study. This choice was somewhat arbitrary, aimed at reducing the number of tentative variant sites to manageable size before further, more stringent filters were applied. In this first, ST-based round of filtering, a total of 196 million tentative polymorphic sites were selected. In the second step, these sites were filtered using the identity by descent (IBD) information derived from about 0.5 million of high-quality polymorphisms obtained previously [8] using the Genotyping-By-Sequencing (GBS) approach [7]. These GBS variants (GBS anchor) were used to determine regions of IBD, where certain pairs of taxa are expected to have identical haplotypes. The tentative polymorphic sites violating these IBD constraints were then filtered out, leaving 96.8 million sites. At roughly half of the sites surviving this filter, minor allele was not present in taxa involved in the tested IBD relationships. At such sites (typically with low minor allele frequency), the satisfied IBD constraints do not confirm the existence of a variant. They are therefore less reliable and have been marked with "IBD1" flag in the VCF files (see Table 2 [7,8], which consists of markers located in hypo-methylated and genetically stable regions. Sites giving only very weak or only nonlocal (i.e., outside of 1 Mb radius) linkage Disequilibrium (LD) hits were eliminated, which resulted in the final set of 61,228,639 polymorphisms. For slightly less than 40% of these sites, LD could not be conclusively calculated due to small minor allele frequencies (MAF), whereas the remaining sites, confirmed to be in local LD with the GBS anchor, have been marked with flag "LLD". Among the sites surviving all filtering steps, 8.7 million are indels or are located near (within 5 bp) of an indel. These have been marked with the flag "NI5". Since a procedure to achieve consistent alignment across all reads covering the same indels -local realignment -is not computationally feasible at this scale and has not been performed, genotyping errors could occur, and, consequently, most such sites are tentative and should be treated with caution. Figure 2 shows overlaps between various classes of variants of HapMap 3.1.1. First, we notice a rather small overlap between sites in confirmed local LD ("LLD" flag) and those marked "IBD1". This is understandable, as the IBD1 sites represent mostly low MAF cases, where LD assessment could not be done. Indels and vicinity (labeled "NI5") constitute about 15% of sites in each of the LLD, IBD1, and the union of LLD and IBD1 sets. Only a very small fraction of sites does not carry LLD or IBD1 flag, i.e., they are strongly confirmed by the IBD filter, but could not be classified with LD. The subset of 29.8 million sites in local LD and away from indels should be considered the most reliable.
To check the sensitivity of the obtained variant set to the mapping quality threshold imposed on the reads counted towards allelic depths, we repeated the pipeline using the mapping quality threshold equal to 1.
Comparison of the variant set obtained this way (referred to as q1) with our recommended set (q30) is shown in Figure 3. While the overall number of variant sites is approximately independent of the mapping 1  2  3  4  5  6  7  8  9  10  11  12  13  14  15  16  17  18  19  20  21  22  23  24  25  26  27  28  29  30  31  32  33  34  35  36  37  38  39  40  41  42  43  44  45  46  47  48  49  50  51  52  53  54  55  56  57  58  59  60  61  62  63  64 9 quality threshold, the two pipelines produce significantly different sets of sites, with only 72% of all q30 sites reproduced by the q1 pipeline. Closer inspection shows that this variability is due primarily to the IBD1 sites, for which our filtering strategy was the least stringent. On the other hand, the LLD sites, confirmed to be in local LD with GBS anchor, are much more independent of the mapping quality threshold, which confirms high quality of such sites.
For a population of inbred lines considered here, insight into genotype quality may be obtained from the inbreeding coefficient, calculated here for each taxon using the VCFtools program [11] from the formula where is the observed number of homozygotes for a given taxon, is the number of sites at which the taxon was genotyped, and is the expected number of homozygotes given by = ∑ (1 − 2 ).
Summation in the latter formula runs over genotyped sites, is the minor allele frequency at site (computed from all taxa in the population with non-missing genotypes at this site), and = (1 − ).
Low values of the inbreeding coefficient, indicating high heterozygosity, are mostly due to genotyping errors. Importance of choosing a sufficiently tight mapping quality threshold for the quality of genotypes is apparent from Figure 4, where the distribution of inbreeding coefficient for chromosome 10 is shown for q1 and q30 variant sets. The lower MAPQ threshold results in a large number of miss-mapped reads being counted towards depth, producing overly heterozygous genotypes, especially for highly covered taxa (the peak below 0.8 is due mostly to CIMMYT lines with 10-15x coverage; these lines have higher heterozygosity than other lines which may also contribute to the peak) and thus shifting the curve to the left. Since most HapMap 3 taxa are inbred lines, one should expect the true distribution to be contained within peak around 0.95. In view of this, the q30 result is definitely an improvement over q1, although a longer than expected tail extending towards the value 0.8 indicates that the HapMap 3 variants may contain too many false heterozygotes.  Figure 1). On these sites, genotypes were called on the 263 taxa from the "282" panel of Ref. [10] using "282 2x" dataset, and on the 31 high-coverage (on average 17 x) "German" taxa [2], for the total of 1210 taxa. Some of the taxa present in the "282" and "German" sets carry the same names as the ones included in the 916-taxa HapMap 3.1.1 set. Since despite identical names such taxa often originate from different germplasm sources, they have been kept separate during genotyping, i.e., reads from different sources were not merged and separate genotypes were computed for each source. In the resulting VCF files, the names of the overlapping taxa have been prefixed by "282set_" and "german_". For example, in the case of B73, there are three columns representing different datasets for this taxon: "B73" (the original 916taxa set), "282set_B73" (sequence from the more recent "282" libraries), and "german_B73" (from Ref. [2]).
To further eliminate the false positives resulting from sequencing errors, an additional depth-based filter was applied to the 96.8 million sites. Referred to as ">1,>2" filter it accepts sites for which the read support of minor allele was greater than 1 in at least one taxon and greater than 2 across all taxa. Genotypes on the surviving 83,153,144 sites, referred to as "un-imputed HapMap 3.2.1", were then processed through 1  2  3  4  5  6  7  8  9  10  11  12  13  14  15  16  17  18  19  20  21  22  23  24  25  26  27  28  29  30  31  32  33  34  35  36  37  38  39  40  41  42  43  44  45  46  47  48  49  50  51  52  53  54  55  56  57  58  59  60  61  62  63  64 11 the LD KNN imputation procedure based on Ref. [12], where the "nearest neighbors" of a given line are selected based on sites in good local LD with the target site. Whenever possible, the procedure filled up missing genotypes with imputed ones, but the non-missing genotypes were left unchanged, even if imputation classified them differently. Non-imputable missing genotypes at the sites with (preimputation) MAF below 1% were assumed to be major allele homozygotes. Imputation reduces the fraction of missing genotypes from 50% to 7%. Most of the originally missing genotypes (about 85%) are imputed to major allele homozygotes. Accuracy of the genotype dataset can be assessed by comparing the original genotypes with imputed ones. As shown in Table 3, 99.8% of major allele homozygotes are imputed back into the same class. While the accuracies of minor allele homozygotes and genotypes including indels are both above 90%, only 11% of heterozygotes are imputed back into the same class, while 47% of them fail imputation altogether. This reflects the inherent difficulty in calling heterozygotes.
Relationship between variant sites included in HapMap 3.1.1 and 3.2.1 is shown in Figure 5. Both pipelines start from the same set of IBD-filtered genotypes and subject them to different kinds of filtering, with that of

DISCUSSION
The maize genome, 2.3 GB in size [5], is smaller than the human genome. But some of its distinctive features makes it more challenging for variants identification. First, a recent whole genome duplication occurred 12 million years ago resulted in homologous segments that complicate the short read  1  2  3  4  5  6  7  8  9  10  11  12  13  14  15  16  17  18  19  20  21  22  23  24  25  26  27  28  29  30  31  32  33  34  35  36  37  38  39  40  41  42  43  44  45  46  47  48  49  50  51  52  53  54  55  56  57  58  59  60  61  62  63  64 13 alignments; second, the rampant activities of transposable elements within last 1-5 million year not only resulted in accumulation of large amount of relatively young repetitive elements in the intergenic regions, but also extraordinary structural variations within species [5,6]. In this study, the genome of the B73 maize line was used as the reference for variant calling from short sequencing reads. Structural variations between B73 and other individuals has been the major challenge for identification of true variants. In particular, short reads derived from regions missing in the reference genome could be mismapped to other paralogous regions, which lead to false positive genotypes. In the human 1000 genome project, a new HaplotypeCaller was used [4], which performs local de novo assembly to identify the most likely haplotypes for each individual and thus improve the genotyping results. However, HaplotypeCaller is computationally very expensive, and not always applicable in species like maize, where the single reference genome misses many haplotypes presents in the species and has a lot more mismapped paralogous reads that would disrupt the local assembly. To filter out these false positive variants called from the mismapped reads, we relied on the Zea GBS map [7,8], which was obtained from GBS markers located primarily in hypo-methylated chromosomal regions. GBS maps were used to identify IBD regions between the individual genomes, and 100 million markers with high percentage of mismatched genotype calls in the IBD regions were filtered out from the initial set of 196 million markers. The highly repetitive genomic regions derived from recent transposition activities are in general easier to identify, because the templates of these repeats are well represented on the reference genome, and sequencing reads mapped to these regions, flagged with low mapping quality, can be removed at the early stage of the analysis pipeline. For HapMap 3, reads with mapping quality lower than 30 were not included in the build.
One of the goals of HapMap 3.1.1 is to identify genetic markers in regions where collinearity is preserved in majority of maize lines. The LD filter in the pipeline was applied for this purpose. To do this, we genetically mapped the presence/absence of the minor alleles using the GBS genetic map, and these mapped genetic positions were compared to the physical positions on the B73 reference. Among the 96. 8   1  2  3  4  5  6  7  8  9  10  11  12  13  14  15  16  17  18  19  20  21  22  23  24  25  26  27  28  29  30  31  32  33  34  35  36  37  38  39  40  41  42  43  44  45  46  47  48  49  50  51  52  53  54  55  56  57  58  59  60  61  62  63  64 14 million sites surviving the IBD filter, 25% did not have enough non-missing data or sufficient minor allele frequency for genetic mapping to be meaningful. For 38% of sites, at least one genetically mapped position matching the physical positons on B73 reference was found, 24% have no significant hits from genetic mapping, probably due to no consensus Local LD filter based on a large, diverse population may be too stringent, as some markers, good within certain sub-populations, may be thrown out. Therefore, the LD filter was not used in the HapMap 3.2.1 release, which contains a total of 83 million variant sites, subject only to ST and IBD filters and an additional depth-based filter aimed to improve reliability of rare allele calls. Although those sites are likely to have higher misalignment rates, they are still likely to capture a true association with phenotypes. 1  2  3  4  5  6  7  8  9  10  11  12  13  14  15  16  17  18  19  20  21  22  23  24  25  26  27  28  29  30  31  32  33  34  35  36  37  38  39  40  41  42  43  44  45  46  47  48  49  50  51  52  53  54  55  56  57  58  59  60  61  62  63  64 15 In the un-imputed HapMap 3.2.1, at about 10% of all variant sites, fraction of heterozygous taxa exceeds 3%. Such sites are marked "DUP", as most likely originating from duplication misalignments. Figure 6 shows the distribution of fraction of heterozygous sites per taxon for different versions of HapMap 3.2.1 release. While for the un-imputed genotypes the distribution peaks slightly below 1%, imputation significantly shifts the peak to the left, down to about 0.5%. This is a consequence of most missing genotypes being imputed to homozygotes. Interestingly, considering only sites in good local LD (marked with the "LLD" flag) leads to distributions (both imputed and un-imputed) shifted towards higher heterozygosities. This is understandable, as the LLD sites are typically those with higher minor allele frequencies, where the chance of encountering a heterozygote is higher.
In summary, besides the addition of more maize lines, the HapMap 3.2.1 release differs from the 3.1.1 release in three major aspects: 1) Improved rare allele calls: to increase the accuracy of the variants with rare allele, the HapMap 3.2.1 pipeline applied more stringent read depth thresholds instead of the population genetics based LD filter that could not be applied to sites with very low MAF; 2) The sites with high percentage of heterozygous calls were flagged in the VCF files; 3) Missing data was imputed using the LD KNN method. As summarized in Table 2, the VCF files of both datasets contain labels that flag the characteristics of each of the sites. To effectively use this resource, it is recommended to filter the sites based on the flags that are appropriate to the purpose of each project.
When constructing the maize HapMap 3, the most serious problems we were facing can be attributed to the use of a genome from a single individual (B73) as a reference for other, often very different species. This is becoming the single limiting factor in the study of maize diversity, as well as breeding practice. The only remedy is to move away from a single genome-based reference coordinate and adopt a pan-genome based reference system that incorporates all major haplotypes of the species .   1  2  3  4  5  6  7  8  9  10  11  12  13  14  15  16  17  18  19  20  21  22  23  24  25  26  27  28  29  30  31  32  33  34  35  36  37  38  39  40  41  42  43  44  45  46  47  48  49  50  51  52  53  54  55  56  57  58  59  60  61  62  63  64  65 genotyped in the HapMap 3.2.1 build is 1218.
In this study, individuals with the same taxa name but contributed by different members of the consortium were kept as separate entries in genotyping pipeline -a decision prompted by comparison of genotyoes obtained from different datasets. For example, the newly sequenced CML103 is significantly more heterozygous from CML103 studied previously in the HapMap2 project. Also, the Mo17 sequence originating at CAU has been treated as taxon separate from Mo17 and CAUMo17. In most of those cases, 1  2  3  4  5  6  7  8  9  10  11  12  13  14  15  16  17  18  19  20  21  22  23  24  25  26  27  28  29  30  31  32  33  34  35  36  37  38  39  40  41  42  43  44  45  46  47  48  49  50  51  52  53  54  55  56  57  58  59  60  61  62  63  64 17 a prefix or suffix indicating the origin of the sequence data has been added to the taxa name (e.g., "282set_" or "german_", "-chin"  1  2  3  4  5  6  7  8  9  10  11  12  13  14  15  16  17  18  19  20  21  22  23  24  25  26  27  28  29  30  31  32  33  34  35  36  37  38  39  40  41  42  43  44  45  46  47  48  49  50  51  52  53  54  55  56  57  58  59  60  61  62  63  64 18 fragments, the remnants of Illumina adapter sequences were clipped using TRIMMOMATIC [14] and only one read was used and aligned as single-end. The bwa mem aligner is capable of clipping the ends of reads and splitting each read in an attempt to map its different parts to different location on the reference. As a result, typically over 95% of reads are reported as mapped. However, the fraction of reads with nonzero mapping quality (negative log of the probability that a read has been placed in a wrong location) is much lower -typically only 40-50%. Figure 7 shows a typical distribution of the mapping quality obtained from bwa mem alignment. In practice, we only used alignments with mapping quality of at least 30. A base was counted towards allele depth if its base quality score was at least 10.
It is well known that alignment may be especially ambiguous when reads contain indels with respect to the reference. In such cases, multiple-sequence realignment approaches have been proposed [4] to find the correct sequence and location of an indel and avoid spurious flanking SNPs. Since indels are not the primary focus of this work and since the realignment is computationally very expensive, it has not been performed by the HapMap 3 pipeline. Thus, although indels and SNPs in their immediate vicinity have been retained in the HapMap 3 VCF files, they are less reliable and have therefore been marked with "NI5" label for easy filtering.
Genotyping pipeline Raw genotypes were obtained using a custom-built multi-threaded java code. First, the code executes samtools mpileup command (thresholds on the base and mapping quality are imposed here) for each taxon individually, processing a certain portion of the genome. On a multi-core machine, several such pileup processes (i.e., for several taxa) can be run concurrently as separate threads. Since we are predominantly interested in calling SNPs, we use a simplified indel representation where insertions and deletions with respect to reference are treated as additional alleles "I" and "D", respectively, regardless of length and actual sequence of the indel. Read depths and average base qualities of all six alleles (A, C,   1  2  3  4  5  6  7  8  9  10  11  12  13  14  15  16  17  18  19  20  21  22  23  24  25  26  27  28  29  30  31  32  33  34  35  36  37  38  39  40  41  42  43  44  45  46  47  48  49  50  51  52  53  54  55  56  57  58  59  60  61  62  63  64 19 G, T, I, and D) are extracted from samtools mpileup output for each taxon at each genomic position and stored in an array shared between all threads. The amount of memory available on the machine along with the number of taxa determine the upper limit on the size of this array, and therefore -the maximum size of chromosome chunk which can be processed at one time. As base quality of I and D alleles we took the value corresponding to the base directly preceding the indel on the reference.
Extraction of allelic depths for all genomic positions is time consuming, which presents a major obstacle if joint genotyping needs to be re-run, for example, upon extending the taxa set. It is therefore advantageous to run the depth extraction only once for each taxon and save the obtained depths on disk to be retrieved (rather than re-calculated) during the genotyping step. This way, when the taxa set for genotyping is extended, mpileup step has to be run only for the newly added taxa. Thus, the program features an option to save allelic depths and average qualities in specially designed data structures stored in HDF5 files -one such file per taxon per chromosome. To save space, each allele depth and average quality is stored as one byte, which allows exact representation of integers from 0 to 182, while higher Once the allelic depths for all taxa and a given chunk of the genome are available in shared memory, each site is evaluated for presence of a tentative SNP. On a multi-core machine, the set of sites within the genome chunk is divided into subsets processed in parallel on different cores. Sites with less than 10 taxa with read coverage and those with only reference allele present are ignored. For all other sites, genotypes are called for all taxa using a simple likelihood model with a uniform error rate [15] assumed at 1%.
Alternative alleles are then sorted according to their allele frequencies and up to two most abundant alleles are kept, as decided by the segregation test described in the next Section. Sites for which all taxa 1  2  3  4  5  6  7  8  9  10  11  12  13  14  15  16  17  18  19  20  21  22  23  24  25  26  27  28  29  30  31  32  33  34  35  36  37  38  39  40  41  42  43  44  45  46  47  48  49  50  51  52  53  54  55  56  57  58  59  60  61  62  63  this test is below 0.2, more exact p-value is obtained from a simulation procedure. The simulation procedure used here, implemented in Java, is the same as the one implemented in R package [16]. An alternative allele is kept if at least one contingency table involving this allele has p-value smaller or equal to 0.01. If none of the alternative alleles survive the ST filter, the site is skipped (not reported in output).
The ST filter tends to eliminate variant sites resulting from random sequencing errors.
To determine the IBD regions, we used the first step of our pipeline to call genotypes for our 916 taxa on the set of GBS v2.7 sites [7,8] which tend to concentrate in relatively well-conserved low-copy regions of the genome and can therefore be considered reliable. This set of 954,384 sites was filtered to include only SNP (not indel) sites for which the p-value from the segregation test was below 0.05 and which were more than 5 bp away from any indel. The set of genotypes at 475,272 sites obtained in this way, which will be referred to as GBS anchor, agree well with those from GBS on 167 taxa present in both sets. Alleles detected by the HapMap 3 pipeline agreed with those from GBS at 94% of the GBS sites. At 90% of the sites, fraction of (non-missing data) taxa with genotypes in agreement with those from GBS was at or above 85%. Genotypes different from GBS ones were observed for 82 taxa. These differences were most frequent (up to 19% of all sites) for teosinte lines.
The GBS anchor was used to compute the genetic distance (identity by state) between any two of the 916 lines in windows containing 2000 GBS sites each (about 8.5 Mbp on average). If the genetic distance within such a window was <= 0.02 (about 10 times smaller than the mean distance across all pairs), the two lines were considered to be in IBD. At least 200 comparable GBS sites (i.e., non-missing data simultaneously on both lines being compared) were assumed necessary to make the genetic distance calculation feasible.
This allowed for good distance estimate while keeping the number of detected IBD relationships large.
The number of taxa involved in IBD relationships in any given window were between 385 (start of chromosome 10) and 757 (middle of chromosome 7) and averaged 588, leading to large numbers of IBD contrasts, ranging from 3,710 (beginning of chromosome 4) to 42,890 (middle of chromosome 7), and averaging 13,500.  3  4  5  6  7  8  9  10  11  12  13  14  15  16  17  18  19  20  21  22  23  24  25  26  27  28  29  30  31  32  33  34  35  36  37  38  39  40  41  42  43  44  45  46  47  48  49  50  51  52  53  54  55  56  57  58  59  60  61  62  63  64 22 The tentative (ST-filtered) variant sites were confronted with the IBD information as follows: for each site, pairs of lines in IBD were determined as described above. Genotypes of IBD-related lines were compared and the numbers of allele matches and mismatches, summed over all IBD pairs, were counted for each allele present at the site. If the match/mismatch ratio was at least 2 for at least two alleles, or if only one allele was present in all IBD contrasts, the site is considered as passing the IBD filter. Such a filter is less powerful for sites where all bases in IBD lines are major allele homozygotes, i.e., the variant being evaluated occurs in lines not involved in IBD pairs. Formally, such a site passes our IBD filter, but the actual variant is not strongly confirmed. These uncertain sites, mostly with low minor allele frequency, are labeled "IBD1" in the HapMap 3 VCF files and constitute about 50% of all HapMap 3 sites.

Linkage Disequilibrium (LD) filter
Any true SNP should be in local linkage with other nearby SNPs. This observation is the origin of another filter used in this work, referred to as the LD filter. For each variable site surviving the ST and IBD filters, we evaluated LD with each site of the GBS anchor. As the LD measure we chose the p-value from a 2 by 2 contingency table of haplotype counts AB, Ab, aB, ab. For the purpose counting haplotypes, heterozygous genotypes were treated as homozygous in minor allele, so that each taxon only contributed at most one haplotype. This tends to somewhat strengthen the LD signal and simplify the calculation. For a pair of sites to be tested for LD, the following three conditions had to be satisfied to make the calculation meaningful: i) the two sites were at least 2,500 bp apart, ii) there were at least 40 taxa with non-missing genotypes at both sites being compared, and iii) at least 2 taxa with minor allele had to be present at each of the two sites.
Filtering procedure executed for each site is summarized in Figure 8. First, LD between the given site and all sites in GBS anchor was computed and up to 20 best LD hits (the ones with lowest p-values) were collected. If the p-value of the best hit exceeded 1E-6 (which roughly corresponds to the peak of the 1  2  3  4  5  6  7  8  9  10  11  12  13  14  15  16  17  18  19  20  21  22  23  24  25  26  27  28  29  30  31  32  33  34  35  36  37  38  39  40  41  42  43  44  45  46  47  48  49  50  51  52  53  54  55  56  57  58  59  60  61  62  63  64  65 23 overall distribution of p-values), the site was rejected. Otherwise, it was determined whether the set of best hits contained any local hits, i.e., hits to GBS sites on the same chromosome within 1 Mbp of the site in question and with the p-value smaller than 10 times the p-value of the best hit. If no such local hits were found, the site was rejected, otherwise it was kept and marked as a site in Local LD using the flag "LLD". Note that the procedure defined this way filters out sites with only non-local LD hits as well as those with only weak LD signal. Sites in local LD as well as those for which LD could not be assessed (because of low minor allele frequency or missing data) pass the filter.

Imputation
In the HapMap 3.2.1 pipeline, the ST-and IBD-filtered genotypes, after the application of the additional ">1,>2" depth-based filter, were processed through the LD KNN imputation procedure based on Ref. [12] to fill in the missing data. The procedure is a version of the "K nearest neighbors" routine where the "nearest neighbors" of a given taxon are selected based on genetic distance computed using variant sites in good local LD. Specifically, for a given target site, a list of up to 70 sites in best LD (as given by the 2 measure) with it is compiled by checking all surrounding sites within 600Kb characterized by heterozygosity lower than 3% and more than 50% taxa with non-missing genotypes. Capping this list at 70 sites leads to good compromise between distance accuracy and computation speed. Then, at the same target site, for each target taxon, up to 30 "nearest neighbor" taxa are selected, with lowest genetic distances from the target taxon. Genetic distances are computed using the set of local LD sites selected in the previous step. Taxa with more than 50% missing genotypes at LD sites, missing genotype at the target position, having distance from the current taxon larger than 0.1, or resulting in less than 10 common LD sites on which the distance can be calculated, are excluded from distance calculation process.
Genotypes of the selected nearest neighbor taxa at the target site are stored in memory along with the genetic distances from the target taxon. This information is used to compute a weight of each neighbor genotype as follows: 1  2  3  4  5  6  7  8  9  10  11  12  13  14  15  16  17  18  19  20  21  22  23  24  25  26  27  28  29  30  31  32  33  34  35  36  37  38  39  40  41  42  43  44  45  46  47  48  49  50  51  52  53  54  55  56  57  58  59  60  61  62  63  64  65   24   = ∑  1  1 + 70  , where the summation index runs over all neighboring taxa with genotype at the target site, and is the distance of taxon from the target taxon. The genotype with the highest weight is considered the imputed genotype (of the target taxon at the target site) provided its weight is at least 10 times larger than that of the second-best candidate genotype. Otherwise the imputation is considered inconclusive and the imputed genotype is set to "unknown" (missing data), as it is in the case when no close neighbors of the current taxon could be found. If a genotype imputed to "unknown" occurs at a site where MAF<1%, it is automatically converted into major allele homozygote.
The imputation procedure is run for each genotype in the input file. However, in the output only the originally missing genotypes are updated to imputed ones, whereas all others are left unchanged, even if classified differently. On the other hand, all imputed genotypes are used during a run to collect imputation statistics. The "transition matrix" showing how many genotypes originally in a given class were imputed into other classes is an indication of the accuracy of the input genotypes. Error rates calculated from the transition data are given in Table 3.
Additionally, files c*_hmp311_q1.vcf.gz (in the same location) contain test results obtained with mapping quality threshold equal to 1.