## Abstract

Single-nucleotide polymorphism (SNP) tagging is widely used as a way of saving genotyping costs in association studies. A number of different tagging methods have been developed to reduce the number of markers to be genotyped while maintaining power for detecting effects on non-assayed SNPs. How the different methods perform in different settings, the degree to which they overlap and share common tags and how they differ are important questions. We investigated these questions by comparing three widely used tagging methods/algorithms—one haplotype r2 -based method, one pair-wise r2 -based method and one method which was based on haplotype diversity but focused on major haplotypes. Tagging efficiency was defined as the number of genotyped markers divided by the number of tagging SNPs. Tagging effectiveness was defined as the proportion of un-genotyped or ‘hidden’ SNPs being detected (having a pair-wise or haplotype r2 with a set of tagging SNPs over a threshold, e.g. haplotype r2 ≥0.80). The ENCODE regions genotyped on the HapMap CEPH individuals were examined in this study. Tagging effectiveness was generally poor for rare SNPs than for common SNPs, for all three tagging methods. Inclusion of rare SNPs into initial HapMap scheme could enhance the performance of tags on rare hidden SNPs at the expense of increased genotyping cost. At a moderate tagging efficiency, more than 90% of hidden SNPs detected by tagging SNPs selected by one method were also detected by tagging SNPs selected by another method, and this figure could be increased to 100% if tagging efficiency was allowed to drop. These results indicate that the tagging space is highly concordant between different tagging methods, despite the fact that they often involve different sets of tagging SNPs.

## INTRODUCTION

Single-nucleotide polymorphism (SNP) tagging is an important strategy to save genotyping costs in disease association studies ( 18 ), and it is becoming even more important with availability of the high-density HapMap ( 9 ). The basic principle behind this tagging is that there is usually redundant information involved among densely spaced SNP markers and a subset of markers often can retain all or most of the information. The concept of tagging was first introduced by Johnson et al. ( 1 ). In their study, haplotype information was used to select the so-called haplotype-tag SNPs (htSNPs). Since then, many different methods and algorithms have been developed ( 4 , 5 , 1020 ). These methods differ from one another mainly on the following aspects: (1) the source of genetic information, i.e. haplotype ( 1 , 4 , 11 , 14 ) or genotype ( 5 , 19 ); (2) the measure and criteria of evaluation used in tagging SNP selection, e.g. haplotype diversity ( 1 , 11 ), haplotype r2 ( 4 ), pair-wise r2 ( 5 , 19 ), entropy ( 13 , 21 ) or a variant form of them ( 20 ); and (3) computing algorithms, e.g. exhaustive search ( 1 , 4 ), analytical and heuristic operations ( 14 , 16 ), hierarchical clustering and graph methods ( 19 ), greedy algorithm ( 5 ) and dynamic programming ( 11 ). With so many methods available, it is very important to understand their common features in practice, as well as where they differ in terms of tagging SNP selection and the regions being tagged. Such information would help users in choosing the appropriate tagging methods in their own applications.

The behaviour of a tagging method can be defined in terms of its computing performance or the tagging space of tagging SNP sets that are selected by the method. Computing performance includes speed and the amount of data being handled and is mainly determined by the computing algorithms applied, whereas tagging space is defined on the basis of how well a set of tagging SNPs captures the information of all the genotyped markers and how well they perform on un-genotyped markers (‘hidden SNPs’). For each set of tagging SNPs, we consider three level of tagging space: (1) the set of tagging SNPs themselves, i.e. the composition of SNPs in terms of their spacing and allele frequency; (2) their coverage of the genotyped markers, i.e. the ‘known’ space; and (3) their coverage of the un-genotyped markers, i.e. the ‘hidden’ space. Here we are primarily interested in comparing the tagging space among different tagging methods. In particular, we are interested in the following questions: What is the relationship of tagging efficiency and the effectiveness of tagging SNPs on hidden SNPs selected by individual methods? How often are the same tagging SNPs selected by different methods? What proportion of hidden SNPs is detected by these tagging SNP sets? What proportion of hidden SNPs is commonly detected by them?

A comparison of all the available tagging methods is difficult and also unnecessary. The tagging selection algorithm, rather than computing algorithm, is the primary underlying determinant on how and which SNPs are selected as tagging SNPs. It is the selection algorithm we focus on in this study. Almost all available methods use either haplotypes ( 1 ) or genotypes ( 5 ) as the information source. To examine whether a set of SNPs are selected as tagging SNPs, objective criteria are also needed. Earlier methods used haplotype diversity ( 1 ) as the measure for selection, but some of the recent developments ( 4 , 5 , 20 ) use either pair-wise r2 or haplotype r2 because of their direct relevance to association studies although the mathematical expression of haplotype r2 is very similar to that of haplotype diversity ( 4 ). In this study, therefore, we choose one haplotype-based tagging method that uses haplotype r2 as the selection criteria and one genotype-based method that uses pair-wise r2 as the selection criteria. Besides these two, we have also used a third method to compare, which focuses on major haplotypes and is intentionally biased towards selecting common markers as tagging SNPs, and thus increases the potential for cost savings on genotyping at the possible expense of general effectiveness.

The three tagging methods were based on three existing software programs—TagIT ( 4 ) for haplotype r2 , ldSelect ( 5 ) for pair-wise r2 and SNPTagger ( 14 ) for haplotype diversity. It should be emphasized again, however, that the idea was to compare the underlying selection algorithms rather than the software programs per se. We believe that the comparison of these three methods can shed light on the tagging space problem of other related methods, although specific values of particular measures may differ according to individual implementations and heuristics. Many other methods were either a variant or a combination of the above three. For example, the block-free tagging method ( 20 ) uses haplotype information for SNP selection with a measure similar to haplotype r2 . The more recent Tagger program (developed by Paul de Bakker; http://www.broad.mit.edu/mpg/tagger/ ) incorporates a novel combination of both genotype and haplotype information.

We investigated the tagging space problem of the three methods using 9673 polymorphic SNP markers genotyped in the ENCODE regions on 90 CEPH individuals by the International HapMap Project ( 9 ). Markers genotyped for 190 CEPH individuals in the human major histocompatibility complex (MHC) region, known for its relatively unusual linkage disequilibrium (LD) structure ( 22 ), were also used as a fine-scale comparison. Haplotype r2 ( 4 ) and in some cases pair-wise r2 between tagging SNPs and un-genotyped SNPs were used as the measures of the reduction in power to detect effects at un-assayed hidden SNPs.

## RESULTS

### Tagging efficiency and tagging effectiveness

Tagging efficiency is influenced by the selection criteria and thresholds of a particular method, whereas tagging effectiveness is a measure of how large the tagging space of a tagging SNP set possesses. Tagging SNP sets with similar tagging efficiency may have different tagging effectiveness. In other words, they can differentially cover not only different ‘known’ marker space, but also different ‘hidden’ space. From the 7285 SNP markers whose minor allele frequency (MAF) ≥5%, markers were randomly selected to construct an 1 SNP/5 kb map modelling the phase I end product of the International HapMap Project ( 9 ). SNPs that were not selected into the 1 SNP/5 kb map were regarded as hidden SNPs. This process was repeated three times for each ENCODE region. Figure 1 gives the result of tagging efficiency and tagging effectiveness in both the high LD and low LD regions.

It is apparent that the trend of change was very similar in regions of different LD profiles for the three different tagging methods (Fig.  1 ). The absolute scale of the two figures ( y -axis) in the upper panel, however, was very different, indicating a 2–3-fold higher tagging efficiency in the high LD regions than in the low LD regions. Tagging effectiveness was also always higher in the high LD regions than in the low LD regions at a particular threshold for all the three methods (Fig.  1 , lower panel). The haplotype-diversity-based method, which focused on common and major haplotypes, had the potential of providing very high genotyping savings, but was achievable only at the cost of tagging effectiveness (Fig.  1 , lower panel). Comparatively, the haplotype r2 -based method and the pair-wise r2 -based method were more conservative, with the latter the most conservative of the three. For the pair-wise r2 -based tagging method, tagging effectiveness was close to optimum for the region at hand by increasing the threshold to 0.90 in high LD regions and to 0.70 in low LD regions. For the haplotype r2 -based method, maximum effectiveness was similarly achieved at 0.90 for both high and low LD regions. For the haplotype-diversity-based method, however, there was always a clear benefit by increasing the threshold until all haplotypes were included.

### Effects of allele frequencies on tagging effectiveness

It should be noted that, however, the x -axis values of Figure 1 , i.e. the pair-wise r2 versus haplotype r2 versus haplotype diversity thresholds of the three methods, are not numerically equivalent, and comparisons based on them are thus not strictly appropriate. Rather, to compare their tagging effectiveness, tagging efficiency should be fixed at the same level. In other words, the same number of tagging SNPs should be selected by different methods to evaluate and compare their respective abilities to detect hidden SNPs. For example, in high LD regions, the same tagging efficiency of 3.0 was achieved by the haplotype-diversity-based method at a threshold of 0.93 (i.e. 93% of all the major haplotypes), by the haplotype r2 -based method at a threshold of 0.78 (i.e. minimum haplotype r2 =0.78) and by the pair-wise r2 -based method at a threshold of 0.50 (Fig.  1 A). Similarly, we set different threshold points for high and low LD regions, where the comparison of tagging effectiveness would be carried out with tagging efficiency fixed at a similar level.

In high LD regions (Fig.  2 A–C), the maximum tagging effectiveness (Fig.  2 C) was achieved by requiring that the tagging method explained 100% of the haplotypic diversity (or haplotype r2 =1.00 or pair-wise r2 =0.90). Subtle changes occurred by decreasing the three thresholds in order to increase tagging efficiency and savings on genotyping. When tagging efficiency was increased to 2.5 (Fig.  2 B), the overall tagging effectiveness was the highest by the haplotype r2 -based method and the lowest by pair-wise r2 -based method; threshold settings were 0.72 for pair-wise r2 , 0.9 for haplotype r2 and 0.97 for haplotype diversity. For all the five categories of SNPs, classified on the basis of their MAF values, tagging effectiveness of the pair-wise r2 -based method was lower than the two haplotype-based tagging methods, at this level of tagging efficiency. Compared with the haplotype r2 -based method, the major haplotype-focused method was equally or slightly more effective with very common SNPs (MAF≥0.30) but slightly less effective with less common ones (MAF≤0.30). We increased the tagging efficiency further to 3.0 (Fig.  1 A), at which point the three thresholds were haplotype diversity=93%, haplotype r2 =0.78 and pair-wise r2 =0.50, and observed that the overall advantage of the haplotype r2 -based method was even more evident when compared with the pair-wise r2 -based method. The major haplotype-based method was less effective than the haplotype r2 -based method, for less common SNPs (MAF<0.20) and the most effective one among all the three tagging methods for the most frequent SNPs (Fig.  2 A and B). It was also noted that for the commonest SNPs (MAF=0.40–0.50), the maximum tagging effectiveness was already achieved at diversity 93% by the major haplotype-based method and there was no benefit of increasing the threshold further. Increasing threshold values had the benefit of increasing tagging effectiveness across the board for the pair-wise r2 -based method and was particularly important on tagging rare SNPs for the major haplotype-based method.

Tagging effectiveness in low LD regions, for all the three tagging methods, was much lower than that in high LD regions. For example, at a tagging efficiency of 1.2 in the low LD regions (Fig.  2 D), the tagging effectiveness for the three methods and for SNPs of different MAF categories was always lower than what was obtained at a tagging efficiency of 2.5 in the high LD regions (Fig.  2 B). Rare hidden SNPs were associated with the lowest tagging effectiveness although a similar level of effectiveness was also observed for SNPs whose MAF values were between 0.30 and 0.40 (Fig.  2 D). This was, however, very different from what was observed in the low LD regions of MHC, where rare hidden SNPs were associated with the highest tagging effectiveness (Supplementary Material, Fig. S 1 ). This discrepancy perhaps reflected the nature of low LD regions, which were often not well defined compared to high LD regions, and, therefore, could be region-specific in SNP and haplotype compositions.

### Shared detection of genotyped SNP markers

When tagging SNP sets are selected by different methods using different selection criteria, some of the important questions include: Do they have similar power to detect non-tagging SNP markers in the genotyped set? How many tagging SNPs are shared between the different tagging SNP sets? How many shared SNPs (including tagging SNPs themselves) can be detected by the various tagging SNP sets? Does maximizing the proportion of shared tags increase the proportion of genotyped SNPs being detected?

To address these questions, we compared the two haplotype-based methods, as well as the two r2 -based methods (Table 1 ). The first sets of tagging SNPs selected by all methods were regarded as the ‘best’ sets on the basis of the genotyped markers only (see Materials and Methods). The evaluation criterion was set as haplotype r2 ≥0.80. When tagging SNPs were selected below this threshold (i.e. Hap r2 =0.78), tagging efficiency was 3.0 and ∼62–73% of tags in the first tagging SNP set by one method were also contained in the first tagging SNP set by the other method. The genotyped SNPs detected by both methods (‘shared detection’ in Table 1 ) were all higher than 99%. When tagging SNPs were selected above the evaluation threshold (i.e. Hap r2 =0.90), the tagging efficiency was 2.5 and ∼68–82% of tags were shared between methods (Table 1 ). The ‘shared detection’ now became 100%.

Often, more than one set of tagging SNPs can be selected by the same tagging approach, yielding equivalent degrees of effectiveness at selection and usually similar degrees of effectiveness at evaluation, which may use a different r2 -value. By comparing each set of tagging SNPs produced by two different tagging methods, a pair of tagging SNP sets sharing a maximum proportion of tags could be obtained. We were interested in whether these sets of commonly shared tag SNPs reflect the underlying LD structure of genotyped SNPs and thus maximize the shared detection for all genotyped markers. Table 1 (column 8) shows the maximum proportion of shared tag SNPs. For example, in the first row (haplotype r2 =0.78 versus haplotype diversity=93%), the percentage of shared tags was increased from 73.1% when considering the first set of tagging sets by both methods (column 4) to 95.4% when the tags shared by the two methods were maximized (column 8). This increase led to a higher percentage of shared detection of hidden SNPs (99.5–99.9%). There was no gain for haplotype r2 -based tagging set, but a slightly larger absolute number of genotyped SNPs was detected by the haplotype-diversity-based tagging SNP set (93.6–95.1%), suggesting a fine-tuning effect of haplotype r2 -based method on haplotype-diversity-based method when tagging efficiency was high and equal for both methods. This small gain was achieved at the expense of reduced tagging efficiency (3.00–2.96) and could be partly offset by purely increasing the number of tagging SNPs to reach the same tagging efficiency (column 11). As tagging efficiency was decreased, such gains were diminished (rows 2–4). The situation was a bit more complicated when the two r2 -based methods were compared. With Hap r2 =0.78 versus r2 =0.50, the percentage of genotyped SNPs being detected decreased for the haplotype r2 -based method from 99.3% (column 5) to 99.2% (column 9) although there was a small increase for the pair-wise r2 -based method (from 91.6 to 92.0%). With Hap r2 =0.78 versus r2 =0.72, a small increase of detected SNPs was observed with haplotype r2 -based method (from 99.3 to 99.5%). However, this gain was not due to the quality of the tagging SNPs selected through maximizing common tags, but rather to the quantity of tagging SNPs contained in the resulting set because the gain was equally obtainable by adding random markers to the set to reach the same number (column 11). In general, there was no consistent gain for maximizing commonality between tagging set pairs selected by different methods, and such gains, when observed, tended to be very small and negligible.

### Shared detection of hidden SNP markers

The previous results indicate that the shared detection of observed, genotyped markers is not exceptionally difficult to predict and control. In contrast, it is more difficult to predict the performance of tags on SNPs that are not initially genotyped, i.e. ‘hidden’ SNPs. Comparisons between tagging methods (Table 2 ) were carried out to address questions such as: How many of the same hidden SNPs can each method detect? Does maximizing the proportion of common tags increase the proportion of hidden SNPs detected by multiple methods? More importantly, does it influence the absolute number of hidden SNPs being detected?

When tagging efficiency was fixed at 3.0, although the proportion of shared tag SNPs was only 61–73% between any two methods, the proportion of hidden SNPs detected in common was higher than 96% (Table 2 ). Lowering the tagging efficiency (i.e. increasing the number of tagging SNPs) yielded a higher proportion of shared tags and commonly detected hidden SNPs. Specifically, when tagging efficiency was dropped to 2.5, almost all hidden SNPs detected by one method were also detected by tagging SNPs selected by another. This indicated that the three different methods shared a highly common tagging space, despite the differences in their selection algorithms. This is encouraging for users who might use different tagging methods in specific applications.

Both the proportion of shared tags and the proportion of commonly detected hidden SNPs could be further increased if one of the methods is allowed to have a lower tagging stringency and thus lower tagging efficiency (Table 2 ). This effectively generated a larger pool to compare.

We then investigated whether maximizing shared tags had any benefit on revealing the underlying LD structure of hidden SNPs and thus maximizing the common tagging space. For the tagging SNPs selected by the two haplotype-based methods at Hap r2 =0.78 versus HapDiv=0.93 (first row, Table 2 ), there was an increase in the percentage of commonly detected hidden SNPs (96.4–99.1%) and also a higher absolute number of hidden SNPs being detected by both the haplotype r2 -based method (79.9–80.3%) and the haplotype-diversity-based method (77.3–77.8%). However, these gains were obtained only after decreasing the tagging efficiency (columns 3 and 7, Table 2 ). Therefore, it is of interest to understand the true gain. By selecting random markers from non-tagging SNPs and adding them to the first tagging SNP set (column 3, Table 2 ), equal values of tagging efficiency could be reached. In this case, the percentage of hidden SNPs being detected by the haplotype-based methods was 80.4 and 77.3%, respectively (column 11, Table 2 ) and the proportion of commonly detected hidden SNPs was 97.6% (column 12, Table 2 ). Therefore, comparison and use of LD information revealed by the two haplotype-based tagging methods yield a genuine but small gain for the haplotype-diversity-based method and no gain at all for the haplotype r2 -based method. Similar gains were also observed with the Hap r2 =0.78 versus HapDiv=0.94 and r2 =0.72 (Table 2 ).

Tagging efficiency was much lower in low LD regions than in high LD regions (Fig.  1 ). At a high selection threshold, e.g. pair-wise r2 ≥0.70, almost all genotyped SNPs need to be selected as ‘tags’. It is, therefore, not surprising to observe a higher proportion of shared tags in low LD regions than in high LD regions (Supplementary Material, Table S1). As a consequence of highly shared tagging SNPs, the shared detection of hidden SNPs was also very high (Supplementary Material, Table S1).

Because the shared proportion of detected hidden SNPs was high in both low and high LD regions, a high value of shared detection of hidden SNPs was expected, in general, for any region regardless of its LD structure. We re-analysed the ENCODE regions using fixed windows of 10 SNPs for the 1 SNP/5 kb data sets. As shown in Table 3 , the shared proportion of detected hidden SNPs was indeed very high, all above 98%, when the same evaluation criterion (haplotype r2 ≥0.80) was used as in the high LD (Table 2 ) and low LD (Supplementary Material, Table S1) regions.

Because haplotype r2 was chosen as the evaluation measure in this study, a small bias for haplotype r2 -based tagging method was observed in ENCODE regions. For example, in high LD (Tables 1 and 2 ), low LD (Supplementary Material, Table S1) and regions of fixed-sized windows (Table 3 ), the haplotype r2 -based method was always associated with the highest tagging effectiveness at the same level of tagging efficiency among all methods. For the same 10-SNP windows across the ENCODE regions, the shared detection of hidden SNPs was re-evaluated with pair-wise r2 , rather than haplotype r2 , set to 0.80 (Table 3 , lower part). As expected, the overall tagging effectiveness was much lower for all the three tagging methods. Perhaps, also as expected, pair-wise r2 -based method now became the most effective in detecting hidden SNPs. The shared detection of hidden SNPs between methods, however, was close to 90% at reasonably high tagging efficiency and could reach 99% when tagging efficiency was further reduced (Table 3 ).

In addition to the ENCODE regions, we also performed similar analyses in the human MHC region, which is known to have a relatively unusual structure of LD ( 22 ). As shown in Supplementary Material, Table S2, detected hidden SNPs were found again to be highly shared (≥90%) among different tagging methods. Interestingly, a small bias against haplotype r2 -based method (i.e. smaller tagging effectiveness) was observed when tagging efficiency was around 3.1 and haplotype r2 was used for evaluation. Also of interest was that in the MHC a small gain could be obtained with the haplotype r2 -based method when the number of shared tags with any other method was maximized. These results demonstrate that shared detection of hidden SNPs is consistently high across different genomic regions by all tagging methods tested irrespective of evaluation measures and criteria.

### Applying tagging SNPs from HapMap to rare SNPs

The ENCODE regions comprised a large proportion of rare SNPs (MAF<5%), many of which were obtained by re-sequencing. This category of SNPs has not been pursued in the current International HapMap Project ( 9 ). We evaluated the general utility of the HapMap data to detect rare variants by examining the effectiveness of tagging SNPs, produced by all the three tagging methods in the HapMap context. When these rare SNPs were excluded from the initial 1 SNP/5 kb map (Fig.  3 , white and dotted bars), at a tagging efficiency of 3.0, <7% of rare hidden SNPs could be detected by tagging SNPs in high LD regions (dotted bars), and a generally higher tagging effectiveness in low LD regions (Fig.  3 , white bars). When tagging SNPs were required to capture the full haplotype variability (haplotype diversity=100% or haplotype r2 =1.00), the selected tags became more effective with rare hidden SNPs in high LD regions than in low LD regions. Even at this point, however, the tagging effectiveness was significantly lower than that of the hidden SNPs, whose MAF values were between 0.05 and 0.10 (Fig.  2 ). Was such a low tagging effectiveness mainly due to the rare nature of the SNPs or due to the fact that they were excluded from the type of markers in the phase I HapMap?

We reconstructed an 1 SNP/5 kb map, including this time the rare SNPs (MAF<5%) into the scheme. We then re-evaluated the tagging effectiveness of tagging SNPs on such rare hidden SNPs (Fig.  3 , dashed and dark bars). There was a substantial increase of tagging effectiveness for all the three tagging methods and at all thresholds, in comparison to the initial scheme where rare SNPs were excluded. The extent of the correlation between a hidden SNP and the set of tagging SNPs depends on whether the hidden SNP is associating or subdividing the haplotypes defined by the set of tagging SNPs ( 23 ), i.e. whether the alleles reside on previously tagged haplotypes or whether they delineate new haplotypes from the existing tags. If a hidden SNP were in complete association with existing haplotypes, the haplotype r2 -value would be 1.0. Otherwise, it would be below 1.0. When rare SNPs were excluded, a maximum of 29.0 and 14.4% of rare hidden SNPs were in complete association with tagging SNP-defined haplotypes in high LD and low LD regions, respectively. In contrast, these figures jump to 38.2 and 36.3% when rare SNPs are part of the initial map. This alone explains well the increase of tagging effectiveness for rare hidden SNPs by inclusion of rare SNPs into the HapMap scheme. It should be noted though that this increase was achieved at the expense of a lower tagging efficiency and, therefore, higher genotyping costs. For example, in the high LD regions, the average tagging efficiency was dropped from 3.0 to 2.2, 2.5 to 2.0 and 2.0 to 1.7, respectively, when the haplotype r2 -based method was used with r2 threshold at 0.78, 0.90 and 1.00.

## DISCUSSION

We analysed 9673 polymorphic SNP markers genotyped across the 10 ENCODE regions at an average marker spacing of ∼0.5 kb to compare the behaviour of three different tagging methods and investigated the tagging space problem of the three methods. Tagging SNP sets were significantly more effective when they were evaluated using haplotype information than using only pair-wise LD information. At a reasonable level of tagging efficiency, tagging SNPs selected by the two haplotype-based methods were usually more effective on hidden SNPs than those selected by the pair-wise r2 -based method. Among them, haplotype r2 -based method was the most effective. Large genotyping savings could be achieved by reducing the selection thresholds for the three tagging methods and the haplotype-diversity-based method, in particular.

In high LD regions, the higher the frequency of a hidden SNP, the more effective is the tagging SNPs selected by the three tagging methods. This allele effect has been previously observed ( 4 , 23 ), and noted to be mainly due to the fact that common SNPs have a tendency to reside on existing haplotypes rather than to break them into different component haplotypes. This allele effect was generally true for low LD regions in the ENCODE regions but not for those in the human MHC region. Compared with high LD regions (haplotype blocks in this case), low LD regions are usually not well defined (non-blocks in this case). The allele effect of tagging effectiveness in low LD regions, therefore, could be subjected to regional variations.

We consider that a set of tagging SNPs had three levels of tagging space: (1) the SNP composition of tagging set, (2) how well the tagging set covers or tags other genotyped markers and (3) how well the tagging set covers/tags un-genotyped markers. Level 2 was straightforward to predict by level 1, but level 3 was more difficult. At a relatively high tagging efficiency (e.g. tagging efficiency=3.0), over 60% of tagging SNPs were shared by different tagging methods. At this level, more than 99% of genotyped markers and, remarkably, over 96% of un-genotyped markers (haplotype r2 ≥0.80) which were predictable by one method were also predicted by another. This revealed a high concordance of tagging space between different methods. This concordance could be further enhanced until reaching 100% if the number of tagging SNPs was allowed to increase in a region (i.e. tagging efficiency decreased).

The high concordance of tagging space was observed in all regions irrespective of their LD structure and measures of evaluation. There was a delicate difference, however, between high and low LD regions. In low LD regions, high concordance of tagging space was primarily due to a high concordance of tagging SNP composition, whereas in high LD regions, a high concordance of tagging space did not necessarily require a similarly high concordance of SNP composition between tagging SNP sets selected by different tagging methods. The message is encouraging for users who might easily be puzzled by so many different choices of tagging methods.

Another question regarding tagging space was whether using multiple tagging methods and using commonly shared tag SNPs would better reflect the underlying LD structure of a region, and thus increase the tagging effectiveness. In haplotype blocks of the ENCODE regions, the haplotype-diversity-based method and pair-wise r2 -based method were found to be tuneable by the haplotype r2 -based method (Tables 13 ). In the MHC region, however, the haplotype r2 -based tagging method was found to be tuneable by the haplotype-diversity-based method and by the pair-wise r2 -based method (Supplementary Material, Table S2). Overall, using multiple tagging methods and maximizing the proportion of shared tagging SNPs seemed to have little and inconsistent benefit, if there was any. This suggested that at a given level of tagging efficiency, individual tagging algorithms may have already utilized the LD information of a given region very efficiently.

In general, rare SNPs are known to be not amenable to tagging strategies ( 4 , 8 , 23 ). This was reflected here by the low tagging effectiveness on rare SNPs (MAF<5%) when such SNPs were initially excluded from a HapMap scheme. If, however, such rare SNPs were included into the scheme from the start, tagging effectiveness on rare hidden SNPs was significantly improved at the expense of higher genotyping cost. Although in common practice ( 2426 ) rare SNPs are often not considered in population association studies, the result in this study implied that inclusion of them could be beneficial if rare causal variants were suspected.

## MATERIALS AND METHODS

### Samples and SNP genotyping

Genotype data of the 30 CEPH trios in the 10 ENCODE regions were downloaded from the International HapMap Project Web site ( http://www.hapmap.org , March 2005 freeze). A total of 19 680 SNP markers were obtained, which included 9622 ‘rs’ SNPs and 10 058 ‘non-rs’ SNPs. Among the 19 680 SNP markers, 9673 were polymorphic, which contained 7285 common SNPs (MAF≥5%) and 2388 rare SNPs (MAF<5%). These 7285 common SNP markers were used throughout this study unless specifically described otherwise.

Besides the HapMap ENCODE regions, a total of 2335 SNP common markers (MAF≥5%) genotyped in 190 CEPH individuals across a 4.46 Mb region spanning the MHC region of human chromosome 6 ( 22 ) were also used in the present study as supplement and for comparison.

### Simulation of 1 SNP/5 kb maps and definition of ‘genotyped’ and ‘hidden’ SNPs

We first constructed an 1 SNP/5 kb map by randomly selecting markers from the full marker set of 7285 SNPs in the 10 ENCODE regions whose MAF are ≥5%. Three such random marker sets of 1 SNP/5 kb were produced for each of the 10 ENCODE regions. SNPs selected into each of the three 1 SNP/5 kb maps were regarded as ‘genotyped’ SNPs (a total of 2889 ‘genotyped’ SNPs were included into the three maps), whereas the rest were regarded as ‘un-genotyped’ or ‘hidden’ SNPs (a total of 18 956 ‘hidden’ SNPs were left out of the three maps). Haplotype block analysis (described below) was then carried out on each of these 1 SNP/5 kb maps, and all the block regions were defined as high LD regions, whereas the rest as low LD regions. Only regions with five or more markers were used for further analyses, which effectively included 1457 and 169 ‘genotyped’ markers in the high and low LD regions, respectively, in the three 1 SNP/5 kb maps. Corresponding to these ‘genotyped’ markers, a total of 9202 and 1041 ‘hidden’ SNPs were assayed in high and low LD regions, respectively. To examine tagging behaviour on regions regardless of LD structure, the 1 SNP/5 kb maps were separated into consecutive windows of 10 SNPs without overlapping. The last window for each map was allowed to contain ≤10 but ≥5 SNPs. This effectively included 2863 ‘genotyped’ SNPs of the three separate maps with 14 851 ‘hidden’ SNPs being assayed. For the MHC region, three 1 SNP/5 kb maps, containing 2691 common SNPs, was also created with high LD (haplotype blocks) and low LD (non-blocks) regions identified for analysis.

### Delimitating regions based on LD structure

LD structure affects tagging SNP selection ( 23 ). The effectiveness of tagging SNPs on un-genotyped markers in a region is also likely to be influenced by the local LD level and pattern. In attempt to make the results more comparable and the comparison more straightforward between different tagging methods, we dichotomize the three 1 SNP/5 kb maps created for the 10 ENCODE regions and the MHC region into high LD segments versus low LD segments. This was done by defining haplotype blocks using HaploView ( 27 ). Although marker densities could influence haplotype block boundaries, regions delimitated by haplotype blocks were found to be generally in high LD ( 28 ).

### Tagging SNP selection

The pair-wise r2 -based tagging method was based on ldSelect ( 5 ) with necessary modifications to allow the ‘best’ set of tagging SNPs to be output first. The ‘best’ sets were defined and selected as follows. There were usually multiple tagging SNPs in a bin and these SNPs could be replacing each other as the chosen tagging SNP for all SNPs in the bin. Although each of these alternative tagging SNPs had a pair-wise r2 -value over a defined threshold with all other SNPs in the same bin, the exact average value of such could be different. For example, suppose there was a bin with SNP A, B, C and D and A had r2 -value 0.83, 0.81, 0.82 with B, C, D, respectively. Similarly, B had r2 -value 0.83, 0.82, 0.83 with A, C, D; C had r2 -value 0.81, 0.82, 0.79 with A, B, D ; and D had r2 -value 0.82, 0.83, 0.79 with A, B and C. If r2 threshold were 0.80, SNP A and B would be selected as alternative tagging SNPs for the bin. The average r2 of A with the other three SNPs in the bin was 0.820, whereas for B this value was 0.827. Therefore, B could be regarded as a better tag than A and was output first. In this study, for bins of multiple alternative tagging SNPs, the average pair-wise r2 between each of them and all the rest of SNPs in the same bin was calculated. Tagging SNPs with the highest average r2 -values were selected from each bin to create the best tagging SNP set. This was an effort to clear up uncertainties over whether any gain of tagging effectiveness, when the number of shared tags between different tagging methods was maximized, was due to the most effective tags not being used in the first place.

The haplotype r2 -based tagging method was based on the minimum haplotype r2 option implemented in TagIT ( 4 ). In a similar way, this option was also modified to select tagging SNPs based on minimum haplotype r2 between a non-tagging SNP and a candidate set of tagging SNPs and allow the ‘best’ set of tagging SNPs to be output first. The major haplotype-based tagging method was a standalone version of SNPTagger ( 14 ) without any modification. For these two methods, haplotypes were inferred for each individual region, and this was carried out by using Merlin and fugue ( 29 ).

The three tagging methods have different mechanisms and criteria in selecting tagging SNPs. For simplicity, we started tagging SNP selection using the three methods all from a threshold of 0.5. This starting threshold had different meaning for the three methods and represented pair-wise r2 =0.5, haplotype r2 =0.5 and haplotype diversity=50% (i.e. top 50% most common haplotypes). Afterwards, the threshold was increased until it reached 1.00—either complete correlation or full haplotype diversity.

### Measures of tagging efficiency and tagging effectiveness

‘Tagging efficiency’ was used to assess the savings in genotyping ( 23 ). In short, it was defined as n / nh , where n is the total number of markers genotyped in a region and nh the number of htSNPs selected to cover the region. ‘Tagging effectiveness’ was used to evaluate how well tag SNPs performed on hidden SNPs in a region. In this study, haplotype r2 ( 4 ) was used to define the effectiveness of tagging SNPs, unless stated otherwise. More specifically, ‘tagging effectiveness’ was defined as what percentage of hidden SNPs in a region that had a haplotype r2 -value equal to or bigger than a certain threshold with the tagging SNPs. To calculate the value of haplotype r2 for the i th SNP, the following formula was used:

$r_{i}^{2}{=}1{-}\frac{2m^{2}{{\sum}_{g}}\ f_{gi}(h_{g}{-}f_{gi})/h_{g}}{2m^{2}f_{i}(1{-}f_{i})}$
where m is the total number of chromosomes observed, fi the frequency of allele 1 at locus i , fgi the frequency of that allele on the g th haplotype and hg the haplotype frequency of the g th htSNP-defined group.

## SUPPLEMENTARY MATERIAL

Supplementary Material is available at HMG Online.

## ACKNOWLEDGEMENTS

This work was supported by the Wellcome Trust and by NIH grant EY-15652. We wish to thank the International HapMap Project for the ENCODE data.

Conflict of Interest statement. None declared.

Figure 1. Tagging efficiency and tagging effectiveness in high and low LD regions using three different tagging methods. Tagging efficiency was defined as the total number of markers genotyped divided by the number of htSNPs required. Tagging effectiveness was defined as the percentage of hidden SNPs which had haplotype correlations with tagging SNPs over a threshold (haplotype r2 =0.80). The results indicate average efficiency and effectiveness obtained across all high and low LD regions in the 1 SNP/5 kb marker sets of the ENCODE regions. ( A ) Tagging efficiency in high LD regions; ( B ) tagging effectiveness in high LD regions; ( C ) tagging efficiency in low LD regions; ( D ) tagging effectiveness in low LD regions. Lines with solid squares denote haplotype-diversity-based tagging method, lines with solid diamonds denote minimum haplotype r2 -based method and lines with white cycles denote pair-wise r2 -based method.

Figure 1. Tagging efficiency and tagging effectiveness in high and low LD regions using three different tagging methods. Tagging efficiency was defined as the total number of markers genotyped divided by the number of htSNPs required. Tagging effectiveness was defined as the percentage of hidden SNPs which had haplotype correlations with tagging SNPs over a threshold (haplotype r2 =0.80). The results indicate average efficiency and effectiveness obtained across all high and low LD regions in the 1 SNP/5 kb marker sets of the ENCODE regions. ( A ) Tagging efficiency in high LD regions; ( B ) tagging effectiveness in high LD regions; ( C ) tagging efficiency in low LD regions; ( D ) tagging effectiveness in low LD regions. Lines with solid squares denote haplotype-diversity-based tagging method, lines with solid diamonds denote minimum haplotype r2 -based method and lines with white cycles denote pair-wise r2 -based method.

Figure 2. Tagging effectiveness and effects of minor allele frequency (MAF) in high LD ( AC ) and low LD ( D ) regions using three different tagging methods at same level of tagging efficiency. Tagging effectiveness was defined as the percentage of hidden SNPs which had haplotype correlations with tagging SNPs over a threshold (haplotype r2 =0.80). The results indicate average efficiency and effectiveness obtained across all high LD regions (≥5 SNPs) in the 1 SNP/5 kb marker sets of the ENCODE regions. White bars denote haplotype-diversity-based tagging method, grey bars denote minimum haplotype r2 -based method and dark bars denote pair-wise r2 -based method. (A) Tagging efficiency at 3.0 (haplotype diversity=93%, haplotype r2 =0.78 and pair-wise r2 =0.50); (B) tagging efficiency at 2.5 (haplotype diversity=97%, haplotype r2 =0.90 and pair-wise r2 =0.72); (C) tagging efficiency at 2.0 (haplotype diversity=100%, haplotype r2 =1.00 and pair-wise r2 =0.90); (D) tagging efficiency at 1.18 in low LD regions (haplotype diversity=100%, haplotype r2 =1.00 and pair-wise r2 =0.70).

Figure 2. Tagging effectiveness and effects of minor allele frequency (MAF) in high LD ( AC ) and low LD ( D ) regions using three different tagging methods at same level of tagging efficiency. Tagging effectiveness was defined as the percentage of hidden SNPs which had haplotype correlations with tagging SNPs over a threshold (haplotype r2 =0.80). The results indicate average efficiency and effectiveness obtained across all high LD regions (≥5 SNPs) in the 1 SNP/5 kb marker sets of the ENCODE regions. White bars denote haplotype-diversity-based tagging method, grey bars denote minimum haplotype r2 -based method and dark bars denote pair-wise r2 -based method. (A) Tagging efficiency at 3.0 (haplotype diversity=93%, haplotype r2 =0.78 and pair-wise r2 =0.50); (B) tagging efficiency at 2.5 (haplotype diversity=97%, haplotype r2 =0.90 and pair-wise r2 =0.72); (C) tagging efficiency at 2.0 (haplotype diversity=100%, haplotype r2 =1.00 and pair-wise r2 =0.90); (D) tagging efficiency at 1.18 in low LD regions (haplotype diversity=100%, haplotype r2 =1.00 and pair-wise r2 =0.70).

Figure 3. Applying tagging SNPs from a HapMap to rare hidden SNPs (MAF<5%) in the ENCODE regions. Tagging SNPs were selected using all the three different tagging methods at different thresholds from an 1 SNP/5 kb HapMap, which either included or excluded such rare SNPs. White bars with dashed borders denote high LD regions and the exclusion of rare SNPs; white bars with solid borders denote low LD regions and the exclusion of rare SNPs; dashed bars denote high LD regions and the inclusion of rare SNPs; dark bars denote low LD regions and the inclusion of rare SNPs.

Figure 3. Applying tagging SNPs from a HapMap to rare hidden SNPs (MAF<5%) in the ENCODE regions. Tagging SNPs were selected using all the three different tagging methods at different thresholds from an 1 SNP/5 kb HapMap, which either included or excluded such rare SNPs. White bars with dashed borders denote high LD regions and the exclusion of rare SNPs; white bars with solid borders denote low LD regions and the exclusion of rare SNPs; dashed bars denote high LD regions and the inclusion of rare SNPs; dark bars denote low LD regions and the inclusion of rare SNPs.

Table 1.

Tagging space of different tagging methods on genotyped SNPs in high LD regions of ENCODE regions

Tag set selected by Tagging efficiency of first sets from both methods  Tags shared (%) a  Genotyped SNPs detected (%) b  Detected SNPs shared (%) c  Tagging efficiency of ‘maximum shared’ sets (%) d  Tags shared (%) a  Genotyped SNPs detected (%) b  Detected SNPs shared (%) c  Genotyped SNPs detected by (first+ran) (%) e  Detected SNPs shared (%) c
Hap r2 0.78  HapDiv 0.93 3.0 vs. 3.0 73.1  99.3 vs. 93.6 99.5 2.88 vs. 2.96 95.4  99.3 vs. 95.1 99.9 (99.4±0.2) vs. (93.8±0.1) 99.8±0.1
Hap r2 0.78  HapDiv 0.97 3.0 vs. 2.5 81.0  99.3 vs. 98.5 99.5 2.88 vs. 2.49 98.1  99.2 vs. 98.6 99.5 (99.4±0.2) vs. 98.5 99.5
Hap r2 0.90  HapDiv 0.97 2.5 vs. 2.5 82.1 100.0 vs. 98.5 100.0 2.44 vs. 2.48 95.2 100.0 vs. 98.6 100.0 100.0 vs. 98.5 100.0
Hap r2 0.90  HapDiv 1.00 2.5 vs. 2.0 94.1 100.0 vs. 100.0 100.0 2.44 vs. 2.0 98.8 100.0 vs. 100.0 100.0 100.0 vs. 100.0 100.0
Hap r2 0.78  r2 0.50  3.0 vs. 3.0 61.7  99.3 vs. 91.6 99.7 2.98 vs. 3.0 88.7  99.2 vs. 92.0 99.8 (99.4±0.2) vs. (91.6±0.1) 99.7
Hap r2 0.78  r2 0.72  3.0 vs. 2.5 76.0  99.3 vs. 97.0 99.6 2.93 vs. 2.5 95.3  99.5 vs. 97.0 99.8 (99.4±0.2) vs. 97.0 99.7
Hap r2 0.90  r2 0.72  2.5 vs. 2.5 68.5 100.0 vs. 97.0 100.0 2.47 vs. 2.5 89.3 100.0 vs. 96.7 100.0 100.0 vs. 97.0 100.0
Hap r2 0.90  r2 0.90  2.5 vs. 2.0 85.1 100.0 vs. 99.9 100.0 2.43 vs. 2.0 98.5 100.0 vs. 100.0 100.0 100.0 vs. 100.0 100.0
Tag set selected by Tagging efficiency of first sets from both methods  Tags shared (%) a  Genotyped SNPs detected (%) b  Detected SNPs shared (%) c  Tagging efficiency of ‘maximum shared’ sets (%) d  Tags shared (%) a  Genotyped SNPs detected (%) b  Detected SNPs shared (%) c  Genotyped SNPs detected by (first+ran) (%) e  Detected SNPs shared (%) c
Hap r2 0.78  HapDiv 0.93 3.0 vs. 3.0 73.1  99.3 vs. 93.6 99.5 2.88 vs. 2.96 95.4  99.3 vs. 95.1 99.9 (99.4±0.2) vs. (93.8±0.1) 99.8±0.1
Hap r2 0.78  HapDiv 0.97 3.0 vs. 2.5 81.0  99.3 vs. 98.5 99.5 2.88 vs. 2.49 98.1  99.2 vs. 98.6 99.5 (99.4±0.2) vs. 98.5 99.5
Hap r2 0.90  HapDiv 0.97 2.5 vs. 2.5 82.1 100.0 vs. 98.5 100.0 2.44 vs. 2.48 95.2 100.0 vs. 98.6 100.0 100.0 vs. 98.5 100.0
Hap r2 0.90  HapDiv 1.00 2.5 vs. 2.0 94.1 100.0 vs. 100.0 100.0 2.44 vs. 2.0 98.8 100.0 vs. 100.0 100.0 100.0 vs. 100.0 100.0
Hap r2 0.78  r2 0.50  3.0 vs. 3.0 61.7  99.3 vs. 91.6 99.7 2.98 vs. 3.0 88.7  99.2 vs. 92.0 99.8 (99.4±0.2) vs. (91.6±0.1) 99.7
Hap r2 0.78  r2 0.72  3.0 vs. 2.5 76.0  99.3 vs. 97.0 99.6 2.93 vs. 2.5 95.3  99.5 vs. 97.0 99.8 (99.4±0.2) vs. 97.0 99.7
Hap r2 0.90  r2 0.72  2.5 vs. 2.5 68.5 100.0 vs. 97.0 100.0 2.47 vs. 2.5 89.3 100.0 vs. 96.7 100.0 100.0 vs. 97.0 100.0
Hap r2 0.90  r2 0.90  2.5 vs. 2.0 85.1 100.0 vs. 99.9 100.0 2.43 vs. 2.0 98.5 100.0 vs. 100.0 100.0 100.0 vs. 100.0 100.0

a The percentage of tags in the smaller set shared by the larger set.

b ‘Detected’ means haplotype r2 ≥0.80 between a genotyped SNP and a set of tagging SNPs. Both non-tagging and tagging SNPs were evaluated together.

c Each tagging SNP set produced a set of ‘detected’ SNPs. The value here indicated the percentage of SNPs in the smaller ‘detected’ set shared by the larger ‘detected’ set.

d By comparing each set of tagging SNPs produced by two different tagging methods, the pair of tagging SNP sets sharing maximum proportion of tags was obtained.

e By selecting random markers from non-tagging SNPs and adding them to the first tagging SNP set (column 3), equal values of tagging efficiency could be reached to those of the column 7. Values shown were the averages of five assessments. Standard deviations were only shown where results were different among the five assessments. Also, both tagging and non-tagging SNPs were included for the evaluation.

Table 2.

Tagging space of different tagging methods on hidden SNPs in high LD regions of ENCODE regions

Tag set selected by Tagging efficiency of first sets from both methods  Tags shared (%) a  Hidden SNPs detected (%) b  Detected SNPs shared (%) c  Tagging efficiency of ‘maximum shared’ sets (%) d  Tags shared (%) a,d  Hidden SNPs detected (%) b,d  Detected SNPs shared (%) c,d  Hidden SNPs detected by (first+ran) (%) e  Detected SNPs shared (%) c,e
Hap r2 0.78  HapDiv 0.93 3.0 vs. 3.0 73.1 79.9 vs. 77.3 96.4 2.88 vs. 2.96 95.4 80.3 vs. 77.8 99.1 (80.4±0.2) vs. (77.3±0.1) 97.6±0.1
Hap r2 0.78  HapDiv 0.97 3.0 vs. 2.5 81.0 79.9 vs. 81.9 98.0 2.88 vs. 2.49 98.1 80.7 vs. 82.0 98.2 (80.4±0.2) vs. 81.9 98.0
Hap r2 0.90  HapDiv 0.97 2.5 vs. 2.5 82.1 83.6 vs. 81.9 99.6 2.44 vs. 2.48 95.2 83.6 vs. 82.0 99.7 (83.6±0.1) vs. 81.9 99.6
Hap r2 0.90  HapDiv 1.00 2.5 vs. 2.0 94.1 83.6 vs. 84.0 100 2.44 vs. 2.0 98.8 83.7 vs. 84.0 100 (83.6±0.1) vs. 84.0 100
Hap r2 0.78  r2 0.50  3.0 vs. 3.0 61.7 79.9 vs. 73.4 97.2 2.98 vs. 3.0 88.7 79.7 vs. 73.5 98.3 (80.4±0.2) vs. 73.4 97.4
Hap r2 0.78  r2 0.72  3.0 vs. 2.5 76.0 79.9 vs. 79.7 96.5 2.93 vs. 2.5 95.3 80.5 vs. 80.4 97.8 (80.4±0.2) vs. 79.7 96.8
Hap r2 0.90  r2 0.72  2.5 vs. 2.5 68.5 83.6 vs. 79.7 99.8 2.47 vs. 2.5 89.3 83.6 vs. 79.5 99.7 (83.6±0.1) vs. 79.7 99.8
Hap r2 0.90  r2 0.90  2.5 vs. 2.0 85.1 83.6 vs. 83.5 99.6 2.43 vs. 2.0 98.5 83.7 vs. 83.5 99.7 (83.6±0.1) vs. 83.5 99.6
Tag set selected by Tagging efficiency of first sets from both methods  Tags shared (%) a  Hidden SNPs detected (%) b  Detected SNPs shared (%) c  Tagging efficiency of ‘maximum shared’ sets (%) d  Tags shared (%) a,d  Hidden SNPs detected (%) b,d  Detected SNPs shared (%) c,d  Hidden SNPs detected by (first+ran) (%) e  Detected SNPs shared (%) c,e
Hap r2 0.78  HapDiv 0.93 3.0 vs. 3.0 73.1 79.9 vs. 77.3 96.4 2.88 vs. 2.96 95.4 80.3 vs. 77.8 99.1 (80.4±0.2) vs. (77.3±0.1) 97.6±0.1
Hap r2 0.78  HapDiv 0.97 3.0 vs. 2.5 81.0 79.9 vs. 81.9 98.0 2.88 vs. 2.49 98.1 80.7 vs. 82.0 98.2 (80.4±0.2) vs. 81.9 98.0
Hap r2 0.90  HapDiv 0.97 2.5 vs. 2.5 82.1 83.6 vs. 81.9 99.6 2.44 vs. 2.48 95.2 83.6 vs. 82.0 99.7 (83.6±0.1) vs. 81.9 99.6
Hap r2 0.90  HapDiv 1.00 2.5 vs. 2.0 94.1 83.6 vs. 84.0 100 2.44 vs. 2.0 98.8 83.7 vs. 84.0 100 (83.6±0.1) vs. 84.0 100
Hap r2 0.78  r2 0.50  3.0 vs. 3.0 61.7 79.9 vs. 73.4 97.2 2.98 vs. 3.0 88.7 79.7 vs. 73.5 98.3 (80.4±0.2) vs. 73.4 97.4
Hap r2 0.78  r2 0.72  3.0 vs. 2.5 76.0 79.9 vs. 79.7 96.5 2.93 vs. 2.5 95.3 80.5 vs. 80.4 97.8 (80.4±0.2) vs. 79.7 96.8
Hap r2 0.90  r2 0.72  2.5 vs. 2.5 68.5 83.6 vs. 79.7 99.8 2.47 vs. 2.5 89.3 83.6 vs. 79.5 99.7 (83.6±0.1) vs. 79.7 99.8
Hap r2 0.90  r2 0.90  2.5 vs. 2.0 85.1 83.6 vs. 83.5 99.6 2.43 vs. 2.0 98.5 83.7 vs. 83.5 99.7 (83.6±0.1) vs. 83.5 99.6

a The percentage of tags in the smaller set shared by the larger set.

b ‘Detected’ means haplotype r2 ≥0.80 between a hidden SNP and a set of tagging SNPs.

c Each tagging SNP set produced a set of ‘detected’ hidden SNPs. The value here indicated the percentage of SNPs in the smaller ‘detected’ set shared by the larger ‘detected’ set.

d By comparing each set of tagging SNPs produced by two different tagging methods, the pair of tagging SNP sets sharing maximum proportion of tags was obtained.

e By selecting random markers from non-tagging SNPs and adding them to the first tagging SNP set (column 3), equal values of tagging efficiency could be reached to those of the column 7. Values shown were the averages of five assessments. Standard deviations were only shown where results were different among the five assessments.

Table 3.

Tagging space of different tagging methods on hidden SNPs in 10-SNP windows of ENCODE regions

Tag set selected by Tagging efficiency of first sets from both methods  Tags shared (%) a  Hidden SNPs detected (%) b  Detected SNPs shared (%) c  Tagging efficiency of ‘maximum shared’ sets (%) d  Tags shared (%) a,d  Hidden SNPs detected (%) b,d  Detected SNPs shared (%) c,d  Hidden SNPs detected by (first+ran) (%) e  Detected SNPs shared (%) c,e
Evaluation criterion: haplotype r2 ≥0.80
Hap r2 0.85  HapDiv 0.94 2.05 vs. 2.05 83.4 81.4 vs. 78.8 98.5 1.99 vs. 2.03 95.7 81.3 vs. 79.6  99.1 (81.5±0.1) vs. (79.1±0.1) 98.7
Hap r2 0.85  HapDiv 1.00 1.643 vs. 2.05 94.0 81.4 vs. 83.5 100 1.638 vs. 1.98 99.3 81.7 vs. 83.5 100.0 (81.5±0.1) vs. 83.5 100
Hap r2 0.85  r2 0.51  2.05 vs. 2.05 72.4 81.4 vs. 75.9 98.4 2.01 vs. 2.05 91.5 81.3 vs. 77.0  99.0 (81.5±0.1) vs. (75.9±0.1) 98.1
Hap r2 0.85  r2 0.81  2.05 vs. 1.643 88.8 81.4 vs. 82.7 99.3 1.96 vs. 1.643 98.9 81.7 vs. 82.8  99.7 (81.5±0.1) vs. 82.7 99.3
Hap r2 1.00  r2 0.81  1.643 vs. 1.643 82.2 83.5 vs. 82.7 100 1.623 vs. 1.643 91.3 83.4 vs. 82.8 100 83.5 vs. 82.7 100
Evaluation criterion: pair-wise r2 ≥0.80
Hap r2 0.85  HapDiv 0.94 2.05 vs. 2.05 83.4 47.2 vs. 46.3 87.2 1.99 vs. 2.03 95.7 47.5 vs. 45.7  98.0 (48.5±0.3) vs. (46.7±0.2) 87.0±0.1
Hap r2 0.85  HapDiv 1.00 1.643 vs. 2.05 94.0 47.2 vs. 56.7 96.5 1.638 vs. 1.98 99.3 49.2 vs. 56.4  99.3 (48.5±0.3) vs. 56.8 96.0
Hap r2 0.85  r2 0.51  2.05 vs. 2.05 72.4 47.2 vs. 54.4 89.0 2.01 vs. 2.05 91.5 48.3 vs. 51.7  94.3 (48.5±0.3) vs. (54.4±0.1) 89.1±0.1
Hap r2 0.85  r2 0.81  2.05 vs. 1.643 88.8 47.2 vs. 63.9 99.3 1.96 vs. 1.643 98.9 51.0 vs. 63.8  99.9 (48.5±0.3) vs. 63.9 99.2
Hap r2 1.00  r2 0.81  1.643 vs. 1.643 82.2 56.8 vs. 63.9 99.1 1.623 vs. 1.643 91.3 59.4 vs. 63.8  99.4 57.5 vs. 63.9 99.1
Tag set selected by Tagging efficiency of first sets from both methods  Tags shared (%) a  Hidden SNPs detected (%) b  Detected SNPs shared (%) c  Tagging efficiency of ‘maximum shared’ sets (%) d  Tags shared (%) a,d  Hidden SNPs detected (%) b,d  Detected SNPs shared (%) c,d  Hidden SNPs detected by (first+ran) (%) e  Detected SNPs shared (%) c,e
Evaluation criterion: haplotype r2 ≥0.80
Hap r2 0.85  HapDiv 0.94 2.05 vs. 2.05 83.4 81.4 vs. 78.8 98.5 1.99 vs. 2.03 95.7 81.3 vs. 79.6  99.1 (81.5±0.1) vs. (79.1±0.1) 98.7
Hap r2 0.85  HapDiv 1.00 1.643 vs. 2.05 94.0 81.4 vs. 83.5 100 1.638 vs. 1.98 99.3 81.7 vs. 83.5 100.0 (81.5±0.1) vs. 83.5 100
Hap r2 0.85  r2 0.51  2.05 vs. 2.05 72.4 81.4 vs. 75.9 98.4 2.01 vs. 2.05 91.5 81.3 vs. 77.0  99.0 (81.5±0.1) vs. (75.9±0.1) 98.1
Hap r2 0.85  r2 0.81  2.05 vs. 1.643 88.8 81.4 vs. 82.7 99.3 1.96 vs. 1.643 98.9 81.7 vs. 82.8  99.7 (81.5±0.1) vs. 82.7 99.3
Hap r2 1.00  r2 0.81  1.643 vs. 1.643 82.2 83.5 vs. 82.7 100 1.623 vs. 1.643 91.3 83.4 vs. 82.8 100 83.5 vs. 82.7 100
Evaluation criterion: pair-wise r2 ≥0.80
Hap r2 0.85  HapDiv 0.94 2.05 vs. 2.05 83.4 47.2 vs. 46.3 87.2 1.99 vs. 2.03 95.7 47.5 vs. 45.7  98.0 (48.5±0.3) vs. (46.7±0.2) 87.0±0.1
Hap r2 0.85  HapDiv 1.00 1.643 vs. 2.05 94.0 47.2 vs. 56.7 96.5 1.638 vs. 1.98 99.3 49.2 vs. 56.4  99.3 (48.5±0.3) vs. 56.8 96.0
Hap r2 0.85  r2 0.51  2.05 vs. 2.05 72.4 47.2 vs. 54.4 89.0 2.01 vs. 2.05 91.5 48.3 vs. 51.7  94.3 (48.5±0.3) vs. (54.4±0.1) 89.1±0.1
Hap r2 0.85  r2 0.81  2.05 vs. 1.643 88.8 47.2 vs. 63.9 99.3 1.96 vs. 1.643 98.9 51.0 vs. 63.8  99.9 (48.5±0.3) vs. 63.9 99.2
Hap r2 1.00  r2 0.81  1.643 vs. 1.643 82.2 56.8 vs. 63.9 99.1 1.623 vs. 1.643 91.3 59.4 vs. 63.8  99.4 57.5 vs. 63.9 99.1

a The percentage of tags in the smaller set shared by the larger set.

b ‘Detected’ means haplotype or pair-wise r2 ≥0.80 between a hidden SNP and a set of tagging SNPs.

c Each tagging SNP set produced a set of ‘detected’ hidden SNPs. The value here indicated the percentage of SNPs in the smaller ‘detected’ set shared by the larger ‘detected’ set.

d By comparing each set of tagging SNPs produced by two different tagging methods, the pair of tagging SNP sets sharing maximum proportion of tags was obtained.

e By selecting random markers from non-tagging SNPs and adding them to the first tagging SNP set (column 3), equal values of tagging efficiency could be reached to those of the column 7. Values shown were the averages of five assessments. Standard deviations were only shown where results were different among the five assessments.

## References

1
Johnson, G.C., Esposito, L., Barratt, B.J., Smith, A.N., Heward, J., Di Genova, G., Ueda, H., Cordell, H.J., Eaves, I.A., Dudbridge, F. et al. (
2001
) Haplotype tagging for the identification of common disease genes.
Nat. Genet.
,
29
,
233
–237.
2
Chapman, J.M., Cooper, J.D., Todd, J.A. and Clayton, D.G. (
2003
) Detecting disease associations due to linkage disequilibrium using haplotype tags: a class of tests and the determinants of statistical power.
Hum. Hered.
,
56
,
18
–31.
3
Goldstein, D.B., Ahmadi, K.R., Weale, M.E. and Wood, N.W. (
2003
) Genome scans and candidate gene approaches in the study of common diseases and variable drug responses.
Trends Genet.
,
19
,
615
–622.
4
Weale, M.E., Depondt, C., Macdonald, S.J., Smith, A., Lai, P.S., Shorvon, S.D., Wood, N.W. and Goldstein, D.B. (
2003
) Selection and evaluation of tagging SNPs in the neuronal-sodium-channel gene SCN1A: implications for linkage-disequilibrium gene mapping.
Am. J. Hum. Genet.
,
73
,
551
–565.
5
Carlson, C.S., Eberle, M.A., Rieder, M.J., Yi, Q., Kruglyak, L. and Nickerson, D.A. (
2004
) Selecting a maximally informative set of single-nucleotide polymorphisms for association analyses using linkage disequilibrium.
Am. J. Hum. Genet.
,
74
,
106
–120.
6
Lowe, C.E., Cooper, J.D., Chapman, J.M., Barratt, B.J., Twells, R.C., Green, E.A., Savage, D.A., Guja, C., Ionescu-Tirgoviste, C., Tuomilehto-Wolf, E. et al. (
2004
) Cost-effective analysis of candidate genes using htSNPs: a staged approach.
Genes Immun.
,
5
,
301
–305.
7
Zhang, K., Qin, Z.S., Liu, J.S., Chen, T., Waterman, M.S. and Sun, F. (
2004
) Haplotype block partitioning and tag SNP selection using genotype data and their applications to association studies.
Genome Res.
,
14
,
908
–916.
8
Ahmadi, K.R., Weale, M.E., Xue, Z.Y., Soranzo, N., Yarnall, D.P., Briley, J.D., Maruyama, Y., Kobayashi, M., Wood, N.W., Spurr, N.K. et al. (
2005
) A single-nucleotide polymorphism tagging set for human drug metabolism and transport.
Nat. Genet.
,
37
,
84
–89.
9
Gibbs, R.A., Belmont, J.W., Hardenbol, P., Willis, T.D., Yu, F., Yang, H., Ang, L., Huang, W., Liu, B., Shen, Y. et al. (
2003
) The International HapMap Project.
Nature
,
426
,
789
–796.
10
Barratt, B.J., Payne, F., Rance, H.E., Nutland, S., Todd, J.A. and Clayton, D.G. (
2002
) Identification of the sources of error in allele frequency estimations from pooled DNA indicates an optimal experimental design.
Ann. Hum. Genet.
,
66
,
393
–405.
11
Zhang, K., Deng, M., Chen, T., Waterman, M.S. and Sun, F. (
2002
) A dynamic programming algorithm for haplotype block partitioning.
,
99
,
7335
–7339.
12
Cousin, E., Genin, E., Mace, S., Ricard, S., Chansac, C., del Zompo, M. and Deleuze, J.F. (
2003
) Association studies in candidate genes: strategies to select SNPs to be tested.
Hum. Hered.
,
56
,
151
–159.
13
Hampe, J., Schreiber, S. and Krawczak, M. (
2003
) Entropy-based SNP selection for genetic association studies.
Hum. Genet.
,
114
,
36
–43.
14
Ke, X. and Cardon, L.R. (
2003
) Efficient selective screening of haplotype tag SNPs.
Bioinformatics
,
19
,
287
–288.
15
Meng, Z., Zaykin, D.V., Xu, C.F., Wagner, M. and Ehm, M.G. (
2003
) Selection of genetic markers for association analyses, using linkage disequilibrium and haplotypes.
Am. J. Hum. Genet.
,
73
,
115
–130.
16
Sebastiani, P., Lazarus, R., Weiss, S.T., Kunkel, L.M., Kohane, I.S. and Ramoni, M.F. (
2003
) Minimal haplotype tagging.
,
100
,
9900
–9905.
17
Wiuf, C., Laidlaw, Z. and Stumpf, M.P. (
2003
) Some notes on the combinatorial properties of haplotype tagging.
Math. Biosci.
,
185
,
205
–216.
18
Zhang, K., Sun, F., Waterman, M.S. and Chen, T. (
2003
) Haplotype block partition with limited resources and applications to human chromosome 21 haplotype data.
Am. J. Hum. Genet.
,
73
,
63
–73.
19
Ao, S.I., Yip, K., Ng, M., Cheung, D., Fong, P.Y., Melhado, I. and Sham, P.C. (
2004
) CLUSTAG: hierarchical clustering and graph methods for selecting tag SNPs.
Bioinformatics
,
21
,
1735
–1736.
20
Halldorsson, B.V., Bafna, V., Lippert, R., Schwartz, R., De La Vega, F.M., Clark, A.G. and Istrail, S. (
2004
) Optimal haplotype block-free selection of tagging SNPs for genome-wide association studies.
Genome Res.
,
14
,
1633
–1640.
21
Ackerman, H., Usen, S., Mott, R., Richardson, A., Sisay-Joof, F., Katundu, P., Taylor, T., Ward, R., Molyneux, M., Pinder, M. and Kwiatkowski, D.P. (
2003
) Haplotypic analysis of the TNF locus by association efficiency and entropy.
Genome Biol.
,
4
,
R24
.
22
Miretti, M.M., Walsh, E.C., Ke, X., Delgado, M., Griffiths, M., Hunt, S., Morrison, J., Wittaker, P., Rioux, J.D., Lander, E.S. et al. (
2005
) A high resolution linkage disequilibrium map of the human major histocompatibility complex in Caucasians and first generation of tag SNPs.
Am. J. Hum. Genet.
,
76
,
634
–646.
23
Ke, X., Durrant, C., Morris, A.P., Hunt, S., Bentley, D.R., Deloukas, P. and Cardon, L.R. (
2004
) Efficiency and consistency of haplotype tagging of dense SNP maps in multiple samples.
Hum. Mol. Genet.
,
13
,
2557
–2565.
24
Abecasis, G.R., Cookson, W.O. and Cardon, L.R. (
2001
) The power to detect linkage disequilibrium with quantitative traits in selected samples.
Am. J. Hum. Genet.
,
68
,
1463
–1474.
25
Carlson, C.S., Eberle, M.A., Rieder, M.J., Smith, J.D., Kruglyak, L. and Nickerson, D.A. (
2003
) Additional SNPs and linkage-disequilibrium analyses are necessary for whole-genome association studies in humans.
Nat. Genet.
,
33
,
518
–521.
26
Zondervan, K.T. and Cardon, L.R. (
2004
) The complex interplay among factors that influence allelic association.
Nat. Rev. Genet.
,
5
,
89
–100.
27
Barrett, J.C., Fry, B., Maller, J. and Daly, M.J. (
2005
) Haploview: analysis and visualization of LD and haplotype maps.
Bioinformatics
,
21
,
263
–265.
28
Ke, X., Hunt, S., Tapper, W., Lawrence, R., Stavrides, G., Ghori, J., Whittaker, P., Collins, A., Morris, A.P., Bentley, D. et al. (
2004
) The impact of SNP density on fine-scale patterns of linkage disequilibrium.
Hum. Mol. Genet.
,
13
,
577
–588.
29
Abecasis, G.R., Cherny, S.S., Cookson, W.O. and Cardon, L.R. (
2003
) Merlin—rapid analysis of dense genetic maps using sparse gene flow trees.
Nat. Genet.
,
30
,
97
–101.