Abstract

Summary: SNPLINK is a Perl script that performs full genome linkage analysis of high-density single nucleotide polymorphism (SNP) marker sets. The presence of linkage disequilibrium (LD) between closely spaced SNP markers can falsely inflate linkage statistics. SNPLINK removes LD from the marker sets in an automated fashion before carrying out linkage analysis. SNPLINK can compute both parametric and non-parametric statistics, utilizing the freely available Allegro and Merlin software. Graphical outputs of whole genome multipoint linkage statistics are provided allowing comparison of results before and after the removal of LD.

Availability: SNPLINK is freely available for non-commercial research institutions. For full details see www.icr.ac.uk/cancgen/molgen/MolPopGen_Bioinformatics.htm

Contact:richard.houlston@icr.ac.uk

The recent availability of dense single nucleotide polymorphism (SNP) maps coupled with technological and cost developments in performing highly paralleled genotyping, now make it practical to use these markers for genomewide linkage searches (Matise et al., 2003). The advantages of conducting searches using SNPs over conventional microsatellite scans based on markers spaced at 10 Mb intervals have been well documented (Evans and Cardon, 2004). Specifically, SNPs are far more abundant than microsatellites throughout the genome and therefore yield a higher information content, giving superior power to detect linkage when typed at a high density (Evans and Cardon, 2004). Furthermore, the denseness of SNP marker maps means that there is potential for improved localization of disease genes (Kruglyak, 1997; John et al., 2004).

A number of analytical platforms capable of simultaneously scoring large numbers of SNPs (such as the Affymetrix GeneChip® Human Mapping 10K Array) have recently become available to mainstream genetic researchers. The performance of these SNP arrays, both in terms of efficiency and precision, indicates that they are likely to become the dominant technology for performing genomewide linkage searches (Sellick et al., 2004).

The current linkage software packages that are in popular use, such as Genehunter (Kruglyak et al., 1996), Merlin (Abecasis et al., 2002) and Allegro (Gudbjartsson et al., 2000) generate multipoint statistics under the assumption of linkage equilibrium between markers. While this would nearly always be the case for more widely, evenly spaced microsatellite markers, linkage disequilibrium (LD) is negatively correlated with increasing distance (Dawson et al., 2002; Ke et al., 2004), and has a high probability of existence between tightly spaced markers. In the Affymetrix GeneChip® Human Mapping 10K Array, SNPs are often found to be in clusters with distances as small as 1 × 10−6 Mb between consecutive markers.

The very property of SNP maps that affords the advantages detailed above therefore leads to the potential problem of LD between markers affecting linkage statistics (Evans and Cardon, 2004; Schaid, 2002). This is not an issue if the genotypes of all family members are included in the analysis. However, if there is missing data, for example, if parental genotypes are unknown, existing linkage software programs estimate the probability distribution of unknown genotypes by assigning equal probabilities to all possible inheritance vectors. Therefore, if markers are in LD (and are thus more likely to be inherited together) certain haplotypes may occur more frequently than expected under linkage equilibrium, thereby falsely inflating linkage statistics. While this type of bias can be reduced by including the genotypes of additional unaffected pedigree members in the analysis (Huang et al., 2004), it cannot be entirely eliminated and may even increase as the number of consecutive markers in LD increases.

Since the use of dense SNP marker sets in linkage analysis is a very recent development, no general approach has been proposed to deal with the existence of LD. Ideally, programs should incorporate LD between markers into the likelihood calculations so that expected haplotype frequencies are correctly estimated. In the absence of such software, other approaches are required to address the issue in the interim.

To this end, we have developed a Perl Script, SNPLINK that interfaces with the linkage software Merlin and Allegro, to undertake automated analyses of genomewide linkage scans, but unlike other programs it addresses the issue of LD. SNPLINK accepts genotype data in standard LINKAGE format as generated by the GeneChip® DNA Analysis Software (GDAS v3.0.2) from Affymetrix or by the pedigree database program ProgenyLab 6.0 (Progeny Software Inc, IN). Non-Mendelian errors are identified and removed using the Merlin option error. SNPLINK then implements full-genome multipoint linkage analysis using Allegro for parametric, and Merlin for non-parametric analyses. Both Allegro and Merlin are capable of analysing large numbers of markers (tested to a maximum of 945 SNPs) whereas other linkage programs, such as Genehunter (version 2.1), are restricted to fewer markers than are available per chromosome in the 10K array design. LD is then removed by considering each set of markers in LD (defined as sets where each consecutive marker pair in the set is found to be in LD) and retaining one SNP from each set, chosen as the middle SNP from the set. Linkage analysis is then re-run using the new LD-free set of markers. SNPLINK performs this process in a fully automated fashion and no user input is required during the runtime. Users may specify the basis on which LD is estimated, and the criteria by which markers in LD are excluded. Graphical outputs of the whole genome multipoint NPL and LOD scores are provided comparing results before and after the removal of LD, using the R statistical software (www.r-project.org) (Fig. 1).

The two most commonly used measures of LD are D′ and r2. The properties of both statistics have previously been discussed extensively (Hedrick, 1987; Devlin and Risch, 1995). The behaviour of these statistics is affected by a number of factors, which can bias the accuracy of LD estimation. In particular, D′ is more robust to small minor allele frequencies and r2 more robust to small sample size. Both measures are available in SNPLINK and are calculated using the EM algorithm. Because linkage phase of the SNPs is not directly observed the approach is simplified by ignoring relationships among family. Although this is a simplification which may result in a loss of efficiency, it is conservative.

SNPLINK allows the user to specify the definition of high LD to be used in the analysis. Various authors have advocated that values greater than 0.7 and 0.4 for D′ and r2, respectively, be used to define significant LD (John et al., 2004; Schaid et al., 2004). However, since the distribution of LD statistics varies between populations,we suggest that the choice of disequilibrium measure to be used and the cut off point for the presence of LD should be determined after preliminary assessment of the data. Specifically, we recommend that the cut point should be chosen as the value at which the minimum of the density of the LD measure is attained.

In conclusion, the advantages of using SNPs for linkage analysis are clearly apparent, but care must be taken to avoid false linkage results due to LD between markers. SNPLINK provides a user-friendly, fully automated program for the systematic removal of LD between SNP markers and performs full genomewide linkage analysis both before and after LD is removed. SNPLINK has been extensively tested on real data and successfully used to perform several linkage scans using the Affymetrix 10K array.

Fig. 1

Example of the output obtained from SNPLINK showing a plot of the multipoint LOD score across chromosome 14 without (solid line) and with (dashed line) correction for LD between markers. An apparent linkage peak on the q arm disappears when LD is removed so that the maximum linkage peak is now on the p arm. The corresponding loss in information content (IC) was small throughout the chromosome (graphs available in SNPLINK) indicating that the reduction of the linkage peak was due to the removal of high LD between markers and not a consequence of reduced IC.

Fig. 1

Example of the output obtained from SNPLINK showing a plot of the multipoint LOD score across chromosome 14 without (solid line) and with (dashed line) correction for LD between markers. An apparent linkage peak on the q arm disappears when LD is removed so that the maximum linkage peak is now on the p arm. The corresponding loss in information content (IC) was small throughout the chromosome (graphs available in SNPLINK) indicating that the reduction of the linkage peak was due to the removal of high LD between markers and not a consequence of reduced IC.

This work was supported by grants from Cancer Research UK. G.S.S. is in receipt of a Post-Doctoral Research Fellowship from Leukaemia Research. We are grateful to two reviewers for their helpful comments.

REFERENCES

Abecasis, G.R., et al.
2002
Merlin—rapid analysis of dense genetic maps using sparse gene flow trees.
Nat. Genet.
 
30
97
–101
Dawson, E., et al.
2002
A first-generation linkage disequilibrium map of human chromosome 22.
Nature
 
418
544
–548
Devlin, B. and Risch, N.
1995
A comparison of linkage disequilibrium measure for fine-scale mapping.
Genomics
 
29
311
–322
Evans, D.M. and Cardon, L.R.
2004
Guidelines for genotyping in genomewide linkage studies: single-nucleotide-polymorphism maps versus microsatellite maps.
Am. J. Hum. Genet.
 
75
687
–692
Gudbjartsson, D.F., et al.
2000
Allegro, a new computer program for multipoint linkage analysis.
Nat. Genet.
 
25
12
–13
Hedrick, P.W.
1987
Gametic disequilibrium measures: proceed with caution.
Genetics
 
117
331
–341
Huang, Q., et al.
2004
Ignoring linkage disequilibrium among tightly linked markers induces false-positive evidence of linkage for affected sib pair analysis.
Am. J. Hum. Genet.
 
75
1106
–1112
John, S., et al.
2004
Whole-genome scan, in a complex disease, using 11,245 single-nucleotide-polymorphisms: comparison with microsatelites.
Am. J. Hum. Genet.
 
75
54
–64
Ke, X., et al.
2004
The impact of SNP density on fine-scale patterns of linkage disequilibrium.
Hum. Mol. Genet.
 
13
577
–588
Kruglyak, L.
1997
The use of a genetic map of biallelic markers in linkage studies.
Nat. Genet.
 
17
21
–24
Kruglyak, L., et al.
1996
Parametric and nonparametric linkage analysis: a unified multipoint approach.
Am. J. Hum. Genet.
 
58
1347
–1363
Matise, T.C., et al.
2003
A 3.9-centimorgan-resolution human single-nucleotide-polymorphism linkage map and screening set.
Am. J. Hum. Genet.
 
73
271
–284
Schaid, D.J., et al.
2002
Caution on pedigree haplotype inference with software that assumes linkage equilibrium.
Am. J. Hum. Genet.
 
71
992
–995
Schaid, D.J., et al.
2004
Comparison of microsatellites versus single-nucleotide polymorphisms in a genome linkage screen for prostate cancer-susceptibility loci.
Am. J. Hum. Genet.
 
75
948
–965
Sellick, G.S., et al.
2004
Genomewide linkage searches for Mendelian disease loci can be efficiently conducted using high-density SNP genotyping arrays.
Nucleic Acids Res.
 
32
e164

Comments

0 Comments