-
PDF
- Split View
-
Views
-
Cite
Cite
Shan-Shan Dong, Wei-Ming He, Jing-Jing Ji, Chi Zhang, Yan Guo, Tie-Lin Yang, LDBlockShow: a fast and convenient tool for visualizing linkage disequilibrium and haplotype blocks based on variant call format files, Briefings in Bioinformatics, Volume 22, Issue 4, July 2021, bbaa227, https://doi.org/10.1093/bib/bbaa227
- Share Icon Share
Abstract
The triangular correlation heatmap aiming to visualize the linkage disequilibrium (LD) pattern and haplotype block structure of SNPs is ubiquitous component of population-based genetic studies. However, current tools suffered from the problem of time and memory consuming. Here, we developed LDBlockShow, an open source software, for visualizing LD and haplotype blocks from variant call format files. It is time and memory saving. In a test dataset with 100 SNPs from 60 000 subjects, it was at least 10.60 times faster and used only 0.03–13.33% of physical memory as compared with other tools. In addition, it could generate figures that simultaneously display additional statistical context (e.g. association P-values) and genomic region annotations. It can also compress the SVG files with a large number of SNPs and support subgroup analysis. This fast and convenient tool will facilitate the visualization of LD and haplotype blocks for geneticists.
Introduction
Due to genetic linkage, nearby single nucleotide polymorphisms (SNPs) are often highly correlated. In genetic studies, understanding the linkage disequilibrium (LD) pattern is helpful in selecting representative SNP subsets and interpreting other statistical results, such as association P-values [1]. However, for multiple SNPs, it is difficult to interpret results from the summary statistics of pairwise LD measurements since the number of measurements increases rapidly with the number of SNPs. Therefore, the triangular correlation heatmaps aiming to visualize the LD pattern and haplotype block structure of SNPs have become a ubiquitous component of population-based genetic studies since the completion of the international HapMap project.

The LDBlockShow workflow. It consists of two units, the data processing unit and the plot unit. The data processing unit includes sample selection, SNP filtering and data structure transformation. The plot unit will generate the LD heatmap with calculated LD measurement statistics.
Haploview [2], LDheatmap [3] and gpart [4] are the most popular tools for visualizing LD heatmaps. Nowadays, with files containing a large number of individuals and SNPs generated from next-generation sequencing (NGS) data, these tools suffered from the problem of time and memory consuming. In addition, in order to display regional association statistics or genomic annotation results in the context of LD, researchers should manually align two disjoint figures. For example, previous studies [5–7] have merged the association statistics, recombination rate or genomic region annotation results with the LD plots generated by Haploview, which is inconvenient. Among the abovementioned three tools, gpart can generate the genomic annotation plot, but it does not support the generation of plot for additional statistics. Moreover, with a large number of SNPs, the produced SVG or PDF vector diagram might be too large to open in personal computers. For example, with 1000 SNPs, a total of 500 000 grids (1000 × 1000/2) will be generated and the plot file would reach to tens of MB. Haploview and gpart support the output of PNG file. However, PNG file is not easy for further editing and will lose quality during scaling. Besides, with variant call format (VCF) [8] files generated from NGS data analysis, researchers should convert the VCF files to ‘PED’ file format first and then use Haploview or LDheatmap to get the LD plot. Meanwhile, gpart can only accept uncompressed VCF files as input, which is not very convenient when handling compressed files since data decompression might result in large storage costing files.
To address the abovementioned problems, here, we present a software, LDBlockShow, to allow biologists to generate LD and haplotype maps quickly and directly from compressed/uncompressed VCF files. LDBlockShow supports the generation of LD heatmap and regional association statistics or genomic annotation results simultaneously. LDBlockShow can also compress the SVG files with a large number of SNPs and support subgroup analysis. This fast and convenient tool will facilitate the visualization of LD and haplotype blocks for geneticists.
Methods
Overview of LDBlockShow workflow
LDBlockShow consists of two units: the data processing unit and the plot unit (Figure 1). The data processing unit takes compressed or uncompressed VCF files as input. Users can also input files in PLINK [9] or genotype format with the option of ‘-InPlink’ and ‘-InGenotype’, respectively. Next, subgroup samples can be selected (‘-Subgroup’ flag), and SNPs with minor allele frequency (MAF) of less than 0.05, missing sample rate of over 0.25 or heterozygosis ratio of over 0.9 will be filtered. Custom criteria can be defined with ‘-MAF’, ‘-Miss’ and ‘-Het’. The genotype of each individual will be stored in specific data structure to facilitate pairwise LD statistics calculation. Using the calculated LD measurement statistics, the plot unit generates the final LD heatmap. Specifically, users can generate the LD plot combined with additional statistical context or genomic region annotations simultaneously. With a large number of SNPs, LDBlockShow will compress the output SVG file.
Data structure
LDBlockShow is implemented in C++ with Open MIC license for Linux/Unix and Mac operating system. The genotype of each individual will be stored with the following data structure to facilitate the fast pairwise LD statistics calculation:
LD measurement
Users can choose to display r2 or D’ in the heatmap with the option of ‘-SeleVar’.

Comparison of computing cost for LDBlockShow, LDheatmap and Haploview. (A) CPU time and (B) memory cost for different methods are shown with a fixed SNP number of 100 and sample size ranging from 2000 to 60 000. (C) CPU time and (D) memory cost for different methods are shown with a fixed sample size of 2000 and SNP number ranged from 100 to 1200. When testing datasets in A–D, both LDBlockShow and gpart finished the analyses within reasonable time and memory. We further tested their performance when handling large dataset. (E) CPU time and (F) memory cost for these two methods are shown with a fixed sample size of 100 000 and SNP number ranged from 300 to 2500. Computation is performed with one thread of an Intel Xeon CPU E5–2630 v4.
LD blocks
LDBlockShow supports the definition of blocks in four different ways. By default, PLINK (Version 1.9, www.cog-genomics.org/plink/1.9/) [9] will be called to generate the block defined by Gabriel et al. [12]. The solid spine of LD [2] method is also supported. Users can also define their own cutoff of r2 and D’ for blocks with the option of ‘-BlockCut’ or supply their own block region definition with the option of ‘-FixBlock’.

An example output figure of LDBlockShow. (A) The association statistics. In addition to the association statics, users can choose other statistical measurements for SNPs to display. (B) The genomic region annotation results. CDS, intron, UTR and intergenic region are shown in yellow, lightblue, pink and orange, respectively. Colors can be user-defined. (C) The LD heatmap. Colors can be user-defined. The LD measurement values are shown by using the flag of ‘-ShowNum’.
Results
LDBlockShow is time and memory saving
We examined the computing time and memory requirement of LDBlockShow, Haploview [2], LDheatmap [3] and gpart [4] using genotype data from the UK Biobank population (all SNPs on chromosome 22). As shown in Figure 2A, with a fixed SNP number of 100 and sample size ranging from 2000 to 60 000, LDBlockShow was faster than other tools. For example, with the sample size of 60 000, it took, LDheatmap, Haploview and gpart, 98.68, 123.87 and 0.53 min to generate the LD plot. In contrast, it took LDBlockShow only 0.05 min to analyze the same dataset, representing 1973.60-, 2477.40- and 10.60-fold speed gain over these three methods. In addition, LDBlockShow only required a small amount of physical memory (Figure 2B). For example, with the sample size of 60 000, LDheatmap, Haploview and gpart required 0.15, 75.00 and 1.03 GB memory for analyzing 100 SNPs, respectively. In contrast, LDBlockShow used only 0.02 GB, which is 13.33, 0.03 and 1.94% of that required by the three methods. With a fixed sample size of 2000 and SNP number ranged from 100 to 1200, LDBlockShow showed similar performance (Figure 2C and D). For example, with 1200 SNPs, it took, LDheatmap, Haploview and gpart, 529.13, 1289.88 and 1.45 min to generate the LD plot, respectively. In contrast, it took LDBlockShow only 0.13 min to analyze the dataset, representing 4070.23-, 9922.15- and 11.15-fold speed gain over these three methods. In the same dataset, LDheatmap, Haploview and gpart required 0.18, 11.00 and 1.07 GB memory, respectively. LDBlockShow used 0.06 GB, which is 33.33, 0.55 and 5.60% of that required by the three methods.
When testing above datasets (Figure 2A–D), both LDBlockShow and gpart finished the analyses within reasonable time and memory. We further tested their performance when handling large-scale datasets using data from a fixed sample size of 100 000 individuals. The results (Figure 2E and F) showed that LDBlockShow is more suitable for analyzing large-scale datasets. For example, with 2500 SNPs, it took gpart 218.35 minutes to generate the LD plot. In contrast, it took, LDBlockShow, only 16.23 min to analyze the dataset, representing 13.45-fold speed gain over gpart. Besides, gpart required 20 GB (not feasible for most personal laptops) to analyze the dataset, while the memory used by LDBlockShow is only 9.56% (1.91 GB) of that used by gpart.
LDBlockShow is convenient in combining additional statistics or genomic annotations
LDBlockShow can generate the plots of LD heatmap and additional statistical context (provided with the ‘-InGWAS’ flag) or genomic annotation results (provided with the ‘-InGFF’ flag) simultaneously. An example plot of LDBlockShow is shown in Figure 3.
LDBlockShow supports the compression of SVG files with large number of SNPs
In addition to converting SVG to PNG file (‘-OutPng’ flag), we also offered another option to compress the SVG file. With SNP number over 50 (cutoff can be defined with the ‘-MerMinSNPNum’ flag), the SVG file can be automatically compressed with a small number of color gradients. Adjacent grids with the same color will be merged into a single grid. In a test dataset of 1000 SNPs, under the condition of no compression, the sizes of the vector diagram generated by LDBlockShow, Haploview and LDheatmap were 26, 91 and 23 MB, respectively. After compression with LDBlockShow, the above 26 MB SVG will be compressed to 8.6 MB (33.08% of the original size). The overall comparison between LDBlockShow and other tools is shown in Table 1.
Performance . | LDBlockShow . | Haploview . | LDheatmap . | gpart . |
---|---|---|---|---|
Input | ||||
Compressed VCF file | √ | × | × | × |
Uncompressed VCF file | √ | × | × | √ |
Support subgroup analysis | √ | × | × | × |
Output | ||||
Visualizing additional statistics | √ | × | × | × |
Visualizing genomic annotation | √ | × | × | √ |
Compressed SVG | √ | × | × | × |
PNG file | √ | √ | × | √ |
Block region | √ | √ | × | √ |
LD measurement | D’/r2 | D’/r2 | r2 | D’/r2 |
Performance . | LDBlockShow . | Haploview . | LDheatmap . | gpart . |
---|---|---|---|---|
Input | ||||
Compressed VCF file | √ | × | × | × |
Uncompressed VCF file | √ | × | × | √ |
Support subgroup analysis | √ | × | × | × |
Output | ||||
Visualizing additional statistics | √ | × | × | × |
Visualizing genomic annotation | √ | × | × | √ |
Compressed SVG | √ | × | × | × |
PNG file | √ | √ | × | √ |
Block region | √ | √ | × | √ |
LD measurement | D’/r2 | D’/r2 | r2 | D’/r2 |
Performance . | LDBlockShow . | Haploview . | LDheatmap . | gpart . |
---|---|---|---|---|
Input | ||||
Compressed VCF file | √ | × | × | × |
Uncompressed VCF file | √ | × | × | √ |
Support subgroup analysis | √ | × | × | × |
Output | ||||
Visualizing additional statistics | √ | × | × | × |
Visualizing genomic annotation | √ | × | × | √ |
Compressed SVG | √ | × | × | × |
PNG file | √ | √ | × | √ |
Block region | √ | √ | × | √ |
LD measurement | D’/r2 | D’/r2 | r2 | D’/r2 |
Performance . | LDBlockShow . | Haploview . | LDheatmap . | gpart . |
---|---|---|---|---|
Input | ||||
Compressed VCF file | √ | × | × | × |
Uncompressed VCF file | √ | × | × | √ |
Support subgroup analysis | √ | × | × | × |
Output | ||||
Visualizing additional statistics | √ | × | × | × |
Visualizing genomic annotation | √ | × | × | √ |
Compressed SVG | √ | × | × | × |
PNG file | √ | √ | × | √ |
Block region | √ | √ | × | √ |
LD measurement | D’/r2 | D’/r2 | r2 | D’/r2 |
Discussion
In this study, we developed LDBlockShow, for visualizing LD and haplotype blocks based on VCF files. Compared with current tools, LDBlockShow has the following advantages: firstly, it is time and memory saving and supporting analyses directly from compressed/uncompressed VCF files. With the advances of NGS, genomic data for large-scale populations have been generated gradually. For example, the numbers of human exomes and whole genomes of the genome aggregation database (gnomAD) consortium have reached 125 748 and 15 708, respectively [13]. Therefore, LDBlockShow can offer help for NGS researchers in a time and resource efficient manner. Secondly, LDBlockShow also complements the common triangular correlation heatmaps by providing additional statistical context and genomic region annotations. For example, with association analysis statistics, users can easily locate the SNPs with the most significant association signal, which is especially useful for genomic fine-mapping studies. Thirdly, with a large number of SNPs, LDBlockShow could compress the original vector diagram to about 33% of the original file size, facilitating visualization in personal computers. In addition, subgroup analysis is supported by LDBlockShow, which is convenient for users to compare the LD patterns in different subgroups.
In conclusion, LDBlockShow is a fast and convenient tool for visualizing LD and haplotype blocks based on VCF files. It supports the generation of LD heatmap and regional association statistics or genomic annotation results simultaneously. LDBlockShow can also compress the SVG files with a large number of SNPs and support subgroup analysis.
Visualizing the LD pattern and haplotype block structure of SNPs has become a ubiquitous component of population-based genetic studies. However, current tools suffered from the problem of time and memory consuming.
Here, we developed LDBlockShow for visualizing LD and haplotype blocks directly from compressed/uncompressed variant call format files. Real-data test confirmed that it is time and memory saving.
LDBlockShow can generate figures that simultaneously display additional statistical context and genomic region annotations. It can also compress the SVG files with a large number of SNPs and support subgroup analysis.
Web resources
LDBlockShow is freely available for non-commercial research institutions. Details can be obtained from https://github.com/BGI-shenzhen/LDBlockShow.
Data Availability Statement
Any required links or identifiers for the data used in this manuscript are present in the Methods section.
Authors’ contributions
T.-L.Y., W.-M.H. and S.-S.D. conceived and designed the study. S.-S.D. wrote the paper. W.-M.H. implemented the software with the help of S.-S.D., J.-J.J. and C.Z.Y.G. edited the paper. The authors read and approved the final manuscript.
Acknowledgement
We thank the LDBlockShow community members that have offered great suggestions for improving the software. This research has been conducted using the UK Biobank Resource under Application Number 46387.
Funding
National Natural Science Foundation of China (31871264 and 31970569); Natural Science Foundation of Zhejiang Province (LWY20H060001); Fundamental Research Funds for the Central Universities.
Shan-Shan Dong is currently working as an Associate Professor at the Key Laboratory of Biomedical Information Engineering of Ministry of Education, School of Life Science and Technology, Xi’an Jiaotong University.
Wei-Ming He is currently working as a Senior Bioinformatics Engineer in BGI Genomics.
Jing-Jing Ji is currently working as a Junior Bioinformatics Engineer in BGI Genomics.
Chi Zhang is currently working as a Senior Bioinformatics Engineer in BGI Genomics.
Yan Guo is currently working as a Professor at the Key Laboratory of Biomedical Information Engineering of Ministry of Education, School of Life Science and Technology, Xi’an Jiaotong University.
Tie-Lin Yang is currently working as a Professor at the Key Laboratory of Biomedical Information Engineering of Ministry of Education, School of Life Science and Technology, Xi’an Jiaotong University.
Reference
Author notes
Shan-Shan Dong and Wei-Ming He contribute equally to this work.