cgmisc: enhanced genome-wide association analyses and visualization

Summary: High-throughput genotyping and sequencing technologies facilitate studies of complex genetic traits and provide new research opportunities. The increasing popularity of genome-wide association studies (GWAS) leads to the discovery of new associated loci and a better understanding of the genetic architecture underlying not only diseases, but also other monogenic and complex phenotypes. Several softwares are available for performing GWAS analyses, R environment being one of them. Results: We present cgmisc, an R package that enables enhanced data analysis and visualization of results from GWAS. The package contains several utilities and modules that complement and enhance the functionality of the existing software. It also provides several tools for advanced visualization of genomic data and utilizes the power of the R language to aid in preparation of publication-quality figures. Some of the package functions are specific for the domestic dog (Canis familiaris) data. Availability and implementation: The package is operating system-independent and is available from: https://github.com/cgmisc-team/cgmisc Contact: marcin.kierczak@imbim.uu.se Supplementary information: Supplementary data are available at Bioinformatics online.


Synopsis
cgmisc is an R package for enhanced data analyses and visualisation in genome-wide association studies (GWAS). This document will guide you through the installation process and to demonstrate package capabilities in a series of practical examples based on a showcase data included in the package.

Package installation
The cgmisc package is available on GitHub repository and can be installed with the help of the devtools package:

install.packages('devtools') # Install devotools if not installed library(devtools) # Load the library install_github('cgmisc-team/cgmisc') # Install cgmisc
In order to use the package functions, it is necessary to load it into environment:

Loading data
Whenever possible, the cgmisc package works with data structures implemented and used by the GenABEL (Aulchenko et al. 2007) package. In particular, the gwaa.data-class and the gwaa.scan-class structures are used. The package is shipped with an example dataset called cgmisc_data. The example dataset contains genotyping data (Illumina CanineHD array, canFam2 assembly) for N=207 German shepherds originally collected for the project described in (Tengvall et al. 2013). However, to illustrate various features of the cgmisc package, the phenotypes included in the example dataset have been simulated. Use the following command to load the example dataset:

data('cgmisc_data')
Now, once the data have been loaded, we can start analyses.

Example analyses
In order to illustrate how to use particular functions, we will perform a much simplified GWAS analysis. We begin by performing initial quality control.

Quality control
First, we will prune the data with per marker or per individual call rates below 95%. Based on 2000 randomly selected autosomal markers, we remove one (the one with lower call rate) from each pair of too similar (more than 95% similarity) individuals. We also set very low (10 −3 ) threshold for pruning on minor allele frequency (in practise only the monomorphic markers will be removed) and turn off checks based on the departure form Hardy-Weinberg equilibrium (p.level=10e-18)

Detecting population structure
Now, we will analyse population structure using genomic-kinship information. Using all autosomal chromosomes, we will calculate the Identity By State (IBS) matrix which captures pairwise genetic similarity between individuals. Then, in order to visualize population structure, we will use the multidimensional scaling (MDS) technique to reduce the IBS matrix to two dimensions.

MDS1 MDS2
We can see that there is a chance of population structure here: a somewhat tighter cluster of individuals to the left. We should investigate this further, but for the purpose of this tutorial, let's just run simple K-means clustering with the number of clusters a priori set to K = 2.

Comparing subpopulations
We can compare subpopulations by, e.g. looking at the differences in the reference allele frequency using the pop.allele.counts and the plot.pac functions. Here, we will focus on the chromosome 2 only.

Comparing of allele counts
We can perform Fisher's exact test on the observed per-population reference allele counts and the counts expected assuming null hypothesis of no population structure (all individuals drawn from the same population). Significant deviations from this expected frequency mark divergent regions.

Association analyses
Having defined and briefly analysed the subpopulations, we can proceed to association analyses using, e.g. mixed model with genomic kinship as random effect.

Visualising and analysing linkage structure
Say, we would like to zoom in on chromosome 2 and visualise linkage disequilibrium (LD) of every marker in the region to the top-associated marker. First, we need the name and coordinates of the top-associated marker: The top-associated marker is BICF2S2365880 and its position is 38256927bp. We will zoom in on a 2 Mbp region centered on the marker using the plot.manhattan.ld function. The function produces a standard so-called Manhattan plot with color-coded linkage disequilibrium (LD) to a specific reference marker measured by r 2 . Given genotyping data (as GenABEL gwaa.data-class object) and GWAS result in the form of p-values vector together with genetic coordinates of a region to be plotted (up to entire chromosome), plot.manhattan.ld produces a plot with genomic coordinates on the x-axis and −log 10 (p−value) on the y-axis. Linkage disequilibrium to a reference marker is represented by specific colors.

Visualising gene annotations
The plot.manhattan.genes extends the plot.manhattan.ld function by enabling visualisation of genes provided in a BED file. In the cgmisc package, we provide a BED file containing information on protein coding genes in dog canFam3.1 assembly. The file was prepared using Broad Improved Canine Annotation v.1 available at the UCSC Genome Browser. Below, we use this file to visualise genes in a specified region:

Working with genomic regions
To extract markers from a given genomic region that are in high LD to a given index marker the choose.top.snps function can be used.

Clumping procedure
We can also use the clumping procedure as outlined in PLINK documentation (http://pngu.mgh.harvard. edu/purcell/plink/clump.shtml) to single out regions of interest. As we can see, there are two clumps represented by grey and violet points respectively. The clumps are shown on both the standard Manhattan plot (upper panel) and, for improved clarity, also on a dedicated clump panel (lower panel). In short: a clump contains markers in high LD that are also significantly associated with the examined trait.

Improved quantile-quantile plots
A quantile-quantile plot (QQ plots) is a graphical way of comparing two probability distributions. In GWAS studies, QQ plots are commonly used to compare computed per marker p-values with the ones expected under the null hypothesis of no association. This comparison is then used to compute genomic-inflation factor λ which is a good measure of the degree of confounding caused by population structure. The cgmisc package provides the plot.qq function for better visualization and improved interpretation of standard QQ plots. The function plots expected vs. observed distribution of p-values and shows theoretical confidence interval computed using approach outlined in (Casella and Berger 2002). Apart from showing the theoretical confidence interval, it can also perform a number of randomization tests (shuffling the phenotype) to determine empirical confidence interval which, due to LD, is often narrower than the theoretical one. Empirical p-values can be supplied by the user, otherwise a randomisation test on Grammar gamma-transformed residuals is performed as outlined in (Belonogova et al. 2013).

Interacting with the UCSC genome browser
It is often convenient to display p-values from a genome-wide association scan directly in UCSC Genome Browser (Kent et al. 2002) to easily align annotations with the signals of interest. This can be done with gwaa.to.bed function that exports coordinates and p-values from gwaa.data-scan into a BED file that can easily be used to set a custom path in the UCSC Genome Browser.

Detecting runs of homozygosity
The get.overlap.windows function divides the selected chromosome (or the whole genome) into overlapping chunks of given size and overlap. The function returns a list containing window coordinates along with a logical matrix where each window is represented by a row and the logical value per-marker is set to true if the marker is contained within the window and to false otherwise. One can specify (in bp) size of a window as well as overlap between windows. If overlap is set to 0, non-overlapping windows will be used.
We will divide chromosome 2 into overlapping windows and then use them to calculate mean heterozygosity and identify runs of homozygosity: my.LW <-get.overlap.windows(data = data.qc1, chr = 2, size = 125e3, overlap = 2500)

Computing average heterozygosity using overlapping windows approach
Heterozygosity is evaluated based on allelic frequencies of markers in particular overlapping windows and the basic Hardy-Weinberg theorem equation. The values range from 0 to 1 and correspond to the probability that the given set of loci, represented by the window, is heterozygous. We can use the calculated heterozygosity to detect runs of homozygosity across the selected chromosome or region. We will use the get.roh function to check if we have any stretches of reduced heterozygosity on chromosome 2. All windows with the average heterozygosity below a given threshold, here 0.30, be deemed homozygous. As a result we get a matrix with runs coordinates, length (in windows) and first window that starts a stretch.

Examining allele/genotype effect using phenotype boxplots
For quantitative traits, the boxplot.snp function can be used to visually examine allele or genotype effect by plotting phenotype boxplots for the individuals in every genotypic class. The function works for both outbreed (three boxes) and inbreed (two boxes) data.

Simple epistasis scan
To gain further understanding of the genetic architecture underlying our trait, we might want to search for potential epistasis between pairs of SNPs. In the function epistasis.scan, we implement a simple way of doing this by fitting linear models including two SNPs: where y is the phenotype, SNP and SN P i are the genotypes at the two SNPs, and e is the residual. i = 1...n, where n is the number of SNPs in the input to the function. The function takes a SNP, phenotypes, and a gwaa.data-class object as input. It then fits linear models between SNP and all markers/SNPs in the gwaa.data-class object. To visualize a potential two-SNP interaction, one can use the function boxplot.snp.twoWay. This function shows a two-locus genotype-phenotype map by plotting phenotype boxplots for the individuals in every genotypic class, as defined by the two SNPs jointly. This gives nine or four boxes, for outbred and inbred data respectively.

Data conversion functions
Our package provides a number of functions which enable navigating between various data formats such as FASTA or the input format used by the PHASE software.
gwaa.to.bigrr The bigRR package implements a computationally-efficient generalized ridge regression (RR) algorithm for fitting models with a large number of parameters (Shen et al. 2013). gwaa.to.bigrr function exports gwaa.data-class object to bigRR-compatible input structure.
gwaa.to.vgwas In a standard approach to GWAS, associations are detected based on differences in mean phenotypic values across genotype classes. Shen et al. (2012) have proposed an extended approach where the associations are detected based on differences in variances, not means only. They have implemented the approach in the vGWAS R package. The gwaa.to.vgwas function converts gwaa.data-class object to a format readable by the vGWAS package.
gwaa.to.phase enables the user convert a GenABEL gwaa.data-class object to the intrnal PHASE input format.
phase.to.fasta PHASE is a software for haplotype reconstruction and our phase.to.fasta function can be used to parse a part of PHASE output to a customised FASTA format: '>haplo_1_count_176' 'TCGGGCTC' In the above example, the first line contains the name of the haplotype along with its count, the second line provides the haplotype sequence.
phase.to.haploview Converts PHASE output format to the Haploview input format. Haploview is a popular software which faciliates haplotype and LD analyses.
gwaa.to.bed produces a BED file containing p-values from a pre-defined region of dataset. The file can be used to, e.g. visualize GWAS p-values as the UCSC Genome Browser custom track.

Dog-specific utilities
Current approaches to finding genome-wide associations in diploid organisms often encounter difficulties when analysing non-autosomal parts of the genome, e.g. the sex chromosomes (Young, Kirkness, and Breen 2008). In cgmisc, the chr.x.fix.canfam function partially addresses this issue by creating an artificial autosome which consists of pseudo-autosomal regions of the X chromosome. This additional autosome can be analysed in a way similar to all other autosomes. Currently, the function is specific to data coming from the domestic dog (Canis familiaris, assemblies canFam2 and canFam3.1).

Endogenous retroviral sequences (ERV)
The get.erv function returns information about endogenous retroviral sequences (ERV) in the analysed region. At first, we need to obtain a list of ERV sequences in the defined genomic region using the get.erv function. In the cgmisc package, we provide collection of canine ERVs identified in canFam3.1 assembly but you can also supply any other database.
Let's search for ERVs on chromosome 2 NOTE! In this example, we are using ERV database for the canFam3.1 assembly on canFam2 data which may not be the optimal way. We shall perhaps use the Liftover software to map coordinates between the assemblies.