RegScan: a GWAS tool for quick estimation of allele effects on continuous traits and their combinations

Genome-wide association studies are becoming computationally more demanding with the growing amounts of data. Combinatorial traits can increase the data dimensions beyond the computational capabilities of the current tools. We addressed this issue by creating an application for quick association analysis that is ten to hundreds of times faster than the leading fast methods. Our tool (RegScan) is designed for performing basic linear regression analysis with continuous traits maximally fast on large data sets. RegScan specifically targets association analysis of combinatorial traits in metabolomics. It can both generate and analyze the combinatorial traits efficiently. RegScan is capable of analyzing any number of traits together without the need to specify each trait individually. The main goal of the article is to show that RegScan can be the preferred analytical tool when large amounts of data need to be analyzed quickly using the allele frequency test. Availability: Precompiled RegScan (all major platforms), source code, user guide and examples are freely available at www.biobank.ee/regscan. Requirements: Qt 4.4.3 or newer for dynamic compilations.


Reference tools
For reference we selected the commonly used tools that can perform linear regression analysis with allele frequency and continuous traits, and can output p-value, beta and se: SNPTEST (2.4.1) and QuickTest (0.97).
There are also tools that perform linear regression analysis but do not output p-value (e.g. ProbABEL 0.3.0). Of all four tools tested, RegScan 0.1 always performed the fastest.

Speed as a function of the number of individuals tested
The analysis time of RegScan did not change relative to QuickTest when the number of individuals was varied. RegScan remained about 10 times faster with one trait. The computational time ranged from 0.08 -0.34 msec/marker/trait for RegScan and 0.79 -3.4 msec/marker/trait for QuickTest (figures are in the main article).

Speed as a function of the number of markers
We tested the computational speed of RegScan and QuickTest with 38.02 million markers, 1 trait, and 750 individuals. RegScan performed 10.6 times faster than QuickTest, therefore showing the same relative speed as with 1 million markers. The computational speed of RegScan was 0.073 msec/marker/trait under these conditions. [When the number of individuals was increased to 3315 (4.42 times higher) the analysis time increased 4.6 times -again showing linear relationship between the number of individuals and the analysis time.]

Speed per trait as a function of the number of traits
It is expected that the time spent on analyzing each trait is decreased with increasing the number of traits. This was tested with 5 million markers, 750 individuals, and variable number of traits. The results indicated that the analysis time decreased from 114 sec/trait with 112 traits to 56 sec/trait with 6212 traits, which corresponds to 0.011 msec/marker/trait.

Speed as a function of memory allocation
RegScan analysis can be further accelerated by allocating more memory for data reading.
The user can allocate 1 -n Mb of RAM for data reading. 1 Mb is sufficient with typical data sets to achieve most of the speed gain that RegScan features. We compared relative analysis speed with allocating 1 Mb vs. 1 Gb of RAM using the '-buffer' switch. The relative analytical speed 1 Gb / 1Mb was 13 % with 750 individuals, 38.02 million markers and 1 trait.

Eliminating less informative markers
RegScan analyis can be accelerated by removing the markers that have a low MAC (minor allele count) from the analysis. This is achieved by setting the MAC limit (-maclimit) higher. Fig. S2 shows relative processing time as a function of MAC threshold. Analysis time can be shortened up to 11% by using the MAC filter ('-macfilter').

Speed of analyzing gzip files
Gzip files are typically analyzed about 6.5-6.8 times slower than the non-zipped files because unzipping takes time. In practice, if the input file is already gzip'ed, it is a better to analyze it directly than first unzip and then analyze because unzipping and file writing takes a significant amount of time.

Practical example to demonstrate RegScan analysis
We tested RegScan with a 1000 Genome imputed dataset of 38.02 million markers, 873 random individuals of European descent and 44 clinical traits to illustrate how RegScan functions. The traits were adjusted for gender and age and inverse-normally transformed. The ratio traits were created with RegsScan's "combitable" function. All marker-trait pairs (single or ratio values) with a p-value of association under 10 -3 were written into the main output by the "gwas" function for further analysis performed by the other functions of RegScan. This threshold was chosen arbitrarily for this example to ensure that the lowest p-values were not missed in decision-making (see below). All p-values could have been chosen instead but that would have yielded in very large file sizes. Since we are generally only interested in top hits the value of 10 -3 works well.
Below are two examples to serve only as proof of principle, they do not represent a scientific study.

1) Test if known trait-associated markers were identified by RegScan
We used bilirubin levels as the trait to compare the top RegScan-identified hits with the published bilirubin-associated markers to test if RegScan was able to detect any of the published markers. Our five top markers (rs111741722, rs887829, rs6742078, rs4148324, rs4148325) had a p-value under 10 -50 and they included the topmost hit of each of the bilirubin-related association study published: Datta S. et al, November 28, 2011, Ann Hum Genet Chen G. et al, November 16, 2011, Eur J Hum Genet Bielinski S.J. et al, June 06, 2011, Mayo Clin Proc Sanna S. et al, May 06, 2009, Hum Mol Genet Johnson A.D. et al, May 04, 2009 The results were also confirmed with QuickTest. The p-values, effect sizes and standard errors computed by QuickTest and RegScan agreed completely.

2) Identify markers for the trait ratios
We used blood plasma total iron concentration as the lead trait (one of the two traits in the trait ratio) in this example. This serves as a practical example to illustrate how RegScan can be used to identify significant markers.
We identified trait ratio candidates with RegScan's "combifilter" by setting the trait ratio p-value limit at <5x10 -8 . For each marker the p-values of the single traits that corresponded to the trait ratio were compared and the smaller p-value was identified as the smaller single trait p-value (SSTP). Next a Reliability Score (RS) was computed for each pair of trait ratio and marker pair by dividing the SSTP by the p-value of the trait ratio (if the trait ratio p-value was <5x10 -8 ). All these steps are automatically performed by the "combifilter" function. The RS indicates how much lower the p-value of the trait ratio is compared to the "best" single trait. The higher RS values were considered more significant and all trait ratio and marker pairs were ranked according to the RS. The top hits based on the RS can be extracted as candidates. This method allows one to report a relatively short list of candidates for each trait. Here is an example for iron concentration: