Summary: While the R software is becoming a standard for the analysis of genetic data, classical population genetics tools are being challenged by the increasing availability of genomic sequences. Dedicated tools are needed for harnessing the large amount of information generated by next-generation sequencing technologies. We introduce new tools implemented in the adegenet 1.3-1 package for handling and analyzing genome-wide single nucleotide polymorphism (SNP) data. Using a bit-level coding scheme for SNP data and parallelized computation, adegenet enables the analysis of large genome-wide SNPs datasets using standard personal computers.
Availability:adegenet 1.3-1 is available from CRAN: http://cran.r-project.org/web/packages/adegenet/. Information and support including a dedicated forum of discussion can be found on the adegenet website: http://adegenet.r-forge.r-project.org/. adegenet is released with a manual and four tutorials totalling over 300 pages of documentation, and distributed under the GNU General Public Licence (≥2).
Supplementary Information:Supplementary data are available at Bioinformatics online.
The free software R (R Development Core Team, 2011) is becoming a standard for the analysis of genetic data, offering a wealth of packages dedicated to population genetics (Jombart, 2008; Paradis, 2010), phylogenetics (Paradis et al., 2004; Schliep, 2011) or genome-wide association studies (Aulchenko et al., 2007; Clayton and Leung, 2007). Until recently, classical genetic marker data such as microsatellites could be analyzed using standard tools and personal computers. However, the increasing availability of genomic sequence data has challenged both the tools and the ressources needed to carry such analyses. While some specific packages have been developed for human association studies (Aulchenko et al., 2007; Clayton and Leung, 2007), more general tools for the analysis of the genetic structure of biological populations are needed. In this article, we introduce new tools implemented in the R package adegenet (Jombart, 2008) which allow large genomic datasets (e.g. hundreds of individuals typed for hundreds of thousands SNPs) to be analyzed using standard personal computers. As an illustration, we show how a new implementation of the discriminant analysis of principal components (DAPC) (Jombart et al., 2010) can be used to identify structuring alleles from genomic data with minimum computing resources.
The sheer size of genomic sequence data often precludes their analysis using standard personal computers. While studies focusing on genetic diversity can reduce the size of the analyzed datasets by considering biallelic SNPs only, the subsequent amount of data often remains considerable and can require prohibitive amounts of random access memory (RAM). To address this issue, we implemented a new data representation which codes each biallelic SNP using a single bit. While such coding is not readily possible in R, the new class
While the bit-level coding of SNP data is undoubtedly memory efficient, it also makes the internal structure of the objects far more complex. Considerable efforts have been made to simplify the handling and analysis of
Beyond the need for efficient data storage, the analysis of genome-wide SNP data also requires significant computing power. Fortunately, most computers now possess processors with multiple cores, which can be used to partition important tasks into several smaller operations executed simultaneously by the different cores. This approach can lead to appreciable reductions in computational time and is most useful for analyzing large datasets. By default, most procedures implemented for
Data interoperability can be a critical issue when large datasets are considered. Therefore, we made sure that genome-wide SNP data could be imported from standard formats into
We illustrate how a new implementation of DAPC for
We then apply DAPC to these data, choosing to retain 20 principal components in the prior dimension-reduction step.
Despite defavourable noise/signal ratio, DAPC discriminates very neatly the two groups of individuals (Fig. 1a). Interestingly, it also clearly identifies the structuring SNPs (Fig. 1b). Despite its simplicity, this example suggests that DAPC could be a useful tool for identifying structuring alleles from genome-wide SNP data.
adegenet 1.3-1 provides new tools for the analysis of genome-wide SNP data using standard personal computers. As the availability of genomic data increases faster than computing resources, efficient data representation and parallel computation represent viable alternatives to the mere increase of raw computing power. As such, we hope that the new class
We thank David Aanensen, Lucy Weinert, Christophe Knecht and Lee Li-Foh for interesting discussions about genomic data, and two anonymous reviewers for their useful comments.
Funding: ERC Grant (P33585) and NIGMS MIDAS Programme to Neil Ferguson.
Conflict of Interest: none declared.