LASER server: ancestry tracing with genotypes or sequence reads

Abstract Summary To enable direct comparison of ancestry background in different studies, we developed LASER to estimate individual ancestry by placing either sezquenced or genotyped samples in a common ancestry space, regardless of the sequencing strategy or genotyping array used to characterize each sample. Here we describe the LASER server to facilitate application of the method to a wide range of genetic studies. The server provides genetic ancestry estimation for different geographic regions and user-friendly interactive visualization of the results. Availability and Implementation The LASER server is freely accessible at http://laser.sph.umich.edu/ Supplementary information Supplementary data are available at Bioinformatics online.


Introduction
Advancing genetic studies of rare variants will require very large sample sizes. Achieving these large sample sizes is challenging both because of the need to combine samples and data across multiple sources but also because of the need to guard against population structure, which can lead to spurious signals in genetic association tests. Typically, large studies estimate genetic ancestry of study participants and use the results to control for population structure or focus analyses on matched subsets of the data (Price et al., 2010). With large amounts of genetic data from many studies, there is a pressing need for tools that can provide comparable ancestry estimates using different types of genetic data and different sets of variants. We have developed the LASER method to infer ancestry places array-genotyped or sequenced individuals in a predefined reference ancestry space (Wang et al., 2014(Wang et al., , 2015. The resulting ancestry estimates are directly comparable across studies, as long as the same reference space is used in the LASER analysis.
Here, we develop a web server that allows researchers to estimate and compare genetic ancestry of genotyped and sequenced samples from different studies without pooling raw data, facilitating ancestry matching and collaboration across studies. The ancestry information can be useful for deciding which samples to include in joint association analysis or in further sequencing or genotyping experiments.

Implementation
The server is based on the LASER method, which can estimate ancestry using either genotypes or sequence reads (Supplementary Data).
A key component of LASER is the ancestry reference panel: a heavily genotyped dataset of diverse populations. LASER applies principal components analysis (PCA) on the ancestry reference panel to construct a K-dimensional ancestry space S, which defines a common ancestry coordinate system for samples from different studies. To assign coordinates to a single study individual, LASER uses variants shared between this individual and the N reference panel members to V C The Author 2017. Published by Oxford University Press.

2056
This is an Open Access article distributed under the terms of the Creative Commons Attribution License (http://creativecommons.org/licenses/by/4.0/), which permits unrestricted reuse, distribution, and reproduction in any medium, provided the original work is properly cited. Applications Note perform a PCA of the N þ 1 individuals and obtains the K'-dimensional (K'K) PCs space S 0 . LASER then performs a projection Procrustes analysis (Gower and Dijksterhuis, 2004) to find a set of transformations that project the N reference individuals from S 0 to S. The transformations maximize the Procrustes similarity between the projected coordinates and coordinates for reference samples in S. Finally, LASER uses these transformations to place the study individual from S 0 into S. The accuracy of the placement is partly reflected by the Procrustes similarity t, a score specific to each study individual. This procedure repeats until all study individuals are mapped to the same space S, regardless of differences in data types and variant sets. Importantly, the LASER method avoids shrinkage of projected coordinates that is common in other projection PCA analyses. The LASER server currently includes three built-in ancestry reference panels: a worldwide panel to estimate continental ancestry (the HGDP dataset, including 938 individuals from 53 populations; Li et al., 2008), a European panel to estimate fine-scale ancestry within Europe (the POPRES dataset, including 1385 individuals from 37 populations; Novembre et al., 2008), and an Asian panel aggregated from five studies (Li et al., 2008;Teo et al., 2009;The 1000Genomes Project Consortium, 2015Xing et al., 2010Xing et al., , 2013 to estimate finescale ancestry within Asia (836 individuals from 43 populations). To improve ancestry estimation, we expanded each of these panels to millions of SNPs by imputation (Das et al. 2016;The 1000Genomes Project Consortium, 2015. The ancestry reference coordinates for each panel are pre-computed using only the directly genotyped SNPs to avoid potential artifacts introduced by imputation. Selecting an appropriate ancestry reference panel is critical for LASER. When an individual's ancestry is not represented in the reference panel, LASER may cluster the individual with reference populations of a distant genetic background, yielding misleading results (Wang et al., 2015). A good practice is to start with a worldwide reference panel and gradually focus on relevant regional panels. To address this issue, we propose a novel statistic Z to help diagnose if a reference panel is appropriate by comparing each study individual's genetic variance with his nearest neighbors in the reference space (Supplementary Data). We showed that our proposed Z score is highly informative when a European reference panel is mistakenly used for non-European samples ( Supplementary Fig. S1).
The LASER server has a user-friendly web interface based on the Cloudgene platform (Schö nherr et al., 2012) where users can select a relevant ancestry panel and upload their data. The server accepts standard VCF files for genotype data and a matrix format to store read counts and estimated per base error rates from BAM files for sequence data; a companion utility is available for users to generate the input files from their BAM files. To facilitate quick exploration of ancestry, the LASER server generates both tabular summaries and interactive 2D/3D visualizations of the estimated coordinates. The interactive features include zooming, rotating, panning and displaying in a dynamic pie chart the ancestry composition of the k nearest neighbors for any selected individual.

Example
We tested the LASER server on 12 940 exomes sequenced at 80X depth (WES) from the T2D-GENES and GoT2D studies (Fuchsberger et al., 2016). These data include five ancestry groups: European, East Asian, South Asian, Hispanic and African American. After uploading a VCF file of genotypes, the LASER server automatically identified 12 719 SNPs overlapping between the T2D-GENES/GoT2D data and the non-imputed HGDP panel, which defines a worldwide ancestry space. LASER analysis (K'¼20, K ¼ 4) suggested this was sufficient to accurately estimate continental ancestry (average t ¼ 0.998). We observed five clusters in a 3D visualization of the top PCs, corresponding to the five ancestry groups (Fig. 1).
Among the 12 940 individuals, we also have whole genome sequence data (WGS, 5X) for 2335 Europeans from the GoT2D study, including British, Finnish, German and Swedish. We placed these individuals on a European ancestry map based on the POPRES panel. The results based on genotypes from WGS data and sequence reads from WES data are highly similar (Procrustes similarity t 0 ¼ 0.9198, Pearson correlation 0.9424 for PC1 and 0.9056 for PC2; Fig. 2), with GoT2D samples cluster nicely with populations from their geographic regions. This example demonstrates that LASER can provide comparable ancestry estimates based on different types of data. The WES-based results are noisier than the WGS-based results due to the small number of targeted SNPs and low coverage across off-target regions in the WES data; the concordance between WESand WGS-based results increases for samples with higher individual- specific Procrustes score t (Fig. 2). In practice, users can filter samples with insufficient data for ancestry estimation based on t. We note that by using a reference panel, LASER is more robust to the sampling distribution than standard PCA, for which uneven sampling of populations can distort top PCs (McVean, 2009). In our example, standard PCA cannot separate British, German and Swedish by PC1 and PC2 because Finnish has much larger sample size than the other populations and thus drives the first two PCs (Supplementary Fig. S2).
The LASER server parallelizes ancestry estimation and the total runtime for each job depends on the number of avaible CPUs. Ancestry estimation for a single study individual takes from a few seconds to several minutes, depending on the input data type (genotypes or sequence reads), the sample size of the ancestry reference panel, and the number of SNPs used in the analysis (Supplementary Table S2).

Conclusion
With a unified analysis framework and preprocessed ancestry reference panels, the LASER server allows users to map genotyped or sequenced samples from different studies into a common ancestry space without pooling the raw data. The ancestry estimates are directly comparable across studies, and thus can facilitate collaborations and help identify ancestry-matched external controls to boost power in disease studies.