-
PDF
- Split View
-
Views
-
Cite
Cite
Christopher Heffelfinger, Christopher A Fragoso, Mathias Lorieux, Constructing linkage maps in the genomics era with MapDisto 2.0, Bioinformatics, Volume 33, Issue 14, July 2017, Pages 2224–2225, https://doi.org/10.1093/bioinformatics/btx177
Close - Share Icon Share
Abstract
Genotyping by sequencing (GBS) generates datasets that are challenging to handle by current genetic mapping software with graphical interface. Geneticists need new user-friendly computer programs that can analyze GBS data on desktop computers. This requires improvements in computation efficiency, both in terms of speed and use of random-access memory (RAM).
MapDisto v.2.0 is a user-friendly computer program for construction of genetic linkage maps. It includes several new major features: (i) handling of very large genotyping datasets like the ones generated by GBS; (ii) direct importation and conversion of Variant Call Format (VCF) files; (iii) detection of linkage, i.e. construction of linkage groups in case of segregation distortion; (iv) data imputation on VCF files using a new approach, called LB-Impute. Features i to iv operate through inclusion of new Java modules that are used transparently by MapDisto; (v) QTL detection via a new R/qtl graphical interface.
The program is available free of charge at mapdisto.free.fr.
Supplementary data are available at Bioinformatics online.
1 Introduction
MapDisto is a user-friendly computer program for construction of genetic linkage maps in diploid species (Lorieux, 2012). It has been used in numerous publications to construct maps of various plant, fungi or animal organisms (see http://mapdisto.free.fr/MapDisto/Refs for a list of references that use MapDisto). In this note, we present a new major implementation that can handle very large marker datasets obtained by genotyping-by-sequencing (GBS) (Elshire et al., 2011; Heffelfinger et al., 2014), and build linkage groups more efficiently in case of segregation distortion (Supplementary Material S1).
2 Main new features
MapDisto now offloads a variety of operations to an included ‘MapDistoAddons.jar’ Java package. The offloaded operations include Variant Call Format (VCF) file conversion, linkage group calculation and data imputation. Similarly, data imputation and correction are performed by LB-impute (Fragoso et al., 2016) through the ‘LB-Impute.jar’ Java module. QTL detection is performed via a new graphical interface, which runs R/qtl commands in the background.
2.1 Large dataset handling and speed
The maximum dataset size depends on the amount of available memory. On modern desktop computers equipped with 16 or 32 GB of RAM, hundreds of thousands of markers could theoretically be processed. In the previous version (v.1.7) based on Visual Basic for Applications (VBA) 32-bit code, the maximum dataset that MapDisto could handle was about ∼8900 or ∼16 000 markers on the Microsoft Windows and Apple OS X platforms, respectively, and ∼32 000 on the 64-bit version of Microsoft Windows. This was due to internal limitations for array size of the VBA virtual machine. We also observed a 20–60x increase in computation speed—e.g. LOD scores or recombination fractions—compared to v.1.7.
As an example, we were able to process a rice population consisting of 181 RIL and scored with 44 398 GBS markers on a microcomputer equipped with as few as 4 GB of RAM and running OSX 10.10.5 and Excel 2011. It took only ∼10 minutes to compute a complete LOD score and recombination fraction matrix and to determine the linkage groups. This is achieved thanks to implementation of multi-threading in the Java module.
2.2 VCF conversion
MapDisto can now convert v.4.1 VCF files into MapDisto format and automatically import the converted file. If parents' identifiers in the VCF file are specified, A (homozygous parent 1), B (homozygous parent 2) and H (heterozygous) alleles will be assigned to calls based on them. If parents' identifiers are not specified, alleles at a given marker will be assigned based on the first allele observed in the dataset by the algorithm.
2.3 Marker and individual filtering
Recombinant populations contain a limited number of recombination breakpoints. In highly-saturated genotyping experiments, like GBS assays, markers often completely co-segregate and the information they provide about recombination is thus redundant. The ‘Filter loci’ identifies the minimum number of genetic markers that provides the full recombination information in the population.
A new ‘compute statistics on genotypes’ function allows filtering of individuals for number of recombination breakpoints (or transitions), percentage of missing data and percentage of genotypic configuration (homozygous parent 1 or 2, and heterozygous).
2.4 Segregation distortion and detection of linkage
We developed and implemented in MapDisto a new approach to identify linkage groups (LG) in case of segregation distortion (SD), a common phenomenon that can alter genetic maps (Lorieux, 1995a, b). Several patterns of SD can occur and will lead to a variety of possible effects on the sensitivity and specificity of statistical tests for detection of linkage. The method is detailed in the joint Supplementary Material S2.
2.5 Data imputation and correction
MapDisto employs the new LB-Impute (for Low coverage, Biallelic Impute) algorithm to impute data in biallelic populations with residual heterozygosity typed via low-coverage sequencing (Fragoso et al., 2016). See Supplementary Material S3 for a brief explanation of how LB-Impute works. An additional imputation algorithm, Breakpoint Imputation (BP-Impute), is designed to further resolve missing markers surrounding recombination breakpoints (Fragoso, et al., 2017). Details of the algorithm are also outlined in Supplementary Material S3.
2.6 QTL search
Although MapDisto v.1.x already had built-in features to perform F-test and distribution-free Kruskal-Wallis quantitative trait locus (QTL) search, this new version now provides a graphical interface to perform more advanced analyses, empowered by the R/qtl package (Arends et al., 2010). It provides access to the ′onescan′ and ′twoscan′ (for two-dimensional interaction analysis) scan types using different methods: interval mapping, Haley & Knott, extended Haley & Knott, simple marker regression. The user can choose to work on all traits or a specific one, and to all chromosomes or a specific one. Graphical results (one and two-dimension scan, trait summary) are displayed as portable document format (PDF) files. Also, one of the most tedious tasks in using R/qtl is preparation of input data files; MapDisto performs transparent, automatic formatting and exporting of input data files as R/qtl-compatible files.
2.7 Population types handled
MapDisto handles F2, backcross (BC), doubled haploids (DH), recombinant inbred lines (RILs) obtained by single-seed descent, and highly recombinant inbred lines (HRILs). All the new features of v.2 are available for these population types, with the exception of the linkage detection in case of segregation distortion, which is still in development for F2 populations.
Acknowledgements
We thank Dr Jean-François Rami for his valuable advice and tips on VBA-R/qtl integration. We are grateful to the MapDisto users for their constant feedback. We also thank two anonymous reviewers for their valuable inputs and comments to improve the manuscript.
Funding
This study was supported in part by NSF Awards 1444478 and 1419501, and by the Biomedical Informatics Research Training at Yale, project T15 LM 007056 (co-Directors of the grant: Cynthia Brandt and Michael Krauthammer). Computational analyses were performed on the Yale University Biomedical High Performance Computing Cluster, which is supported by National Institutes of Health grants RR19895 and RR029676-01.
Conflict of Interest: none declared.
References