statgenMPP: an R package implementing an IBD-based mixed model approach for QTL mapping in a wide range of multi-parent populations

Abstract Motivation Multi-parent populations (MPPs) are popular for QTL mapping because they combine wide genetic diversity in parents with easy control of population structure, but a limited number of software tools for QTL mapping are specifically developed for general MPP designs. Results We developed an R package called statgenMPP, adopting a unified identity-by-descent (IBD)-based mixed model approach for QTL analysis in MPPs. The package offers easy-to-use functionalities of IBD calculations, mixed model solutions and visualizations for QTL mapping in a wide range of MPP designs, including diallele, nested-association mapping populations, multi-parent advanced genetic inter-cross populations and other complicated MPPs with known crossing schemes. Availability and implementation The R package statgenMPP is open-source and freely available on CRAN at https://CRAN.R-project.org/package=statgenMPP Supplementary information Supplementary data are available at Bioinformatics online.


Introduction
Multi-parent population (MPP) designs capture wide genetic diversity and overcome the drawbacks of low minor-allele frequencies and population structures (Arrones et al., 2020;Scott et al., 2020). General MPPs, e.g. diallele (Coles et al., 2010;Giraud et al., 2017), NAM (Yu et al., 2008) and MAGIC (Gardner et al., 2016;Huang et al., 2015) are nowadays widely used in genetic studies and plant breeding programs. Statistical models that can be used for MPP designs are either family-based (linkage mapping) or populationbased (linkage disequilibrium) methods (Giraud et al., 2014;Wü rschum et al., 2012;Xu et al., 2017), but a limited number of software tools is specifically developed for MPP designs.
In this Application Note, we present an easy-to-use R package, statgenMPP, adopting an identity-by-descent (IBD)-based mixed model approach for QTL mapping. Compared to other tools, statgenMPP integrates a framework of IBD calculation (Boer and van Rossum, 2021b;Zheng et al., 2014Zheng et al., , 2015 with linear mixed models (Boer and van Rossum, 2021a). The IBD-based mixed model approach estimates random QTL effects in relation to IBD probabilities of parental origins across the offspring genome (Li et al., 2021;Wei and Xu, 2016) while accounting for polygenic and family background genetic variation, which has been proven to increase the mapping power and resolution of QTLs for simulated and empirical MPPs (Li et al., 2021). Single crosses can also be analyzed with statgenMPP, for a full list of available population types, see Boer and van Rossum (2021b). Figure 1a demonstrates the workflow of IBD-based QTL mapping using statgenMPP that comprises two main steps: IBD calculation and QTL mapping.

IBD calculation
The framework of hidden Markov models (HMM) and inheritance vectors (Zheng et al., 2014(Zheng et al., , 2015Boer and van Rossum, 2021b) is employed for the IBD calculation. For a wide range of MPP designs,

5134
This is an Open Access article distributed under the terms of the Creative Commons Attribution License (https://creativecommons.org/licenses/by/4.0/), which permits unrestricted reuse, distribution, and reproduction in any medium, provided the original work is properly cited. Applications Note IBD probabilities are calculated by the function calcIBDMPP() at a customizable grid (cM) for the specified population type. For complex MAGIC designs and designs with complicated pedigree structures, the IBD probabilities can be first calculated using RABBIT (Zheng, 2019) and then imported by the function readRABBITMPP().

QTL mapping
IBD probabilities between parents and offspring, as design vectors, or genetic predictors, are fitted in a mixed model for QTL mapping using the function selQTLMPP(). The IBD-based mixed model approach is described by Li et al. (2021). To test each position on a 1D grid along the genome, a single locus QTL model is fitted whose effects are modeled as random in a mixed effects model: Y is a vector with phenotypes; X is the design matrix indicating to which family each individual belongs; b is a vector of fixed family intercepts; M q is the design matrix containing the expected number of parental alleles as a function of IBD probabilities; a q is the vector of random parental effects with the variance-covariance (VCOV) structure I P r 2 q , in which I P is the identity matrix for P parents and r 2 q is the genetic variance of the QTL effects; it is optional to include a polygenic term g whose VCOV is described by the kinship matrix K; the residual term e has a family-specific VCOV structure F k¼1 I n k r 2 e k in which r 2 e k is the residual variance of individuals in the kth family ðk ¼ 1; 2; :::; FÞwith family size n k . The linear mixed model is fitted and variance components are estimated based on restricted maximum likelihood. Variance components corresponding to putative QTLs (r 2 q ¼ 0 versus r 2 q 6 ¼ 0) are evaluated by likelihood ratio tests (LRT) that approximate a mixture of v 2 distributions (Self and Liang, 1987). Multiple rounds of genome QTL scans can be performed until either (i) no new QTL outside of a certain window size is found with a Àlog10(p) value below a predefined threshold or (ii) a predefined maximum number of QTLs is reached.

Applications
We demonstrate the main functionalities of statgenMPP using two publicly available MPP designs-a soybean NAM design and a barley complex design (datasets are provided in the supplementary material with R codes). Other examples with full details are available in the vignette of statgenMPP.
All computations were performed in (64-bit) R 4.2.1 (R Core Team, 2022) and a 3.10 GHz Intel Core i5 processor computer with 16GB of RAM and Windows 10 operating system. We used the parallel option of statgenMPP using four cores, and using the default values, not including a kinship matrix.
We selected six families (Fig. 1b, upper panel) with a total population size of 732 genotypes from the soybean NAM project (https:// soybase.org/SoybeanNAM/index.php) (Xavier et al., 2018) for analysis to demonstrate the functionalities of statgenMPP. The consensus map containing 4289 markers (Song et al., 2017) was used to calculate IBD probabilities on a regular 5 cM grid of evaluation points. We map QTLs for the trait of seed weight ('mean_seedWT') as an example. The total computation time was 1.03 min. The second example is a complex barley MPP design (Liller et al., 2017), with a total population size of 916 genotypes, for QTL mapping for awn length ('Awn_length') ( Fig. 1c, upper panel). The total computation time was 1.43 min. IBD calculation for the complex design was performed by RABBIT, and then the output was imported by the readRABBITMPP() function in statgenMPP.
Results of QTL mapping for soybean NAM and barley complex MPP designs are shown in Figure 1b and c. For example in the barley complex design, all QTLs for awn length in the barley complex design can be confirmed from the previous study where the strong QTL on chromosome 7 was successfully fine mapped (Liller et al., 2017). Further details on how to use the package for visualization can be found in the statgenMPP vignette.

Conclusion
An increasing range and number of MPP designs become available for QTL mapping in breeding programs and genetic studies. We introduce the R package statgenMPP that covers the demands for versatile QTL analysis in MPP designs. statgenMPP contains a unified IBD-based mixed model framework for mapping multi-allelic QTLs. We analyzed multiple MPP data sets with statgenMPP, whose data are available within the package, and demonstrated the theoretical and practical advantages of our approach. Extensions of statgenMPP will deal with epistasis, pleiotropy and QTL-byenvironment analysis, to allow the user to explore the full potential of MPP designs.