wgd—simple command line tools for the analysis of ancient whole-genome duplications

Abstract Summary Ancient whole-genome duplications (WGDs) have been uncovered in almost all major lineages of life on Earth and the search for traces or remnants of such events has become standard practice in most genome analyses. This is especially true for plants, where ancient WGDs are abundant. Common approaches to find evidence for ancient WGDs include the construction of KS distributions and the analysis of intragenomic colinearity. Despite the increased interest in WGDs and the acknowledgment of their evolutionary importance, user-friendly and comprehensive tools for their analysis are lacking. Here, we present an easy to use command-line tool for KS distribution construction named wgd. The wgd suite provides commonly used KS and colinearity analysis workflows together with tools for modeling and visualization, rendering these analyses accessible to genomics researchers in a convenient manner. Availability and implementation wgd is free and open source software implemented in Python and is available at https://github.com/arzwa/wgd. Supplementary information Supplementary data are available at Bioinformatics online.


Introduction
Here we supply additional methodological details with regard to the wgd software package for the analysis of whole genome duplications (WGDs) in genome data. In section 5, we provide a detailed step-by-step protocol to acquire the results shown in Figure 1 of our main paper. In that section, we also show exactly which parameter settings were used to generate the relevant data.

K S distribution construction
We describe the workflow for paralogous families, as the approach for one-vs.-one ortholog K S distributions is essentially the same. For every gene family, a multiple sequence alignment (MSA) is inferred using one of the supported aligners (currently MUSCLE (Edgar 2004) and MAFFT (Katoh et al. 2013) are supported for protein MSAs (which are back-translated), and PRANK (Löytynoja & Goldman 2008) for codon-level MSAs). Subsequently, the K S values for all gene pairs in the family are estimated by maximum-likelihood (ML) using the model of Nielsen & Yang (1998) as implemented in the codeml program from the PAML package (Yang 2007). Codon frequencies are determined using the F3X4 method based on the average nucleotide frequencies at the three codon positions. Codon model 0 is used for pairwise K S , K A and ω estimation, assuming a constant ω across sites and branches. K S estimates are subsequently node-weighted to correct for the redundancy in K S estimates when the family has undergone multiple duplication events. To this end the user can choose to apply average linkage clustering (e.g. Maere et al. 2005) or phylogenetic tree construction (e.g. Vanneste et al. 2015) using FastTree (Price et al. 2010) or PhyML (Guindon et al. 2010). We note that the modular design of the package allows to easily provide support for other aligners and phylogenetic tree inference programs than those currently supported. The full K S analysis workflow can be executed in parallel to allow efficient computation on multi-core systems.

Kernel density estimation & Gaussian mixture modeling
Kernel density estimates (KDEs) are used frequently for the visualization of empirical distributions as density curves, and have been used for K S distributions as well. One problem that is rarely accounted for however when fitting KDEs to K S distributions is the boundary effect at K S = 0 (or any other lower bound when some filtering step is used). When not accounted for the boundary, a KDE will strongly underestimate the density around the boundary, where the size of this region of underestimation is dependent on the bandwidth. This underestimation may lead to spurious peaks in low K S regions. One simple approach to account for the boundary effect is to reflect the data around the boundary and generate a new data set from the combination of the original data and the reflected data. One can then fit a KDE to the resulting data set, which is of course visualized in the original K S range of interest. We note that this approach effectively amounts to the assumption that the derivative of the density curve at the boundary is equal to 0. This was implemented both in wgd kde and wgd viz. Gaussian mixture modeling has been used frequently in the literature to study K S distributions (e.g. Barker et al. 2008, Vanneste et al. 2015, Devos et al. 2006, Li et al. 2018. However there has been a widespread misconception that peaks in the K S originating from WGDs are expected to show a Normal distribution (e.g. Barker et al. 2008, Devos et al. 2006, Li et al. 2018. Simple molecular evolutionary arguments however indicate otherwise. If we consider synonymous substitution as a Poisson process, and synonymous substitution rates for different duplicate pairs sampled from some Gamma distribution, the expected distribution of number of synonymous substitutions will follow a Negative binomial distribution (which is well approximated in the continuous case by the Gamma or log-normal distribution). WGD peaks in the K S distribution will therefore have positive skew, and this effect will be stronger the more recent the WGD. Normal GMMs are not able to account for this effect, and neither can they cope with the background exponential decay from SSDs. We provide tools for fitting mixtures of log-normal components to K S distributions using either an expectation-maximization (EM) algorithm or a variational Bayes (VB) inference algorithm (Blei & Jordan 2006). The latter is of particular interest here, as it allows to mitigate to some extent the common overfitting problems encountered with GMMs by means of regularization. The parameter γ governs the strength of regularization, with a small γ leading to stronger regularization, making less components likely to be active in the mixture. Therefore this strategy allows to some degree to automatically select the number of components, as for strong enough regularization the weights of spurious components will be shrunk towards 0. However in practice, overfitting can still be a problem, and active components should be interpreted with caution (see for example Tiley et al. (2018) for a recent study on mixture modeling for WGD inference). Table 1: Comparison of wgd with some frequently used tools for studying WGDs. We note that of these tools, only wgd is specifically designed for the purpose of providing an integrative tool for WGD analysis, whereas the other tools provide some of the analyses that are often performed when studying WGDs. This list may be inexhaustive but focuses on tools that have been used in recent studies. An 'x' marks a feature as available. Notes: (i) Uses I-AdHoRe 3.0, only for intra-genomic co-linearity analysis in the current version. . We note that evopipes.net (Barker et al. 2010) also provides tools for K S distribution construction and gene family inference, however this web-based platform has been unavailable for some time at the time of writing.

Arabidopsis thaliana example recipe
This is a recipe for performing the analyses to obtain the results presented in Figure 1 of the wgd application note. The analyses are presented for a test data set, but the exact same commands can be used for the full data set to acquire the full results. The results (gene families, K S distributions and anchor pair K S distributions) can also be found in the example directory of the wgd repository. grep ">" cds.ath_filtered.fasta | shuf | head -n 1000 > ids grep -A 1 -f ids cds.ath_filtered.fasta | sed "/--/d" > sample.fasta rm ids Run an all-vs.-all Blastp analysis and cluster using MCL, assuming you're working with sample.fasta (replace with cds.ath_filtered.fasta for a full analysis). We use the default parameters, which are at the moment of writing an e-value cut-off of 10 −10 and an inflation factor of 2.0.
wgd mcl -s sample.fasta --cds --mcl A directory named wgd_blast will appear that will contain some gene families in the file sample.fasta.blast.tsv.mcl. We move the file to the working directory and give it a shorter name for convenience mv wgd_blast/sample.fasta.blast.tsv.mcl ./sample.mcl Now let's compute a K S distribution (use -n to set the number of cores to use, defaults to 4). We again use default methods, being the family-wise mode (as opposed to the pairwise mode, where codeml is run for every pair instead of the whole family), MUSCLE for multiple sequence alignment and FastTree for node-weighting. Mixture models can be fit using the following command: wgd mix --method bgmm wgd_ksd/sample.fasta.ks.tsv which will fit models using the variational Bayes algorithm with up to four components. You can find plots and output data with component-wise probabilities in the directory wgd mix.
The Carica papaya distribution was obtained with identical methods, whereas a one-vs.-one ortholog distribution can be obtained by following the same approach but with the one-vs.-one flag in wgd mcl and wgd ksd. When all K S distributions are computed, wgd viz can be used to interactively visualize these together. To do so, put all distributions in one directory (I assume it is named ks_dir) and run a bokeh server instance: