Genetics and population analysis Inferring the heritability of bacterial traits in the era of machine learning

Quantiﬁcation of heritability is a fundamental desideratum in genetics, which allows an assessment of the contribution of additive genetic variation to the variability of a trait of interest. The traditional computational approaches for assessing the heritability of a trait have been developed in the ﬁeld of quantitative genetics. However, the rise of modern population genomics with large sample sizes has led to the development of several new machine learning-based approaches to inferring heritability. In this article, we systematically summarize recent advances in machine learning which can be used to infer heritability. We focus on an application of these methods to bacterial genomes, where heritability plays a key role in understanding phenotypes such as antibiotic resistance and virulence, which are particularly important due to the rising frequency of antimicrobial resistance. By designing a heritability model incorporating realistic patterns of genome-wide linkage disequilibrium for a frequently recombining bacterial pathogen, we test the performance of a wide spectrum of different inference methods, including also GCTA. In addition to the synthetic data benchmark, we present a comparison of the methods


Introduction
Heritability is a fundamental quantity in genetic applications (Falconer, 1960;Lynch and Walsh, 1998) which specifies the contribution of additive genetic factors to the variation of a phenotype. In the narrow-sense, heritability is defined as the proportion of the variance of a phenotype explained by the additive genetic factors. Heritability can be used to compare the relative importance between genes and environment to the variability of traits, within and across populations. Together with GWAS (genomewide association studies), the primary tool for discovering the genetic basis of a phenotype of interest, heritability has been playing as a more and more critical role in exploring the genetic architecture of complex traits.
Current investigations of heritability in the quantitative genetics literature have focused on using the linear mixed-effect model framework (Bonnet, 2016;Bulik-Sullivan et al., 2015;Golan et al., 2014;Speed et al., 2012Speed et al., , 2017Yang et al., 2010;Zhou, 2017). In this framework, the effect sizes of genetic markers, usually SNPs (single nucleotide polymorphisms), are assumed to be independent and identically distributed random variables, and often the normal Gaussian distribution is used for computational reasons. The genomic restricted maximum likelihood (GREML) and method of moments are the most widely used methods for heritability inference in this model, and some corresponding popular software are GCTA (Yang et al., 2011), LDSC (Bulik-Sullivan et al., 2015) and LDAK (Speed et al., 2012;Speed and Balding, 2019). Although the linear mixedeffect model provides various ways to interpret correlations among covariates and traits, and is computationally tractable, it makes assumptions which do not necessarily accurately reflect the underlying genetics (Gorfine et al., 2017;Holmes et al., 2019;Janson et al., 2017;Lee et al., 2018;Li et al., 2019;Speed et al., 2017;Speed and Balding, 2019). Some comparisons of different methods in this direction for estimating heritability have been recently conducted, for example, in Zhou (2017), Evans et al. (2018), Weissbrod et al. (2018), Gorfine et al. (2017) and Holmes et al. (2019).
The study of heritability estimation in the statistical machine learning community has been started relatively recently and it has not yet received wider attention. Current machine learning approaches often focus on the high-dimensional linear regression model where some sparsity regularizations could be used on the number of the covariates. This is a natural model for GWAS in modeling the whole-genome level contributions of genetic variation (Falconer, 1960;Lynch and Walsh, 1998). The benefit of this model over the classical univariate approach in GWAS has been demonstrated for example in Wu et al. (2009) and Brzyski et al. (2017). Several machine learning methods for making heritability inference have been studied: a method of moments approach is proposed in Dicker (2014); a convex optimization strategy is investigated in Janson et al. (2017) through a singular value decomposition; maximum likelihood estimation is studied in Dicker and Erdogdu (2016); some adaptive procedures have also been theoretically studied in Verzelen and Gassiat (2018); two-step procedures based on high-dimensional regularized regression have been introduced in Gorfine et al. (2017) and Li et al. (2019); and, a strategy for aggregating heritabilities through multiple sample splitting is introduced in Mai et al. (2021). However, up to our knowledge, a systematic numerical comparison of these different methods for estimating heritability has not yet been conducted.
In this study, we provide a systematic summary of recent advances in machine learning methods for estimating heritability. More especially, we review and discuss the six above-mentioned different methods and compare them with GCTA method, a state-of-the-art method in quantitative genetics. The application is carried in a bacterial GWAS context for estimating the heritability of antibiotic resistant phenotypes. While estimating heritability in human GWAS has been studied in numerous works, the topic has not yet been considered widely in bacteria, for the only prominent examples see Lees et al. (2017Lees et al. ( , 2020 and Mallawaarachchi et al. (2022). This is partly because bacterial GWAS poses unique challenges compared to studies with human or eukaryotic DNA in general, originating from highly structured populations and more limited recombination that produce in considerable linkage disequilibrium across whole chromosomes.
Our article is structured as follows. In Section 2, the problem of heritability estimation is introduced and a systematic review of machine learning methods for estimating heritability is given. In Section 3, we briefly introduce the test datasets used in our evaluation. Results and discussion of different methods on test datasets are presented in Section 4.

Heritability inference using machine-learning methods
The following notations are used in the work. The ' q;0 < q < þ1 norm of a vector x 2 R d is defined by jjxjj q ¼ ð P d i¼1 jx i j q Þ 1=q . For a matrix A 2 R nÂm , A iÁ denotes its ith row and A Áj denotes its jth column. For any index set S f1; . . . ; dg, x S denotes the subvector of x containing only the components indexed by S, and A S denotes the submatrix of A forming by columns of A indexed by S.

The problem of heritability estimation
Let y i 2 R be the measured phenotype of subject i such that where X iÁ ¼ ðX i1 ; . . . ; X ip Þ 2 R p is the vector of genotypes of subject i and p is the total number of variants; e 1 ; . . . ; e n 2 R are unobserved independent and identically distributed (iid) errors with . . . ; b p Þ > 2 R p is an unknown p-dimensional parameter. We assume that X iÁ ; i ¼ 1; . . . ; n are iid random vectors and independent of e i with EðX iÁ Þ ¼ 0 and p Â p positive definite covariance matrix covðX iÁ Þ ¼ R.
Under the model (1), for the ith observation it follows that Varðy i Þ ¼ VarðX iÁ bÞ þ r 2 e ¼ b > Rb þ r 2 e : Our main focus is in estimating the narrow-sense heritability for the phenotype y defined as (2) In other words, this quantity computes the proportion of genetic differences present in the population variability of a trait. It can benefit modeling the underlying genetic architecture of a trait, because a heritability close to zero means that environmental factors cause most of the variability of the trait, while a heritability close to 1 indicates that the variability of the trait is nearly solely caused by the differences in genetic factors. It is noted that as E½jjyjj 2 2 =n ¼ VarðyÞ; one can use jjyjj 2 2 =n as an unbiased estimator for the denominator of the heritability. Further, (2) can also be written as And thus, an estimate of the noise-variancer 2 e [see e.g. Reid et al. (2016)] can be used to estimate h 2 rather than directly estimating the genetic variance b > Rb.
Hereafter, we provide details for different methods for making heritability inference. We especially focus on methods that enable confidence intervals (CIs) to be computed.

Direct methods
We first recall some methods that can directly estimate heritability from the data without identifying the genetic basis of the phenotype.

Convex optimization approach
Using a singular value decomposition (s.v.d.) transformation, the work in Janson et al. (2017) proposes a method, called Eigenprism, to estimate the squared signal b > Rb and thus heritability by solving a convex optimization problem. They also prove the asymptotic normality of their estimator.
More specifically, let X ¼ UDV > be a singular value decomposition (s.v.d.) and put z ¼ U > y. Let k i;i¼1:n denote the eigenvalues of XX > =p. The authors of Janson et al. (2017) consider the following convex optimization problem, denoted by P 1 , Let w Ã be the solution to the problem P 1 , then the heritability estimator is given byĥ With P Ã 1 being the minimized objective function value, the ð1 À aÞ-CI is given by ½ĥ 2 Eprism 6z 1Àa=2 ffiffiffiffiffiffiffiffi 2P Ã 1 p ; where z 1Àa=2 is the ð1 À a=2Þ quantile of the standard normal distribution.
The cost of this method stems mainly from the calculation of the s.v.d. of X. This will be expensive and slow for a genotype matrix with large dimensions. Moreover, we note that in practice the optimization in P 1 can fail sometimes and, in addition, this method only works with high-dimensional data where p > n.

Maximum likelihood estimation
Another direct method for estimating heritability is based on using the maximum likelihood method. In the paper (Dicker and Erdogdu, 2016), the authors derive consistency and asymptotic normality of the maximum likelihood estimation (MLE) under additional Gaussian assumptions. More specifically, the maximum likelihood problem is defined as ) and the heritability estimate is given bŷ VarðyÞ : The authors also studied the consistency and asymptotic normality of this MLE estimator. As a result, the ð1 À aÞ-CI is given by ; where z 1Àa=2 is the ð1 À a=2Þ quantile of the standard normal distribution.
This method appears to be quite efficient computationally as it only requires handling a matrix inversion of dimension n Â n.

Moments method
Heritability estimation based on method-of-moment has been proposed and studied in Dicker (2014) and Verzelen and Gassiat (2018). Several estimators have been proposed in these works. However, when the covariance matrix R is non-estimable or expensive to estimate (as often is the case in practice), the reference (Dicker, 2014) proposed an estimator as follows, with S ¼ X > X=n;m 1 ¼ traceðSÞ=p;m 2 ¼ 1 p traceðS 2 Þ À p nm 2 1 ; and put and the heritability estimate is then h 2 Moment ¼s 2 s 2 þr 2 : The asymptotic properties of the moment estimator were proven in Dicker (2014) and some non-asymptotic results were derived in Verzelen and Gassiat (2018). These results allow to obtain the ð1 À aÞ-CI approximately as ½ĥ 2 Moment 6logð1=2Þ ffiffiffi p p =n: It is noted that this method requires to compute a p Â p matrix S which is very costly for data with large dimensions.

Plug-in Lasso type approaches
We now discuss some naive plug-in methods that are based on sparsity penalized regression methods. These methods typically assume that there is a small subset of biomarkers (in the genotype matrix) that will be important to the phenotype and influence its variability.

Scaled Lasso
The paper (Verzelen and Gassiat, 2018) studied the problem of heritability estimation through using a variance estimation, see formula (3), in high-dimensional sparse regression from the scaled Lasso method (Sun and Zhang, 2012). The scaled Lasso (also known as square-root Lasso) is defined as where k > 0 is the tuning parameter. The heritability estimate isĥ 2 Slasso ¼ 1 Àr

SL
VarðyÞ ; and its honest ð1 À aÞ-CI, given in Verzelen and Gassiat (2018), is given by ½ĥ 2 Slasso 6logð1=2Þðk ffiffiffi p p =n þ 1= ffiffiffi n p Þ; where k :¼ jjb SL jj 0 the number of non-zero components inb SL . It is noted that this CI is rather honest in the sense that its width tends to be quite large. A sharp confidence interval for this estimator has not yet been constructed.
We note that the scaled Lasso method shares the same spirit as Lasso which returns a very sparse model. Therefore, heritability estimation for a phenotype with a polygenic basis by this method tends to lead to underestimation.

Elastic Net
From (2), a direct heritability estimate can be obtained by using a Lasso type method. More precisely, let S ¼ fj :b 6 ¼ 0g whereb is an estimate from a Lasso-type method, we can calculate the heritability as in equation (2) VarðyÞ : Here, we focus on using the Elastic Net estimate for the regression parameters:b ¼b Enet . The Elastic Net has been shown to be especially useful when the variables are dependent (Zou and Hastie, 2005) (LD structure), which is particularly relevant in bacterial genome data (Earle et al., 2016). The corresponding estimator is defined aŝ where 'ðÁÞ is the negative log-likelihood for an observation. Elastic Net is tuned by a 2 ½0; 1; that bridges the gap between Lasso (a ¼ 1) and ridge regression (a ¼ 0). The tuning parameter k > 0 controls the overall strength of the penalty and can be chosen by using cross-validation. For example, 10-fold cross-validation is often used in practice. The paper (Lees et al., 2020) showed that Elastic Net is a promising method for bacterial GWAS data where the authors suggest using a small value of a, e.g. a ¼ 0:01. However, we would like to note that the CI for heritability estimated by this plug-in method is not available yet. It is worth noting that a GWAS analysis using highdimensional sparse regression, such as the Elastic net discussed above, would already provide the estimated effect sizes corresponding to the selected covariates. Therefore, estimating the heritability by using these effect sizes would bring insight on understanding both the genetic architecture as well as the genetic contribution to a trait.

Boosting heritability method
We now present some advanced approaches in machine learning for making inference about heritability. The core idea of these types of methods is to combine a covariate selection step with an estimation step, known as the selective inference approach. The selection step aims at either reducing the dimension of the problem or removing irrelevant biomarkers, after which a heritability estimation step is applied. These approaches have been shown not only to improve the computational aspects of the estimation procedure, but to also yield more accurate results.
In this approach, the original data ðy; XÞ is randomly divided into two disjoint datasets ðy ð1Þ ; X ð1Þ Þ and ðy ð2Þ ; X ð2Þ Þ with equal sample sizes.
The HERRA method introduced in Gorfine et al. (2017) is first based on a screening method [e.g. as in Fan and Lv (2008)] to reduce the number of covariates below the sample size. With the remaining covariates, the data are randomly split into two equally sized subsets. Then, a Lasso-type estimator is employed on ðy ð1Þ ; X ð1Þ Þ to select a small number of important variables. After that, the authors use the least squares estimator on ðy ð2Þ ; X ð2Þ Þ with only the selected covariates (from the Lasso-type estimator) to obtain an estimate of the noisevariance. The procedure is repeated where the role of ðy ð1Þ ; X ð1Þ Þ and ðy ð2Þ ; X ð2Þ Þ is switched to obtain another estimate of the noisevariance. Finally, heritability is calculated as in the formula (3) where the noise-variance is the mean of the two estimated noise-variances.
The work in Li et al. (2019) has also proposed a 'two-stage' approach with sample splitting. More particularly, the data is also randomly split into two disjoint datasets with equal sizes. On the first dataset ðy ð1Þ ; X ð1Þ Þ, they use a sparse regularization method based on Elastic net to first reduce the model by selecting the relevant variables. Then, on the second dataset ðy ð2Þ ; X ð2Þ Þ, only the selected variables are used to estimate the heritability through a method of moments approach (Dicker, 2014) or GCTA method.
These post-selection approaches guarantee that sparse regularization and variance estimation are carried out on independent datasets and thus the heritability will not be overestimated. However, both methods in Gorfine et al. (2017) and Li et al. (2019) heavily depend on the way data is split. One can avoid this dependence by performing the sample splitting and inference procedure many times (e.g. 100 times) and aggregating the corresponding results. This is to make sure that the different latent structures possibly residing in the sample are properly taken into account in both the selection and estimation steps. This is the core idea of the 'Boosting heritability' strategy in Mai et al. (2021).
The work in Mai et al. (2021) recently introduces a generic strategy for heritability inference, termed as 'Boosting Heritability', which generalized the ideas from the post-selection approaches by Gorfine et al. (2017) and Li et al. (2019). Boosting heritability, detailed in Algorithm 1, uses in particular a multiple sample splitting strategy which is shown to lead in general to a stable and more reliable heritability estimate.More importantly, this procedure also provides an informative interval of estimated heritabilities that shows the range of the target heritability would belong to. We call this interval a 'reliable' interval.

Test datasets
We investigate performance of the different methods using four public bacterial datasets suitable for GWAS and simulated phenotypes based on the genetic architecture present in the data. We focus on estimating heritability of the antibiotic resistance phenotypes. Table 1 summarizes our test datasets. The heritability of the antibiotic resistance phenotype is expected to be high, meaning that the variability stems primarily from the observed genetic differences among these bacteria.

Streptococcus pneumoniae: MA data
Streptococcus pneumoniae (the pneumococcus) is a common nasopharyngeal commensal that can cause invasive pneumococcal disease. Here, we consider two datasets for this bacterium, abbreviated as the MA and Maela data.
The MA dataset consists of 616 S.pneumoniae genomes from isolates collected from healthy children in an asymptomatic nasopharyngeal colonization survey in Massachusetts between 2001 and 2007. The genomic data and phenotypes are publicly available through the publication (Croucher et al., 2015). After initial data filtering using a minor allele frequency (5%) and removing missing data greater than 10%, we obtain a genotype matrix of 603 samples with 89 703 SNPs (Chewapreecha et al., 2014). We consider resistance to penicillin antibiotics as the phenotype (Croucher et al, 2013). The genome-wide association studies of this phenotype were conducted in Chewapreecha et al. (2014). The results are given in Figure 3. It is noted that the main mechanism of resistance to penicillin can be explained by the causal SNPs in the penicillin binding proteins pbp2x, pbp2b and pbp1a, see Chewapreecha et al. (2014).

Streptococcus pneumoniae: Maela data
The Maela dataset is a large S.pneumoniae dataset which consists of 3069 whole genomes produced from randomly selected isolates from a longitudinal nasopharyngeal colonization study of infants and a subset of their mothers, performed between 2007 and 2010 in a rural refugee camp on the Thailand-Myanmar border (Chewapreecha et al., 2014;Lees et al., 2016). The genomic data and penicillin MICs are publicly available from Chewapreecha et al. (2014). Using a minor allele frequency threshold (5%) and removing missing data greater than 10%, we obtain a genotype matrix with 121 014 SNPs.
We use a continuous phenotype corresponding to the inhibition zone diameters measured in the lab. These inhibition zone diameters are in practice used to define whether a sample is 'Sensitive' or 'Resistant' to an antibiotic, for some antibiotics, an 'Intermediate' designation is also given which we treat as resistant. We consider resistances to three different antibiotics as the phenotypes: tetracycline, penicillin and co-trimoxazole. The results are given in Figure 4. The genetic loci associated with these antibiotic resistances have been examined in genome-wide association studies in Lees et al. (2016). The tetracycline resistance is conferred by the tetM gene and Algorithm 1 Boosting heritability (Mai et al., 2021) Step 0: Using a screening method, such as the sample correlation (Fan and Lv, 2008), to remove 25% least associated covariates.
Repeat B times from step 1 to step 4, Step 1: With the remaining covariates, divide the sample uniformly at random into two equal parts.
Step 2: On the first part of the data, use Elastic net to select the important covariates.
Step 3: On the second subset with only selected covariates from Step 2, estimate the heritability by using (3) where the noise-variance is estimated using the least square method.
Step 4: Repeat Step 2 and Step 3 by changing the role of the first and second subset. Final ! The final heritability estimate is the mean of the estimated heritabilities at each repeat. the co-trimoxazole resistance is conferred by the SNPs in the dyr gene (Maskell et al., 2001).

Escherichia coli: E.coli data
Escherichia coli is a common colonizer of the human gut but is also a leading cause of blood stream infections, in which antibiotic resistance is increasing. The E.coli data from Kallonen et al. (2017) consists of 1509 isolates from a systematic survey of blood stream infections conducted in England between 2001 and 2012 with an alignment of 121 779 SNPs (after initial data filtering with a minor allele frequency threshold 5% and removing missing data greater than 10%).

Simulation settings
As the basis for systematically evaluating the performance of the different methods, we use a subset of the Maela dataset (see Section 3) to create a semi-synthetic dataset that incorporates levels of population structure and LD closely reflecting those present in natural populations (see Fig. 1). This subset corresponds to a genotype matrix of 3051 samples and 5000 SNPs. Using this real genotype matrix, we simulate the responses/phenotypes through the linear model defined in (1). For choosing the causal SNPs (non-zero effect sizes), we follow the penicillin resistance-like setting (Dewé et al., 2019;Lees et al., 2016): to select all SNPs from three genes (pbpX, pbp1A, penA) as causal.
Given the chosen SNPs, regression coefficients b 0 are either drawn from the normal distribution N ð0; 1Þ or Student t 3 distribution. Because the true covariance of the genotype matrix is not given, we need to re-normalize the coefficient b to assure that the true corresponding heritability is approximating our target. Here, h 2 ¼ 0:8 is our target heritability and R is the sample covariance matrix of the genotype matrix and the noise variance is fixed as r 2 e ¼ 1. In simulations, the true covariance matrix of the genotype matrix is unknown, so phenotypes are approximated based on a given heritability using model (1). To establish a benchmark for comparison, the noise variance is set to r 2 e ¼ 1, and an estimator is calculated using formula (3). This estimator, denoted by 'oracle', is based on the true simulated values and cannot be used with real data. The methods are tested using the following settings: 'wholegenes' (whole genome analysis), 'subsample1500' (1500 randomly selected samples), 'subsample500' (500 randomly selected samples), 'causalgenes' (only true causal genes, Eprism does not work in this setting) and 't-effect' (effect sizes simulated from Student t 3 distribution). Additionally, the GCTA (mixed-effect) model is used to simulate phenotypes and is denoted as 'GCTA.model' (oracle does not work in this setting). Results from 50 replications are shown in Figure 2.

Simulation results
As seen in the simulation results presented in Figure 2, the 'oracle' estimator consistently demonstrates the highest level of accuracy and therefore serves as an effective benchmark for comparison. It is also evident that both the Elastic net and Lasso methods tend to underestimate the target heritability. However, the Elastic net method does provide a more reliable lower bound for the underlying heritability. The moment method, on the other hand, is found to be particularly unreliable in this context, failing to produce acceptable results across all settings. The Eprism approach, which is based on  convex optimization, is also observed to be quite unstable in its performance.
On the other hand, both the maximum likelihood estimation (MLE) and Boosting methods consistently provide accurate approximations of the target heritability for bacterial genome data across all the simulation settings that were considered. The GCTA method, however, is observed to consistently underestimate the target heritability, with a high degree of variability. This may be due to the fact that the GCTA method was specifically designed for mixed-effect models with different underlying assumptions about the effect sizes.
It has been observed that using the Elastic net (Enet) method to estimate heritability provides a reliable lower bound for the heritability and serves as a benchmark for comparison. However, studies such as Qian et al. (2020) and Mai et al. (2021) have shown that Enet tends to underestimate the true heritability due to a bias known to affect Lasso-type approaches. This is because coefficients associated with weak effects are shifted toward zero, even though these weak effects may still play a significant role in the overall genetic variation of a trait. On the other hand, the scaled Lasso (SLasso) approach often results in even lower heritability estimates than Enet. This is because it only selects a small number of covariates, resulting in a model that can only explain a limited amount of variation in the phenotype.

Results with real data
The results on real datasets are given in Figures 3, 4 and Tables 2, 3.
Overall, we see that direct approaches like convex optimization (Eprism) and moment method (Moment) are not able to deal with these data and return unstable results quite often. On the other hand, the maximum likelihood method (MLE) yields consistent results in line with the outcome from Elastic Net (Enet) and GCTA methods.
The approach by combining selection and estimation steps together with multiple sample splitting as done in the Boosting heritability (BoostHer) method always returns reliable results. More specifically, its results are always at a higher value than the lower bounds given by the Enet method, and it can still work well when either MLE or GCTA method fails, as seen from Table 2. We also consider the estimation with respect to the causal genes for MA and Maela data. We observe that for MA data the estimation of heritability with only causal genes is slightly improved, see Figure 3. While considering causal genes for Maela data does not gain an improvement, and in the case of Co-trimozaxole, it even lower downs the estimation, see Figure 4. Note that this is desirable, since in practice we may not know all the causal genes, and thus we would like to obtain good results from using all predictors.
Regarding the uncertainty quantification, we can see that the CIs of MLE and the 'reliable' intervals of BoostHer are stable. More particularly, their widths are similar to those from the GCTA method. In contrast, other methods come with wider intervals and thus they can be harder to interpret. As an example, the CIs for penicillin resistance heritability in Maela data (Fig. 4) are as follows: the width for CI of GCTA is 7.91%; of MLE is 5.02%; and of BoostHer is 2.64%; while the width of CI of Eprism is 28.16% and of SLasso is 52.16%. Since heritability is between 0 and 1, the latter two intervals are of limited value for interpretation.

Running time
The indicative running times of all considered methods on four tested datasets are given in Table 4. The codes were executed on a Linux Redhat 64-bit operating system using a CPU with Intel-E7-4850v3 processor and 3TB of RAM, with the splitting step utilizing 10 CPU cores for parallelization. Overall, the maximum likelihood (MLE) method is the fastest method and also returns trustworthy results. The moment method seems to be computationally expensive, while its results are highly unstable and unreliable.

Conclusions
In this study, we have conducted a thorough examination of multiple techniques for estimating heritability in bacteria, focusing on their precision and calibration of uncertainty. We have compared a diverse set of methods, including both traditional and newer techniques. Our findings revealed that, as anticipated, the maximum likelihood method is the fastest of all the methods we evaluated, while the method of moments is generally the slowest. However, it is important to note that none of the methods we tested had running times that would be considered prohibitive for practical use.
In our simulations, the maximum likelihood method demonstrated consistently strong performance. However, when applied to real data cases, its behavior was more mixed. This is likely caused by sensitivity to model assumptions, which tend to lead to a lack of robustness of MLE in general when the data deviate from these assumptions.
Our analysis also revealed that certain methods, such as Eprism, method of moments and SLasso, consistently displayed poor performance and are not recommended for estimating heritability in bacteria. Conversely, the multiple sample splitting technique used in the BoostHer method emerged as the most reliable and accurate approach in our experiments. Overall, our findings provide useful insights for researchers and practitioners looking to determine heritability in bacteria.
It is important to note that our findings have implications for the estimation of heritability in other organisms as well. The patterns of linkage disequilibrium (LD), population structure and existing studies of heritability are all quite different in humans and other organisms compared to bacteria. Therefore, it is crucial to consider the unique characteristics of the organism and the data when choosing a method for heritability inference. Furthermore, it is worth noting that the results of our computational experiments may not generalize to all possible scenarios and further research is needed to fully understand the applicability of these methods in different contexts.
Another investigation of heritability estimation for bacteria recently appeared, where linear mixed models were compared with the Elastic net and LD-score regression (Mallawaarachchi et al., 2022). The study found that linear mixed models showed poor correlation with the ground truth and typically overestimated heritability to a large degree, while Elastic net and LD-score regression methods were found to perform well. This observation is consistent with our findings, where we found that multiple sample splitting as in BoostHer appears overall as the most reliable and accurate approach in our experiments. It is worth noting that this recent study highlights the need for further research in the field of heritability estimation for bacteria, as it is clear that there is a lack of consensus on the best approach to use. The combined results of our study and the aforementioned study call for more research in the field of heritability estimation in bacteria to facilitate future studies of genetic architectures in bacteria.