Hierarchically modelling Kepler dwarfs and subgiants to improve inference of stellar properties with asteroseismology

With recent advances in modelling stars using high-precision asteroseismology, the systematic effects associated with our assumptions of stellar helium abundance ($Y$) and the mixing-length theory parameter ($\alpha_\mathrm{MLT}$) are becoming more important. We apply a new method to improve the inference of stellar parameters for a sample of Kepler dwarfs and subgiants across a narrow mass range ($0.8<M<1.2\,\mathrm{M_\odot}$). In this method, we include a statistical treatment of $Y$ and the $\alpha_\mathrm{MLT}$. We develop a hierarchical Bayesian model to encode information about the distribution of $Y$ and $\alpha_\mathrm{MLT}$ in the population, fitting a linear helium enrichment law including an intrinsic spread around this relation and normal distribution in $\alpha_\mathrm{MLT}$. We test various levels of pooling parameters, with and without solar data as a calibrator. When including the Sun as a star, we find the gradient for the enrichment law, $\Delta Y / \Delta Z = 1.05^{+0.28}_{-0.25}$ and the mean $\alpha_\mathrm{MLT}$ in the population, $\mu_\alpha = 1.90^{+0.10}_{-0.09}$. While accounting for the uncertainty in $Y$ and $\alpha_\mathrm{MLT}$, we are still able to report statistical uncertainties of 2.5 per cent in mass, 1.2 per cent in radius, and 12 per cent in age. Our method can also be applied to larger samples which will lead to improved constraints on both the population level inference and the star-by-star fundamental parameters.


INTRODUCTION
The inference of stellar ages, masses, and radii has improved through the use of asteroseismology in recent years (e.g. see the review by Chaplin & Miglio 2013). Measuring the oscillation modes in stars using photometric time series data, from missions such as CoRoT (Baglin et al. 2006), Kepler (Borucki et al. 2010), and TESS (Ricker et al. 2015) has given us new insights into the structure and evolution of stars (García & Ballot 2019). Recent examples include a deeper understanding of stellar structure (Verma et al. 2017), chronology of a Milky Way merger (Chaplin et al. 2020), and classifying exoplanetary systems (Huber et al. 2019;Tayar et al. 2020). Several studies used grids of stellar models with constraints from asteroseismology to produce catalogues of precise stellar parameters Silva Aguirre et al. 2017). However, with increasing precision on fundamental parameters inferred from stellar models with ★ E-mail: ajl573@student.bham.ac.uk (AJL); g.r.davies@bham.ac.uk (GRD) asteroseismology, extra care should be taken to ensure that we are accounting for uncertainty in our choice of stellar physics.
Typically, stars are modelled on a star-by-star basis with estimates of stellar properties including some assumptions handled by simple scaling relations and solar calibrations. For instance, a helium ( ) to heavy-element ( ) enrichment ratio (Δ /Δ ) and mixing-length theory parameter ( MLT ) are often assumed. However, there has been little effort to account for the population distribution of such quantities. Assuming values for and MLT can result in inaccurate inference and systematics on the inferred stellar parameters (Valle et al. 2015). Independently modelling and MLT can also be computationally demanding and requires high-precision observations in order to return meaningful stellar properties.
In this work, we apply the method of Davies et al. (in prep.) to determine stellar properties for a sample of Kepler dwarfs and subgiants using a hierarchical Bayesian model (HBM). With an HBM, we introduce population-level distributions for and MLT to encode prior information throughout the sample. We will show that when an HBM is used, we can increase the precision of inferred masses, radii, and ages despite increasing the number of free parameters.
The use of HBMs has been demonstrated in other areas of astrophysics to reduce individual parameter uncertainties by encoding prior information about the distribution of the parameter in a population. For example, HBMs have been used with data from Gaia, improving distance measures Anderson et al. 2018) and calibrating the red clump as a standard candle (Hawkins et al. 2017;Chan & Bovy 2020) using asteroseismology (Hall et al. 2019). In other instances, HBMs have been used to infer binary-star and exoplanet eccentricities (Hogg et al. 2010), obliquity of transit systems (Morton & Winn 2014), stellar inclination with asteroseismology (Campante et al. 2016;Kuszlewicz et al. 2019), and the age-metallicity relation of the solar neighbourhood (Feuillet et al. 2016).
To describe the distribution of in this work, we assume a linear helium enrichment law characterised by freely varied population parameters: the gradient given by Δ /Δ , the primordial helium abundance at = 0 ( ) and an intrinsic spread in helium ( ). There have been many studies to determine a linear enrichment law, from modelling eclipsing binaries (Ribas et al. 2000) to spectroscopy of galactic H regions (Balser 2006). Values of Δ /Δ have also been determined for samples of main sequence (MS) stars (Casagrande et al. 2007), open clusters (Brogaard et al. 2012), and more recently with asteroseismology using individual oscillation frequencies (Silva ) and the glitch due to the second helium ionisation zone (Verma et al. 2019). In these studies the value of the enrichment ratio was typically inferred to be 1 Δ /Δ 3. However, helium enrichment is unlikely to be exactly linear with metallicity and may depend on other chemical abundances (West & Heger 2013) or the location of the star in the Milky Way (Frebel 2010). Therefore, we account for some deviation from the linear law by introducing . Moreover, our method may be adapted to different helium enrichment priors in future work.
The widely used mixing-length theory (MLT) of convection, parametrised by MLT , has been tested throughout the Hertzsprung-Russell (HR) diagram with 3D hydrodynamical simulations (Trampedach et al. 2014;Magic et al. 2015) and asteroseismology (Tayar et al. 2017;Viani et al. 2018;Li et al. 2018) with values in the range 0.8 MLT / MLT, 1.2 where MLT, is the value calibrated to the Sun. However, in many grids of stellar models, a constant value calibrated to reproduce the Sun is assumed. This can lead to systematic uncertainties in stellar ages due to the effects of variable mixing depending on the mixing length. The MLT approximates convective mixing and thus we expect the value of MLT to vary from star-to-star due to various effects, from changes in chemical composition to other sources of mixing described poorly by MLT. The investigation of more complex MLT distributions is beyond the scope of this work. Instead, we experiment with two prior assumptions for MLT in the population. The first assumes MLT is normally distributed in our sample with a spread ( ) to account for the aforementioned variation in MLT . The second assumes MLT is constant throughout the sample.
Our HBM requires a way to map from the stellar initial (or bulk) properties to predict observables. We can achieve this with a large grid of stellar evolutionary models. However, a discrete grid can produce inaccurate posterior distributions, limited to the grid resolution. Increasing the resolution is computationally demanding, especially when scaling to higher input dimensions. One solution is to interpolate the stellar models, as is common in the isochrone-fitting method (see e.g. Berger et al. 2020). However, interpolation can become computationally expensive at high input dimensions and grid size, and evaluating the likelihood using modern Bayesian sampling techniques is slow. Therefore, we use machine learning to map stellar model inputs to observables to provide a fast way to sample the HBM. In this work, we train an artificial neural network (ANN) on a large grid of stellar models. There have been similar applications of ANNs in asteroseismology (Verma et al. 2016;Bellinger et al. 2016;Hon et al. 2017Hon et al. , 2018Hendriks & Aerts 2019) but not yet in the context of an HBM. Using the machine learning speed-up, we demonstrate a scalable method for obtaining fundamental stellar parameters.
In Section 2, we describe the data for the sample of 81 Kepler dwarfs and subgiants studied in this work. We then present the methods in Section 3 for which we produced a large grid of stellar evolutionary models to use as training data for an ANN. We use the ANN as an emulator in a series of statistical models described in Section 3.2 to model the sample and present our results in Section 4. Finally, in Section 5 we discuss the results from each model and compare them to results for the sample of stars in the literature.

DATA
For this study, we selected the sample of 415 stars from the first APOKASC catalogue of dwarfs and subgiants (Serenelli et al. 2017, hereafter S17). This sample provides an extensive set of dwarfs and subgiant stars with asteroseismic detections observed by the Kepler mission. S17 used grid-based modelling to determine the ages ( ), masses ( ), radii ( ), and surface gravity (log ) of stars in the sample, using global asteroseismic parameters, effective temperature ( eff ), and metallicity ([M/H]) as inputs.
Using five independent pipelines, S17 determined values for global asteroseismic parameters -the large frequency separation Δ and the frequency at maximum power, max with median uncertainties of 1.7 per cent and 4 per cent respectively. We chose to adopt the Δ determined in their work as inputs for our method. They also used [M/H] published in Data Release 13 (DR13; Albareti et al. 2017) of the APOGEE stellar abundances pipeline (ASPCAP; García Pérez et al. 2016) with uncertainties of 0.1 dex. For their preferred set of results, they adopted eff from the Sloan Digital Sky Survey (SDSS) griz-band photometry (Pinsonneault et al. 2012) with a median uncertainty of 70 K.
We removed more evolved stars from the APOKASC sample by cutting those with log < 3.8 dex. We then kept stars within 1-of −0.5 < [M/H] < 0.5 dex to remove metal-poor and -rich stars. Main sequence stars with 1.2 M are understood to have a convective, hydrogen-burning core, with some dependence on the chemical composition and choice of stellar physics (Appourchaux et al. 2015). Stellar models with a convective core require the treatment of extra stellar physics such as overshooting, which is beyond the scope of this work. Therefore, we keep only stars with masses within 1-of 0.8 to 1.2 M from the preferred set of results of S17.
Following cuts to the sample, we adopted updated ASPCAP spectroscopic metallicities, [M/H], from Data Release 14 (DR14; Blanton et al. 2017) which had a median uncertainty of 0.07 dex. We also chose to adopt eff from the same catalogue to be internally consistent. We note that our chosen effective temperature scale is offset from the photometric temperature scale of S17 by approximately −170 K, with a dispersion of ∼ 120 K corresponding to typical uncertainties on the individual temperatures. The median uncertainty in our adopted ASPCAP eff was 125 K.
To provide a means of calculating luminosities, we used Gaia Data Release 2 (DR2) parallaxes (Gaia Collaboration et al. 2016Collaboration et al. , 2018. We cross-matched the remaining sample with the DR2 catalogue, taking the nearest neighbours within a 4 arcsec radius. Although DR2 parallaxes have improved upon the DR1 values at the time of S17, there was still evidence for a zero-point offset (Lindegren et al. 2018). We adopted a global offset of 0.05 mas, in the sense that DR2 parallaxes were underestimated, representative of values obtained in the literature using Kepler field stars (see e.g. Zinn et al. 2019a;Hall et al. 2019) and through other methods (see e.g. Riess et al. 2018;Chan & Bovy 2020). We then cross-matched our sample with the Two-Micron All Sky Survey (2MASS; Skrutskie et al. 2006) to obtain K S -band (2.16 µm) photometry.
We determined luminosities, for the sample using the direct method of (Huber et al. 2017;Berger et al. 2020) with K S -band photometry, Gaia DR2 parallaxes, ASPCAP [M/H] and eff , and asteroseismic log as inputs. This involved computing absolute K S -band magnitudes using the Gaia DR2 parallaxes and extinctions determined by the 3D galactic reddening maps of Green et al. (2018). We determined absolute bolometric magnitudes by interpolating the MIST bolometric correction tables as a function of eff , log and [M/H] (Dotter 2016;Choi et al. 2016). We adopted the uncertainty of 0.02 mag assumed by for both the extinctions and bolometric corrections, corresponding to typical systematic errors in extinction maps and bolometric fluxes (e.g. Zinn et al. 2019b;Tayar et al. 2020). We obtained luminosities for the sample with a median uncertainty of 3.4 per cent.
The final sample comprised 81 stars for which we had complete data for eff , [M/H], Δ , and to use as inputs for our stellar modelling method -see Table 1. In Fig. 1, we show the HR diagram for the sample plot in context with a series of stellar evolutionary tracks at solar metallicity.

METHODS
Firstly, we used a stellar evolutionary code to compute a grid of models to predict observable quantities (see Section 3.1). Subsequently, we trained an ANN on the grid of stellar models to map input parameters to output observables (see Appendix A for further details). We then constructed three Bayesian models in Section 3.2 which each sampled the trained ANN to estimate stellar fundamental parame-ters as described in Section 3.4. Evaluation of the ANN gradient is required during training. Consequently, estimating the gradient of the model likelihood is fast and simple when the observables are generated by an ANN. Hence, we open up the possibility of using a Hamiltonian Monte Carlo (HMC) algorithm -for example, using the No-U-Turn Sampler (NUTS; Hoffman & Gelman 2014) -which requires the gradient to sample the model posterior. Once we had tested the accuracy of the model on a sample of synthetic stars, we evaluated each model on the subset of the APOKASC catalogue selected in Section 2.

Grid of stellar models
We built a stellar model grid to use in training the ANN. The grid includes four independent model inputs: stellar mass ( ), initial helium fraction ( init ), initial metallicity ([M/H] init ), and the mixing-length parameter ( MLT ). Ranges and grid steps of the four model inputs are summarised in Table 2. We increased the resolution at higher [M/H] to give a more consistent resolution in init which we used as an ANN input in Appendix A. We computed each stellar evolutionary track from the Hayashi line to the base of red-giant branch defined here by log = 3.6 dex. We also computed an additional set of evolutionary tracks, with input values chosen randomly within the range of the regular grid, to use when testing the accuracy of the ANN.

Stellar models and input physics
We used Modules for Experiments in Stellar Astrophysics (MESA, version 12115) to establish a grid of stellar models. MESA is an open-source stellar evolution package which is undergoing active development. Descriptions of input physics and numerical methods can be found in Paxton et al. (2011Paxton et al. ( , 2013Paxton et al. ( , 2015Paxton et al. ( , 2018Paxton et al. ( , 2019. We adopted the solar chemical mixture, ( / ) = 0.0181, provided by Asplund et al. (2009). The initial chemical composition was calculated by: We used the MESA − tables based on the 2005 update of OPAL EOS tables (Rogers & Nayfonov 2002) and OPAL opacity supplemented by low-temperature opacity (Ferguson et al. 2005). The grey Eddington − relation is used to determine boundary conditions for modelling the atmosphere. The MLT of convection was implemented, where MLT = ℓ MLT / is the mixing-length parameter. We also applied the MESA predictive mixing scheme (Paxton et al. 2018(Paxton et al. , 2019 in the model computation. Atomic diffusion of helium and heavy elements was also taken into account. MESA calculates particle diffusion and gravitational settling by solving Burger's equations using the method and diffusion coefficients of Thoul et al. (1994). We considered eight elements ( 1 H, 3 He, 4 He, 12 C, 14 N, 16 O, 20 Ne, and 24 Mg) for diffusion calculations, and had the charge calculated by the MESA ionization module, which estimates the typical ionic charge as a function of , , and free electrons per nucleon from Paquette et al. (1986).

Oscillation models and asteroseismic quantities
Theoretical stellar oscillations were calculated with the GYRE code (version 5.1), which was developed by Townsend & Teitler (2013). We computed radial modes (for ℓ = 0) for 42 radial orders by solving the adiabatic stellar pulsation equations with the structural models generated by MESA. We determined the asteroseismic large separation (Δ ) for each model with theoretical radial modes to avoid the systematic offset of the scaling relation. We derived Δ with the approach given by White et al. (2011), which is a weighted least-squares fit to the radial frequencies as a function of . We chose to ignore the well known, yet poorly characterised impact of modelled oscillation mode inaccuracies in the near-surface region of the star (Kjeldsen et al. 2008;Ball & Gizon 2014;Sonoi et al. 2015). This typically presents only a small effect compared to observational uncertainties when considering the average large frequency spacing, Δ . Additionally, there may be further inaccuracies in the modelled Δ because of variations in p mode frequencies with stellar activity (Chaplin et al. 2007;García et al. 2010;Kiefer et al. 2017). Therefore, a thorough treatment of systematic uncertainties in Δ is instead left to future work (Carboneau et al. in preparation).

Statistical models
We devised three Bayesian models, each with varying levels of parameter sharing (pooling) between stars in the population. Initially, we tested the models and demonstrated reduction of statistical uncertainties in the stellar fundamental parameters by analysing a random sample of 100 synthetic stars modelled using MESA. Then, we applied the models to the sample of stars in Table 1 (with and without solar data for two of the models) and compared the results with that of S17.
Our first model was equivalent to modelling each star individually and featured no pooling; henceforth, we refer to it as the no-pooled (NP) model (see Section 3.2.1). We then derived two hierarchical Bayesian models (HBMs) which use population-level parameters to describe their distribution in the sample. Both of these models partially-pooled helium using a linear enrichment law. We drew the initial helium fraction for each star from a normal distribution with a mean described by the enrichment law and standard deviation representing its spread. Similarly, we partially-pooled the MLT parameter, MLT in one model, whereas we maximally-pooled MLT in the other, such that it assumes the same value for the entire sample. Hence, we refer to the former as the partial-pooled (PP) model and the latter as the max-pooled (MP) model, described in Sections 3.2.2 and 3.2.3 respectively.

No-pooled model
Firstly, we constructed a model comprising independent parameters corresponding to the ANN inputs = { evol, , , MLT, , , } for a given star, . The parameter evol, describes the evolutionary stage of the -th star (see Equation A4 in the Appendix). Using Bayes' theorem, the posterior probability density function (PDF) of the model parameters given a set of observed data, is, where ( ) is the prior PDF of the model parameters and ( | ) is the likelihood of observing the data given the model. We chose weakly-informative, bounded priors for the independent parameters, restricting them to their respective ranges in the ANN training data. Although the neural network is able to make predictions outside the training data range, these have not been tested and may be unreliable. Therefore, we used a beta distribution with = = 1.2 as the prior PDF on the independent parameters, transformed such that the probability is null outside the chosen range, where the beta distribution is defined as, and, is the transformed parameter where ,min and ,max are the upper and lower bounds for each parameter. The beta distribution was preferred over a bounded uniform distribution because our sampler evaluates the gradient of the posterior and is thus sensitive to discontinuities (see e.g. Fig. B1 in the Appendix).
Using notation which represents a given random variable ∼ as equivalent to being drawn from a probability distribution ( ) ∝ ( ) where ( ) is a non-normalised probability density function, we may write the priors for as, where each beta distribution is scaled to cover the boundaries of the grid of stellar models computed in Section 3.1. We made predictions for each star using the trained ANN, using the Stefan-Boltzmann law. Out of the model parameters, those which may be observed are denoted by = ( ). Therefore, we write the likelihood that we observe any with known uncertainty, given our model as, where obs is the number of observed variables. We chose to use observed eff , , Δ , and [M/H] collated for our sample as described in Section 2. It follows that the posterior PDF for a population of stars stars for the NP model is, where is the matrix of model parameters and is the matrix of observables. A probabilistic graphical model (PGM) of the NP model can be seen inside the grey box of Fig. 2, without the arrow connecting init to init . We ignore the nodes outside the box because these correspond the the PP model described next.

Partial-pooled model
Sharing, or pooling parameters between stars in a population can improve the uncertainties on stellar fundamentals by encoding our prior knowledge of their distribution in a population. We constructed a hierarchical model, which builds upon the NP model by introducing population-level hyperparameters. Specifically, we chose to describe initial helium and MLT by partially-pooling them.
We constructed the PP model such that each of the initial helium, init and MLT parameter, MLT are drawn from a common distribution characterised by the set of hyperparameters, . Thus, Bayes' theorem becomes, is the same as in the NP model, i.e. each object-level parameter, = { , } stars =1 , and = {Δ /Δ , , , , }. The hyperparameters for init comprise the helium enrichment ratio (Δ /Δ ), primordial helium abundance fraction ( ), and the spread in helium ( ). The remaining hyperparameters for MLT comprise the mean ( ) and spread, ( ).
We assumed the initial helium and the mixing-length parameter are each drawn from a normal distribution characterised by a population mean and standard deviation. The probability of init and MLT given is, where and is the mean initial helium fraction as described by the linear helium enrichment law, Therefore, we may write the prior PDF of initial helium given its population-level hyperparameters as, Similarly, for the second component of Equation 9, we chose to partially-pool MLT . We assume that convection in stars of a similar mass, evolutionary stage and area of the HR diagram may be approximated using a similar value of MLT , but the accuracy of the MLT may vary from star-to-star. Given the small range of our sample, any such variation will be absorbed by the spread parameter, . Therefore, we decided to describe the prior on MLT as, We gave all of the hyperparameters weakly informative priors, with the exception of for which we adopt a recent measurement of the primordial helium abundance from big bang nucleosynthesis (BBN) as the mean (Pitrou et al. 2018), with a standard deviation representative of the range of values in the literature (Aver et al. 2015;Peimbert et al. 2016;Cooke & Fumagalli 2018). Hence, we assumed priors on the hyperparameters as follows, We produced a PGM for the model, depicted in Fig. 2. The hyperparameters are shown outside of the grey box containing the individual stellar parameters to represent the hierarchical aspect of the model.

Max-pooled model
We built another hierarchical model similar to the PP model except that MLT is max-pooled (MP). In this model, we assumed that MLT must be the same value for every star in the sample, but still allowed it to freely vary on a population-level. Thus the hyperparameters are now, = {Δ /Δ , , , MLT }. The posterior distribution of the model takes the same form as in Equation 8 except that the MLT parameter for the -th star is, where, chosen such that MLT is confined to the boundaries of the grid of stellar models (1.5 < MLT < 2.5).

The Sun as a star
Pooling parameters in an HBM allows us to use the Sun as a calibrator in a unique way. Rather than calibrating our model physics to the Sun and then assuming the calibrated parameters across our sample, we can include the Sun as a part of the same population as our sample of stars. If we assume init and MLT for the Sun are a part of the same prior distribution as for the rest of the sample, then we can simply add solar observables to our model inputs.
For both the PP and MP models, we iterated with and without data for the Sun included in the population, referred to as PPS and MPS respectively. We adopted the solar data in Table 3 with uncertainties conservatively limited to the accuracy of the ANN for , and  Table 3. Solar input data. The references correspond to the central values and the uncertainties are chosen to either be representative of the ANN accuracy or the spread of values in the literature (see text for details).

Input
Value Reference 0.00 ± 0.01 Asplund et al. (2009) representative of variation in the literature for eff . We also adopted Δ = 135.1 ± 0.2 µHz with a central value from Huber et al. (2011) and a standard deviation representative of variations in measurements of the solar Δ (Broomhall et al. 2011).

Sampling
We obtained results for each of the models described above by sampling their posterior distributions using a Markov chain Monte Carlo (MCMC) algorithm. In particular, we used the NUTS algorithm implemented in T F P (TFP; Abadi et al. 2016;Dillon et al. 2017) 1 . For each model, we produced 20 000 samples 1 We interacted with TFP using the now deprecated P MC4 package, developed as a successor to P MC3 (Salvatier et al. 2016). split across 10 MCMC chains and computed summary statistics for the marginalised posteriors of each parameter in the model. We removed stars with problems during tuning using the Gelman-Rubin diagnostic (ˆ; Gelman & Rubin 1992). We used results from each model onceˆ< 1.04 for all parameters, indicating good model convergence.
Initially, we created a random synthetic population of stars using MESA to test the ability of the method to recover stellar properties according to our choice of model physics and population priors. We tested the NP, PP, and MP models. Since our sample was fictitious, it would not have been appropriate to include real solar data. We summarise the results for the synthetic stellar parameters and hyperparameters in Appendix C. We found that the models were able to recover the true synthetic properties accurately, with increased precision when pooling parameters.
Once we had tested the method on synthetic stars, we obtained results for the sample of 81 dwarfs and subgiants described in Section 2. Here, we included the PPS and MPS to test the effects of adding the Sun as a star in our population. For the purpose of comparison, we fit the hyperparameters of the PP model (Δ /Δ , , , , ) to the results from the NP model.
Since our initial sample was chosen based on masses from S17, we expected some stars to lie outside (or near the boundary) of the observational parameter space provided by our grid of stellar models. We used an initial run of the NP model to catch and remove these stars. During the initial run, we dropped 16 of the 81 stars from the sample. Of the removed stars, we found the posteriors in for 6 skewed towards the prior upper mass limit of 1.2 M . The remaining 10 removed stars suffered poor convergence during sampling (ˆ>> 1.04) which could be because of poor step-size tuning and sampling at the prior boundary.
Out of the remaining 65 stars with results from the NP model, 2 stars were dropped from the PP model. A consequence of partially pooling parameters is that a population spread, allows for individual parameters to vary outside of the prior range given to the population mean, . In this case, individual stellar MLT was allowed to vary outside the range for which the ANN was valid if was large. The 2 removed stars happened to have high likelihoods outside of the valid MLT range. We found that removing the same 2 stars from the other models made negligible difference to the results, so we leave a solution to this problem to future work. Naturally, we did not see the same issue in the MP model, so we proceeded with modelling the same 65 stars as with the NP model.

RESULTS
In this section, we present the results for each of the NP, PP, and MP models with the sample of 81 APOKASC dwarfs and subgiants as inputs. We also present the results for the PPS and MPS models which include the Sun as a star in the population. Firstly, we show the reduction in age, mass and radius uncertainty with the addition of pooling in Section 4.1. We then show the results for model hyperparameters in Section 4.2 where we infer the initial helium abundance and mixing-length parameter distribution in the sample.

Stellar parameter results
In Table 4, we present results for the 65 APOKASC stars from the NP model. Running the NP model with synthetic stars resulted in unreliable uncertainties (see Appendix C). This was because the boundary of the priors in init and MLT truncated the posterior distribution leading to underestimated uncertainties and skewing their posterior means towards the centre of their priors. Therefore, we present the NP results only for comparison purposes, but we exclude them from further discussion. In Tables 5 and 6 we present the results for the 63 stars from the PP and PPS model respectively. In Tables 7 and 8 we also present results for the 65 stars from the MP and MPS models respectively. We note that for the MP models, there is no column for MLT because this parameter is the same across the population and hence is given in Section 4.2.

Population parameter results
We obtained values for the hyperparameters for each of the models and present them in Table 9 along with their upper and lower 68 per cent credible regions. We omit the results for because its posterior is the same as the prior, = 0.247 ± 0.001 for all the models. We fit the same hyperparameters from the PP model to the NP model results for init , init , and MLT for the purpose of comparison. However, the NP model results suffer from boundary effects which makes the resulting fit unreliable, pushing the population mean to the centre of the priors and underestimating the uncertainties. We leave the NP results here only for completeness. Fig. 3 shows the joint and marginal distributions (corner plot) output by the PP and PPS model. We see an anti-correlation between Δ /Δ and , expected due to the degeneracy between the two parameters in the stellar evolutionary models. In Fig. 4, we also show the corner plot for the MP and MPS model output. Similarly, we see an anti-correlation between Δ /Δ and MLT .
We present the helium enrichment relation resulting from the PPS model in Fig. 5. In this figure, we plot the individual results for init and init for each of the stars from the NP and PPS models. This is an example of shrinkage in the HBM; the estimates for individual stellar parameters move towards the mean of the population when they are pooled.

DISCUSSION
So far, we have shown that we can add parameters to stellar models without sacrificing statistical uncertainties through the application of an HBM. We freed the init and MLT using pooling to encode our prior knowledge of their distribution in the population. We also tested the impact of including the Sun as a star in our population. We first discuss the impact of pooling and our choice of population priors for init and MLT in Section 5.1 and 5.2. To assess the accuracy of our model with respect to the literature, we compare our results to those of S17 in Section 5.3. We found good agreement between this work and their results, despite some differences in observables and stellar model physics which we discuss further. Then, we discuss sources of systematic uncertainties in Section 5.4. Although we have accounted for uncertainties in 5.1 and 5.2 in our model, there are still differences between stellar modelling codes and other model physics which should be considered. Finally, in Section 5.5, we highlight a possible outlier in our dataset.

Helium enrichment
We found the value for the helium enrichment ratio, Δ /Δ to be the same in both the PP and MP models, Δ /Δ = 1.6 +0.5 −0.4 . This is consistent with values of ∼ 1.4 in the literature (when heavy element diffusion is included) albeit obtained through different methods: e.g. measuring the metallicity and helium abundance of an open cluster and comparing with the primordial helium abundance (Brogaard et al. 2012), and fitting to helium abundances determined for Kepler field stars using asteroseismology (Verma et al. 2019).
When we added the Sun to the pooled models, PPS and MPS, obtained Δ /Δ up to 2-lower than the models without the Sun. In both models, the resulting Δ /Δ of approximately 0.8 to 1.0 was consistent with the initial helium fraction expected from solar models with our choice of Asplund et al. (2009) abundances (Serenelli & Basu 2010). However, such solar models have been shown to not recover helioseismic measurements of helium in the Sun (Basu & Antia 2004;Serenelli et al. 2009;Villante et al. 2014). Solar models with the older Grevesse & Sauval (1998) abundances typically yield higher helium fractions more in-line with helioseismology. The Δ /Δ from the PP and MP models are higher than those including the Sun. We could extend our model to include asteroseismic indicators of helium to improve the uncertainties on and test whether this difference becomes more significant.
Our models assumed a prior of = 0.247 ± 0.001 for the primordial helium fraction which dominated its posterior. This was a sensible assumption to make when using a linear enrichment law, because measurements of the primordial helium correspond to the abundance at the epoch of BBN according to current cosmological theory (Cyburt et al. 2016). However, if we used a less informative prior for we might yield more uncertain results, or even a different value for . In previous work fitting a linear enrichment law, some results for suggested a value below the BBN value (Casagrande et al. 2007;Silva Aguirre et al. 2017). It is more probable that the assumption of a linear enrichment law is inaccurate than a sample of stars could contradict independent from cosmology. We justify our prior on as in-line with the assumption of a linear enrichment law, but highlight the need to investigate other ways of describing helium in a population of stars.

Mixing-length theory
To a greater degree than chemical composition, the best-fitting MLT depends on the choice of model physics and stellar modelling code. The MLT is an approximation of convection which is often calibrated to the Sun and then assumed for all stars in a model. However, studies of 3D hydrodynamical simulations suggest that the degree to which MLT approximates convection varies across the HR diagram (Magic et al. 2015) and this is confirmed when modelling stars with asteroseismology (Tayar et al. 2017).
The PP model (without the Sun) favoured a mean mixing-length parameter of 1.7. Whereas, the PPS model yielded a higher value of 1.9 by ∼ 2-. We found this was attributed to the addition of the Sun. The individual solar results for the PPS model yielded a value of MLT = 2.12 ± 0.03 which was considerably higher than the MLT obtained for the other stars in the sample (see Appendix D). This result agrees with the 3D simulations (e.g. Trampedach et al. 2014), which predict lower MLT for stars with lower eff and higher log . However, the solar value also exceeds the reference solar calibrated values of ≈ 1.92 for the same stellar evolution code (Paxton et al. 2011). This is caused by the differences in adopted solar mixture, atmospheric boundary conditions and the treatment of convective mixing between this work and typical reference values.
Despite the difference in , the resulting spread in mixing-length for the PPS model ≈ 0.13 was double that of the PP model to cope with the high solar value. This implies that a large population spread in MLT could explain the difference we see. In other words, if we assume that the best-fitting MLT is normally distributed in our   Viani et al. (2018) for stellar models including diffusion, predict MLT in the range 1.5 to 2.3 across our sample. This dispersion would be more compatible with the larger spread obtained by our PPS model. However, in future work we should further investigate how MLT varies with stellar parameters, as our assumption of a normal distribution may not be accurate.
We found a greater difference in MLT between the models with and without the Sun when we max-pooled MLT . The MP models yielded a global MLT in-line with from the PP model. However, when we added the Sun, the model yielded MLT ≈ 2.1 which is in common with the solar results (see Appendix D). This had a similar affect as assuming a solar calibrated value, because the model favoured fitting to data with the best observational precision. The change in MLT between the MP and MPS models resulted in a mean difference of ∼ 20 per cent between the individual stellar ages. This is an example of how adopting a solar calibrated value can bias stellar ages. We argue that carefully including the Sun as a part of the population with an intrinsic spread is a better way to calibrate the stellar models than assuming as solar MLT across the sample.
In all observables except for , the Sun is near the centre of our distribution of stars. However, we found no relationship between and MLT in both our NP and PP models. A possible explanation for the difference in MLT with and without the Sun could be some systematic offset in our observational data for the sample. Here, we point to our choice of spectroscopic eff which typically underestimates eff compared to photometric scales, as noted in S17. We ran a solar model with an additional parameter, Δ eff = eff,obs − eff which represents a bias in the observed effective temperature. The estimated covariance between Δ eff and MLT was 0.452 K 2 (with a correlation of 0.517). Therefore, underestimating eff by about 100 K could underestimate MLT by about 0.1. If we extend this result to the other stars in the sample, the lower MLT obtained without including the Sun as a star could be caused by underestimating eff relative to the Sun. Alternatively, the MLT of the Sun could have been higher than the rest of the sample to compensate for neglecting additional sources of mixing required to reproduce the higher precision solar observables. A deeper quantification of systematic uncertainties is left to future work.

Comparison with APOKASC results
Before we compare our results to S17, we should highlight some key differences between our data and methodology. The results from S17 were determined using a grid-based-modelling technique, which estimates the likelihood across a dense grid of stellar models. They used results from several pipelines to estimate the systematic uncertainties. For the central values of their results, they used the Bayesian stellar algorithm (BASTA; Silva Aguirre et al. 2015) using a grid computed with GARSTEC (Weiss & Schlattl 2008). Their choice of stellar physics was similar to this work, except for two major differences.
Firstly, the results of S17 were determined using stellar models calculated without heavy-element diffusion. The inclusion of diffusion when modelling the Sun has been commonplace over the last few decades, with good agreement between models and helioseismic observations (Christensen-Dalsgaard et al. 1993;Bahcall et al. 1995). More recent work explored the diffusion in cluster stars (Korn et al. 2007;Önehag et al. 2014) and another demonstrated the impact   Grevesse & Sauval (1998) mixtures adopted by S17. The former leads to a solar heavy-element to hydrogen ratio of ( / ) = 0.0181, and the latter, ( / ) = 0.0230. Typically, Grevesse & Sauval (1998) abundances are favoured in asteroseismic modelling because they are better able to reproduce measurements of helium in the Sun from helioseismology (Serenelli et al. 2009). An effect of using the Asplund et al. (2009) abundances, is that it favours lower init for a given [M/H] surf . As a result, models using Grevesse & Sauval (1998) abundances on average underestimate radii and mass compared to those without by about 1 and 0.5 per cent respectively (Nsamba et al. 2018).
Although updated, much of our observable data is comparable to that of S17, with the exception of eff . The preferred results from S17 were determined using a photometric eff scale which we found to be on average ∼ 170 K greater than our spectroscopic scale from DR14. In S17, they saw a similar offset between the DR13 eff available at the time. They found a median difference in mass, radius, and age of approximately −6, −2, and +35 per cent respectively with results from the photometric eff scale subtracted from the spectroscopic scale.
In Fig. 6 we compare our statistical uncertainties for , , and with those for the equivalent stars from S17. We found that the NP model yielded comparable uncertainties to S17 but note that these are likely underestimated due the influence of the prior boundaries for init and MLT . We expected larger uncertainties because we included additional free parameters ( init and MLT ) over the work of S17. However, when we treat these parameters hierarchically, we saw a reduction in uncertainties from all of the pooled models. This is because our prior assumptions about the population allows for the sharing of information between the stars. This uncertainty reduction scales with the number of stars in our sample, demonstrated by the results for the synthetic stars in Fig. C1. Thus, hierarchically modelling our population resulted in improved statistical uncertainties in stellar fundamental parameters.
In the following subsection, we compare the results between our PPS model with that of S17 for mass, radius, and age with reference to Fig. 7. We preferred the PPS model for comparison because it utilised the high-precision data available for the Sun as a star to help calibrate the sample, while partially pooling both init and MLT to allow for small variations within the population.

Mass, radius, and age
In the left panel of Fig. 7, we compare the masses obtained by the PPS model with S17 and found a dispersion of around 2 per cent. Our masses were on average 1 per cent above the results from S17. Although we might expect the lower eff scale in this work to underestimate the mass, we attribute this overall effect to our choice of stellar model physics. As previously discussed, the use of Asplund et al. (2009) solar abundances and heavy-element diffusion has the cumulative effect of overestimating stellar masses compared to the physics adopted by S17. We also found that the results from all the pooled models returned similar masses, with or without the Sun.
In the central panel of Fig. 7, we show that our radii are similar to S17 with a spread of 1 per cent. We also found radii on average 1 per cent greater than the APOKASC results. Similarly to with mass, this contradicts what would be expected from a lower eff scale and could also be explained by model physics. Our radii also varied little between models with and without the Sun.
Our ages were also consistent with those from S17. The right-most  panel of Fig. 7 shows the spread in the relative age differences to be about 18 per cent, slightly underestimated by 4 per cent. We would expect the lower eff scale to overestimate the ages as found in S17, but instead they are comparable. However, as discussed previously, including diffusion has been shown to reduce age estimates compared to those without. Since we have included diffusion, this could explain the similar ages despite the difference in eff scale.
Including the Sun in our pooled models affected the resulting ages more than mass and radius. Including the Sun typically overestimated the ages compared to models without the Sun. This is expected given the higher MLT for the models including solar data, because a larger mixing-length leads to more efficient nuclear burning and more time spent during the MS phase.

Systematic uncertainties
We have already accounted for systematics due to the choice of helium enrichment and mixing-length parameter by marginalising over their uncertainties assuming their population distributions. However, there are other model physics which we have not freely varied, including diffusion and choice of solar mixture. Although our method can be adapted to different stellar evolutionary codes and choice of physics, an in-depth analysis of systematic uncertainties is left to future work.
In previous work studying stars in the APOKASC sample, several pipelines used a range of stellar evolutionary codes and model physics are employed to evaluate systematic uncertainties from the models (Serenelli et al. 2017;Silva Aguirre et al. 2017). Using a hierarchical model in this work enabled us to reduce median statistic uncertainties to 2.5 per cent in mass, 1.2 per cent in radius, and 12 per cent in age. The systematic uncertainty analysis of S17 found median systematic uncertainties of 3, 1, and 13 per cent in mass, radius, and age respectively. Reducing statistical uncertainties highlights the importance of understanding systematics uncertainties.
Other systematics could arise from observational data. For example, we chose the ASPCAP DR14 eff scale which was systematically lower than the photometric scale of choice in S17. However, our method was still able to recover similar masses, radii, and ages. This could be explained by our choice of stellar model physics, as discussed previously.

Outliers
We identified KIC 9025370 as a possible outlier. Consistent across all our models, its output effective temperature, eff = 5934 ± 50 K was about 4-greater than its observed eff , and its modelled was about 2-dimmer than its observed luminosity. Only Δ and [M/H] surf were consistent between modelled and observed values. The difference was also apparent in our comparison of ages with S17 where we obtained an age of 1.5 +0.7 −0.6 Gyr compared to their value of 7.0 +2.0 −1.6 Gyr. KIC 9025370 turned out to be a double-lined spectroscopic binary (Nissen et al. 2017), discovered after S17 and hence included in the original sample. The higher observed luminosity from I and inconsistent spectroscopic eff compared with our model posteriors were compatible with a spectroscopic binary. We calculated a photometric eff using the IRFM method (Casagrande et al. 2010) with the available 2MASS photometry for the target and obtained eff = 5983 ± 120 K, more consistent with our modelled effective temperature and inconsistent with its spectroscopic eff . Thus, our inferred eff was within the dispersion between different observed eff scales. Running the model without KIC 9025370 did not affect the resulting inferred hyperparameters, demonstrating the robustness of our model. Therefore, we present KIC 9025370 in our results but suggest that further investigation should be carried out.

CONCLUSION
We have shown that modelling init and MLT to improve inference of fundamental parameters can be done through the use of an HBM, whilst still improving statistical uncertainties. Our results were in good agreement with S17 with small changes in mass and radii expected from our choice of model physics and updated observables. Taking our partially-pooled model including the Sun (PPS) as our preferred set of results, we obtained median statistical uncertainties on , , and of 2.5, 1.2, and 12 per cent respectively. Furthermore, we demonstrated that the uncertainties reduced with increasing sample size in a population of synthetic stars, giving scope to further improve our inference on larger sample sizes from TESS.
We found that the gradient, Δ /Δ of the linear helium enrichment law ranged from 0.8 to 1.6 depending on the level of parameter pooling and the inclusion of the Sun in our sample, with Δ /Δ = 1.1 +0.3 −0.3 from our preferred PPS model. Consistent across our models was the spread in initial helium about the enrichment law, = 0.005 +0.004 −0.003 . The mean MLT in the population was = 1.90 +0.10 −0.09 for the PPS model, with values from 1.7 to 2.1 depending on the level of pooling and whether or not solar data was included. We also found the spread in MLT doubled to = 0.13 +0.06 −0.05 to account for the addition of the Sun in our sample. We conclude that there are still discrepancies between the best-fitting MLT in our population and that of the Sun which need to be investigated further. Perhaps, the addition of asteroseismic signatures of helium abundance (see e.g. Verma et al. 2017) would improve our constraints on init and thus reduce star-by-star uncertainties in MLT .
Using HBMs has allowed us to introduce more free parameters without sacrificing statistical uncertainties. We used an ANN to approximate stellar models, a method which can be extended to higher input dimensions with little impact on training and evaluation time. Our model also scales well with the number of stars, making use of GPU parallel processing when sampling the posterior.
As shown in tests with synthetic stars (Appendix C) and apparent in Fig. 6, increasing the number of stars decreases the statistical uncertainties when parameters are pooled. The theoretical limit to this improvement is √︁ 1 / 2 for two populations of size 1 and 2 . For example, if we increase our sample to 300 stars, we would expect the uncertainties to reduce by up to a factor of 2. Naturally, the uncertainty is still limited by observational precision. However, hierarchical modelling as demonstrated in this work, allows us to get the most out of our data and paves the way for a data-driven analysis of model systematics.  Including all-sky data from TESS and in anticipation of PLATO (Rauer et al. 2014) we can expect our sample size of asteroseismic dwarfs and subgiants only to increase. There is also scope to extend our grid of models to include red giants, for which there are vast catalogues of stars already studied with Kepler (Pinsonneault et al. 2018).
Once we constructed our grid of models, we needed a way in which we could continuously sample the grid for use in our statistical model. We opted to train an ANN. The ANN is advantageous over interpolation because it scales well with dimensionality, training and evaluation is fast, and gradient evaluation is easy due to its roots in linear algebra (Haykin 2007). We trained an ANN on the data generated by the grid of stellar models to map fundamentals to observables. Firstly, we split the grid into a train and validation dataset for tuning the ANN, as described in Appendix A1. We then tested a multitude of ANN configurations and training data inputs, repeatedly evaluating them with the validation dataset in Appendix A2. In Appendix A3, we reserved a set of randomly generated, off-grid stellar models as our final test dataset to evaluate the approximation ability of the best-performing ANN independently from our train and validation data. Here, we briefly describe the theory and motivation behind the ANN.
An ANN is a network of artificial neurons which each transform some input vector, based on trainable weights, and a bias, . The weights are represented by the connections between neurons and the bias is a unique scalar associated with each neuron. A multi-layered ANN is where neurons are arranged into a series of layers such that any neuron in layer − 1 is connected to at least one of the neurons in layer .
In this work, we considered a fully-connected ANN, where each neuron in layer − 1 is connected to every neuron in layer . The output of the -th neuron in layer is, where is the activation function for the -th layer, , are the weights connecting all the neurons in layer −1 to the current neuron, and , is the bias. This generalises such that the output of the -th layer is, where is the matrix of weights leading to all neurons in the -th layer. For a regression ANN, we typically have a linear activation function applied to the final output layer. Layers of neurons between the input and output layers are called hidden layers. Therefore, the output of a network of hidden layers with initial input X is, We also restricted our configuration to an ANN with the same number of neurons, in each hidden layer. Hereafter, we refer to our choice of neurons per layer, and hidden layers, as the architecture (see Fig. A1).
To fit the ANN, we used a set of training data, D train = {X , Y } train =1 comprising train input-output pairs. We split the training data into pseudo-random batches, D batch because this has been shown to improve ANN stability and computational efficiency (Masters & Luschi 2018). The set of predictions made for each batch is evaluated using a loss function which primarily comprises an error function, (D batch ) to quantify the difference between the training data outputs (Y) and predictions ( Y). We also considered an additional term to the loss called regularisation which helps reduce over-fitting (Goodfellow et al. 2016). During fitting, the weights are updated after each batch using an algorithm called the optimizer, back-propagating the error with the goal of minimising the loss such that Y ≈ Y (see e.g. Rumelhart et al. 1986).
We trained the ANN using T F (Abadi et al. 2016). We varied the architecture, number of batches, choice of loss function, optimizer, and regularisation during the optimisation phase. For each set of ANN parameters, we initialised the ANN with a random set of weights and biases and minimized the loss over a given number of epochs. An epoch is defined as one iteration through the entire training dataset, D train . We tracked the loss for each ANN using an independent validation dataset to determine the most effective choice of ANN parameters (see Appendix A2).

A1 Train, validation, and test data
We built the train and validation datasets from the outputs of the grid of stellar models in Section 3.1. This included the input parameters: , MLT , init , and the initial heavy-elements fraction, init . We also included the eff , log , Δ , stellar age ( ), radius ( ), surface metallicity ([M/H] surf ), and other chemical composition information generated by the models. We determined the fractional MS lifetime, MS = / MS , of each evolutionary track by taking MS as the age when the central hydrogen fraction, < 0.01. We then cut data where MS < 0.01 to remove points on the grid prior to the MS. Once we had refined the data from the grid of models, we randomly sampled 7.736 × 10 6 points to use as the training dataset, with the remaining ∼ 2 × 10 6 points given to the validation dataset. We varied our choice of ANN input and output parameters among those available in the dataset during tuning (see Appendix A2).
Additionally, we produced a test dataset of ∼ 2 × 10 6 stellar models evolved using MESA. Values for the initial , [M/H], , and MLT were chosen randomly within the range of grid parameters described in Table 2 such that they spanned the breadth of the grid in an unbiased manner. We prepared this dataset in the same way as the training set, but also constrained it to < 15 Gyr because we consider ages above ∼ 15 Gyr unphysical and such points are sparse in the training data. The test dataset was set aside and evaluated on the final ANN.

A2 Tuning
We trained an ANN to reproduce stellar observables according to our choice of physics with greater accuracy than typical observational precisions. We experimented with a variety of ANN parameter choices, such as the architecture, activation function, optimization algorithm, and loss function. We tuned the ANN parameters by varying them in both a grid-based and heuristic approach, each time evaluating the accuracy using the validation dataset.
During initial tuning, we found that having stellar age as an input was unstable, because it varied heavily with the other input parameters. We mitigated this by introducing an input to describe the fraction of time a star had spent in a given evolutionary phase, evol .
where end is the age of the star at the end of the track, and MS is the MS lifetime. A star with 0.01 < evol ≤ 1.0 is in its MS phase, burning hydrogen in its core, and 1.0 < evol ≤ 2.0 has left the MS. Consequently, evol gives the ANN information about the internal state of the star which affects the output observables.
Otherwise, evol has little physical meaning, although it could be interpreted as a measure of the evolutionary phase of the star. We also observed that the ANN trained poorly in areas with a high rate of change in observables, likely because of poor grid coverage in those areas. To bias training to such areas, we calculated the gradient in eff and log between each point for each stellar evolutionary track and used them as optional weights to the loss during tuning. These weights multiplied the difference between the ANN prediction and the training data in our chosen loss function.
After preliminary tuning, we chose the ANN input and output parameters to be X = { evol , , MLT , init , init } and Y = {log( ), eff , , Δ , [M/H] surf } respectively. A generalised form of our neural network is depicted in Fig. A1. The inputs corresponded to initial conditions in the stellar modelling code and the outputs corresponded to surface conditions throughout the lifetime of the star, with the exception of age which is mapped from evol .
We standardised the training dataset by subtracting the median, 1/2 and dividing by the standard deviation, for each input and output parameter. We found that the ANN performed better when the training data was scaled in this way as opposed to no scaling at all. We present the parameters used to standardise the training dataset in Table A1.
We found that the optimal choice of architecture ( and ) varied depending on our choice of other ANN parameters. Therefore, each time we explored a new parameter, we trained an ANN with a grid of ( , ) ranging from (32, 2) to (512, 10).
We evaluated the performance of three activation functions: the hyperbolic-tangent, the rectified linear unit (ReLU; Hahnloser et al. 2000;Glorot et al. 2011) and the exponential linear unit (ELU; Clevert et al. 2015). Although the ReLU activation function outperformed the other two in speed and accuracy, the resulting ANN output was not smooth. The discontinuity in the ReLU function, ( ) = max(0, ) in turn caused the output to be discontinuous. This made the ANN difficult to sample for our choice of statistical model (see Section 3.2). Out of the remaining activation functions, ELU performed the best, providing a smooth output which was well-suited to our probabilistic sampling methods.
We compared the performance of two optimisers: Adam (Kingma & Ba 2014) and stochastic gradient descent (SGD; see e.g. Ruder 2016) with and without momentum (Qian 1999). Both optimizers required a choice of learning rate which determined the rate at which the weights were adjusted during training. We found that Adam performed well but the validation loss was noisy as a function of epochs as it struggled to converge. The SGD optimizer was less noisy than Adam, but it was difficult to tune the learning rate. However, SGD with momentum allowed for more adaptive weight updates and outperformed the other configurations.
There are several ways to reduce over-fitting, from minimising the complexity of the architecture, to increasing the size and coverage of the training dataset. One alternative is to introduce weight regularisation. So-called L2 regularisation adds a term, ∼ 2 , to the loss function for a given hidden layer, which acts to keep the weights small. We varied the magnitude of and found that if it was too large it would dominate the loss function, but if it was too small then performance on the validation dataset was poorer.
We compared the choice of two error functions: mean squared error (MSE) and mean absolute error (MAE). The former is widely used among ANNs because it is more sensitive to large errors. However, we tracked both metrics regardless of which was added to the loss function and found that MAE converged faster. Although MAE is less effective at large errors, we found that these were typically at the edges of the grid and the accuracy was good enough everywhere else.
After extensive tuning, we opted for an ANN with = 128 neurons in each of = 6 hidden layers. Each of the hidden layers used an ELU activation function and L2 weight regularisation with = 1 × 10 −6 . We trained the ANN for 50 000 epochs with a 500 training data batches each containing 15 472 input-output pairs. To fit the ANN, we used an SGD optimiser with an initial learning rate of 1 × 10 −4 and momentum of 0.999 with an MAE loss function. Training took ∼ 48 h on an NVidia Tesla V100 graphics processing unit (GPU). In Fig. A2 we show the training and validation MAE as a function of epochs for the final ANN configuration. The training and validation loss were comparable throughout training.

A3 Testing
The test dataset contained ∼ 2 × 10 6 stellar models evolved in the same way as the training dataset, but with initial conditions chosen randomly across the range of the grid. We made predictions for the test dataset, deriving luminosity from the output radius and effective temperature, using the final trained ANN as described in Appendix A2. We then evaluated the accuracy of the ANN by taking the difference between the test truth and ANN prediction, true − pred . We found good agreement between the test dataset and ANN predictions, within typical observational uncertainties. We noted that the largest errors lay at the boundaries of the training data and in areas sparsely populated by the grid. This is apparent in Fig. A3 where we plot the test error against each parameter. For example, the spread in error increases for [M/H] surf < −0.5 where training data is sparse at the edge of the grid. However, the accuracy is very good within the observed range covered by our sample of 81 dwarfs and subgiants. Hence, we chose the median absolute deviation (MAD) as an estimator of the spread in error because it is less sensitive to large errors at the grid boundary than the standard deviation.
To represent the accuracy of the ANN, we present the median, 1/2 and MAD estimator, MAD = 1.4826 · median(| ( ) − 1/2 |) of the error ( ( )) in Table A2. The median is close to zero for all parameters, showing little systematic bias in the ANN. The MAD is also lower than observational uncertainties quoted in Section 2. The spread in error for Δ of 0.06 Hz is comparable to a small number of observations with the best signal-to-noise. However, the error in Δ predictions is also comparable to other systematic uncertainties in Δ discussed in Section 3.1.2. Therefore, a robust model which takes account of systematic uncertainties pertaining to Δ , including those from the ANN, will be explored in future work (Carboneau et al. in preparation).

APPENDIX B: PRIOR DISTRIBUTIONS
We chose a transformed beta distribution (see Equation 4) as the prior for the non-pooled stellar parameters as an alternative to a uniform distribution. Fig. B1 shows the beta distribution compared with a uniform distribution for some parameter from 0 to 1. We found that the continuously differentiable nature of the beta distribution was preferred by the NUTS over the uniform distribution.

APPENDIX C: THE SYNTHETIC POPULATION
In this section, we present the results for the NP, PP, and MP models run on a synthetic sample of 100 stars with the following initial conditions. We randomly generated initial and [M/H] init uniformly. We drew initial values for init from a normal distribution centred on the helium enrichment law from Equation 10 with Δ /Δ = 1.8 and = 0.247, and scaled by = 0.008. We also generated initial values for MLT from a normal distribution centred on = 2.0 and scaled by = 0.08. We evolved the synthetic stars to randomly chosen ages using MESA. We then took the output , eff , , Δ , and [M/H] surf from the models and used these as true values for each of the stars. We added random noise to the observed quantities centred on the true values with a standard deviation of 2.2 per cent in eff , 3.5 per cent in , 0.9 µHz in Δ , and 0.07 dex in [M/H] surf chosen to be representative of the APOKASC sample.

C1 Stellar parameters
We found that the NP model recovered the true values for the individual stellar parameters, but the uncertainties were unreliable. The observational quantities alone were not good enough to constrain init and MLT . As a result, their distributions were truncated at the bounds of their priors. These boundary effects skewed the marginalised posterior means for init and MLT towards the centre of the prior (0.28 and 2.0 respectively).
The PP model recovered true values for the synthetic stars with more reliable uncertainty than the NP model. The addition of pooling init and MLT between the stars improved their uncertainty which reduced the effects of the prior as seen in the NP model. We found little difference between the results of the PP and MP models.
We reran the PP model with 10 and 50 stars chosen randomly from the sample of synthetic stars. In Fig. C1, we show the uncertainties in the several parameters from the results of each of the models. For the two pooled parameters, init and MLT , the uncertainty reduction due to pooling is most obvious. We see the PP model repeatedly improves on the uncertainties from the NP model when stars is increased.
In Fig. C1 we also see a similar reduction in uncertainty for , , and , with all models improve upon the NP model. However, we do not see the same effect in init for which the uncertainty appears dominated by observations of [M/H] surf .

C2 Population parameters
In Fig. C2, we see that the PP model also recovers the hyperparameter truths well, with some noise due to random realisation error. Fitting the model this way has the added benefit over the NP model of improving the inference of the individual stellar parameters, as shown in the previous two sections. We also found that when we ran the PP model with 10 and 50 stars, the uncertainties on the hyperparameters also shrank with increasing stars . Fig. C3 shows the hyperparameter results for the MP model. Here, MLT was assumed to be the same for all stars. This model also recovers the true hyperparameters for helium well, and the assumed value for MLT is within uncertainty of the true .

APPENDIX D: THE SUN AS A STAR
Our model consistently recovers the Sun when modelled in each of the NP, PP, and MP models. In Table D1 we present the results for the Sun as a star from the NP model to show what we obtain without the influence of any other stars in the sample. We show the marginal Table A2. The median error, 1/2 and median absolute deviation of the error, MAD = 1.4826 · median( | (Y) − 1/2 |) for a given ANN output parameter, Y from the test dataset. The error, (Y), is given in the  Figure A3. Left: the rolling error between a given parameter in the test dataset (Y) and the ANN prediction for that parameter ( Y) where Y = Y − Y. Right: a kernel density estimate (KDE) of the error and a normal distribution centred on the median, 1/2 with an estimator for the standard deviation from the median absolute deviation, MAD . and joint posterior distributions for the solar parameters from the NP model in the corner plot in Fig. D1. We found some differences between MLT from our solar model and solar calibrations in the literature produced using MESA with similar input physics. For example, the solar calibration in Stancliffe et al. (2016) using Asplund et al. (2009) abundances yields compatible initial abundances, init = 0.0149 and init = 0.266 but MLT = 1.783 which differs from our results by about 10-. This is likely because of a few differences in observed values used for the calibration. Stancliffe et al. (2016) used observed helium abundance and convection zone depth measurements from helioseismology. Furthermore, solar calibrations typically include convective envelope overshooting. Presuming overshooting increases mixing in the star, we might expect a lower MLT to compensate this. Therefore, we stress that the addition of the Sun as a star in our model is with the assumption of our choice of input physics. This paper has been typeset from a T E X/L A T E X file prepared by the author.  Figure B1. A beta distribution (B) with = = 1.2 for some parameter and a uniform distribution (U) from 0 to 1.