The Horizon-AGN simulation: evolution of galaxy properties over cosmic time

We compare the predictions of Horizon-AGN, a hydro-dynamical cosmological simulation that uses an adaptive mesh refinement code, to observational data in the redshift range 0<z<6. We study the reproduction, by the simulation, of quantities that trace the aggregate stellar-mass growth of galaxies over cosmic time: luminosity and stellar-mass functions, the star formation main sequence, rest-frame UV-optical-near infrared colours and the cosmic star-formation history. We show that Horizon-AGN, which is not tuned to reproduce the local Universe, produces good overall agreement with these quantities, from the present day to the epoch when the Universe was 5% of its current age. By comparison to Horizon-noAGN, a twin simulation without AGN feedback, we quantify how feedback from black holes is likely to help shape galaxy stellar-mass growth in the redshift range 0<z<6, particularly in the most massive galaxies. Our results demonstrate that Horizon-AGN successfully captures the evolutionary trends of observed galaxies over the lifetime of the Universe, making it an excellent tool for studying the processes that drive galaxy evolution and making predictions for the next generation of galaxy surveys.


INTRODUCTION
In recent years, an explosion of multi-wavelength survey data has enabled us to probe the evolution of galaxy properties over ∼90% of cosmic time. To interpret these observations in the context of the ΛCDM paradigm (e.g. Rees & Ostriker 1977;White & Rees 1978), and understand the processes that underpin galaxy formation, wellcalibrated models are required, that reproduce the broad evolution of galaxy properties over the lifetime of the Universe.
In this paradigm, initial density perturbations, that are gravitationally amplified, collapse to form dark matter halos. Smaller halos form first and merge under the influence of gravity to form progressively larger ones (e.g. Peebles  (Pichon et al. 2010), in the u, r and z filters. The resolution is 0.15"/pixel and the image is computed using star particles in the redshift range 0.1 < z < 5.8. Dust extinction and non-stellar sources are not taken into account in this mock image. 2012). However, observational evidence for this process remains mixed. While there is strong evidence that AGN regulate cooling from hot gas (e.g. Tabor & Binney 1993;McNamara & Nulsen 2007), the AGN couple only weakly to gas acquired via mergers and accretion, at least at low redshift (e.g. Kaviraj et al. 2015;Sarzi et al. 2016). Nevertheless, both theoretical and observational work indicates that AGN are likely to play an important role in shaping galaxy evolution (e.g. Silk & Rees 1998;Bower et al. 2006;Croton et al. 2006;Cattaneo et al. 2009;Kimm et al. 2012;Dubois et al. 2012).
While the interplay between star formation and feed-back processes shapes stellar-mass and black-hole growth, the observed changes in the morphological mix of the Universe (e.g. Mortlock et al. 2013;Conselice 2014) is postulated to be driven largely by merging. Both major mergers (e.g. Toomre & Toomre 1972;White 1978;Barnes 1992;Hernquist 1992;Springel et al. 2005) and minor interactions (which can trigger disc instabilities, see e.g. Dekel et al. 2009;Kaviraj et al. 2013;Welker et al. 2015a), can convert rotation-dominated (disky) systems into dispersiondominated spheroids.
Over the last two decades, semi-analytical models (e.g. White & Rees 1978;White & Frenk 1991;Kauffmann et al. 1993;Hatton et al. 2003;Baugh 2006;Benson 2012) have been successful in reproducing many of the bulk properties of galaxies over a significant fraction of cosmic time. By employing approximations derived from more detailed numerical simulations, and empirical calibrations from data, the semi-analytical approach has offered a computationally inexpensive route to probing the phenomenology of galaxy formation and, in particular, the theoretical analysis of today's large survey datasets (e.g. Cole et al. 2000;Hatton et al. 2003;Bower et al. 2006;Somerville et al. 2012;Overzier et al. 2013;Bernyk et al. 2016). A significant recent advance has been the advent of full hydro-dynamical simulations in cosmological volumes which, while more computationally demanding than the semi-analytical approach, evolve the dark matter and baryons self-consistently. These simulations provide predictions for the gas and baryonic components of galaxies at high spatial resolution (on ∼kpc scales) and, while sub-grid prescriptions are still needed, these are applied on ∼kpc scales rather than at ∼ 100 kpc scales as in the semi-analytical approach.
The cosmological box sizes of the current generation of hydro-dynamical simulations offer, for the first time, detailed survey-scale predictions that can be compared to contemporary datasets across a large fraction of cosmic time (e.g. Devriendt et al. 2010;Schaye et al. 2010;Dubois et al. 2014;Vogelsberger et al. 2014;Schaye et al. 2015;Khandai et al. 2015). In concert with current and forthcoming observational data, these simulations will, over the next few years, play a central role in advancing our understanding of the key processes that drive galaxy evolution, particularly in the (still poorly understood) high-redshift Universe. It is, therefore, important to study the reproduction of galaxy properties in such models to establish whether they provide a reliable framework for interpreting current and future observational datasets.
In this paper, we explore the predicted redshift evolution of galaxy properties in the Horizon-AGN cosmological simulation (Dubois et al. 2014), the first such simulation that uses an adaptive mesh refinement (AMR) code and reaches z = 0. Recent work has used this simulation to probe the intrinsic alignments of galaxies for calibrating weak-lensing analyses (Dubois et al. 2014;Welker et al. 2015a,b;Codis et al. 2015;Chisari et al. 2016), the evolution of galaxy morphology (Welker et al. 2015a) and the role of merging at high redshift (Kaviraj et al. 2015). This study is part of a series of papers that will explore the cosmic evolution of black holes (Volonteri et al. 2016, Beckmann et al. in prep), the morphological transformation of galaxies (Dubois et al. in prep) and dark-matter cusp-core modulation by AGN activity (Peirani et al. in prep).
In this work, we study the predicted evolution, in Horizon-AGN, of quantities that are sensitive to the aggregate star formation history of the galaxy population: luminosity and stellar mass functions, the star formation main sequence, rest-frame UV-optical-near infrared colours and the cosmic star formation history. By comparing these predictions to an array of corresponding observational data in the redshift range 0 < z < 6, we explore how well the simulation captures the evolutionary trends of observed galaxies, and probe its usefulness as a tool to investigate the processes that drive that evolution. This paper is organized as follows. In Section 2, we de-scribe the simulation and the methodology used for the prediction of galaxy luminosities, stellar masses and rest-frame colours. In Section 3, we study the reproduction of luminosity functions, the star formation main sequence and restframe UV-optical-near infrared colours in Horizon-AGN. In Section 4, we probe the reproduction of stellar-mass functions and the cosmic star formation history in the model. We summarize our findings in Section 5.

THE HORIZON-AGN SIMULATION
Horizon-AGN is a cosmological hydro-dynamical simulation (Dubois et al. 2014) that employs the adaptive mesh refinement Eulerian hydrodynamics code, RAMSES (Teyssier 2002). While the simulation is described in Dubois et al. (2012) and Dubois et al. (2014), we briefly revisit key aspects of the model here. The size of the simulation box is 100 h −1 Mpc (comoving), which contains 1024 3 dark matter particles and uses initial conditions corresponding to a WMAP7 ΛCDM cosmology (Komatsu et al. 2011). The dark matter mass resolution is 8 × 10 7 M⊙. A quasi Lagrangian criterion is used to refine the initially uniform 1024 3 grid, when 8 times the initial total matter resolution is reached in a cell, down to a minimum cell size of 1 kpc in proper units. Gas cools via H, He and metals (following Sutherland & Dopita 1993) down to 10 4 K, and a uniform UV background is switched on at z = 10, following Haardt & Madau (1996).

Star formation and stellar feedback
Star particles are created using a standard 2% efficiency per free fall time (Kennicutt 1998), when the gas hydrogen density reaches a critical threshold of 0.1 H cm −3 . Star formation is assumed to follow a Schmidt-Kennicutt law (Kennicutt 1998), with a Poissonian random process (Rasera & Teyssier 2006;Dubois & Teyssier 2008) that has a stellar mass resolution of ∼2 × 10 6 M⊙.
We implement a subgrid model for stellar feedback that probes all processes that may impart thermal and kinetic feedback on the ambient gas. Many previous works that implement stellar feedback employ a single supernova explosion per star particle to minimize computational cost (e.g. Dubois & Teyssier 2008). However, this is an oversimplification, particularly from the point of view of chemical enrichment. A significant fraction of stellar mass is, in fact, lost through various phases of stellar evolution, such as Wolf-Rayet stars or the asymptotic giant branch (Leitherer et al. 1992). Thus, stellar feedback should be modelled more realistically by taking into account stellar winds and both Type II and Type Ia supernovae (SNe; e.g. Kobayashi & Nakasato 2011;Hopkins et al. 2012).
To this end, Horizon-AGN implements continuous stellar feedback that includes momentum, mechanical energy and metals from Type II SNe, stellar winds, and Type Ia SNe. For stellar winds and Type II SNe, Starburst99 (Leitherer et al. 1999(Leitherer et al. , 2010 is used to generate look-up tables as a function of metallicity and age. Specifically, we use the Padova model (Girardi et al. 2000) with thermally pulsating asymptotic branch stars (Vassiliadis & Wood 1993), with the kinetic energy of stellar winds calculated via the 'Evolution' model of Leitherer et al. (1992).
We implement Type Ia SNe following Matteucci & Greggio (1986), assuming a binary fraction of 5% (Matteucci & Recchi 2001). The chemical yields for Type Ia explosions are taken from the W7 model of Nomoto et al. (2007). Although the energy input from this source is minor compared to that of Type II SNe (≈ 10% of the total kinetic energy), they provide a significant fraction (≈ 50%) of the iron for the chosen parameters. In order to mimic the propagation of bubbles as realistically as possible, we allow for the injection of energy, mass and momentum only if a blast wave from star particles in each cell propagates to rB ≥ 2 ∆ x, where ∆ x is the size of the host cell and rB is the radius of the shock front at ∆t: (1) If the energy released from each cell at ∆t ≡ t last − tnow is not large enough to push the blast wave to 2 ∆ x, we accumulate the energy, momentum, and metals until the next time step, where t last is the time at which the last blast wave is launched. This produces a more realistic evolution of bubbles and prevents them expanding too rapidly.
To reduce computational cost, we model the stellar feedback as a heat source after 50 Myr, while the energy liberated before 50 Myr from star particles is modelled as kinetic feedback, as described above. This is a reasonable choice, given that, after 50 Myr, almost all of the energy is liberated via Type Ia SNe that have time delays between several hundred Myrs to a few Gyrs (e.g. Maoz et al. 2012). These systems are less prone to excessive radiative losses, as stars are likely to disrupt or move away from their dense birth clouds after around a few tens of Myrs (e.g. Blitz & Shu 1980;Hartmann et al. 2001).

Feedback from black holes
Seed black holes (BHs) with a mass of 10 5 M⊙ are assumed to form in dense star forming regions where both the gas and stellar densities are above ρ0, and where the stellar velocity dispersion is larger than 100 km s −1 . The growth of the BH is tracked self consistently, based on a modified Bondi accretion rate at high gas densities (Booth & Schaye 2009). The accretion rate is capped at Eddington, with a standard radiative efficiency of 0.1.
The central BH impacts ambient gas in two possible ways, depending on the gas accretion rate. For Eddington ratios > 0.01 (high accretion rates), 1.5% of the accretion energy is injected as thermal energy (a quasar-like feedback mode), whilst for Eddington ratios < 0.01 (low accretion rates), bipolar jets are employed with a 10% efficiency. The parameters are chosen to produce agreement with the local cosmic black-hole mass density, and the MBH -M * and MBH -σ * relations (Dubois et al. 2012). An explicit dynamical drag force is exerted from the gas onto the BHs (Ostriker 1999;Chapon et al. 2013) in order to stabilize BH motions into galaxies and suppress limited resolution effects (Dubois et al. 2013). Finally, BHs are allowed to merge when they are closer than 4 kpc and when their relative velocity is smaller than the escape velocity of the binary.
We note that, apart from choosing the BH-feedback parameters to match the MBH -M * and MBH -σ * relations at z = 0, Horizon-AGN is not otherwise tuned to reproduce the bulk properties of galaxies such as stellar mass and luminosity functions, galaxy sizes etc. at z ∼ 0.

Galaxy identification and prediction of observables
We identify galaxies using the AdaptaHOP structure finder (Aubert et al. 2004;Tweed et al. 2009), applied to the distribution of star particles. Structures are selected using a local threshold of 178 times the average matter density, with the local density of individual particles calculated using the 20 nearest neighbours. Only structures that have more than 50 particles are considered. We compute galaxy fluxes and magnitudes using the stellar models of Bruzual & Charlot (2003, BC03 hereafter), using a Chabrier (2003) initial mass function (IMF). We assume that each star particle behaves as a simple stellar population (SSP), and compute its contribution to the total spectral energy distribution (SED) by logarithmically interpolating the models in metallicity and age and multiplying by the initial mass of the particle. Figure 1 shows a 14 arcmin 2 mock u, r, z composite image from the Horizon-AGN lightcone (a simulated box where one axis is redshift), constructed using star particles in the redshift range 0.1 < z < 5.8, via the BC03 models as described above. The resolution of the image is 0.15"/pixel and the lightcone was produced on-the-fly at every time step of the simulation. We direct readers to Pichon et al. (2010) for details of the lightcone construction. Note that dust extinction is not taken into account and that non-stellar sources (e.g. AGN) are not included in this image. We calculate attenuation by dust using the SUNSET code. The gas density and metallicity are extracted, under the assumption that the dust mass scales with the gas metal mass, with a dust-to-metal ratio of 0.4 (e.g. Dwek 1998;Draine et al. 2007). We compute the column density of dust, and thus the line-of-sight optical depth for each star particle, using the R = 3.1 Milky Way dust grain model of Weingartner & Draine (2001). The gas is assumed to be transparent beyond 1 virial radius (but we note that relaxing this assumption leaves our conclusions unchanged). This dust implementation assumes that all the dust is placed in a screen in front of each star particle. Nevertheless, the geometry of the metals, and therefore the spatial distribution of dust within the galaxy, is taken into account. The total dust-attenuated SED is computed by summing the contribution of all star particles. Magnitudes are computed after convolving this SED through the appropriate filtercurves. Galaxy magnitudes are extracted within Petrosian apertures (Petrosian 1976;Blanton et al. 2001;. While many observational studies use Petrosian apertures, other variants, such as Kron (e.g. Kron 1980) or modified Sersic (e.g. Bernardi et al. 2013) apertures can also be employed, leading to small shifts in galaxy luminosities (e.g. . However, these offsets are typically lower than the statistical uncertainties. We note that, ideally we would like to quantify the effect of dust on galaxy SEDs via a full radiative transfer approach. While processing all model galaxies using this approach is prohibitively time-consuming, we check, in Section 3 below, whether using a dust screen is a reasonable assumption to make in our calculations. We perform this check by using SUNRISE (Jonsson 2006;Jonsson et al. 2010), a Monte-Carlo radiative transfer code applicable to arbitrary dust geometries, on a small random sample of galaxies. In a similar vein to SUNSET, SUNRISE employs the gas metallicity distribution as a proxy for the dust distribution, with a metal-to-dust ratio of 0.4 and the R = 3.1 Milky Way dust grain model of Weingartner & Draine (2001) to set the dust grain composition and size. A sub-resolution model for photo-dissociation regions (MAPPINGSIII; Dopita et al. 2005, Groves et al. 2008) is used to account for obscuration of young stars by their birth clouds. SUNRISE then performs radiative transfer ray tracing to calculate the dust absorption and scattering. The attenuated SED is convolved with the relevant filtercurves to extract rest-frame magnitudes.
As shown below, good agreement is found between the predicted magnitudes from the screen (SUNSET) and radiative transfer (SUNRISE) approaches in the optical filters, with small offsets in the UV wavelengths. Overall, the dust screen assumption appears sufficient for the purposes of this study. However, as we note in the following sections, the uncertainties in the modelling of dust are large and are, in particular, difficult to quantify for high-redshift galaxies without future datasets. The comparison of rest-frame UV-optical-near infrared colours, while useful, may not offer the best test of the overall performance of the model.

LUMINOSITIES, STAR FORMATION RATES AND COLOURS
We begin by comparing the predictions of Horizon-AGN to observed luminosity functions, since these are one of the basic quantities delivered by observations. Since the ages and metallicities of star particles are known precisely in the simulation, galaxy luminosities can be constructed, the main uncertainty in the predicted luminosity being the derivation of the dust mass from the gas and metal content of the model galaxy (e.g. Guiderdoni & Rocca-Volmerange 1987;Devriendt et al. 1999). We note that the simulated gas-phase metallicity in Horizon-AGN is under-estimated by a factor of ∼ 2 to 4 compared to observations (e.g. Maiolino et al. 2008;Mannucci et al. 2009). This happens because our blast wave model allows for the propagation of energy and metals only when it reaches 2∆x, where ∆x is the size of a host cell of SNe. Given the relatively low resolution adopted in Horizon-AGN, we find that this tends to delay the metal enrichment of star-forming clouds in the simulation, particularly when the specific star formation rate is low (i.e. at lower redshift).
To correct for these lower metallicities, we calibrate the gas-phase metallicities by multiplying a redshift-dependent renormalisation factor (fno) that brings the simulated metallicity in agreement with the observed mass-gas phase metallicity relations at z = 0, 0.7, 2.5 and 3.5 (Maiolino et al. 2008;Mannucci et al. 2009 Kewley & Dopita (2002) for higher metallicities (which are similar to the ones from Kewley & Ellison (2008)). We do not attempt to make the calibration stellar mass-dependent, but simply adjust the normalization as a function of redshift, following fno. This is because the shape of the predicted mass-gas phase metallicity relation is reasonably consistent with the shape of their observed counterparts at redshifts where data is available, although it should be noted that the data do not extend across the entire mass range spanned by the model galaxies.
In Appendix A we show the mass-gas phase metallicity relations predicted by the simulation, the observational data to which we perform the calibrations and the corrected relations that are used in calculating the properties of model galaxies in the simulation.
In Figure 2, we compare the predicted rest-frame K and r-band luminosity functions in Horizon-AGN to observational data in the redshift range 0 < z < 2. Observational estimates of rest-frame K-band luminosities are largely restricted to this redshift range because observational data currently extends into the observed mid-infrared, from facilities like Herschel (e.g. Pilbratt et al. 2010). Rest-frame K corresponds to the longest wavelengths at which stellar light still dominates the SED, the inter-stellar medium becoming increasingly important longward of this filter (e.g. Fazio et al. 2004;Eales et al. 2010). In addition, its negligible sensitivity to dust and young stars makes the K band a good tracer of the underlying stellar mass of the galaxy (e.g. Kauffmann & Charlot 1998) 1 . The r-band, being a shorter wavelength filter, is more sensitive to the mass-to-light ratio of the galaxy, which in turn depends on its star formation history.
In Figure 2, we compare the predicted luminosity functions with and without dust extinction to their observed counterparts. Given its low sensitivity to dust, the rest-frame K-band luminosity function with and without extinction are almost identical. Thus, we only show the dust-reddened Kband luminosities. The shaded regions indicate the uncertainties in the observed luminosity functions, based on the errors in the fitted Schechter-function (Schechter 1976) parameters (but not including the effect of cosmic variance). In our redshift range of interest, the slope and normalisation of the predicted K-band luminosity function from Horizon-AGN agrees reasonably well with its observational counterparts, within the observational uncertainties. However, some points of tension with the data are worth noting. In the local Universe (z ∼ 0.1), the simulation does not reproduce all observational datasets equally well, overshooting the Eke et al. (2005) data at both the low and high luminosity ends. While the predicted data points generally fall within the observed ranges, the predicted luminosity functions are slightly steeper than their observed counterparts Figure 2. Comparison of the predicted K (left-hand column) and r-band (right-hand column) luminosity functions from Horizon-AGN to observational data. The solid lines show dust-attenuated luminosity functions predicted by Horizon-AGN, while the greydotted curves show their unattenuated counterparts. Note that, given its low sensitivity to dust, the rest-frame K-band luminosity function with and without extinction are almost identical -we only show the dust-reddened K-band luminosities here. Observational data is shown using the coloured hatched regions (see legend for the individual datasets used).  at all redshifts. The difference in steepness is most apparent at z ∼ 1.7 where it causes the predictions to overshoot the data at high luminosities and undershoot them at low luminosities. Indeed, the tendency to overshoot at low luminosities mirrors a trend seen in the mass-function analysis for low-mass galaxies, and we return to this point in Section 4 below. Nevertheless, the generally good reproduction of the evolving K-band luminosity function, within the observational uncertainties, indicates that, on average, the aggregate mass growth of galaxies over cosmic time is well reproduced by the simulation (this is also borne out by the mass function analysis presented later in this study). Similar trends are seen in the comparison of the predictions to the observed r-band luminosity functions. While the model predictions agree with data within observational uncertainties, the overproduction of galaxies at the low-luminosity end is also apparent in the r-band filter.
We proceed by comparing the 'star-formation main sequence' (i.e. the star formation rate plotted against stellar mass) predicted by Horizon-AGN to observational data in the redshift range 0 < z < 6 ( Figure 3). Comparison to the observed main sequence probes whether the instantaneous star-formation activity in the simulation is consistent with observations. Performing this exercise over a large redshift range then offers insights into how well Horizon-AGN predicts stellar mass growth over cosmic time, compared to what is observed in the real galaxy population. Figure 3 indicates that, within the dispersion between the observed main sequences, the simulation produces good agreement with the observations, across our redshift range of interest (0 < z < 6). The simulated main sequence appears to fall slightly below the observed ones at z ∼ 1.7 and the predictions are inconsistent, at this redshift, with the Karim et al. and Whitaker et al. datasets at high stellar masses, while matching the Tasca et al. data well. However, we note that the observed loci shown are median values and that the observed main sequences typically have spreads of ∼0.5 dex (e.g. Whitaker et al. 2012), suggesting that the overlap between the theoretical and observational values is also reasonable at this redshift. Overall, our results indicate that Horizon-AGN predicts the observed main sequence with good accuracy between z = 0 and z = 6, suggesting that the aggregate stellar mass growth of galaxies with stellar masses greater than ∼10 9 M⊙ is generally well reproduced by the simulation.
We next compare the predicted evolution of rest-frame colours in Horizon-AGN to observational data. To perform this exercise in a consistent way across redshift, we use observational data from a new homogeneous multi-wavelength catalogue of a single area of sky: COSMOS2015 (Laigle et al. submitted). The COSMOS2015 catalog contains 30 bandphotometry, from UV to IR (0.25-7.7µm) of more than half a million objects in the 2 deg 2 COSMOS field. It includes the optical datasets from previous releases (Capak et al. 2007;Ilbert et al. 2009), the new Y -band data taken with the Subaru Hyper-Suprime-Cam (PI: Guenther), new nearinfrared (NIR) data from the UltraVISTA-DR2 survey and IR data from the SPLASH program (P. L. Capak et al. in prep.). The photometry for the COSMOS2015 catalog is extracted using SExtractor (Bertin & Arnouts 1996) in dual image mode. The detection image is the chi-squared sum of four NIR images from UltraVISTA-DR2 (Y, J, H, Ks) Figure 5. Comparison of the predicted magnitudes in the N U V (top), r (middle) and J (bottom) band filters, using SUNRISE (full radiative transfer) and SUNSET (dust screen in front of each star particle), on a small random sample of Horizon-AGN galaxies at z ∼ 0.1. The offset in these predicted magnitudes (∆) is shown in the top section of each panel, with ∆ ≡ MSUNRISE -MSUNSET. and the optical z ++ band image from Subaru Suprime-Cam. Photometric redshifts, rest-frame magnitudes and stellar masses are computed using SED fitting via the LePhare code (Arnouts et al. 1999(Arnouts et al. , 2002Ilbert et al. 2006), following the methodology described in Ilbert et al. (2013). Comparison to published spectroscopic redshifts of galaxies in the COSMOS field indicates a precision of 0.007 and a catastrophic failure rate of 0.5% for bright galaxies (i+ < 22.5). The deepest region of the COSMOS2015 catalogue reaches a 90% completeness limit for galaxies with masses greater than 10 10 M⊙, out to z = 4. Note that, since COSMOS2015 is a deep field, it does not contain many galaxies in the local Universe. Our lowest redshift bin in this analysis is therefore z = 0.3.
In Figure 4, we compare the predicted evolution of restframe N U V − R − J colours in Horizon-AGN to observed galaxies in COSMOS2015. The UV (N U V −R) colour traces very recent star formation (stars with ages < 0.5 Gyr), with even residual (< 1%) mass fractions of young stars capable of driving galaxies into the UV blue cloud (Kaviraj et al. 2007a,b). The optical (R − J) colour, on the other hand, traces stellar mass growth over several Gyrs in the past (Kaviraj et al. 2007a). Taken together, these colours probe the formation history of the galaxy population over the last few Gyrs. Figure 4 indicates that the model galaxies occupy similar parts of the N U V − R − J colour space as their observed counterparts. While the simulated galaxies agree well with the observed optical colours (although note that the model occupies a narrower locus at z ≥ 1.6), the predicted bimodality in the UV colour is weaker than in observational data, especially at low redshift. However, the good reproduction of the star formation main sequence and the optical colours indicates that this is largely due to small amounts of residual star formation in the model galaxies (recall that even negligible mass fractions of young stars can produce blue UV colours). While, this residual star formation has little bearing on the bulk stellar-mass growth of the galaxy population as a whole, the weakness of the bimodality in the predicted UV colours suggests that the feedback recipes employed by the model do not quench star formation as completely as is the case in real galaxies at low redshift.
As noted before, a large uncertainty in this exercise is the way the attenuation by dust is estimated for simulated galaxies. In particular, it is worth exploring the robustness of the dust screen assumption in SUNSET, by comparing these results to a full radiative transfer treatment using SUNRISE. In Figure 5 we compare magnitudes computed using SUNSET to their counterparts using SUNRISE. As noted above, processing all simulated galaxies using this approach is prohibitively time-consuming. Therefore, we explore the potential differences between these two approaches for a small, random sample of galaxies at z ∼ 0.3 that spans our mass range of interest.
We find that, while the optical magnitudes are almost identical using the two approaches, the predicted UV magnitudes are somewhat fainter when they are calculated via full radiative transfer and show a dependence on the UV magnitude itself ( Figure 5). The difference is due to the fact that the optical depth due to both absorption and scattering is, on average, larger in the shorter wavelengths. A simple screen absorption model (as employed by SUNSET) is likely to progressively underestimate the attenuation at increasingly shorter wavelengths. In addition, our results suggest that the contribution of scattering to the optical depth depends on the dust geometry. Figure 6 indicates that applying the average values of the offsets in Figure 5 enhances the bimodality slightly, but not enough to achieve reason- Figure 6. The predicted rest-frame N U V − R − J colours in Horizon-AGN at z = 0.3, colour-coded by the absolute N U V magnitude. The arrows indicate how the predicted colours are likely to move if the effect of dust were estimated using SUNRISE (full radiative transfer) rather than SUNSET (dust screen in front of each star particle).
able agreement with the observational data. Our conclusions above therefore remain unchanged -the feedback recipes in the model appear unable to quench star formation to the extent that is required to produce the bimodality in rest-frame UV-optical-near infrared colours.
It is worth noting here that there are additional uncertainties in the treatment of dust that may further complicate the comparison between theory and observation. For example, while the dust-to-metal ratio is assumed to be a fixed (Milky-Way-like) value in our analysis, this value may not be applicable to all galaxies. Studies of damped Lyman-α absorbers (e.g. Vladilo 2004) and the UV/optical afterglow spectra of gamma ray burst host-galaxies at high redshift (e.g. De Cia et al. 2013) indicate that dust-to-metal ratios may vary with both metallicity and the metal column density (see also Fisher et al. 2014;Herrera-Camus et al. 2012). In a similar vein, the extinction law (assumed in this study to be Milky-Way-like), may also vary as a function of galaxy properties like age and metallicity (e.g. Buat et al. 2012). The potentially large unknowns in the treatment of dust makes the prediction of colours, especially those involving shorter wavelengths, uncertain. In particular, the current dearth of observational data at high redshift makes it difficult to ascertain whether the properties of dust in our local neighbourhood can be blindly extrapolated to early epochs. In that sense, the rest-frame UV-optical-near infrared colour space may not be the best test of the reliability of the model.
Nothwithstanding the dust-related uncertainties outlined above, our analysis indicates that Horizon-AGN produces good agreement with the redshift-evolution of galaxy luminosity functions, the star-formation main sequence and the bulk of the rest-frame UV-optical-near infrared colour space. This suggests that the aggregate star-formation history predicted by the model successfully reproduces the trends in the observed galaxy population.
We conclude this section by briefly noting how well other hydrodynamical cosmological simulations reproduce the observables that we have studied in this section. While Figure 7. Comparison of the predicted stellar mass function to observational data in the redshift range 0 < z < 6. The grey shaded region shows the prediction from Horizon-AGN (with the width of the region indicating Poisson uncertainties). The pink dashed curves indicate predictions from Horizon-noAGN, a twin simulation without BH feedback. Vertical error bars indicate observational uncertainties due to cosmic variance (Ilbert et al. 2013). The horizonal error bar (0.3 dex) indicates typical observational uncertainties in stellar masses derived from SED fitting (e.g. Muzzin et al. 2009). rest-frame luminosities and UV-optical-NIR colours have not been tested in other similar simulations over a large redshift range as in this study, the reproduction of rest-frame optical colours and u to K-band luminosities in the local Universe is found to be generally good in such models (e.g. Trayford et al. 2015). In a similar vein, other simulations that are comparable to Horizon-AGN generally show consistency with the observed star formation main sequence (e.g. Furlong et al. 2015;Sparre et al. 2015), although in some cases offsets of ∼ 0.2 − 0.4 dex are seen in the normalization of the predicted sequences compared to the observational ones (Furlong et al. 2015).

STELLAR MASS FUNCTIONS AND THE COSMIC STAR FORMATION HISTORY
We proceed by comparing the predicted stellar mass functions in Horizon-AGN to an array of observational data at 0 < z < 6 ( Figure 7). The grey shaded region in this plot indicates the Horizon-AGN predictions (with the width of the region indicating Poisson uncertainties). The pink dashed line indicates predictions from Horizon-noAGN, a twin simulation without BH feedback. Not unexpectedly, low-mass galaxies are largely unaffected by BH feedback. However, agreement between theory and observation at the high-mass end of the mass function depends strongly on implementing feedback from BHs, confirming the results found in the literature (e.g. Oppenheimer & Davé 2008;Sales et al. 2010;McCarthy et al. 2010;Vogelsberger et al. 2014;Khandai et al. 2015;Sijacki et al. 2015;Crain et al. 2015). The role of BHs is most important after the epoch of peak cosmic star formation (z ∼ 2), as the BHs help keep star formation rates low in massive galaxies (e.g. by maintaining the temperature of their hot gas reservoirs) and prevent them from becoming too massive. A forthcoming paper in this series (Beckmann et al. in prep) will perform a detailed study of the role of AGN in regulating the inflow of gas into galaxies and shaping the evolving stellar mass function across cosmic time.
In a similar vein to the analyses presented above, the predictions from Horizon-AGN show good consistency with the slope and normalisation of observed mass functions within the observational uncertainties. The agreement is particularly good for galaxies that are more massive than the knee of the mass function at all epochs. However, two points of tension between theory and observation are worth noting here: the overproduction of galaxies at the low-mass end and the general underproduction of galaxies regardless of galaxy stellar mass at z > 5. Similarly to what is seen in the luminosity-function analysis above, Horizon-AGN tends to overproduce galaxies less massive than the knee of the stellar mass function at low and intermediate redshift (z < 2). While the predictions are consistent with observations if uncertainties due to cosmic variance are also taken into account (see the vertical error bars in Figure 7), the systematic nature of the overproduction at all epochs indicates that this disagreement may be due to missing physics and that the sub-grid SN feedback prescription employed by Horizon-AGN is not strong enough to regulate star formation in smaller haloes, making the galaxies embedded in them too massive. Figure 8. Comparison of the predicted cosmic star formation history in Horizon-AGN to observational data (Hopkins & Beacom 2006) in the redshift range 0 < z < 6. The red curve shows predictions from Horizon-AGN while the grey shaded region indicates the parameter space covered by the observational data.
Indeed, recent work (Kimm et al. 2015) has shown that a more realistic treatment of the (inherently clumpy) interstellar medium and the momentum injection from SN populations in the snowplough phase (see also Thornton et al. 1998;Hopkins et al. 2014;Geen et al. 2015;Kim & Ostriker 2015;Martizzi et al. 2015) can produce strong SN-driven outflows that regulate star formation more efficiently than classical treatments. Alternatively, the simulated galaxy stellar masses can be reduced if the star formation efficiency is locally enhanced to the level that we observe in star clusters (∼ 10 % per free fall time), as clustered star formation drives stronger winds (Agertz & Kravtsov 2015) -enhanced efficiencies are indeed plausible, given the large fraction of field stars that are thought to come from the dissolution of star clusters (see e.g. Whitmore et al. 2007). Such improved prescriptions for star formation are able to bring simulated galaxies in line with key observables, such as the stellar-tohalo mass relation and the mass-metallicity relation. While their implementation in Horizon-AGN is beyond the scope of this particular paper, such revised recipes are likely to be a solution to the disagreement observed at the low-mass end, and will be explored in forthcoming papers.
We now turn to the second point of tension between theory and observation. While the reproduction of the stellar mass function is good across cosmic time, it appears to break down at z ∼ 5. At this epoch, model galaxies appear to be less massive than their observed counterparts across the entire mass range probed by our study. We note here that the mass and spatial resolution of a simulation can play an important role in influencing the predicted stellar mass growth of galaxies (Kimm et al. 2012). Adopting a higher resolution enables us to resolve smaller haloes in the early universe (z > 5), leading to earlier star formation. This can result in an order of magnitude enhancement in the star formation activity in typical galaxies at this epoch (e.g. Rasera & Teyssier 2006). It is, therefore, likely that a higher-resolution simulation can reduce the disagreement between the theoretical and observed mass functions at this epoch, without changing the baryonic physics currently implemented in Horizon-AGN.
We proceed by comparing the cosmic star formation history (SFH) predicted by Horizon-AGN to observations (Figure 8). The good reproduction of the luminosity and stellar-mass functions, the star formation main sequence and rest-frame colours already suggests that the simulation captures the general trends in the evolution of galaxies over cosmic time. This agreement is reflected and summarized in the comparison of the cosmic star formation history. Not unexpectedly, Horizon-AGN shows agreement with the observations, tracking both the shape and normalization of the observational data (Hopkins & Beacom 2006), although we note that the predictions are close to the upper bound of the parameter space defined by the observational uncertainties at very low redshift (z ∼ < 0.1).
We conclude this section by briefly discussing the performance of similar models in reproducing the observed stellar mass function and cosmic SFH. In the redshift range studied here, most simulations that are similar to Horizon-AGN produce reasonably good agreement with data (e.g. Furlong et al. 2015;Genel et al. 2014;Khandai et al. 2015), although a detailed comparison between models is difficult, because all models are typically not compared to the same observational datasets at the same redshifts. However, similar patterns are seen when most simulations are confronted with observational data. While galaxies beyond the knee of the luminosity/stellar-mass functions are well reproduced (due mainly to implementation of AGN feedback, as noted above), models typically overshoot the data at low stellar masses/luminosities at low and intermediate redshifts (z < 3), with the level of the discrepancy varying with the model in question. It is worth noting that the undershoot of the model compared to data seen in Horizon-AGN at high redshift (z > 5) is also seen in other simulations (e.g. Furlong et al. 2015;Genel et al. 2014). Predictions made by these models for the cosmic SFH are generally consistent with observations -while all models reproduce the shape of the cosmic SFH, small systematic offsets are seen in some comparisons (e.g. Furlong et al. 2015).

SUMMARY
We have compared the predictions of Horizon-AGN, a hydro-dynamical cosmological simulation that uses an adaptive mesh refinement code, to an extensive array of observational data across cosmic time. Our study has focussed on confronting the predicted evolution of luminosity functions, stellar-mass functions, the star formation main sequence, rest-frame UV-optical-near infrared colours and the cosmic star formation history to observational data in the redshift range 0 < z < 6 (around 95% of the lifetime of the Universe). These observables, which are functions of the evolving stellar mass growth of the galaxy population, represent a significant constraint on the methodologies employed by the simulation. We note that, apart from choosing BH feedback parameters that reproduce the local MBH -σ * relations, the simulation is not otherwise calibrated to the local Universe.
Since they are sensitive to the aggregate star formation history of galaxies, comparing the simulation to these quantities indicates how well Horizon-AGN captures the evolu-tionary trends of observed galaxies over cosmic time, and thus its usefulness as a tool for understanding galaxy evolution. Our analysis shows that Horizon-AGN produces good agreement with the quantities mentioned above across our redshift range of interest, from the present day all the way to the epoch when the Universe was ∼ 5% of its present age. This indicates that model galaxies in the simulation broadly reproduce the cosmic star formation history of their counterparts in the real Universe.
Notwithstanding the reproduction of key observations by Horizon-AGN, two main points of tension are worth noting. First, the model tends to overproduce galaxies that are less massive than the knee of the luminosity function at all epochs. While agreement can be achieved by considering observational uncertainties due to cosmic variance, it is possible that the SN feedback prescriptions in the model do not sufficiently quench star formation in small halos. More accurate modelling of the clumpy inter-stellar medium, combined with higher star-formation efficiencies that correspond to star-cluster formation may offer a solution to this disagreement. Secondly, at very early epochs (z ∼ 5) the predicted galaxy stellar masses are too low, across the mass range of interest in our study. Higher-resolution simulations, that resolve smaller haloes in the early Universe and their associated star formation, are likely to reduce the disagreement between theory and observation at these epochs.
Nevertheless, the good reproduction of the array of key observations presented here indicates that the sub-grid recipes that drive the baryonic evolution in the model are reasonably accurate representations of the processes governing galaxy evolution in the real Universe. Overall, we find that Horizon-AGN offers an excellent tool, both for studying galaxy evolution over cosmic time, and for making predictions for the next generation of galaxy surveys. 10 11 log M star z~3.5 Figure 9. Calibration of the predicted mass-gas phase metallicity relations to osbervational data. The mass-gas phase metallicity relations predicted by Horizon-AGN are shown in grey. The observational data (Maiolino et al. 2008;Mannucci et al. 2009) to which we perform the calibrations to correct the simulated metallicities are shown in orange. Best fits to the observational datasets are shown using the orange lines. For the z ∼ 3.5 dataset we also show the individual points to give an indication of the typical scatter around the best-fit lines (we only show one set of points for clarity and because the scatter is similar at all epochs). The corrected metallicities that are used for the analysis in this study are shown using the black points. ties in agreement with the observed mass-gas phase metallicity relations at z = 0, 0.7, 2.5 and 3.5 (Maiolino et al. 2008;Mannucci et al. 2009), where fno = 4.08 − 0.21z − 0.11z 2 . As noted before, the observed mass-metallicity relations are calculated using strong line diagnostics (e.g. [OIII] In Figure 9 we show the mass-gas phase metallicity relations predicted by the simulation using the grey points. The observational data (Maiolino et al. 2008;Mannucci et al. 2009) to which we perform the calibrations to correct the simulated metallicities are shown in orange. Best fits to the observational datasets are shown using the orange lines. For the z ∼ 3.5 dataset, we also show the individual points to give an indication of the typical scatter around the best-fit lines (we only show one set of points for clarity, and because the scatter is similar at all epochs). The corrected metallicities that are used in calculating the properties of simulated galaxies in the model are shown using the black points. Since the shape of predicted mass-gas phase metallicity relation is reasonably consistent with the shape of their observed counterparts at redshifts where data is available, we do not attempt to make the calibration stellar mass-dependent -the normalization is simply adjusted as a function of redshift, following fno.