MIGHTEE: multi-wavelength counterparts in the COSMOS field

In this paper we combine the Early Science radio continuum data from the MeerKAT International GHz Tiered Extragalactic Exploration (MIGHTEE) Survey, with optical and near-infrared data and release the cross-matched catalogues. The radio data used in this work covers $0.86$ deg$^2$ of the COSMOS field, reaches a thermal noise of $1.7$ $\mu$Jy/beam and contains $6102$ radio components. We visually inspect and cross-match the radio sample with optical and near-infrared data from the Hyper Suprime-Cam (HSC) and UltraVISTA surveys. This allows the properties of active galactic nuclei and star-forming populations of galaxies to be probed out to $z \approx 5$. Additionally, we use the likelihood ratio method to automatically cross-match the radio and optical catalogues and compare this to the visually cross-matched catalogue. We find that 94 per cent of our radio source catalogue can be matched with this method, with a reliability of $95$ per cent. We proceed to show that visual classification will still remain an essential process for the cross-matching of complex and extended radio sources. In the near future, the MIGHTEE survey will be expanded in area to cover a total of $\sim$20~deg$^2$; thus the combination of automated and visual identification will be critical. We compare redshift distribution of SFG and AGN to the SKADS and T-RECS simulations and find more AGN than predicted at $z \sim 1$.


INTRODUCTION
In order to truly understand the astrophysical processes that occur in our Universe, a multi-wavelength approach is necessary.This requires combining data from a number of different instruments operating across the full range of the electromagnetic spectrum.At the longest wavelengths, radio observations of extragalactic sources are invaluable; not only do they provide a dust-free view of star-forming galaxies (SFG), but they are also crucial for understanding Active Galactic Nuclei (AGN), which are powered by the supermassive black holes that reside in the centre of all massive galaxies, and are thought to play a key role in their evolution.
New radio facilities such as Meer-Karoo Array Telescope (MeerKAT; Jonas 2018; Mauch et al. 2020), the Low-Frequency ★ E-mail: imogen.whittam@physics.ox.ac.ukArray (LOFAR; e.g.van Haarlem et al. 2013) and the Australian Square Kilometre Array Pathfinder (ASKAP; e.g.Johnston et al. 2007;Hotan et al. 2021) are able to probe faint radio sources down to thermal noise levels of just a few μJy, which means we are no longer limited to observing the radio properties of only the brightest and most massive galaxies detected at optical wavelengths (e.g.Smolčić et al. 2017a;Heywood et al. 2020;Best et al. 2023).Crossmatching radio and multi-wavelength data for these objects is necessary to build up a panchromatic view of the processes taking place in galaxies, which in turn allows us to determine their redshifts and other physical quantities such as luminosities and stellar masses.
The MeerKAT International GHz Tiered Extragalactic Exploration (MIGHTEE) survey is one of the Large Survey Projects (LSPs) carried out with the MeerKAT telescope array.It will observe a number of well studied extragalactic fields, which have a wealth of multi-wavelength data available.These are the COS-MOS, XMM-LSS, E-CDFS and ELAIS-S1 fields (Jarvis et al. 2016).MeerKAT is being used to observe 20 sq.deg. of sky, over a total of ∼ 1 000 hours of observation time, at L-Band radio frequencies between 856 -1712 MHz.The Early Science data release covering part of the COSMOS and XMM-LSS fields is described in Heywood et al. (2022).As well as providing radio continuum images, MIGHTEE will also produce spectral line (Maddox et al. 2021) and polarisation information (Sekhar et al. in prep.), allowing a range of science cases to be investigated.These include studying the evolution of star-forming galaxies and AGN, the role of AGN feedback in the quenching of star-formation, the evolution of neutral hydrogen in the Universe and measuring cosmic magnetic fields in large scale structures.
Here we describe the process of cross-matching a subset of the Early Science MIGHTEE radio observations with multi-wavelength data in the COSMOS field.This paper is structured in the following way: in Section 2 we describe the initial radio and multi-wavelength datasets that we cross-match.In Section 3 we lay out the method used to cross-match these two datasets using visual identification.Our visually inspected cross-matched catalogue is compared with those produced from the likelihood ratio method in Section 4. In Section 5 we highlight the properties of the sample and discuss the reliability of the photometric redshifts of the radio sources.In Section 6 we divide our sample into active galactic nuclei and starforming populations and compare to predictions from simulations.We conclude in Section 7.

Radio Data
This work is based on the MIGHTEE Early Science continuum data in the COSMOS field.These data are described fully in Heywood et al. (2022) and summarised briefly below.The observations consist of a single pointing with the MeerKAT telescope centred on RA 10 h 00 m 28.6 s , Dec +02 • 12 ′ 21 ′′ .The full Early Science image covers 1.6 deg 2 , but for this work we restrict ourselves to the central region with an area of 0.86 deg 2 , where the radio data is deepest and approximately of uniform depth.The observations were taken between 2018 and 2020 with the L-band receiver (bandwidth 856 -1712 MHz) and include 17.45 hours on source.
The MIGHTEE Early Science data contains two versions of the data processed with different Briggs (1995) robust weighting values.The first 'high-resolution' image is produced using a Briggs robust weighting of −1.2, which down-weights the short baselines in the core of the array.This results in a higher resolution of 5 arcsec, but comes at the expense of sensitivity, resulting in a 1 thermal noise level of 6 μJy beam −1 .The second image uses a robust weighting of 0.0, resulting in better sensitivity (thermal noise level of 1.7 μJy beam −1 ) but a lower resolution of 8.6 arcsec.Unlike the high resolution image, it should be noted that this lower resolution image is limited by classical confusion at the centre, meaning the actual measured noise is 4 − 5 μJy beam −1 .
Source extraction on both images was conducted using the Python Blob Detection and Source Finder (pybdsf, Mohan & Rafferty 2015), as fully described in (Heywood et al. 2022).The primary catalogue we use in the cross-matching process here is the low resolution (Level 0) catalogue that contains 9 915 radio Gaussian components with peak brightnesses that exceed the local background noise by 5 local .In this paper we crop the catalogue to remove sources away from the edge of the field and restrict the area to where the primary beam gain drops to 0.5 resulting in a catalogue of 6338 radio source components.We also remove 236 radio source components located within masked regions of the near-infrared image used for cross-matching (see Section 2.2).This results in a radio catalogue containing 6102 source components over an area of ∼ 0.86 deg 2 .A similar catalogue using the high resolution image contains 3116 radio source components over the same area.Heywood et al. (2022) also release a Level 1 catalogue based on the low-resolution image which has been visually inspected to remove artefacts and includes additional information.This work is based on the Level 0 catalogue, but we make use of the 'resolved' flag in the Level 1 catalogue in Section 4.
Complementary to our MIGHTEE observations are those of the VLA-COSMOS 3 GHz Large Project (Smolčić et al. 2017a).Here the COSMOS field was observed in the S-band (2 -4 GHz) for a total of 384 hours, in both the VLA's A and C-array configurations.The resulting image has resolution of 0.78 ′′ with a sensitivity of 2.3 μJy beam −1 .This is equivalent to a flux density of ∼ 4 μJy beam −1 at the mean effective frequency of 1.34 GHz for the lower resolution MIGHTEE Early Science data (Heywood et al. 2022).A total of 3949 of the 6102 components in the initial catalogue used in this work have a match within 8.6 arcsec in the VLA-COSMOS catalogue.This is discussed in more detail in Whittam et al. (2022).

Multi-wavelength data
A wealth of multi-wavelength data for the COSMOS field has already been collated, and here we use the dataset fully described in Bowler et al. (2020); Adams et al. (2020Adams et al. ( , 2021)).Covering ∼ 2 deg of the sky centred on the J2000 coordinates of RA = 10 h 00 m 28.6 s DEC = +02 • 12 ′ 21.0 ′′ , this compilation includes  * -band data from the Canada-France-Hawaii Telescope Legacy Survey (CFHTLS, Cuillandre et al. 2012), -band Hyper Suprime Cam (HSC) imaging (Aihara et al. 2018), near-infrared    -band data from the UltraVISTA Survey (McCracken et al. 2012).Infrared data at 3.6 and 4.5 microns were obtained from the Spitzer Extended Deep Survey (Ashby et al. 2013).Source finding was conducted using SExtractor (Bertin & Arnouts 1996).We adopt a flux-limited sample selected in the   band with   < 25.We then carried out forced photometry in all other bands with the same fixed aperture and then adopt an aperture correction for determining the total flux from each object.Full details can be found in (Adams et al. 2021).
We use a compilation of spectroscopic redshifts from the following observing campaigns; VIMOS VLT Deep Survey (VVDS, Le Fèvre et al. 2013), z-COSMOS (Lilly et al. 2009), Sloan Digital Sky Survey (SDSS DR12, Alam et al. 2015), 3D-HST (Momcheva et al. 2016), Primus (Coil et al. 2011), and the Fiber-Multi Object Spectrograph (FMOS, Silverman et al. 2015).Utilising the flag system provided by each survey, we ensure we only use spectroscopic redshifts which have a > 95 per cent confidence of being correct.
Photometric redshifts for the dataset were determined using a hierarchical Bayesian combination of two different techniques as conducted by Duncan et al. (2018).The photometric redshifts were determined using a traditional template fitting technique carried out by the Le Phare Spectral Energy Distribution (SED) fitting code (Arnouts et al. 1999;Ilbert et al. 2006), along with machine learning using the GPz algorithm (Almosallam et al. 2016a,b).This method weights the combinations of photometric redshifts for both active galaxies and normal galaxies from the template fitting, and then combines this with the solutions determined from the more empirical machine learning approach with GPz.Full details and the catalogues can be found in Hatfield et al. (2020Hatfield et al. ( , 2022)).The photometric redshifts of the sources in our radio sample are discussed further in Section 5.

VISUAL CROSS-MATCHING
The cross-matching of the radio and near-infrared datasets was carried out via visual inspection in a similar way to Prescott et al. (2018).Overlays for each of the 6102 radio components in the low resolution pybdsf catalogue were produced using the Astronomical Plotting Library in Python (APLpy, Robitaille & Bressert 2012).These overlays consist of radio contours produced from the MIGH-TEE and 3 GHz images overlaid on top of an UltraVISTA   -band image.The location of known sources from the near-infrared catalogue described in Section 2.2 are also highlighted on top of the overlays.The cross-matching process is aided by the use of two different radio images with different resolution and sensitivity.The high resolution (0.78 arcsec) of the 3 GHz VLA images allows a counterpart to be identified more easily, whereas the high sensitivity MeerKAT image reveals more diffuse radio sources.As in Prescott et al. (2018), two sets of overlays are produced for each source to aid the visual classification.One overlay set has a size of 0.5 ′ × 0.5 ′ whilst the other covers a larger area of 3 ′ × 3 ′ .The smaller overlay ensures we can assign the radio source with the correct counterpart for galaxies in crowded fields, and the larger overlays allows us to identify sources that are extended.
In order to ensure we have a robust set of cross-matches, the radio sources were divided into batches of 100 and inspected by three separate people from a team of 6 classifiers.This was conducted using an improved version of the Xmatchit code (Prescott et al. 2018), that now makes extensive use of Jupyter notebooks (Kluyver et al. 2016).When inspecting the overlays we classify the cross-matches as one of the following; • Single component -a single-component match, where the nearinfrared counterpart to an isolated radio source is unambiguous.
• Multiple-component -where multiple radio components are associated with a single near-infrared counterpart.
• No visible optical counterpart -where the radio emission is not associated with a multi-component source and has no apparent near-infrared counterpart.
• Confused source -where the resolution of the radio data is insufficient to identify an unambiguous counterpart.A subset of these sources are subsequently split into separate sources using the higher-resolution VLA 3 GHz data, as described below.
The output from each classifier was then compared to find mismatches.When mismatches occurred the overlays were re-inspected by a team of three experts and re-classified.Despite visual classification being a subjective and time consuming process, it is still necessary, as we show when comparing it to the likelihood ratio technique in Section 4 and it is recognised as being more reliable than automated techniques (Fan et al. 2015).With visual classification, imaging and source detection errors can be noticed easily, and rare and interesting objects such as giant radio galaxies (e.g.Delhaize et al. 2021) can be identified.
Peak and integrated radio flux densities for the single component sources in the cross-matched catalogue are directly taken from the low-resolution Level 0 MIGHTEE pybdsf catalogue.Integrated flux densities for multi-component sources are the sum of integrated flux densities of the individual components, and the peak fluxes for multi-component sources are taken as the peak flux of the component with the highest peak flux.
For confused radio sources, if the radio source clearly consists of two or more radio sources that are separate sources in the VLA 3 GHz catalogue and each have a separate host galaxy, we split the MIGHTEE radio source into two or more sources with separate nearinfrared counterparts.We estimate the 1.3-GHz peak and integrated flux densities of these split sources by dividing the flux of the original MIGHTEE source into two (or more) according to the ratio of the fluxes of the VLA 3 GHz sources as follows where  1.3 i is the estimated 1.3 GHz flux density of the -th split source,  1.3 orig is the original MIGHTEE flux density of the confused source,  3 i is the VLA 3 GHz flux density of the -th split source and  is the total number of sources the source is being split into.We note that this assumes that all of the confused components have a similar spectral index between the 3 GHz data in VLA-COSMOS and the 1.3 GHz MIGHTEE data.As these are generally faint radio sources, and are thus likely star-forming galaxies (see Whittam et al. 2022;Smolčić et al. 2017b), this assumption would not produce a large systematic offset in flux density, as starforming galaxies tend of have similar spectral indices of  ∼ 0.7.We note, in particular, that the peak fluxes scaled in this way should be used with caution.Confused MIGHTEE sources which cannot be clearly separated in this way are flagged as being too confused.
A full breakdown of all the possible cross-matching outcomes and their flags can be seen in Table 1.
Examples of the different classifications from the crossmatching process can be seen in Fig. 1.The green and blue contours show the MIGHTEE and VLA-COSMOS 3 GHz Survey imaging data respectively, overlaid on a grey scale UltraVISTA   -band image.The upper left and right panels display two large extended AGN residing in host galaxies at redshifts of  = 0.349 and  = 0.219, that are made up of multiple radio components ( comp = 47 and  comp = 17).The bottom left panel displays a nearby ( = 0.078) star-forming galaxy, comprising of a single radio component.The bottom right panel highlights a confused source, where two objects are contributing flux to a single MIGHTEE component radio source.This source has been split into two separate sources in the resulting visual cross-matched catalogue, with 1.3-GHz flux densities estimated from the 3-GHz flux densities as described above.
A total of 5 282 of the initial pybdsf catalogue of radio components could be visually matched to 5 223   -band counterparts.Note that there is not a direct mapping between sources in the input and cross-matched catalogues, as components which form part of multi-component sources have been combined and some blended sources have been split.The percentage of the initial radio components we can cross-match is therefore 87 per cent (5 282 out of 6 102).This appears to be an improvement over previous studies, for example Prescott et al. (2018) found that only 57 per cent of their initial radio catalogue from the VLA Stripe 82 Snapshot Survey (Heywood et al. 2016) could be cross-matched to an optical source and Williams et al. (2019) found 73 per cent of their radio sources from the LOFAR Two-metre Sky Survey (LoTSS) have optical/IR identifications from Pan-STARRS and/or WISE.However, due to the shallower radio and multi-wavelength datasets used by these studies, the samples are not directly comparable, as the ability to identify counterparts to radio sources is influenced by the depth of  both the radio and the multi-wavelength imaging.A more useful comparison is to the recent LOFAR Deep Fields work by (Kondapally et al. 2021).They cross-match the LOFAR deep field data to a wealth of multi-wavelength imaging data and achieve a successful identification for 97 per cent of the radio sources over the three deep fields using a combination of visual identification and automated cross-matching, and we return to this in Section 4. The numbers of radio components that have been assigned to a single optical counterpart can be seen in Fig. 2.This shows the vast majority of objects (99 per cent) are comprised of a single radio component, while a small number are very extended, with 10 sources consisting of > 5 components.These extended, multi-component sources are particularly challenging to match automatically and demonstrate the benefit of identifying counterparts by eye.This not only allows us to identify an appropriate host galaxy for the radio source, but also enables us to combine all detected components into one source, meaning we can produce a reliable estimate of the total source flux.The fraction of cross-matched sources in each radio flux density bin is shown in Fig. 3.This shows that although there is not a strong dependence on our ability to visually cross-match sources as a function of their flux-density, we are more successful at identifying counterparts for the brightest sources.When we consider only sources with  1.4 GHz > 0.4 mJy, the match fraction rises to 97 per cent.The positional offsets between the radio and   -band coordinates of our single radio component cross-matches can be seen in Fig. 4. The mean offset between the radio and   -band crossmatches is 0.24 arcsec in RA and 0.40 arcsec in Dec.As these offsets are significantly less than the resolution of the radio data, we do not correct for them in the cross-matching analysis.
In order to test the robustness of our visual cross-matching process, we employ a similar method to Prescott et al. (2016) and Prescott et al. (2018).We measure the separation between each component in our input radio catalogue and the nearest object in the near-infrared catalogue.We then repeat this process with a catalogue of random radio positions, generated to have the same source density as the real radio catalogue.The resulting distribution of separations between the real and random radio sources and the nearest near-infrared source is shown in Fig. 5.If we only consider cases where the separation between the radio source and the match in the near-infrared catalogue is less than 1 arcsec, there are 4501 matches identified to the real radio catalogue and 456 to the random catalogue, giving a reliability of 90 percent and a completeness of 71 per cent.Setting the separation limit at 2 arcsec raises the completeness to 92 per cent, but this is at the expense of reliability which drops to 73 per cent.Thus, use of the visually cross-matched catalogue should be tailored according to the science that is being carried out, and choosing the appropriate balance between reliability and completeness.Our final catalogue contains 4 881 matched sources comprising of a single radio component and 62 matched multicomponent radio sources.There are a further 280 split matches, giving a total of 5223 sources in the visually matched catalogue.
A description of the columns of the visually cross-matched (Level 2) catalogue, released with this work, can be seen in Appendix A. A catalogue of source classifications based on these visually cross-matched sources and their multi-wavelength data (the Level 3 catalogue) was released with Whittam et al. (2022).

THE LIKELIHOOD RATIO
In this section we show how our visually inspected cross-matched catalogue compares to the result of an automated method and highlight the advantages and disadvantages of both methods.
The likelihood ratio (LR) describes the ratio of the probability that a given radio source is related to a particular optical/infrared counterpart to the probability that it is unrelated (Sutherland & Saunders 1992) given by: where () is the expected distribution of the true counterparts as a function of optical/infrared magnitude. () is the radial probability distribution function of the offsets between the radio and optical/infrared positions, and () is the magnitude distribution of the entire catalogue of optical/infrared detected objects.This has been used by a number of studies to identify the multi-wavelength counterparts to radio catalogues, and can be very effective for single, isolated sources (Smith et al. 2011;McAlpine et al. 2012;Kondapally et al. 2021).Following the method described in McAlpine et al. (2012) (which contains a detailed description of how each of the terms in the equation above are calculated), we use the likelihood ratio to identify the host galaxies of the radio sources in both our high and low resolution catalogues, and use our visually cross-matched catalogue to evaluate the success of this method for the MIGHTEE COSMOS field.The ultimate aim is to determine whether the likelihood ratio can be used to match a sub-sample of the MIGHTEE sources automatically, thereby reducing the total number of sources which need to be matched by eye for the rest of the survey.This will be important given the much larger area which is yet to be cross-matched (this paper concerns less than 1 deg 2 out of a total ∼ 20 deg 2 ). 1he input radio catalogues for the likelihood ratio method are the Level-0 pybdsf source catalogues produced from both the high and low resolution MIGHTEE Early Science radio images, cut to the same 0.86 deg 2 area as used for the visual cross-matching (see Section 2.1).Although the visual cross-matching described in the previous section is based on only the low-resolution catalogue, here we employ the LR method to both the low and high resolution catalogues.This is because a cross-matched high-resolution catalogue has useful science applications, and because it allows us to inform our cross-matching strategy for different resolution images for the full MIGHTEE survey.We search for counterparts in an UltraVISTA   -band selected catalogue with   < 25.For those sources detected in the , ,  and  bands using magnitude limits of 25.0, 27.4,26.9, 26.6 respectively we find stars using the stellar locus defined in Jarvis et al. (2013).Our final IR catalogue contains all objects in the initial IR catalogue with stars removed and with   < 25.For each radio source we select the object with the highest LR, and retain this match provided the LR value is above our defined threshold,  thr .To determine the most appropriate LR threshold to use, we calculate the completeness and reliability for a given  thr in a similar way to Williams et al. ( 2019) where  ( thr ) is the completeness for a given  thr (i.e. the fraction of real matches which are accepted) and ( thr ) is the reliability (i.e. the fraction of accepted matches which are correct). 0 represents the fraction of radio sources which have a counterpart,  0 =  matched / radio which we calculate following the method outlined in Fleuren et al. (2012).Following Williams et al. (2019) we set our LR threshold to the point where the  ( thr ) and ( thr ) curves intersect.This gives us LR threshold values of 0.22 and 0.36 for the high and low resolution MIGHTEE catalogues respectively.The completeness and reliability curves as a function of  thr are shown in Fig. 6.

The likelihood ratio for all sources
Table 2 shows the performance of the likelihood ratio method on our radio source catalogues.With the likelihood ratio we are able to identify counterparts for 93.6 and 94.2 per cent of the initial high and low resolution radio component catalogues respectively.Figs.7 and 8 show the flux density distribution of the sources in the MIGHTEE catalogue we are able to match using the LR method, and the fraction of matches in each flux bin.This demonstrates that the LR method is less successful at higher flux densities, due to the larger fraction of complex sources as discussed above.This Notes: 1 The high-resolution image is less sensitive so the resulting catalogue contains fewer sources than the low-resolution catalogue. 2In the final visual cross-matched catalogue components of multi-component sources have been combined and some blended sources have been split, resulting in 5223 sources in the final catalogue.
is in contrast to the match fraction for the visually cross-matched catalogue shown in Fig. 3, which increases at larger flux densities.
For sources with  1.4 GHz > 100 μJy, by matching visually we are able to identify a counterpart for 93 per cent of sources, while the LR method is only able to cross-match 61 per cent of the same sample.This highlights the benefit of combining the two methods; by using the LR we are able to automatically match a large number of the fainter sources, but it is still necessary to match the more complex sources, which tend to have larger flux densities, by eye.
For the sources which also have a good match in the visually cross-matched catalogue, the two methods identify the same counterpart for 95.5 and 94.3 per cent of sources in the high and low resolution catalogues respectively.Note that when an input radio source has been split into two or more sources with separate near-infrared counterparts when visual cross-matching (see Section 3) this is automatically counted as a disagreement with the LR method, as both counterparts are not identified by the LR method.This highlights one important aspect of where the LR method can be misleading, as it will produce a high LR counterpart to a "single source" and be seen as successful, whereas the source itself is confused and has two optical/NIR counterparts.Such sources are readily identified in the visual classification.On the other hand, if higher resolution radio data was available then the radio source itself would have been split into separate components and the LR could have been successful in assigning two optical/NIR counterparts.
However, this shows that the likelihood ratio method can be used to successfully identify counterparts for a large fraction of the MIGHTEE radio sources, and that the performance on the high and low resolution MIGHTEE catalogues is similar.For the sources with a good LR match, the two methods identify the same counterpart for 81.0 and 81.7 per cent of sources in the low and high resolution catalogues respectively.The likelihood ratio as a function of separation between the radio and near-infrared source positions can be seen in Fig. 9.The upper panel displays the likelihood ratio for the low resolution catalogue and the lower panel displays the same for the high resolution catalogue.The number of sources where the two methods disagree is higher when the separation between the radio and near-infrared positions are larger, and when the LR is lower, as expected.
We release the full likelihood ratio matched catalogues with this work and details can be found in Appendix A.

The likelihood ratio for unresolved sources
We expect the likelihood ratio method to be more successful at identifying the correct counterpart for single, isolated sources than for extended sources, therefore we investigate whether excluding extended sources can increase the reliability of this method.As described in Section 3.3.3 of Heywood et al. (2022), sources in the MIGHTEE Early Science catalogue are flagged as resolved if their deconvolved major axis size (  ) exceeds the full-width half maximum of the restoring beam ( beam ) by where    is the uncertainty on the deconvolved major axis.There are 5572 sources in the low resolution catalogue which are not flagged as resolved, and 5429 (97.4 per cent) of these have a match identified by the likelihood ratio method described above.This demonstrates that the likelihood ratio method is able to cross-match a higher fraction of compact sources, as expected.4725 of these sources also have a match identified in our visual classification catalogue, and for 4483 of these sources the counterparts identified by the two methods are the same object (this is 82.5 per cent of the 5428 unresolved components with a good LR match).The agreement of these matches with the visual classifications is therefore very similar to when we consider the full sample.Despite the likelihood ratio on its own not being sufficient to identify multi-wavelength counterparts for each and every one of the MIGHTEE sources, it can be used successfully to produce a subsample of matched MIGHTEE sources and therefore dramatically reduce the total number of sources which need to be cross-matched by eye.Obviously in any method there will be mismatches between the radio and the optical identifications due to the plethora of different structures seen in the radio, e.g.jets, lobes and hotspots from active galactic nuclei, and automating such cross-matching is extremely difficult.Thus, the need to use a combination of LR and visual cross-matching will remain and the adopted threshold to "eyeball" sources will necessarily change depending on the science which is being carried out, e.g. a balance between completeness and reliability.We will use this analysis to inform the cross-matching strategy for the remaining MIGHTEE fields.For the cases where the visual cross-matches and the LR matches disagree, we would require additional information to be able to associate these sources, e.g.higher-resolution radio data or spectroscopy.

REDSHIFTS FOR THE CROSS-MATCHED SAMPLE
The sample presented in this paper contains 5223 visually crossmatched sources, which is 86 per cent of the parent radio sample.Spectroscopic redshifts are available for 2427 sources, and for the The likelihood ratio as a function of separation between the radio source and multi-wavelength counterpart.The upper panel shows the likelihood ratio matches from the low resolution radio catalogue whilst the lower panel shows the high resolution radio catalogue.Sources where the counterpart identified using the LR method agree with that identified visually can be seen as blue circles, and those where they do not can be seen as red crosses.The black dashed line shows where LR =  thr .To calculate rest-frame 1.4 GHz radio luminosities of our radio sample, we assume a spectral index of  = 0.7 (where  ∝  −  ).Due to the wide bandwidth of the MeerKAT L-band receivers and the varying response of the primary beam with frequency the effective frequency of the MIGHTEE data varies across the image.This is discussed in detail in Heywood et al. (2022), and we use the effective frequency map released with that work to scale the MIGHTEE flux densities and luminosities to 1.4 GHz.The luminosity-redshift plot of the objects in our sample is shown in Fig. 11.This shows that we are able to investigate the evolution of faint ( 1.4 ∼ 10 24 W Hz −1 ) AGN out to the epoch of re-ionisation and assuming the correlation between SFR and radio luminosity (e.g.Yun et al. 2001;Bell 2003;Delvecchio et al. 2021) star-forming (SFR ∼ 50 M ⊙ yr −1 ) and starburst (SFR > 100 M ⊙ yr −1 ) galaxies to  ∼ 1 and  ∼ 5, respectively, if the optical and near-infrared data are deep enough to measure redshifts.It tends to be more difficult to produce accurate photometric redshift estimates for radio sources, due to the prevalence of bright emission lines in both star-forming galaxies and AGN, and the possible AGN contribution to the continuum.We therefore assess the accuracy of the photometric redshifts of our sample by comparing sources that have both spectroscopic ( Spec ) and photometric redshifts ( Photo ) available.
The spread between the two redshift estimates can be defined as Δ/(1 +  Spec ) where Δ =  Spec −  Photo .As in Ilbert et al. (2006) and Jarvis et al. (2013), we calculate the normalized median absolute deviation (NMAD) as NMAD = 0.023 which implies there is a good agreement between the two quantities.Defining outliers as cross-matches that have | Spec −  Photo |/(1 +  Spec ) > 0.15, we find that only 115 objects or 4.94 per cent of the sample have poorly determined photometric redshifts, showing that the photometric redshifts are fairly robust.In the future, spectroscopic redshifts for further MIGHTEE sources will become available from the Deep Extragalactic VIsible Legacy Survey (DEVILS, Davies et al. 2018), the Multi-Object Optical and Near-infrared Spectrograph (MOONS, Cirasuolo et al. 2012) and the 4-metre Multi-Object Spectroscopic Telescope (4MOST, de Jong et al. 2019), and in particular the Optical, Radio Continuum and HI Deep Spectroscopic Survey (ORCHIDSS; Duncan et al. 2023).

COMPARISONS WITH SIMULATIONS
In this section we compare the radio flux densities and redshift distributions of the AGN and star-forming galaxies (SFG) in our visually cross-matched sample to those from the Square Kilometre Array Design Study (SKADS) (Wilman et al. 2008(Wilman et al. , 2010) ) and the more recent Tiered Radio Extragalactic Simulation (T-RECS; Bonaldi et al. 2019).We use the AGN and star-forming galaxy classifications from Whittam et al. (2022), which make use of the abundance of multi-wavelength data available in the COSMOS field to classify sources as AGN and SFG.As these classification are only available for the visually cross-matched sample, we restrict our analysis to that sample for the remainder of this section.The classification scheme is described in detail in Whittam et al. (2022) and outlined briefly here.The classifications are based on five criteria which are then combined to give an overall classification.The first diagnostic makes use of the far-infrared-radio correlation to identify objects with significantly more radio emission than would be expected from star-formation alone.Following Delvecchio et al. (2021), sources with radio emission > 2 above the correlation are classified as AGN.The second diagnostic identifies AGN from their X-ray emission.Objects with X-ray luminosities of   > 10 42 erg s −1 are classified as AGN.Third, AGN are identified from their mid-infrared colours using a colour-colour diagram as described in Donley et al. (2012).For the fourth diagnostic, sources detected by Very Long Baseline Interferometry (VLBI) observations of the COSMOS field by Herrera Ruiz et al. (2017) are labelled as AGN.Finally, objects that have point-like morphologies at optical wavelengths (using Hubble ACS I-band data) are classified as AGN.A source is classified as an AGN if it meets any one (or more) of the five AGN criteria.Sources which we can securely classify as not being an AGN using all five criteria are classified as star-forming galaxies.The depth of the X-ray data used means that we can only rule out AGN-related X-ray emission at redshifts < 0.5, meaning that we are only able to securely classify objects as star-forming galaxies in this redshift range.We therefore introduce an additional classification of 'probable SFG' for sources which have redshifts > 0.5 so are unable to fulfil the 'not X-ray AGN' criteria, but which are classified as 'not AGN' using the other four criteria.For the remainder of this work we combine the SFG and 'prob SFG' classes and refer to the combination simply as 'SFG'.The AGN are further classified as radio-loud and radio-quiet.All AGN which meet the 'radio excess' criteria are considered to be radio-loud, while those which do not have excess radio emission, but are classified as an AGN using one of the other criteria are classified as radio-quiet AGN.

Flux distribution
Fig. 12 shows the fraction of AGN and star-forming galaxies as a function of their total radio flux density, compared to the SKADS and T-RECS simulations.The MIGHTEE flux densities have been scaled to 1.4 GHz using the effective frequency map assuming a spectral index of  = 0.7.So as not to be affected by incompleteness  due to the variation in noise across the MIGHTEE image, we cut all three catalogues at  1.4 GHz = 50 μJy as the MIGHTEE sample is complete above this flux density (see Hale et al. 2023).With this flux density cut applied, the MIGHTEE sample contains 3294 sources, of which 2824 (86 per cent) have a multi-wavelength counterpart identified in the visually cross-matched catalogue.2467 (75 per cent) of these objects are classified as an AGN or SFG, the remaining sources do not have enough multi-wavelength information available to be able to securely classify them.The top panel of Fig. 12 shows the fraction of classified MIGH-TEE sources which are identified as AGN or SFG as a function of 1.4-GHz flux density.This demonstrates that AGN and star-forming fractions in the MIGHTEE sample are in good agreement with the SKADS simulations.Both show that the AGN fraction increases with increasing radio flux density from ∼ 40 per cent at ∼ 50 μJy to ∼ 95 per cent at 1 mJy.Both SKADS and our sample show equal fractions for SFG and AGN at ∼ 100μJy, below which SFG become the dominant population.This is consistent with the findings of Padovani et al. (2015) who find that SFG become the dominant population below ∼ 100μJy using radio observations of the Extended Chandra Deep Field South (E-CDFS) Very Large Array sample, as well as Smolčić et al. (2017b), using 3 GHz observations of the COSMOS field.
In contrast, T-RECS significantly over-predicts the fraction of SFGs, and therefore under-predicts the fraction of AGN, when compared to the MIGHTEE sample.However, this plot does not include the sources we are unable to classify; both those with an optical match but without enough information to classify as AGN or SFG, and those without an optical match.The middle panel of Fig. 12 show the proportion of MIGHTEE SFG and AGN in the full MIGHTEE sample, with the fraction of sources without a classification shown by the grey line.This shows that even if none of the unclassified MIGHTEE sources are AGN, the fraction of AGN at radio flux densities less than 1 mJy is higher in the MIGHTEE sample than predicted by the T-RECS simulation.At  1.4 GHz ∼ 50 μJy around 30 per cent of the MIGHTEE sample are AGN (and this should be considered a lower limit on the fraction of AGN, as it is very possible that some of the unknown sources are AGN), while T-RECS predicts that less than 10 per cent of this sample should be AGN.Note that despite their faint radio flux densities the majority of these AGN are not radio quiet -they have an excess over what would be expected from star-formation alone.This can been seen in Fig. 13.
However, the T-RECS work does not include radio-quiet AGN (which are instead included in the SFG class) which could account for some of this difference.To test this, in the bottom panel of Fig. 12 we show the MIGHTEE radio-loud AGN (in yellow) and all other MIGHTEE sources not classified as radio-loud AGN (green line, this includes radio-quiet AGN, SFG and unclassified sources).This shows that even when radio-quiet AGN are included and all unclassified sources are assumed to be SFG, T-RECS still significantly over-predicts the fraction of SFG in the observed sample by ∼ 10 per cent at  1.4 ≲ 0.5 mJy.
In Fig. 13 we show the fraction of radio-loud and radio-quiet AGN in the MIGHTEE sample compared to what is predicted by SKADS as a function of flux density.There is a reasonable agreement between the two samples, although the MIGHTEE sample contains fewer radio-quiet AGN than predicted at  1.4 GHz < 200μJy.There have been several studies investigating the process responsible for radio emission in radio-quiet AGN.For example, Kimball et al. (2011) and Kellermann et al. (2016) find that using radio observations of radio-quiet quasars that a significant fraction of the radio emission could be attributed to star formation.On the other hand, White et al. (2015White et al. ( , 2017) ) use multi-wavelength data to fit the spectral energy distribution of a sample of radio-quiet AGN from blank surveys and targeted surveys to determine the contribution of star formation to the radio luminosity and find that the AGN is responsible for the bulk of the radio emission.More recent work (e.g.Macfarlane et al. 2021;Xiao et al. 2022) also attribute the bulk of the radio emission in radio-quiet quasars as due to similar jet-production processes occurring in their radio-loud counterparts.Therefore, it is clear that more work is needed in this area, and classifying such faint radio sources as AGN requires very good ancillary data.For example, past work has been concentrated on the specific class of radio-quiet quasars, where the nuclear point source at optical wavelengths is dominant, whereas the classifications here include mid-infrared and X-ray data.

Redshift distribution
Fig. 14 shows the redshift distributions for the MIGHTEE AGN and star-forming galaxies in the visually cross-matched sample, compared with those from the SKADS and T-RECS simulations.We use the spectroscopic redshifts where available, and the photometric redshifts for all other sources.The distributions from the Smolčić et al.VLA-COSMOS 3 GHz work are also shown for reference2 .As above, the MIGHTEE SFG class shown here is a combination of the 'SFG' and 'probable SFG' classes described in Whittam et al. (2022).All distributions are normalised to the MIGHTEE area used in this work, 0.86 deg 2 .The visually cross-matched MIGH-TEE sample in the top panel only includes the 2824 sources that we are able to identify a host for (86 per cent of the full sample with  1.4 GHz > 50 μJy).As we are able to cross-match a higher proportion of sources using the Likelihood Ratio method (3088 sources, 94 per cent of the full sample with  1.4 GHz > 50 μJy), we also show the redshift distribution of the LR-matched sample for reference in  the top panel of Fig. 14.While they are more complete, these identifications are not as robust as those from the visually cross-matched catalogue.However, their distribution gives a good indication of the potential distribution of the sources missing from the visually cross-matched sample so provides a useful reference.We note that the AGN and SFG classifications are not currently available for the LR matched catalogue, so we are not able to include this catalogue on the bottom two panels of Fig. 14.The differences between our observed distribution of sources and the simulated distributions are highlighted when the source populations are split into SFG and AGN.There are more AGN in the MIGHTEE sample at  ∼ 1 than predicted by either simulation.T-RECS under-predicts the number of AGN to a greater extent than SKADS, this is probably because the T-RECS 'AGN' class only includes radio-loud AGN, as discussed in Section 6.1.To account for this, in Fig. 15 we show the T-RECS AGN and SFG compared to the MIGHTEE radio-loud AGN and all other MIGHTEE sources (i.e.radio-quiet AGN, SFG and unclassified sources), which should be more directly comparable classifications.This shows that T-RECS still under-predicts the number of AGN at  ∼ 1, even when only radio-loud AGN are considered.
We note that as the MIGHTEE sample shown in Fig. 14 only includes sources we are able to securely classify, the number of AGN (and SFG) shown here should be considered a lower limit.The prescriptions for simulating the AGN population in both simulations are based on observations at higher fluxes and extrapolated down to the fluxes reached by the MIGHTEE survey.For example, SKADS uses the Fanaroff and Riley type I and II (FRI and FRII;Fanaroff & Riley 1974) evolution models from Willott et al. (2001), along with the observed relationship between X-ray and radio luminosity for radio-quiet quasars (Brinkmann et al. 2000), then extrapolate these to fainter flux densities.This work demonstrates that there are more AGN than predicted by these extrapolations.These AGN that are missing from the simulations are predominately low-excitation radio-loud AGN (LERGs, see e.g.Heckman & Best 2014), which show an excess of radio emission but do not display the other indi-cators of AGN emission typically present in more highly-accreting nuclei, such as strong nuclear emission and mid-IR emission from a dusty torus.It is only due to the combination of deep radio data and excellent multi-wavelength data in the MIGHTEE fields that we are able to identify these very faint AGN.This has implications for the role of radio galaxies in galaxy evolution, as it suggests that mechanical feedback could play a significant role even at faint flux densities.This is discussed further in Whittam et al. (2022).
In terms of SFG, both the MIGHTEE and VLA-COSMOS observed samples show good agreement with T-RECS at  < 1.The SKADS simulation, however, under-predicts the number of SFG observed at  ≲ 0.6.This is in agreement with the growing evidence in the literature that the SKADS simulation underestimates the number of SFG at faint flux densities ( 1.4 GHz ≲ 0.1 mJy), see e.g.Smolčić et al. 2017a;Prandoni et al. 2018;Mauch et al. 2020;Matthews et al. 2021;Hale et al. 2023.While the absolute numbers of SFG and AGN in SKADS do not agree well with the observations, as discussed in this section, the fractions of AGN and SFG in SKADS are in good agreement with the observed fractions, as shown in Fig. 12 and discussed in Section 6.1.This is because SKADS does not include a significant number of faint radio AGN at similar redshifts to the SFGs.As the MIGHTEE observations only cover a relatively small area (0.86 deg 2 ), it is possible the cosmic variance has an impact on the absolute number of sources in the field.However, Heywood et al. (2013) shows that cosmic variance is only expected to around ∼ 5 per cent of the number density at  1.4 GHz ∼ 100μJy.Additionally, Hale et al. (2023) demonstrates that the MIGHTEE source counts in the COSMOS field (used in this work) are consistent with those from the XMM-LSS field.
On the other hand, the VLA COSMOS sample appears to contain more SFGs and fewer AGN at  ∼ 1 than the MIGHTEE sample.This is probably primarily due to differences in the methods used to classify the sources, particularly the different criteria used to identify radio-excess AGN.When comparing the sources in common, there are a number of radio-excess AGN in the MIGHTEE sample which are identified as SFG in the VLA COSMOS work.This is in part because the VLA COSMOS team require a source to have a 3 radio excess to be classified as radio loud (Smolčić et al. 2017b), while we follow the more recent work by Delvecchio et al. (2021) and only require a 2 radio excess.This results in a higher completeness, but could cause a 3 − 4 per cent contamination of SFG in the radio-loud AGN sample.This is discussed in detail in Whittam et al. (2022) where the classification schemes for the two studies are compared.In the lowest redshift bins ( < 0.3) MIGHTEE detects more SFG (and more sources in total) than VLA COSMOS.As discussed in Hale et al. (2023) there are a number of extended galaxies detected in MIGHTEE which are not detected in VLA-COSMOS despite having total flux densities above their detection limit.This is because the configuration of the VLA used for the VLA-COSMOS observations lacks short baselines, so while it provides excellent resolution, it is not sensitive to extended emission, resulting in these sources being missed.

CONCLUSIONS
In this work3 we have cross-matched the MIGHTEE Early Science radio catalogue in the central part COSMOS field with a multi-wavelength catalogue of objects selected in the near-infrared  band both by eye and by using an automated Likelihood Ratio method.The cross-matched catalogues are released with this work.Our main results can be summarised as follows: • From an initial pybdsf catalogue of 6 102 radio sources, we find that 5 223 radio sources can be successfully assigned to a multiwavelength counterpart via visual inspection.
• We compare our visually cross-matched sample to samples obtained using the likelihood ratio method.With the automated LR method we are able to identify counterparts for 94 per cent of radio components in the low-resolution MIGHTEE catalogue, and these matches agree with those identified visually in 95 per cent of cases.The fraction we are able to match rises to 97 per cent when we consider sources which are unresolved only.
• Visual inspection is still crucial for cross-matching extended and multi-component radio sources, and for identifying confused sources.The LR method is only able to match 61 per cent of sources with  1.4 GHz > 100 μJy, while using visual inspection we are able to identify counterparts for 93 per cent of the same sample.This highlights the benefits of combining the two methods; by using the LR we are able to automatically match a large number of the fainter, compact sources, but visual inspection is necessary to match the extended, complex sources.A dual approach of automated and visual inspection will be implemented for future MIGHTEE observations of the remainder of the COSMOS field and the XMM-LSS, E-CDFS and ELAIS-S1 fields.
• Our sample contains a mixture of AGN and star-forming galaxies, which can be probed out to  ∼ 5. We show that the fractions of AGN and star-forming galaxies as function of radio flux agree well with SKADS simulations, with star-forming galaxies becoming the dominant population below flux densities of ∼ 100μJy.The T-RECS simulation, however, seems to under-predict the fraction of AGN and over-predicts the fraction of SFG below  1.4 GHz ∼ 1 mJy.
• The MIGHTEE sample contains more AGN at  ∼ 1 than predicted by either simulation (although SKADS is closer to matching the observed distribution than T-RECS).The majority of these AGN are low-excitation radio galaxies (LERGs) and it is only due to the combination of deep radio data and excellent multi-wavelength data in the MIGHTEE field that we are able to detect these faint AGN.(Robitaille and Bressert, 2012), hosted at http://aplpy.github.com.We also acknowledge the IDL Astronomy User's Library, and IDL code maintained by D. Schlegel (IDLUTILS) as valuable resources.

Figure 1 .Figure 2 .Figure 3 .
Figure 1.Examples of overlays examined in the cross-matching process.Radio contours from MIGHTEE Early Science (Heywood et al. 2022, green) and the VLA-COSMOS 3 GHz Large Project (Smolčić et al. 2017a, blue) are overlaid on an UltraVISTA K  -band background grey scale image (McCracken et al. 2012).The contour levels here represent 7 levels evenly spaced in log space between 1.5 times the local rms noise and half the maximum pixel value in the image.The green stars indicate radio components in the pybdsf catalogue.The red circles indicate the host galaxies of these radio sources.The upper two panels show two large extended, multi-component AGN.The bottom left panel a single component star-forming galaxy, and the bottom right panel shows a radio source that is confused in MIGHTEE.The size of the MeerKAT radio beam is highlighted by the solid green circle.

Figure 4 .
Figure 4.The positional offsets between the radio and   -band coordinates for each of the single component radio sources in the cross-matched catalogue.

Figure 5 .
Figure5.Separation between each radio source and the nearest object in the   -band near-infrared catalogue for the real radio catalogue and a random set of positions with the same source density as the radio catalogue.

Figure 7 .Figure 8 .
Figure 7.The distribution of total fluxes of all components in the low resolution MIGHTEE catalogue (white), with those with a counterpart in the LR matched catalogue shown in black.The bottom panel shows the fraction of matched in each flux density bin.

Figure 10 .
Figure10.Redshift distribution of the 5 223 objects in cross-matched sample.The distribution of objects with spectroscopic redshifts (2 427 objects) can be seen as a blue dashed line, whereas the distribution of those with photometric redshifts (2 796 objects) can be seen as the red dotted line.

Figure 11 .
Figure11.The rest-frame 1.4 GHz radio luminosity -redshift distribution of the cross-matched sample.Objects with spectroscopic redshifts can be seen as blue crosses and those with photometric redshifts as red dots.The dotted black line represents a flux limit of 20 μJy.The y axis on the right hand side shows an estimate of the star-formation rate, scaled from the radio luminosity using theBell (2003) relation.

Figure 12 .
Figure 12.Fraction of AGN and SFG as a function of 1.4-GHz flux density in the visually cross-matched MIGHTEE sample compared to the SKADS simulated skies and T-RECS.The top panel shows the number of AGN and SFG (including 'probable SFG' for MIGHTEE) as a fraction of the classified radio sources -i.e.only MIGHTEE sources which we are able to classify as either AGN or SFG are included.The middle panel shows the MIGHTEE SFG and AGN as a fraction of all MIGHTEE sources (including unmatched and unclassified sources).The unclassified and unmatched MIGHTEE sources are shown as the pale grey line (labelled 'MIGHTEE unknown').The bottom panel shows the fraction of MIGHTEE radio-loud AGN and all sources not classified as radio-loud AGN, compared to the AGN and SFG in T-RECS.Note that the 'MIGHTEE not AGN' class includes all unmatched and unclassified sources, as well as those sources classified as SFG and radio-quiet AGN.MIGHTEE fluxes are scaled to 1.4 GHz assuming a spectral index of 0.7.Uncertainties shown are Poisson errors.

Figure 13 .
Figure 13.The fraction of sources classified as radio-loud and radio-quiet AGN in MIGHTEE and SKADS.MIGHTEE fluxes are scaled to 1.4 GHz assuming a spectral index of 0.7.Uncertainties shown are poisson errors.

Figure 14 .
Figure 14.Comparison between the redshift distribution of simulated radio sources from SKADS (dashed lines), T-RECS (dotted lines) and the MIGHTEE visually cross-matched sample (solid lines).The Smolčić et al. (2017b) VLA-COSMOS 3 GHz sample is also shown (dot-dashed line).All distributions are normalised to the MIGHTEE area of 0.86 deg 2 .The top panel shows all sources, the middle panels shows sources classified as SFG, and the bottom panel shows AGN.See text for details of classifications.The distribution of sources in the LR-matched MIGHTEE catalogue, which contains more sources, is also shown by the red solid line in the top panel.

Figure 15 .
Figure 15.Comparison of the redshift distribution of simulated radio sources from T-RECS and the MIGHTEE visually cross-matched sample.For the T-RECS sample, AGN and SFG are show separately (T-RECS does not include radio-quiet AGN, so these will be included in the 'SFG' class).For the MIGHTEE sample, radio-loud AGN and shown in green, and all other sources (i.e.radio-quiet AGN, SFG and unclassified sources) are shown in magenta.
acknowledges the support of the LMU Faculty of Physics.LM and MV acknowledge support from the Italian Ministry of Foreign Affairs and International Cooperation (MAECI Grant Number ZA18GR02) and the South African Department of Science and Innovation's National Research Foundation (DSI-NRF Grant Number 113121) under the ISARP RADIOSKY2020 Joint Research Scheme.MGS acknowledges support from the South African Radio Astronomy Observatory and National Research Foundation (Grant No. 84156).This work was supported by the Medical Research Council [MR/T042842/1].This research made use of aplpy, an open-source plotting package for Python

Table 1 .
Breakdown of the classifications from the visual inspection of the MIGHTEE radio sources.

Table 2 .
Summary of the performance of the likelihood ratio to identify counterparts for radio sources in the high and low resolution MIGHTEE images.