A self-supervised, physics-aware, Bayesian neural network architecture for modelling galaxy emission-line kinematics

In the upcoming decades large facilities, such as the SKA, will provide resolved observations of the kinematics of millions of galaxies. In order to assist in the timely exploitation of these vast datasets we explore the use of a self-supervised, physics aware neural network capable of Bayesian kinematic modelling of galaxies. We demonstrate the network's ability to model the kinematics of cold gas in galaxies with an emphasis on recovering physical parameters and accompanying modelling errors. The model is able to recover rotation curves, inclinations and disc scale lengths for both CO and HI data which match well with those found in the literature. The model is also able to provide modelling errors over learned parameters thanks to the application of quasi-Bayesian Monte-Carlo dropout. This work shows the promising use of machine learning, and in particular self-supervised neural networks, in the context of kinematically modelling galaxies. This work represents the first steps in applying such models for kinematic fitting and we propose that variants of our model would seem especially suitable for enabling emission-line science from upcoming surveys with e.g. the SKA, allowing fast exploitation of these large datasets.


INTRODUCTION
In studying galaxy evolution, astronomers often use the atomic Hydrogen (H ) 21-cm line to trace the outermost regions of galactic discs (e.g. Warren et al. 2004;Begum et al. 2005;Sancisi et al. 2008;Heald et al. 2011;Koribalski et al. 2018). This region can mark the continuous boundary between galaxies and their surrounding environments, including the dark matter halos within which galaxies are thought to reside. The rotation curves of extended H discs can be used to begin probing the properties of dark matter halos as well as allow the detailed modelling of galaxies' mass distributions when coupled with ancillary observations (e.g. van Albada et al. 1985;de Blok et al. 2008). In the local Universe, H discs are useful in determining the gaseous content of a galaxy as well as allowing astronomers to probe kinematic properties ranging from substructures such as bars, warps, counter-rotating discs, and spiral arms (e.g. Józsa et al. 2007;Spekkens & Sellwood 2007;Kamphuis et al. 2015;Di Teodoro & Fraternali 2015). Molecular gas observations (typically of the CO molecule) can provide a complimentary view of these regions at high resolution, revealing the interplay between these gas phases. H is typically more extended than molecular gas, however, allowing it to trace environmental properties such as extended tidal features and the existence of dwarf companions (Hibbard et al. 2001;Sancisi et al. 2008;Heald et al. 2011;Serra et al. 2013;Bosma 2016;Koribalski et al. 2018).
The evolution of H gives astronomers insight into the method by which galaxies accrete material from surrounding environments ★ E-mail: dawsonj5@cardiff.ac.uk and how the mass of galaxies builds and evolves through star formation. The next generation of H survey instruments (e.g. the Square Kilometre Array, Dewdney et al. 2009, Australian Square Kilometre Array Pathfinder, Johnston et al. 2007Johnston et al. , 2008, the South African Meer-Karoo Array Telescope, Jonas & MeerKAT Team 2016, the Chinese Five-hundred metre Aperture Spherical Telescope, Li & Pan 2016) are poised to collect observations spanning a large look-back time, advancing our H driven science as well as pushing this field of astronomy firmly into the Big Data era.
Currently it is estimated that the Square Kilometre Array (SKA) will collect data on the order of hundreds of petabytes per year. Given that amount of data is not only too much to fully exploit by hand but also too large to store, astronomers should be looking to develop realtime models that can perform efficient science on incoming data. In an ideal world, physical information would be extracted from incoming data automatically, leaving the work of unravelling the prevailing science to astronomers. However, with such large data volumes and time-intensive techniques how are astronomers to begin moving in a direction in which we can fully exploit the data quality promised by the SKA?
In previous work we sought to begin addressing this challenge via the application of machine learning (Dawson et al. 2019), and in particular neural networks, to extract kinematic properties of cold gas in galaxies. Models and tools exist to do this kind of work already. With the upcoming data releases from surveys such as the Widefield ASKAP L-Band Legacy All-Sky Blind Survey (WAL-LABY), it comes as no surprise that kinematic modelling tools (e.g. 3D-BAROLO 1 Di Teodoro & Fraternali 2015, 2DBAT 2 Oh et al. 2017, FAT 3 Kamphuis et al. 2015, and KinMS 4 Davis et al. 2013;Davis et al. 2020) have been in use and ongoing development for some time. Yet these models typically require several minutes or more to provide a full kinematic model of a single object, and longer if errors are required, which may prove problematic for kinematic analyses at SKA survey speeds.
In the past decade machine learning (ML) has become a popular solution to many Big Data challenges in galaxy evolution studies (e.g. Dieleman et al. 2015;Domínguez Sánchez et al. 2018a,b;Ackermann et al. 2018, Bekki 2019, but remains an under-utilised resource among the galaxy kinematics community. Computer vision, which often utilises ML techniques, has been successfully applied to kinematic characterisation (e.g. Stark et al. 2018). Yet, there is a distinct absence of directly exploiting ML (with the notable exception of a few recent works, e.g. Shen & Bekki 2020). Recently our group has made attempts to exploit the use of ML in this field, featuring the use of convolutional autoencoders to identify disturbed cold gas in galaxies using data from both simulations and observations (see Dawson et al. 2019). We still have a long way to go in fully exploring the application of ML to galaxy kinematic characterisation but it appears to be a promising avenue of research and one which we explore further in this work.
While conventional ML models are capable of high empirical accuracy and low testing time (e.g. Breiman 2001;Krizhevsky et al. 2012), they are often highlighted for their slow training times (Lim et al. 2000) and, in some cases, reluctance to generalise to unseen datasets (Dinh et al. 2017;Kawaguchi et al. 2017). These qualities are unsuitable for survey tasks proposed for the SKA and therefore we are required to look at alternative methods that incorporate the benefits of ML, without the drawbacks associated with standard ML practice.
Such an approach may exist in the form of self-supervised learning (Liu et al. 2020), whereby models train themselves without the need for an isolated training set. This has huge benefits in that one does not require long training times on a throw-away-dataset, essentially eliminating data wastage. As with all machine learning approaches, self-supervised learning does have its disadvantages including requiring fixed analytical functions to perform training, as well as results which change depending on when one wishes to evaluate test data throughout the model training procedure. Few pilot tests of these networks exist in astronomy (and even fewer utilising physics-aware capabilities, e.g. Aragon-Calvo 2019) and none exist in the modelling of galaxy kinematics. In this paper we present the current results from our first attempts at creating a self-supervised neural network with the primary goal of inferring the kinematic properties of gas discs in galaxies and an emphasis on extracting (simplistic) characteristics of their rotation curves.
The paper is divided into 3 main sections. §2 gives an in depth description of the model architecture used throughout this work, with emphasis lying on the decoder subnet described in §2.4. §3 presents the results from testing the network using synthetic and real interferometric observations, and §4 summarises the main outcomes of the work presented in this paper as well as proposed avenues for future work. A simplified pictorial representation of the neural network used throughout this work. The model features two convolutional encoder subnets which concatenate learned features before passing them to a decoder subnet. The model receives moment maps as inputs and minimises the loss between decoder-generated moment map outputs and the inputs throughout training. In the diagram grey squares indicate convolutional layers, blue rectangles depict linearly connected layers, and the grey cube represents the auxiliary 3D cube containing the coordinate axes passed into the network.

Input data
A typical interferometric observation returns visibilities in a complex plane from which one can obtain a 3D datacube consisting of 2D spatial flux observations separated into discrete channels which correspond to observed frequency. It is this channelisation that allows astronomers to measure the line of sight velocities and hence the kinematic properties of galaxies' gas reservoirs. In practice, one can collapse these datacubes further to create 2D maps that reflect the mean properties of the gas in galaxies. A moment zero (integrated intensity) map is simply a summation along a cube's frequency/velocity dimension: and a moment one (velocity) map is an intensity weighted averaging of the line of sight velocities Working directly with the datacubes, or in fact the complex visibilities, would be optimal for any fast pipeline kinematic modelling tool. However, we have chosen to work with moment maps in this work as a first step and to avoid the problems associated with channelised inputs as explained further in §4. It should be noted that, because of our choice to use moment maps, the models described in this work are also suitable to analyse optical IFU maps, as they will be handled similarly by the model described in this work and have been shown to encode kinematic information which can be extracted using both analytical and ML approaches (e.g. Stark et al. 2018;Hansen et al. 2020). This will be explored further in future work (Dawson et al., in prep).
It should be noted that in this work, we are not making attempts to mitigate the effects of "beam smearing" (Swaters 1999;Blais-Ouellette et al. 1999). During the recovery of datacubes from complex visibilities, the raw observational datacubes are convolved with a restoring beam which effectively encodes the complex visibility plane coverage and is, in some ways, analogous to resolution. It is this convolution step which gives rise to "beam smearing", the effects of which are discussed further in §3.1.2 along with implications for interpreting the model results discussed in this work. Counteracting "beam smearing" will need to be tackled in future work to maximise the effectiveness of models of this type.

Model aim
An autoencoder (Rumelhart et al. 1986) is a model composed of two subnets, an encoder and a decoder. In an undercomplete autoencoder the encoder subnet extracts features and reduces input images to a constrained number of nodes. This so-called bottleneck forces the network to embed useful information about the input images into a nonlinear manifold from which the decoder subnet reconstructs the input images and is scored against the input image using a loss function.
The aim of the model used in this work is to extract semantically meaningful information from observational data. Typical approaches using a convolutional autoencoder (CAE, Masci et al. 2011) (such as that presented in Dawson et al. 2019) are powerful for extracting arbitrary (hyperparametric) features that define dataset characteristics. During training, a CAE learns to minimise the difference between input and output tensors rather than the difference between an output and target label (whether this be a continuous or categorical set of target classes). A CAE works similarly to a powerful nonlinear generalisation of principle component analysis (PCA, Plaut 2018) whereby it finds a continuous nonlinear latent surface on which input data best lies. In this work, however, we would like to extract semantically meaningful parameters of observed systems. In order to achieve this we have combined a convolutional autoencoder with a set of analytical, gradient trackable, functions which approximate the functional forms of observed kinematics of galaxies.
The model, known as a semantic autoencoder (SAE, Kodirov et al. 2017), is a modified CAE created using PyTorch 5 0.4.1, an open source ML library capable of GPU accelerated tensor computation and automatic differentiation (Paszke et al. 2017). The model has a neural network architecture suited to self-supervised learning, with additional Bayesian capabilities. Figure 1 shows a simplified pictorial representation of the model architecture.
The encoder subnets extract lower dimensional feature representations from input images (here the integrated intensity and mean velocity maps as described in §2.1) using a combination of convolutional and linearly connected layers; the decoder then reconstructs the input images from the learned feature representations. In a standard convolutional autoencoder, the decoder would make use of transposed convolution operations, however in this network the decoder is composed of analytical functions written using native PyTorch. This imposes a constraint on the CAE by forcing the network to generate a semantic encoding of the input images. As highlighted by Aragon-Calvo (2019), the decoder function can take any possible form, no matter how representative of the true underlying functions being modelled. In this way, we can be assured that the encoders are learning semantically meaningful properties of the input images and are no longer tied to traditional training methods, instead allowing the network to train on all available data (including test data) in a self-supervised manner. An SAE becomes physics-aware once the assumption is made that the decoder function can be used to reveal physically meaningful information about the input. In this paper, the Table 1. The SAE encoder subnet architecture used throughout this paper. The first column lists the name of each layer/operation, the second column describes the type of layer/operation, the third column shows the dimensions of each layer's output tensors (hence the input shape to the next layer). The dimensions follow the PyTorch convention (batch size, number of channels, height, width). The filter column shows the dimensions (height, width) of kernels used to perform the convolution and pooling operations. The convolutional and linearly connected layer groups are separated by a blank row for clarity.

Name
Layer Linear (64,1,1,256) -Htanh Hard tanh activation --Output -(64,1,1,2) physics-awareness of the model refers to our main focus of approximating parameterisations for rotation curves, intensity profiles and recovering galaxy inclinations (see §2.4). For a more in-depth background to the use of autoencoders we refer the reader to Bourlard & Kamp (1988) and Hinton & Salakhutdinov (2006). For both a concise and thorough introduction to the use of self-supervised, physics aware, neural networks in astronomy we recommend Aragon-Calvo (2019).

The encoder subnets
Within the network, the encoders are two convolutional-classifierlike subnets. Each comprises a series of 4 convolutional and 2 fully connected layers, interspersed with pooling layers and activation functions. The encoders are used to extract and dimensionally reduce features from input images. The two subnets independently receive a moment zero map (a 2D intensity profile, normalized in the range 0-1) and a moment one map (a 2D velocity profile, normalized into the range -1-1) respectively. Throughout this work, we ensure that the input maps have size of 64×64 pixels. All input maps whose sizes are larger or smaller, like those discussed in §3.2 and §3.3, are subsequently up/down-sampled to a size of 64×64 using PyTorch's torch.nn.Upsample class, in bilinear mode. Each moment map carries valuable information for the decoder functions as described in §2.4. With this in mind, the output of the encoders are two vectors which are concatenated before passing to the decoder subnet. For an in depth look at the encoder subnet structure see Table 1.
The encoders learn the following properties: subnet 1: observed galaxy inclination (i) and free parameters of the intensity profile which make up 1 in Figure 1; subnet 2: the parameters of the velocity profile of the galaxy which make up 2 in Figure 1.

The decoder subnet
Here we detail the functions required for reconstructing the moment zero and moment one input maps from the concatenated feature representations 1 and 2 as shown in Figure 1. In recovering the moment maps, we are primarily interested in modelling two profiles. Firstly, the intensity: where I 0 is the intensity normalisation factor (set to 1 throughout, due to the global normalisation described above), r x,y is the radius in the plane, in arcseconds, r scale is the intensity scale length in the plane, is the position in the axis, and r z-scale is the intensity scale length in the axis set to a value of 1 spaxel throughout this work, to emulate a thin disk. Intensity values are determined by combining the integrals of Equation 3 across each spaxel in the and planes.
Secondly, the rotational velocity: where V max is the asymptotic line of sight velocity, r is the radius in arcseconds, and r turn is the velocity profile scale length.
Here, our choice of exponential intensity profile and arctan velocity profile are entirely arbitrary (i.e. not driven by any physical theory), but are choices motivated by some of the simplest forms that can approximately fit the typical disks and rotation curves found in the Universe. Clearly objects that do not follow these functional forms will not be appropriately fit by this network and we discuss this further in §3.5. However, it should be noted that this analytical-style decoder implementation would be equally valid for other functional forms. For example, one could choose to fit bulge-disk models with such an architecture, or include the influence of central point masses or the effects of dark matter halos. These more realistic networks will be explored in future works.
An auxiliary 3D tensor of radii (labelled r in Figure 1) is passed into the network, cloned, and evaluated using Equations 3 and 4. The 2D moment maps are then created using Equations 1 and 2. The velocity profile is later converted into line of sight velocity map via an inclination projection and velocity weighting based on the pixel angles about the line of sight axis.

Model training procedure
The network is trained with minimal optimisation of hyperparameters in order to demonstrate the simple nature of this architecture. At all times the network utilises a PyTorch's MSELoss function which computes the mean squared error: between the model outputs, y i , and inputs, x i , for every forward pass of a batch of size N. In this case, this is the squared difference between the moment zero and moment one inputs and decoder generated outputs. It is worth noting here that all synthetically generated moment maps have the same position angle and consequently any observational data used for training and testing have been de-rotated using published position angle measurements. We do this as position angle is a non-physical parameter which we can easily account for in preprocessing (with e.g. the fit_kinematic_pa routine of Krajnović et al. 2006). We use an adaptive Adam learning rate optimiser (Kingma & Ba 2014) which begins with a value of 10 −4 and reduces via multiplication of 0.975 every 2 epochs. We find that the model converges well after 300 epochs for all training runs presented in this paper.
Where synthetic training data is used, the network receives batches of 64 input moment map pairs. Initial tests showed the network to be largely unaffected by batch size and so 64 is arbitrarily chosen to increase training speed.
The models and Python training scripts used for the work presented in this paper are publicly available on GitHub 6 .

Model testing procedure
Testing the network can be done in three distinct ways, depending on the situation at hand. In order to test data, one can choose whether to train the network on the test data alone (we call this testing procedure solo testing), to train on the test data alongside other examples (we call this testing procedure combined), or to use the network in full test mode having only trained on examples not including those data that we wish to test (called blind testing).
One can imagine the case where sufficient training data has been passed through the network in a survey, such that in order to return rapid kinematic modelling of new observations one simply passes the new observations through the network with no prior exposure to the training procedure. This blind testing has the advantage of rapid testing speed but at the potential cost of lowered predictive accuracy, in an epistemic uncertainty dominated regime. One can also imagine the case whereby initial survey data has been collected and some sample of the dataset the network used to train is also in need of testing. As the network has seen these data during the training procedure, combined testing has the advantage of potentially higher accuracy at the expense of time needed to train the model. It should come as no surprise that the ideal testing scenario for this network is combined, with a sufficiently large training set in an aleatoric uncertainty dominated regime. However, there are cases (such as at first light of a survey) where the only test data available is that which the network was trained on. It is in this scenario that solo testing will occur and although this testing regime lacks the benefits afforded by combined testing, it has the potential advantage of predictions not being influenced by anomalous data whose population increases with training set size.

Monte Carlo dropout
In this section we summarise the use of Monte Carlo dropout (henceforth MC dropout; Gal & Ghahramani 2016) to provide quasistatistical modelling uncertainties over learned parameters within the model.
In conventional neural network training circumstances dropout may be interpreted as permuting a trained model (Srivastava et al. 2014) via the probabilistic zeroing of weights in linearly connected layers. Traditionally, dropout layers are used throughout training in order to force the network to behave as an ensemble of architectures increased testing accuracy and generalisation power. In the case of MC dropout, after training, dropout is reapplied to the network in evaluation mode and inputs are passed through the model many times, effectively sampling a posterior where the model architecture is marginalised out. Gal (2016) first proposed the idea of approximating distributions over parameters learned in neural networks in this way and has since been used in astronomy (e.g. for the probabilistic labelling of galaxy morphologies, Walmsley et al. 2019).
For an input x (comprised of a moment 0 and moment one map), training data D, model weights w, T forward-pass evaluations, and encoder output k, the predicted parameter means and standard deviations are given by Equations 6 and 7 respectively.
For a comprehensive derivation of Equations 6 and 7, as well as the implications for using an arbitrary dropout probability, we refer the reader to Walmsley et al. (2019). Examples of the posterior distributions, p(k|w,D), over learned parameters using MC dropout for a randomly selected synthesised galaxy are described further in §3.1.1.
It should be noted that, as the network does not use dropout to zero weights in the convolutional layers, does not represent a complete error over learned parameters. Instead one should consider as a lower limit error over parameters whose use becomes immediately obvious for pipeline flagging purposes or to generate relative errors within a test set. The errors produced through this technique are strictly errors due to the modelling technique, and will underestimate Table 2. Parameter values and ranges for all synthetically generated galaxies using the KinMS package. The units for r scale and r turn are absent due to both quantities being fractions of the input map size. The position angle of each galaxy is fixed at 0 as it is not a physically meaningful parameter. Throughout model training, parameters are drawn uniformly in the ranges listed.

Parameter
Size/range Units Position angle 0 deg Inclination 10-90 deg r scale 0.1-0.35 r turn 0.01-0.8 -V max sin(i) 50-500 km s −1 the true error in any parameter, which arises due to both modelling and observational uncertainties.

RESULTS AND DISCUSSION
In this section we present exemplar test results for highly spatially resolved galaxy observations. In each case we have trained new networks using the procedures described in §2.5.

Input-output
In order to explore the limitations of the network, we tested the model using synthetic galaxies generated using the Python based kinematic Corner plot showing the level of covariance between learned parameters for one randomly generated, synthesised galaxy (discussed further in §3.1). The accompanying histograms represent quasi-probabilistic distributions thanks to the use of Monte Carlo dropout. This galaxy was passed through the network in test mode 10 000 times in order to build the distributions. We observe well constrained learned parameters with Gaussian like profiles, allowing for quasi-probabilistic modelling errors for the parameters. The only strong covariance observed is that between the maximum line of sight velocity and the velocity profile scale length, which is entirely expected and present in traditional kinematic analyses.
simulator KinMS 7 (KINematic Molecular Simulation, Davis et al. 2013;Davis et al. 2020). Figure 2 shows the inputs and outputs as well as both known and predicted profiles for a galaxy generated using the same analytical functions described in §2.4 with inclination, maximum velocity, and scale lengths drawn randomly in the ranges shown in Table 2, and a fixed beam size of 2 resolution elements. It is clear that the model is able to recover the galaxy's rotation curve (and other parameters) well in blind testing mode, whereby the model has not yet trained on the test data. The quasi-probabilistic distributions for each learned parameter for this galaxy are shown in Figure 3, highlighting the Gaussian-like nature of the learned parameter distributions as well as an expected covariance between r turn and V max sin(i). As seen in Figure 4 the model is able to recover the desired physical parameters of synthesised galaxies well, heuristically. For the 1739 7 https://github.com/TimothyADavis/KinMSpy test galaxies shown in Figure 4 we measure the average deviation of parameters: i, r scale , r turn , and V max sin(i), from the 1:1 line as i = 0.98 • , r scale = 0.003, V max sin(i) = 3.48 km s −1 , and r turn = 0.017 respectively.
It is clear from Figure 4 that the error estimates do not represent the total errors over the parameters and only encode the modelling error. This makes the presented errors strictly lower limit estimates, and mostly useful for comparing reliability within the dataset, rather than external use. This can be seen by the fact that on average only ∼35% of the data points in Figure 4 have errorbars which overlap with the 1:1 true-versus-predicted line. For the presented dataset these errors likely underestimate the total error by a factor of ∼2.5. Including errors in the observations themselves will help to narrow this gap and will be explored further in future work.  1. Those galaxies whose projected r turn fell below 1.5 times the restoring beamsize were removed in order to mimic the automated flagging of poorly resolved galaxies at high inclination in a survey. Of the 2000 synthesised galaxies tested, 261 (13%) were removed using this cut.

The effect of resolution
One expects r scale,pred to artificially increase with beam size for a fixed r scale . However, r scale is not known for observations of galaxies whose values r scale fall below some fraction of the beamsize. We see this effect happening as shown in Figure A1 in a non-complex manner. Therefore, we recommend enforcing flagging based on inclination which appears to be strongly linked with those galaxies whose r scale is under predicted (along the minor axis). In the edge-on galaxy case, the minor axis is no longer well resolved resulting in a poor recovery of the intensity profile. However, this is a well-known issue in moment based kinematic modelling, in which the intensity profiles and kinematics can never be fully derived in edge-on galaxies due to line of sight effects. As we have included no method for mitigating the changes induced by varying beam size, it comes as no surprise that the network will behave differently given a sufficiently large ratio of beam size to galaxy extent. Given that we do not have a mechanism for dealing with "beam smearing" in the current network architecture, we expect to see its influence, lowering the apparent line of sight velocities close to the center of galaxies where the iso-velocity contours are closest together. For minimising the effects of varying beam size we recommend convolving the 3D spatial cube (see Figure 1), evalu-ated using Equation 3, with the restoring beam before creating the output maps. The advantage of this approach being that the restoring beam is often included in data-product header units, and so should be readily available for creating kernels with which to perform the aforementioned convolution. We consider this approach as beyond the scope of the work presented in this paper, but will be included in future work focusing on retrieving the properties of marginally resolved galaxies.

Fill factor
In previous work we showed that the fill factor (i.e. the number of zeroed pixels) in a velocity map's field of view, impacts the behaviour of NN models which take them as inputs (Dawson et al. 2019). With the NN model presented in this work, we have seen little evidence that this has an effect on the galaxies' predicted parameters. We attribute this behaviour to the nature of the training procedure, whereby in combined and solo testing, the network does not rely solely upon inference of unseen data.  In this way we are directly observing the input and output maps of the model. The right column has undergone an -axis rescaling to match observational scales found in the literature. The black dashed lines and grey areas show the mean and 1 modelling errors respectively for profiles predicted by the neural network. The blue dashed line shows a major axis cut of the input intensity map. The red dashed line and filled area show the best fit and associated errors modelled using BBarolo on the datacube. In order to make a direct comparison between the network's and BBarolo's derived rotation curves, the network's velocity profile has been corrected for by the predicted inclination term. The network predicted parameters are shown as text in the upper-middle, upper-right, and lower-right subplots. We see that this galaxy has a velocity profile which can be roughly approximated by an arctan function meaning the kinematic parameters are well recovered by the model.

H examples
The primary goal of developing a network like that presented in this paper is to demonstrate the applicability of machine learning to SKA science. As such, in this section we show the network performs well with H observational data. In order to do this we present two example test galaxies, NGC 2403 and NGC 3198, observed using the Very Large Array (VLA) as part of The H Nearby Galaxy Survey (THINGS) , and showing a diversity of rotation curve shapes. These galaxies are two of 17 THINGS galaxies used for mixed training and testing using the network and chosen heuristically for the appearance of their well defined rotating H disks. The names and publications for the galaxies used in this sample are shown in Table A1. Figure 5 shows the derived intensity profile and rotation curve for NGC 2403. We include the rotation curve modelled using BBarolo (Di Teodoro & Fraternali 2015) on the datacube (Di Teodoro & Lelli, private communication). In comparison, we see that the neural network's predicted rotation curve matches closely and so we are convinced that the network is able to recover physical information well. Although the galaxy's intensity profile does not strictly exhibit an exponential form, this has little impact in the recovery of the rotation curve which is the networks primary objective. Figure 6 shows the derived intensity profile and rotation curve for NGC 3198. This galaxy exhibits a mild warp and a flat rotation curve (Gentile et al. 2013) with a slight rise at ∼ 200 . Warped H discs are not uncommon in the outer regions of galaxies. At present our network architecture is not set up to model these (however one could easily extend the model in order to do so). Again, we include the rotation curve modelled using BBarolo on the datacube (Di Teodoro & Lelli, private communication) in Figure 6. Crucially, although this warping behaviour is not included in our model, in this case the network still returns reasonable parameter estimations, showing that it could still be usable for parameter estimations across a broadly diverse population of galaxies.

CO examples
In order to demonstrate the flexibility of this network architecture, we trained a model to recover the kinematic properties of galaxies observed in the CO line using the Atacama Large Millimeter/submillimeter Array (ALMA). Our samples are drawn from the mm-Wave Interferometric Survey of Dark Object Masses (WISDOM) project (see Table A2 for more information) and have high spatial resolution. Due to the nature of these objects being targeted for their evidence of black hole influence on the gas kinematics, we expect to see small values of a V for the sample. As seen in Figure 7, this effect is clearly visible, highlighting the predictable behavioural nature of the network. It is also clear in Figure 7, that NGC 1387 (FCC184, Zabel et al. 2020, Boyce et al., in prep), an exemplar galaxy from the WISDOM sample, exhibits an exponential intensity profile which the network can easily recover.
Such an example demonstrates the transferable nature of this network architecture and training style but without the difficulties often associated with traditional transfer learning tasks. This means that such architectures and training styles can be applied to a multitude  Figure 6. An example galaxy, NGC 3198, observed in H and evaluated using the network in combined testing mode. Maps in the left and middle columns share and axis sizes of 64×64 pixels. In this way we are directly observing the input and output maps of the model. The right column has undergone an -axis rescaling to match observational scales found in the literature. The black dashed lines and grey areas show the mean and 1 modelling errors respectively for profiles predicted by the neural network. The blue dashed line shows a major axis cut of the input intensity map. The red dashed line and filled area show the best fit and associated errors modelled using BBarolo on the datacube. In order to make a direct comparison between the network's and BBarolo's derived rotation curves, the network's velocity profile has been corrected for by the predicted inclination term. The network predicted parameters are shown as text in the upper-middle, upper-right, and lower-right subplots. We see that this galaxy has a velocity profile which can be roughly approximated by an arctan function meaning the kinematic parameters are well recovered by the model. of different datasets with the possibility of architectural modifications suiting other types of data outside of interferometry and even astronomy.

Testing speed
The network can retrieve a mean field approximation for all learnable parameters, of a single galaxy observation, in 0.0025 seconds on a single Intel(R) Core(TM) i7-6700 CPU core. This time scales linearly with the number of MC dropout samples one wishes to collect (i.e. for a set of 1000 MC dropout samples, a typical test on an individual galaxy would take 2.5 seconds) to generate pseudoprobabilistic distributions. However as the batch throughput size is limited only by the available device memory, it is possible to retrieve values for learnable parameters, and hence MC dropout samples, in the same time frames as listed above for multiple observations. This means that one could potentially return hundreds to thousands of parameterisations and associated pseudo-errors in a matter of seconds.

Caveats
There are a few caveats pertaining to the use of the model described in this work. These caveats may impact the way in which users handle the network and the confidence levels associated with parameter estimations.
A key factor in recovering sensible parameterisations using the network is the choice of decoder functions (see §2.4). In this work we have used simple, general, functions in the form of an exponential (see Equation 3) and an arctan (see Equation 4). However, should one wish to model specific emission line components of galaxies, it would be prudent to adopt more tailored functional forms. For example, it has been shown that H discs can display depressions in their intensities in their central regions, typically filled by molecular gas (Wong & Blitz 2002), for which a truncated Gaussian intensity profile (Martinsson et al. 2013) would be more appropriate when reconstructing the intensity maps. Additionally, when modelling the very outer regions of H discs, one might consider adopting a more complex multi-parameter function capable of encoding the sharpness of the turnover at r turn and the behaviour of the curve after this point (e.g. Rix et al. 1997), or even declining velocities in the central regions (Lelli et al. 2016). A declining rotation curve would be challenging for the current model to fit (and impossible to fully retrieve). However, due to the nature of the loss function chosen in this work (see Equation 5), the network will prioritise fitting to the higher velocity regions of galaxies.
As described in §3.1.2, the resolution of input images impacts the ability of the network to correctly predict a scale , particularly in the high inclination regime. This places constraints on the user's confidence in parameter estimations when working in both the large-beam and high inclination cases combined. Additionally, we can see in Figure 4 that the network struggles to accurately recover inclinations at the very low inclination range. This is a predictable effect caused by the loss of line of sight velocity information for face on disks but again, in the case of survey pipelines, these low inclined galaxies will require additional flagging. In both the aforementioned caveat . An example WISDOM galaxy, NGC 1387, observed in CO and evaluated using the network in combined testing mode. The left and middle columns share and axis sizes of 64×64 pixels. In this way we are directly observing the input and output maps of the model. The right column has undergone an -axis rescaling to match observational scales found in the literature. The black dashed lines and grey areas show the mean and standard deviation respectively for profiles predicted by the neural network model. The blue dashed line shows a major axis cut of the input intensity map. The red dashed line shows the KinMS reconstructed rotation curve. The network predicted parameters are shown as text in the upper-middle, upper-right, and lower-right subplots. We easily see that this galaxy has an intensity profile and velocity profile which can be roughly approximated by an exponential and an arctan function respectively, meaning the kinematic parameters are well recovered by the model. cases it is worth noting that traditional kinematic modelling methods also struggle to accurately estimate parameters, in particular when working with moment maps. Extensions of the network's framework presented here to kinematically model datacubes may alleviate these issues and will be explored in future work.

CONCLUSIONS
We have demonstrated the performance of a neural network model architecture which can be used to recover rotation curves of galaxies from their kinematics. The model was tested on synthetically generated galaxies as well as observations using both H and CO emission lines.
Testing on synthetically generated galaxies has highlighted the powerful performance of the network as well as areas where the network's performance is sub optimal. For the latter areas we have discussed solutions including: an additional convolution with the restoring beam to counteract the effects of "beam smearing", and flagging high inclination data in a large beam and high inclination regime.
Testing observational H data from THINGS has shown that this style of network is well suited to work with data like that expected from the SKA in the near future. We have shown that the network is capable of estimating velocity curves for discs exhibiting a variety of profiles. In order to do this, we have directly compared the rotation curves estimated by the network to those modelled directly from the cubes using kinematic modelling tools. The network is able to perform adequate recovery of parameters even in cases where it would not be possible to reproduce the true rotation curves. These promising results give us confidence that adopting more flexible decoder functions will extend the applicability of the model for more specific use cases should one wish to model H discs exclusively.
Testing observational CO data from the WISDOM project has shown that the network is suitable for a range of emission line observations. Unlike traditional ML models, the network architecture and training styles outlined in this work prevent the need for transfer learning which is often time consuming and fraught with ungainly challenges associated with systematic properties of training sets. We have shown that the model outlined in this work can recover rotation curves which heuristically match rotation curves extracted from ALMA observations using more time-consuming approaches.
As previously stated, improvements to the model architecture in this work include but are not limited to: adapting the model to use more complex intensity and velocity profiles in the decoder subnet, automatically accounting for large beam effects such as beam smearing and information loss either via systematic offsets in model predictions or via the incorporation of an extra convolutional layer in the decoder subnet, and reintroducing a position angle estimation step. An idealised improvement on the model would be to work directly with interferometric datacubes themselves, or even visibilities, without the need to generate moment maps prior to training and testing. However, we have found that the discretised nature of channels in interferometric datacubes presents a non-gradient-trackable step in the decoder's reconstruction of datacube inputs. This discon-tinuity in the gradient tree prevents back propagation via gradient descent and consequently halts model training. We propose adapting this self-supervised approach to work with datacubes as a lucrative avenue of research for challenging current kinematic modelling tools in preparation for the SKA and other upcoming large facilities. Table A1. Information regarding the THINGS sample galaxies used throughout his work. Columns give the following information: Object, the target name as given in THINGS project publications, Publication, records the relevant publication in which the THINGS targets appear.

DATA AVAILABILITY
The data and scripts underlying this article are available via GitHub, at https://github.com/SpaceMeerkat/Corellia.

APPENDIX A: EXTRA MATERIAL
This paper has been typeset from a T E X/L A T E X file prepared by the author. . The effects of varying the ratio of beam size to galaxy extent. It is clear to see that an increased beam size results in an artificial lengthening of the intensity profile scale length. It can also be seen that the spread in median offset increases with r scale , which occurs due to information loss as the convolved flux is "smeared" out beyond the field of view. The value of r scale at which this effect begins to take hold is clearly inversely proportional to the beamsize.  Davis et al. (2017a)