E(2)-Equivariant Features in Machine Learning for Morphological Classification of Radio Galaxies

With the growth of data from new radio telescope facilities, machine-learning approaches to the morphological classification of radio galaxies are increasingly being utilised. However, while widely employed deep-learning models using convolutional neural networks (CNNs) are equivariant to translations within images, neither CNNs nor most other machine-learning approaches are equivariant to additional isometries of the Euclidean plane, such as rotations and reflections. Recent work has attempted to address this by using G-steerable CNNs, designed to be equivariant to a specified subset of 2-dimensional Euclidean, E(2), transformations. Although this approach improved model performance, the computational costs were a recognised drawback. Here we consider the use of directly extracted E(2)-equivariant features for the classification of radio galaxies. Specifically, we investigate the use of Minkowski functionals (MFs), Haralick features (HFs) and elliptical Fourier descriptors (EFDs). We show that, while these features do not perform equivalently well to CNNs in terms of accuracy, they are able to inform the classification of radio galaxies, requiring ~50 times less computational runtime. We demonstrate that MFs are the most informative, EFDs the least informative, and show that combinations of all three result in only incrementally improved performance, which we suggest is due to information overlap between feature sets.


INTRODUCTION
Radio galaxies are galaxies with strong radio emission, typically forming jets and lobes on either side of the host galaxy, which originate from the synchrotron emission of active galactic nuclei (AGNs).The morphologies of these radio galaxies were first classified into two types by Fanaroff & Riley (1974), known as FR type I (FR I) and type II (FR II).These two classes are based on the relative locations of highest surface brightness; in FR I, the brightest area lies towards the centre of the AGN, known as edge-darkened, while in FR II the brightest regions lie further away from the centre of the galaxy, known as edge-brightened.These are classified based on their Fanaroff-Riley ratio R FR , defined as the ratio of the distance between the two brightest regions located on opposing sides of the galaxy to the total extension of the galaxy at the lowest threshold.FR I galaxies are those with R FR < 0.5, and FR II are those with R FR > 0.5.In the original investigation by Fanaroff & Riley (1974), these two types were shown to be divided in power: FR I radio galaxies typically have powers at 1.4 GHz below  1.4  = 10 25 W Hz −1 , while FR II galaxies typically have powers above this.Understanding the formation of these different radio galaxies, the accretion process and interactions of the jets and surrounding environment in the galaxies remains an active area of research, and requires large numbers of FR type labelled radio galaxies (Hardcastle & Croston 2020).
★ Equal contribution: natalie.lines@port.ac.uk † Equal contribution: joan.fontquer.roset@gmail.com Upcoming telescopes are expected to detect radio galaxies in the millions, with the Square Kilometre Array (SKA) expected to detect upwards of 500 million radio sources (Norris et al. 2015), and the Evolutionary Map of the Universe project (EMU) expected to find 70 million radio sources (Norris et al. 2011), of which around 10% are estimated to be complex objects that will require cross-identification. Though attempts have been made to manually sort large datasets of galaxies, such as the Radio Galaxy Zoo project (Banfield et al. 2015), automated radio galaxy classification is becoming more essential to allow for a full exploitation of the data available.

Machine Learning for Radio Galaxy Classification
Many different machine learning models have been implemented in astronomy for the task of radio galaxy classification.In a review of the recent works on radio galaxy morphological classification by Ndung'u et al. (2023), convolutional neural networks (CNNs) are shown to be the most popular machine learning method used.CNNs use convolutional layers to extract features from an image, which are integrated into a neural network.CNNs were first developed by Fukushima (1980), and are the standard tool for computational vision due to their ability to capture information stored in complex images.First examples of using neural networks for classification in astrophysics date back decades to works such as Odewahn et al. (1992) and Weir et al. (1995), and they were first implemented into the task of FR classification in Aniyan & Thorat (2017).Since then, works such as that by Lukic et al. (2019) and Ma et al. (2019), have built on this work, developing models with accuracies of 93-94%.
In addition to CNNs, other machine learning models have been used for the problem of FR classifications.A review of different conventional machine learning methods has shown that random forests were the best of the tested models 1 (Becker & Grobler 2019), achieving an accuracy of 95% using features such as lobe size and brightness as inputs to the model.Gradient boosting models such as XGBoost have been found to perform similarly to random forests (Darya et al. 2023).
However, many of these models have drawbacks.CNNs extract information from images through the use of kernels, which contain learnable weights, allowing CNNs to be adaptable to learning different features.This requires the training of more parameters which is less computationally efficient, and although recent work has been done to reduce the number of parameters needed while maintaining similar performance levels (see e.g.Bowles et al. 2021), there is also wider interest in developing more computationally efficient approaches, such as the trainable COSFIRE descriptor (Ndung'u et al. 2024).Another drawback to CNNs is that although they are translationally equivariant, the convolutional steps mean that they are not intrinsically equivariant to rotations and reflections, and as such two images of the same galaxy at different orientations are not necessarily classified consistently.When training models on augmented datasets, in which the dataset includes multiple transformed copies of the same image, the model is required to independently classify the same object at different orientations.Data augmentation has been used in works such as that by Becker et al. (2021), and has been shown to increase performance under E(2) symmetries (Maslej-Krešňáková et al. 2021).
One data space solution to the non-equivariance of CNNs, used in the works by Polsterer et al. (2019) and Brand et al. (2023), involves normalising the orientation of the galaxies, which is shown to improve both the accuracy and the training time of the model (Brand et al. 2023).A model space solution is the use of -steerable CNNs (Cohen & Welling 2016), in which the desired symmetries are encoded into the weight sharing of the kernel, ensuring mathematical equivariance under these symmetries.This has been shown to be more effective than data augmentation, and has been implemented into the problem of FR classification by Scaife & Porter (2021), where the accuracy is found to improve from 94% for a conventional CNN to 97% for a -steerable CNN, though at the cost of increased computational time.
In this work we test alternative methods of feature extraction that are equivariant under rotations, reflections and translations to replace the convolutional steps of a standard CNN when classifying radio galaxies according to their FR type.The structure of this paper is as follows: in Section 2 we define the concepts of isometric invariance and equivariance, including how these can be built in to convolutions; in Section 3 we describe the equivariant feature sets used in this work; in Section 4 we introduce the data set that will be used, including the pre-processing steps that are applied, and in Section 5 we outline the machine learning methods that are employed, including model training and hyperparameter tuning details.In Section 6, we report on the performance of the trained models for different feature set combinations, and in Section 7 we evaluate these results and discuss 1 These were: Nearest Neighbours, Linear Support Vector Machine (SVM), Radial Basis Function SVM, Gaussian Process Regression, Random Forest, Multi-layered Perceptron Neural Network, AdaBoosted Decision Tree, Naive Bayes and Quadratic Discriminant Analysis.
the importance of different features, with additional investigation of the representation space for each feature set using dimensionality reduction.In Section 8 we draw our conclusions.

ISOMETRIC EQUIVARIANCE AND INVARIANCE
Here we introduce a simplified explanation of the concepts of equivariance and invariance for Euclidean transformations including the dihedral and cyclic subgroups of the orthogonal group O(), following Weyl (1980).For the -dimensional real space R  , equivariance is described as follows: let  be a mapping  :  −→  , where  and  are subsets of the real space, ,  ⊆ R  .Suppose  is a group of transformations acting on  and  .If  commutes with each element of  for all elements of , i.e.
then  is said to be equivariant with respect to , where () is the representation of the action  in R  .If instead then  is said to be invariant with respect to .
The -dimensional Euclidean group, E(), is the set of transformations that preserve Euclidean distance, and comprises of translations, rotations and reflections.Given  ∈ R  , a translation by  is defined by the map The set of -dimensional vectors with real coefficients under vector addition, (R  , +), is the group of all translations.The group of non-translational distance-preserving transformations of the -dimensional Euclidean space is the orthogonal group, O(), defined as O() = { ∈ GL() |  −1  =  −1 = I}, where GL() is the set of invertible  ×  matrices.In order to preserve distances, the magnitude of the determinant of these matrices has to equal 1, and as such the orthogonal group is split into two components, those of matrices with determinant 1, and those of matrices with determinant −1.
The set of matrices in O() with a determinant of 1 forms a subgroup, known as the special orthogonal group, SO() = {  ∈ O() | det( ) = 1}.In two dimensions, the special orthogonal group is the group of all rotations of the plane.For example, a rotation by an angle  of a set of points (  ,   ) ∈ R 2 is expressed as The set of orthogonal matrices of determinant −1, which is given by O() \SO() = {  ∈ O() | det( ) = −1}, represents reflections in two dimensions.A reflection by an angle  of a set of points (  ,   ) ∈ R 2 is written as (5) Consequently, every element of O() is either a rotation or a reflection, and the Euclidean group E() can be expressed as the semidirect product E() = (R  , +) ⋊ O().Therefore, for a map to be equivariant to all Euclidean isometries, it must be equivariant with respect to (R  , +) and O().Commonly used subgroups of E(2) are the cyclic group   , which contains all the rotations by multiples of 2   , and the dihedral group   , which contains all the rotations of   in addition to  reflections.

Group-Equivariant Convolutions
The convolutional layers of a CNN convolve the input image with a kernel of weights, which are parameters learned by the CNN.Convolutions involve taking the dot product between the kernel and every subsection of the image of the same shape as the kernel, producing a feature map.As this process involves applying the kernel systematically to every subsection of the image, the weights of the kernel are shared, meaning this process is equivariant to translations in the input image up to edge effects.This translational equivariance is expressed as for a kernel , translation ( trans ) and image  : R 2 → R, where ★ represents convolution.Convolutions are linear operations, so an activation function is applied to the feature map to introduce nonlinearity.Following this, a pooling layer is typically used to decrease the size of the feature maps.This is done by dividing the feature map into equally sized subsections and representing each subsection by a single value such as the mean or the maximum of the pixels in the subsection.This allows the representation to be invariant to small translations in the input, and also increases computational efficiency due to the down sampling.Though equivariant to translations, convolutions are not equivariant to rotations and reflections.Instead, the convolution of a rotated image,  ( rot ) •  , with a kernel is equivalent to convolving the original image with the inverse-rotated kernel,   −1 rot • , and then rotating the output, expressed by (Cohen & Welling 2016).Equivariance of convolutions under other symmetry transformations can be imposed by restricting the form of the convolution kernel.Inspection of Equation 7shows that this satisfies the definition of equivariance when the kernel, , also encompasses ( −1 ) • .Equivalently, we can speak of equivariance not just as convolving with one kernel but with multiple, i.e. if instead of convolving an image with a single kernel, we convolve it with a stack kernels that correspond to every possible rotation of the original kernel to produce a stack of feature maps, then this stack convolution will be equivariant with respect to rotation (Cohen & Welling 2016).This motivates the extension to the definition of the convolution operation for a group action  ∈  to be where  is the input channel,  = R 2 for the first layer (lifting convolutions) and  =  for the following layers (-convolutions).The result of this is a feature map of higher dimensionality than the input image due to the dependence of the definition of the convolution operation, ★  , on the group action, .When  is the set of all translations, this simplifies to the standard definition of the convolution operation due to the commutativity of  and ℎ.Furthermore, to allow for equivariance, the kernel should satisfy for all  ∈  and  ∈ R 2 .

Minkowski Functionals
Minkowski functionals (MFs) are a class of morphological descriptors of continuous fields, developed by Minkowski (1903).They provide a generalisation of the smooth curvature integrals for the number of dimensions of the data, quantifying its shape and connectivity.Their previous applications primarily consist of statistical descriptors of non-Gaussian fields, such as spatial patterns in the large-scale structure of the universe (Mecke et al. 1994), weak lensing convergence maps (Munshi et al. 2012), galaxy clustering (Grewal et al. 2022), and the cosmic microwave background (Schmalzing & Gorski 1998), and are often used in the context of providing an alternative to the -point correlation function.They are useful in these contexts due to their ability to capture information on larger scales than typical of 2-point correlation functions, and can encompass correlation at arbitrary orders.Though previous works typically use MFs as topological descriptors of random fields, their applicability extends to any set whose boundary is smooth.MFs are invariant under E(2) symmetries, and for  dimensions there are  + 1 MFs.Furthermore, Hadwiger (1951) showed that a functional in  dimensions is continuous and both translationally and rotationally invariant if and only if it is a linear combination of the  + 1 MFs in that number of dimensions, making it a natural tool for our work.These  + 1 MFs depend on a threshold, , used to define the excursion set of points above the thresholds, converting the input to binary form.Consequently, in two dimensions there are three MFs, proportional to area, perimeter, and Euler characteristic respectively.A full mathematical definition can be found in Appendix A1.The values of the MFs evaluated at three different thresholds for an FR I and FR II galaxy are shown in Figure 1.

Haralick Features
Haralick features (HFs) are a set of features proposed by Haralick et al. (1973) as mathematical descriptors of image texture, with the applicability to classification problems in mind.They are calculated from the Grey Level Co-occurrence Matrix (GLCM), which encodes information about the spatial distribution of variations in pixel brightness of an image.HFs have previously been implemented in Top row shows galaxies of increasing contrast, the second HF.Middle row shows galaxies of increasing sum of squares variance, the fourth HF.Bottom row shows galaxies of increasing sum entropy, the eighth HF.
astronomy by Ntwaetsile & Geach (2021), using these features to sort radio galaxies in an unsupervised setting, though their previous applications in the context of machine learning have primarily been in medical imaging for tasks such as cellular dose distributions (Mansour & Thomson 2023), and tumour classification (Chen et al. 2007).There are 14 HFs that can be calculated from the GLCM, however these previous studies have found the 14 th feature to be numerically unstable so only the first 13 are typically used.
The GLCM contains textural information through encoding the relative pixel brightness of adjacent pixels.Unlike in convolutions, the resulting matrix does not preserve any topology of the image, but rather describes the frequency at which two pixels of certain brightness appear relative to each other.For an image where each pixel (  ,   ) has a value  (  ,   ) ∈ {0, 1, . . .,  max }, the resulting GLCM matrix, (, ), will have shape  max ×  max .Two pixels are said to be related, ( 1 ,  1 ) ∼ ( 2 ,  2 ), if they have a (user) specified relative positioning.For example: horizontally adjacent, in which case ( 1 ,  1 ) ∼ ( 2 ,  2 ) ⇐⇒ ( 1 =  2 and  1 =  2 ± 1).
The matrix (, ) is then defined as where for each matrix entry, its index is the pixel brightness being considered, and its value is the number of times two pixels of that brightness appear with the specified relative positioning.The process of calculating the matrix elements, (, ), is invariant under translations and equivariant under reflections and rotations (e.g. the number of pixels horizontally adjacent in the image is the same as the the number of pixels vertically adjacent in the image after rotation by 90 • ).Rotational and reflectional invariance of the features extracted from this matrix can be imposed by restricting the form the relation can take: reflectional invariance can be imposed by requiring the relation to be symmetric, ( 1 ∼  2 =⇒  2 ∼  1 ), which is an a priori assumption for any metric-like relation; rotational invariance can be imposed by requiring the relation to be independent of angle, or by calculating the HFs for all possible angles at a set distance and computing the average.From the GLCM, the 14 different textural features can be extracted, with each of these quantifying a different textural property such as contrast, variance, and entropy, as shown in Figure 2. A list of all HFs along with their mathematical definition is given in Appendix A2.

Elliptical Fourier Descriptors
Elliptical Fourier analysis was formalised by Kuhl & Giardina (1982), and provides a tool to describe complex contours through approximating a contour as sums of sine and cosine functions.Previous uses include morphometry (Chitwood & Otoni 2017), describing lake morphologies on Titan (Dhingra et al. 2019), and the Hubble classification of galaxies (Dutta et al. 2013).In normal Fourier analysis, a contour in two dimensions is converted into a single curve in the frequency domain which is then approximated, treating the two dimensions simultaneously.This simultaneous parametrisation means this analysis is unable to fit to any contour which folds back on itself with respect to the polar angle, limiting the applicability to polar convex shapes.In elliptical Fourier analysis, the  and  coordinates are reconstructed in the frequency domain independently, where the harmonics are ellipses.It is these Fourier coefficients, {   ,   ,   ,   }, that are referred to as the elliptical Fourier descriptors (EFDs) of order .
In two dimensions, the -coordinates and -coordinates are separately decomposed as where the maximum order  sets a limit on how close the approximation is to the true contour, and  ∈ [0, 2).More mathematical detail on these features can be found in Appendix A3.Due to the coordinate-dependence of these functions, EFDs are not invariant under E(2) symmetries.They are rotationally and reflectionally equivariant, and the translational information is captured in the constants  0 and  0 , about which the contour is centred.Many implementations of EFDs impose invariance to allow for direct comparison of the EFDs by rotating the image to align the major axis of the ellipse at  = 1 to the -axis, as shown in Figure 3.They can further be made scale invariant by normalising the length of the major axis.As with MFs, calculation of EFDs requires a threshold, , to be defined, such that the curve being approximated is the boundary of the excursion set, Σ().

MiraBest
MiraBest is a catalogue of labelled radio galaxies compiled by Miraghaei & Best (2017).It comprises of data from data release 7 (Abazajian et al. 2009) of the Sloan Digital Sky Survey (SDSS) (York et al. 2000), the Northern Very Large Array (VLA) Sky Survey (NVSS) (Condon et al. 1998), and the Faint Images of the Radio Sky at Twenty centimetres (FIRST) survey (Becker et al. 1995), cross-matched by Best & Heckman (2012).With the aim of these galaxies being morphologically classifiable, the dataset is restricted to extended galaxies containing AGN, and a 40 mJy flux cut-off is imposed to exclude extended galaxies that are too noisy to classify.Miraghaei & Best (2017) manually classified all of these galaxies into types FR I, FR II, hybrid, and unclassifiable.The hybrid class contains radio galaxies that would be classified as FR I according to one side of the galaxy but would be classified as FR II according to the opposite side, while the unclassifiable galaxies showed no strong characteristics of either FR I or FR II.In addition to the morphological type, each galaxy was given a label of either 'confident' or 'uncertain' based on how sure they were of the classification, and a sub-morphology out of standard, double-double, wide-angle tail, diffuse, or head-tail.The final catalogue contains 1329 sources.
In this work we use a version of the MiraBest dataset processed by Porter & Scaife (2023) for the context of machine learning.As part of this processing, the unclassifiable radio galaxies were removed from the dataset, as well as any galaxies larger than a 150 × 150 pixel grid, any images that contained no data, and one galaxy that was in a morphological subclass on its own.The images had background noise removed, were centred, cropped to a 150 × 150 pixel grid (corresponding to 270 arcsecs for images from the FIRST survey), and the pixel values were normalised from 0 to 255.Additionally, the data was organised into 8 batches2 , with one labelled as a reserved test batch.The final dataset contains 1256 radio galaxies, of which 833 are labelled as confidently classified as either FR I (containing 397 samples) or FR II (containing 436 samples) with any sub-morphology.

Data Pre-processing
In this study, we use the subset of 833 confidently classified FR I and FR II galaxies, known as 'MBFRConfident', which is split into training (87.5%) and testing (12.5%) data.Including only FR I and FR II and ignoring sub-morphologies simplifies the classification problem to a binary one, and as the FR I and FR II types make up the majority of the data, this avoids complexities with adding extra classes with an unbalanced split.Using only confidently labelled galaxies further simplifies the problem as it means that during the training process, performance on each galaxy can be treated equally.By including uncertain data, either the machine learning model would learn from its ability to classify confident and uncertain galaxies equivalently, which is undesirable especially in the case where the the uncertainty of the label potentially corresponds to a misclassification, or we would have to introduce a mechanism through which its predictions against confident and uncertain galaxies are treated differently.Though the full dataset of both confident and uncertain galaxies provides a better representation of radio galaxies as a whole, we also want to ensure that our model is able to deal with the simplified case before extending its adaptability.
In machine learning, an increased convergence rate can be obtained through the process of normalising the input data such that the values are close to zero with a standard deviation of 1 (LeCun et al. 2012).Following this, we normalise the images using -score normalisation by normalised pixel = pixel − mean standard deviation . ( For the MBFRConfident dataset, the mean is 0.0031 and the standard deviation is 0.0350 (Porter & Scaife 2023).The only other preprocessing applied is for the calculation of HFs, as this requires an image of integer pixel values which we obtain through rounding.
As the range of this background-removed data is -0.09 to 28.48, this results in a GLCM matrix of size 29 × 29, which is able to store adequate information to calculate the HFs at a relatively high precision while not impeding computational efficiency.

Neural Network
To allow for a comparison, our neural network architecture follows on from Scaife & Porter (2021), which is based on a LeNet (LeCun et al. 1998) structure, consisting of two convolutional and pooling layers followed by three fully connected layers.In our model, we replace the two convolutional layers with the feature calculation process, which contains no learnable parameters.Each of the three fully connected layers are followed by an activation function, which is ReLU in the case of the two hidden layers and softmax for the output layer.
The input size of the first layer is the number of features being used, the final fully connected layer has size 2, and the two hidden layers have an unspecified number of neurons, which is varied during the hyperparameter tuning.Dropout is implemented before the final layer, with the dropout probability also being a hyperparameter.We use the cross-entropy loss function, and an Adam (Kingma & Ba 2014) optimiser to modify the learning rates, with the values of initial learning rate and weight decay as hyperparameters.This optimiser has been shown to be one of the best performing gradient descent optimisation algorithms (Ruder 2016).The test data is pre-defined, making up 12.5% of the MBFRConfident dataset, and we split the remaining data such that the training data is 70% of the full dataset and the validation data is the remaining 17.5%, as this has been found to produce the best performance (Nguyen et al. 2021).We run each of the features individually as well as together to gain an understanding of the information content of the features and to allow for a comparison.Each run has 1000 epochs, with early stopping implemented to save the best model which is used in testing, and we implement shuffling of the input data each epoch to avoid the model memorising the data.Following hyper-parameter selection, as described further in Section 5.3, we used 10 realisations of the model, each trained using a randomised train/validation split and randomised initial parameter values, evaluated on the reserved test set.From these runs, the mean and standard deviation of the performance of the model are calculated.
Seven unique models types were trained: three for each individual feature type, three for each possible combination of features, and a final model that combines all three feature set extraction methods.All neural network models were built using the PyTorch (Paszke et al. 2019) machine learning framework.

XGBoost
Decision trees form the basis of many machine learning algorithms; however, while individual decision trees are conceptually simple and can be applied to non-linear problems, they are very unstable to changes in training data and are strongly prone to overfitting (see e.g.Louppe 2014).To mitigate these effects, these decision trees can be combined through ensemble learning to produce more powerful models.
One such ensemble learning algorithm is a random forest.This involves creating an ensemble of decision trees, each of which is constructed on a bootstrapped sample of the datapoints in the dataset, referred to as bagging.Random forests extend this by adding a level of bootstrap sampling to the features, from which the decision nodes are constructed.This allows for better generalisation as it prevents overfitting and forces the model to not rely heavily on any one feature.Random forests often outperform neural networks for classification tasks on lower dimensional, less complex data.They have the benefit over neural networks that they tend to overfit less, are less computationally expensive, and perform reasonably well without the requirement of large training datasets.Their simplicity compared to neural networks also allows for better interpretability, which can provide insight into the features of the data.
The alternative to bagging is gradient boosting, developed by Friedman (2001), in which decision trees are constructed sequentially through learning the mistakes of the previous decision trees.The first instances of gradient boosting used an algorithm known as adaptive boosting (Adaboost; Freund & Schapire 1997), from which the more flexible gradient boosting algorithm was developed.
Extreme Gradient Boosting (XGBoost; Chen & Guestrin 2016) is a tree ensemble machine learning model that can be trained to predict class labels.XGBoost is a type of gradient boosting that is optimised for speed, computational efficiency, and performance through integration of features such as parallelisation.The XGBoost algorithm follows the same structure as the one previously described, with an additional regularisation term added to the loss, similar to the weight decay parameter for neural networks.XGBoost tends to outperform random forests, but are prone to overfitting if hyperparameters are not tuned properly.Some of these key hyperparameters are max depth of decision trees, number of iterations, learning rate, and loss function.
The training set-up for XGBoost is similar to that of the neural network but is trained across a varying number of decision trees instead of a set number of epochs, and is tested on all seven of the possible combinations of features.We again use cross-entropy loss, with no added class weights due to the balance of the dataset.The training, validation and test splits are the same as that for the neural network, and for testing we again train the model with a different initial setup 10 times.XGBoost provides feature importances, which we average from testing the model 10 times.

Hyperparameter Tuning
The performance of neural networks and XGBoost depends on the values of their hyperparameters, and hence optimisation of these is essential.The neural network hyperparameters are number of thresholds, order, layer size, dropout probability, initial learning rate and weight decay.The XGBoost hyperparameters are number of thresholds, order, number of estimators and learning rate.Details of priors and optimal values are given in Appendix B.
The space of all possible hyperparameters is very large and as the optimal values of hyperparameters are not independent, careful consideration is needed to be able to ensure an optimal set of hyperparameters is found.For this, we implement Bayesian optimisation using the developer platform Weights & Biases (Biewald 2020), with at least 100 runs being executed before a set a hyperparameters were chosen.Bayesian optimisation searches the space of all possible hyperparameters in a way that is informed by the performance of previous runs.It constructs a probability distribution of the objective function in the hyperparameter space, and then tests the set of hyperparameters with the highest probability of optimisation according to Bayes' theorem.This speeds up the process of finding optimal hyperparameters compared to alternative search algorithms such as grid search or random search, and has been shown the find a better set of hyperparameters in the same number of runs (Snoek et al. 2012).When hyperparameter tuning, the performance of each set of hyperparameters is averaged over 10 runs to mitigate the effect of fluctuations from different initialisations and data splits.

Feature Calculation
In addition to these hyperparameters, there are other variables we can control.As well as considering which of these features to include, these features also depend on the threshold, , in the case of MFs and EFDs, and order, , in the case of EFDs.Therefore these two parameters were added to the list of hyperparameters which were varied during the tuning stage.The thresholds can vary in two ways: the number of thresholds the features are calculated from, and the values of each of these thresholds.We keep the number of thresholds as a hyperparameter and define the values of the thresholds to be linearly distributed from the minimum pixel value to the maximum pixel value for simplicity.
Through imposing a threshold, the galaxy images are converted to binary images, from which the three MFs are calculated using QuantImPy (Boelens & Tchelepi 2021).When testing the MFs, we choose to investigate not only the value of the MFs, but also how the MFs vary with threshold.This was motivated by the hypothesis that information of the FR-type could be contained in the relative values of features at different thresholds.For example, as the threshold increases, we might find that in FR I galaxies the regions above this threshold tend to form one connected region, while FR II galaxies might tend to form two connected regions.This difference would be quantified in MF 2, the Euler characteristic, as shown in Figure 1.For  number of thresholds, this results in 3 ×  different data points to feed into the model.
HFs do not depend on threshold or order, and are calculated from an image of integer pixel values.As explained in Section 3.3, we only calculate the first 13 of the 14 HFs, which is done using the Mahotas library (Coelho 2012)3 .To allow for these features to be rotationally invariant, the 13 HFs are calculated four times, for each of the four ways two adjacent pixels can be related to each other (vertically adjacent, horizontally adjacent, and adjacent on both diagonals).As we are interested in the difference between adjacent pixels, we use a distance offset of 1.This allows the Haralick features to capture information with the same resolution as the image.These are averaged over to produce the set of 13 HFs which we use.
Similarly to MFs, the number of EFDs calculated depends on the number of thresholds, , and also the order, , to which the Fourier sum is expanded, with  × (4 − 3) EFDs in total.At  = 1, the Fourier analysis results in an ellipse with its major axis extending in the longest direction of the contour, which we align with the coordinate axes as described in Section 3.3.This results in the coefficients, which we extract using the PyEFD library (Blidh 2013), being rotationally invariant.

MODEL PERFORMANCE
When assessing the performance of the supervised learning models, we use the metrics of recall, precision, F1 score and accuracy, calculated against a reserved test set.Tables 1 and 2 give the values of these metrics for the XGBoost and neural network models respectively, for all combinations of the three types of features set.These are tested using the best set of hyperparameters, averaged over 10 runs with different random initialisations from which the standard deviation is calculated.The computational times reported are the sums of the time taken for training and testing using an Apple M2 chip (CPU), including the time taken to calculate features.Furthermore, we include that training to validation loss ratio as an indicator of generalisability of the model.All performance metrics are above 50%, demonstrating that these features contain information about the FR type of the galaxies.XGBoost consistently outperforms the neural network in every case, with a mean difference in accuracy of ∼5%.The uncertainty in accuracy and F1 score between the neural network and XGBoost runs are generally comparable, suggesting that although XGBoost may perform better, the methods are similar in terms of stability.When applying the neural network to MFs alone, we find a large variation in the recall of both FR I and FR II, indicating that similar accuracies can be achieved through different distributions of predictions.For FR I classification, the recall is higher than the precision for every run, while for FR II classification the opposite is the case, suggesting a disproportionate labelling of galaxies as FR II.In both machine learning models, the best feature set combination is MFs and HFs, producing an accuracy of 74.2 ± 1.3% for the neural network and 77.3 ± 2.0% for XGBoost.The validation loss over the number of epochs for all neural network runs is shown in Figure 4.

DISCUSSION
When tested on their own, MFs outperform HFs and EFDs, with an accuracy of 76.5 ± 2.2% compared to 70.2 ± 1.7% and 69.7 ± 2.4% using XGBoost, respectively.Combining EFDs and HFs produces the biggest increase in the individual performances, resulting in an accuracy of 74.7 ± 3.8%.However, combining EFDs and MFs using XGBoost results in a slightly worse accuracy than the MFs on their own, suggesting a significant overlap in the information that these features contain, and also an inability for these models to ignore data that is superfluous.Combining MFs and HFs produces the model with the best accuracy and F1 scores for both the neural network and XGBoost, with the XGBoost algorithm using these features having the highest overall accuracy.Compared to using XGBoost with MFs alone, adding HFs increases the FR II recall and FR I precision, but at the expense of a decrease in FR I recall and FR II precision, suggesting an increase in the number of galaxies being labelled as FR I.
Although this model has the best performance, the accuracies of all the XGBoost runs that combine two or more feature types lie within their common uncertainty bounds, while there is a larger variation in performance using the neural network.
In the study by Scaife & Porter (2021), the accuracy of a nonequivariant CNN is found to be 94.0 ± 1.4%, and the accuracy of a D 16 -equivariant CNN is 96.6 ± 1.3%, which our models are far from achieving.However, the computational complexity of convolutional layers means that the models developed in this work are significantly faster, as shown in Table 2. Using a CNN results in near 50 times the required computation time of the non-convolutional machine learning algorithms, meaning these can provide a significantly faster classification of large amounts of data when a high accuracy is not required.Furthermore, as shown in Table 2, all feature sets besides EFD and HF+EFD have similar validation-training loss ratios to that of the CNN, indicating that there is no particular advantage in either approach with respect to generalisation.Fluctuations in computation times for the models developed in this study are primarily due to the different order and number of thresholds from which the features were calculated, as these hyperparameters were tuned individually for each model.
We note that the improved performance of the CNN compared with the direct feature extraction methods is not expected to be robust to reductions in training data volume (see e.g.Lin et al. 2020), as CNNs require relatively large data volumes to train effectively.

Feature Importance
In addition to the performance of the models trained on the individual features, we can investigate their suitability for FR classification through inspecting their relative importance according to XGBoost.Shown in Figure 5 are the most important features according to XGBoost from the run of highest accuracy (MFs+HFs).MFs make up the majority of these most important features, with 40% of these being the MF corresponding to the Euler characteristic evaluated at different thresholds.Of the 13 HFs, the most important were HF 8 (sum entropy), HF 2 (contrast) and HF 4 (sum of squares variance), listed in decreasing importance.
During the model tuning process, the number of thresholds and order of EFDs are varied as hyperparameters.Order is found to be very weakly correlated with the performance of the model, and when EFDs were combined with any other feature, fewer orders were favoured.The number of thresholds was also found to be only weakly correlated to the performance in comparison with other hyperparameters.Having very few thresholds results in generally a worse performance, but beyond ∼5 thresholds, adding more did not increase the accuracy.This is unsurprising as the thresholds always span the entirety of the possible values, and the cumulative nature of increasing the cutoff value for the pixels means that adding more thresholds does not provide any new information beyond increasing the resolution at which the change in the values of the MFs and EFDs can be calculated.The only Haralick feature that is not robust to noise is the correlation (HF 3; see e.g.Brynolfsson et al. 2017).Our results show that for our best performing model, which uses a combination of the Haralick features and Minkowski functionals, HF 3 (Equation A7) is not in the top 30 most important features, as can be seen in Figure 5.However, when using Haralick features on their own, this feature ranks in the top three most important and therefore a model based on Haralick features alone is expected to be less robust to noise.

Representation Space
To investigate the distribution of features in more detail, we use the Uniform Manifold Approximation and Projection (UMAP) algorithm, which is a non-linear dimensional reduction technique developed by McInnes et al. (2018).The aim of UMAP is to project   high dimensional data into a lower dimensional space (typically two dimensions to allow for visualisation), while preserving topological relationships between the datapoints.This is not a supervised machine learning algorithm and does not aim to classify the data according to the labels given, but instead groups the data based on how close they are in parameter space.We can gain an understanding of how the features intrinsically contain information about the galaxy type by investigating what galaxies are mapped to similar areas of the manifold.This allows a comparison between the features, and allows us to analyse the suitability of these features in the context of radio galaxy FR classification, especially in the context of neural networks which are typical hard to interpret.UMAP has been found to provide a useful tool for visualising the separation of galaxies from stars and quasars (Clarke et al. 2020), and performs better when working on large scales than t-distributed Stochastic Neighbour Embedding (t-SNE), another tool commonly used in dimensionality reduction (McInnes et al. 2018).
Input data to the UMAP algorithm, which typically is not uniformly distributed in the high dimensional parameter space, is mapped to a Reimannian manifold in which the data is uniformly distributed by construction.Reimannian geometry tells us that a hypersphere of unit radius on this manifold maps to a hypersphere that stretches to the  th nearest neighbour according to a chosen metric  in the parameter space.The value of  is a hyperparameter which is used to define a notion of locality, and results in each datapoint having its own sized hypersphere of locality in the parameter space.Two datapoints are assigned a probability of being connected based on the overlap of these hyperspheres, and these probabilities are used as the weights of the edges of a connected graph with datapoints as nodes.This results in a topological representation of the data which has been constructed to preserve the topology of the original manifold, and is easier to manipulate onto a new lower dimensional manifold.This lower dimensional manifold is constructed by creating a generic lower dimensional manifold and then manipulating it to optimise its information overlap with that of the connected graph, quantified by the cross entropy.The optimisation process is done using a force directed graph layout algorithm, in which datapoints in the same clump are attracted and datapoints in different clumps are repelled until an optimal representation is reached.
We construct a UMAP projection of the equivariant feature space in two dimensions using the top 10 most important features from the feature importance analysis in Section 7.1, which is presented in Fig 6 with the true labels of the galaxies indicated.There is no distinct separation between the two types into groups, although there are regions of denser FR I (near the area marked C) and FR II (near the region marked A).The lack of obvious clustering of FR I and FR II is to be expected from the relatively low accuracy of the supervised machine learning models, but the distinction between the two types in certain areas is of interest.Examples of the galaxies in each of the 6 regions of Figure 6 is shown in Figure 7.
The largest distance between regions in Figure 6 is that between group A and groups E and F. As shown in Figure 7, going from group A to group E corresponds to the galaxies becoming more extended and containing a higher total image flux.UMAP tends to group datapoints that are close in the parameter space and spread out the datapoints that are further away, and the fact that this transition from compact sources to extended sources aligns with the principle axis of the embedding suggests that this information is the primary information held in these features.
Another feature of the galaxies that appears to be represented in the embedding is that of messiness of the image.Galaxies in groups A, C and F, on the upper side of the UMAP embedding, tend to be more organised and show more homogeneity within the group.Group A consists exclusively of compact sources with two distinct areas of luminosity and are almost entirely FR II.Group C consists of sources that are thinly extended in one direction, most of which are FR I. Group F has a more even mix of FR I and FR II, but they are consistent in that they contain two large bright regions that are well separated, though may contain a third bright but compact region at the centre of the galaxy.In comparison, groups B, D and E of the lower side of the UMAP embedding typically contain more distinct regions, many of which are less well-defined.Group B contains compact bright sources that are separated from compact dimmer regions.The large empty regions between the multiple compact areas of brightness make it hard to tell where the centre of the galaxy is and it often is not clear if these areas of brightness are from the same source.Group D and group E both contain galaxies with multiple components that tend to be less straight than those in group C, though the galaxies in group D tend to have one brighter region and multiple dimmer regions, while galaxies in group C are more uniformly bright and display high contrast.

Restriction of Invariance
Both MFs and HFs are E(2) invariant.As invariant features are a subset of equivariant features, there are inherent limits to the information they contain about galaxy morphology.Despite the E(2) equivariance of EFDs, challenges arose in the presence of multiple contours.Analysis of Equation A32 reveals a potential issue with concatenating multiple contours as the gradient is undefined at the point of concatenation.This effect limits the functionality of the elliptical Fourier analysis and leads to a loss of information about the distances between contours.
Analysis of Figures 6 and 7 reveals that MFs and HFs inherently capture information about galaxy extension.This information about extension is built into the MFs, as both area and perimeter are measures of size.How HFs sort by extension is more enigmatic, as they measure the relationships between the pixels in an image, regardless of whether or not they are part of the background.Therefore, the normalisation constant from Equation A4 is always the same for all galaxies, and as such HFs also encode information about galaxy ex- tension.It is important to note that while total extension is a useful metric to classify FR galaxies, its dependence on telescope resolution and cosmological distance introduces a caveat.The accuracy of models based on feature sets such as those presented here when applied to new data at different telescope resolutions will be affected by this dependence.
We suggest that the limited performance of the E(2) equivariant features is due to the nature of the features being invariant rather than equivariant to the symmetry of the problem and therefore limiting their flexibility in the context of FR classification.These features have demonstrated their strength as useful tools for classification in less complex image-processing tasks when compared to CNNs, such as in medical imaging (Lin et al. 2020), indicating that these features do provide viable alternatives to CNNs in some contexts.However, as the task of FR classification for radio galaxies is inherently complex due to the visual variation found in each FR type, these features alone are unable to extract all the information from the image that is relevant to their FR classification, explaining their limited performance in the context of this research.

CONCLUSIONS
In this work we have compared MFs, HFs and EFDs as tools for the task of FR classification of radio galaxies.These features are equivariant under E(2) symmetries, a desirable property for this task.When used as inputs to both a neural network and XGBoost, we find that all three of these feature sets contain morphological information about the galaxies, with MFs being most informative and EFDs least informative.We demonstrate that although combining these features can result in an increase in performance, this increase is only incremental due to the limitations of these features when applied to complex real-world data, and information overlap between the different types of features.
We show that, compared to equivalent group-equivariant CNNs, the models developed in this work are around 50 times faster at classifying radio galaxies, though their accuracy is found to be limited to ∼ 80%.
Similarly to Ntwaetsile & Geach (2021), our analysis shows that these features primarily capture morphological information such as extension and entropy, which are only weakly correlated to FR type.Therefore we suggest that these features may be more applicable for the task of unsupervised clustering of data where higher computational efficiency is required.

Figure 1 .
Figure 1.MFs of the binary images created by imposing a relative threshold of 0.1, 0.3 and 0.7 for a FR I and FR II galaxy.Values of the MFs are shown in the subplots, with the original galaxy images displayed on the left.As the threshold increases, MF 2 tends to the value of 1  for the FR I galaxy and 2

Figure 2 .
Figure2.Example galaxies corresponding to increasing values of a specific HF.Top row shows galaxies of increasing contrast, the second HF.Middle row shows galaxies of increasing sum of squares variance, the fourth HF.Bottom row shows galaxies of increasing sum entropy, the eighth HF.

Figure 3 .
Figure 3. Elliptical Fourier analysis of the contour of a FR I galaxy at threshold = 0.2.Shown in purple is the true contour at the threshold, and shown in blue are the contours using the EFDs at order 1, 3 and 10.To allow for direct comparison of values of the individual EFDs, these are calculated with respect to the coordinates shown in the left-most image.

Figure 4 .
Figure 4. Left: Validation loss as a function of epoch for the neural network architecture described in Section 5.1 for different combinations of equivariant features.Right: the validation loss for the CNN from Scaife & Porter (2021) is shown for comparison.

Figure 5 .
Figure 5.The relative importance of each type of feature for the top 30 most important features from the best run according to XGBoost.

Figure 6 .
Figure 6.UMAP embedding of the top 10 features from the most accurate XGBoost run.Labels of FR I and FR II types are given in orange and blue respectively, with labelled regions of interest.A Euclidean metric is used with 8 nearest neighbours and minimum distance of 0.1.

Figure 7 .
Figure 7. Random galaxies from each of the 6 regions labelled in Figure 6.FR type for each galaxy is labelled.

Figure A1 .
Figure A1.Co-occurrence matrix (right) calculated for a random matrix (left) according to a horizontal relation between pixels.In red, we show the calculation for the  = 8 and  = 9 entry.As this relation is symmetric, the co-occurrence matrix is also symmetric.

Table 1 .
Performance metrics for classification of the MiraBest dataset using XGBoost.Column [1] lists the type of features being used for classification in each case.Run times were obtained on a laptop with an Apple M2 chip (CPU).The best values for each metric are shown in bold.

Table 2 .
Performance metrics for classification of the MiraBest dataset using the neural network described in Section 5.1.The CNN metrics are taken fromScaife & Porter (2021).Column [1] lists the type of features being used for classification in each case.Run times were obtained on a laptop with an Apple M2 chip (CPU).The best values for each metric are shown in bold.

Table B1 .
Hyperparameter priors for the neural network.

Table B4 .
Optimal hyperparameters for every neural network run.