Radio Galaxy Zoo EMU: Towards a Semantic Radio Galaxy Morphology Taxonomy

We present a novel natural language processing (NLP) approach to deriving plain English descriptors for science cases otherwise restricted by obfuscating technical terminology. We address the limitations of common radio galaxy morphology classifications by applying this approach. We experimentally derive a set of semantic tags for the Radio Galaxy Zoo EMU (Evolutionary Map of the Universe) project and the wider astronomical community. We collect 8,486 plain English annotations of radio galaxy morphology, from which we derive a taxonomy of tags. The tags are plain English. The result is an extensible framework which is more flexible, more easily communicated, and more sensitive to rare feature combinations which are indescribable using the current framework of radio astronomy classifications.


INTRODUCTION
Language is often difficult to define or use. When new concepts arise and demand their own terminology, terms can be adopted from similar ideas (e. g. 'entropy' in information theory and physics; Natal et al. 2021), invented (e. g. 'utopia';Romm 1991), or named after the discoverers (e. g. 'Newtonian physics'). Individual terms often have in the 1930s (Southworth 1956), the language used to describe celestial objects has been developed almost entirely in tandem with the instruments and corresponding scientific understanding. Consequently, some terms are limited by our physical understanding (e. g. Little Green Man 1; Hewish et al. 1968) or the sample inspected at the time (e. g. FRI / FRII; Fanaroff & Riley 1974).
In the case of radio galaxy morphologies, language is becoming increasingly difficult to use, especially as technological and scientific advancements provide deeper insight into the vast range of radio morphologies. The gap between the diverse range of observed radio galaxy morphologies and the classification schemes used is widening. The current morphological classifications carry information which cannot be quantified under current frameworks, meaning the use of non-numeric features, i. e. language based schemes, is unavoidable. Accordingly, the current classification schemes fall victim to obfuscated language. Additionally, the terms used tend to describe abstract classes, which lack the ability to capture the increasingly complex features of radio galaxies observed with the newest generation of instruments. Rudnick (2021) urges the radio astronomy community to develop a tagging system rather than forcibly attempting to create classes which neatly separate objects. Such a tagging system would allow an object to be assigned plain English descriptors capturing the semantics of the object's features through tags rather than be assigned a distinct class to which it belongs. Additionally, this tagging system would be able to consolidate instrument specific morphologies within the same framework without producing conflicts. As an example, a source could be tagged as 'compact' in a low resolution survey while having specific morphological features captured by tags referring to observations made by higher resolution instruments. This work aims to build the framework for such a tagging system for the first time.
The newest radio instruments in operation are producing maps of sources which are so deep, resolved, and with such high dynamic range that our existing classification schemes are failing. An updated, and extensible, radio morphology taxonomy of tags would be a tremendous benefit moving forward because deeper and wider surveys are expected to be a massive driver of scientific development in the coming decades. If the scientific community had a framework and terminology which were not intrinsically limited by sensitivity or resolution, it would mean that we could work with the same framework regardless of the technological improvements to the instruments in the field. We therefore expect this work could have major implications in various scientific contexts, including population studies and rare object searches in observations made by current and future radio instruments including the Australian Square Kilometre Array Pathfinder (ASKAP; Johnston et al. 2008 This work uses data from the Evolutionary Map of the Universe (EMU; Norris et al. 2011), a radio survey being conducted with the ASKAP telescope. ASKAP's large field of view means that it can map a large portion of the sky at once. Because of this, EMU is currently planned to map three quarters of the sky, the first two thirds of which are planned to be completed in the first five years. EMU is estimated to catalogue 40 million sources (estimate made using the Tiered Radio Extragalactic Continuum Simulation method, T-RECS; Bonaldi et al. 2019). In an effort to classify these sources at scale, we are launching 'Radio Galaxy Zoo EMU' (RGZ EMU).
RGZ EMU is a citizen science project designed to allow the public to provide valuable and essential insights into these sources, including host identification, source assembly, and source classification (full details on RGZ EMU in Tang, Vardoulaki, et al. in prep.). While designing this project, we discussed what classifications we would ask the citizen scientists to use. It became clear that there was no consensus on the terms to use. In part as a response to this dilemma, we collect plain English annotations on radio galaxies and implement a novel framework to derive semantic plain English tags.
The proposed process uniquely combines existing natural language processing (NLP) methods. NLP has garnered significant research interest over the last twenty years (see for instance Mishra & Kumar 2020). Thomas et al. (2022) use NLP and a form of topic modelling called latent Dirichlet allocation (LDA; Vayansky & Kumar 2020) with the aim of guiding the planning process of research priorities by analysing trends in previous publications. Grezes et al. (2021) use deep learning based NLP techniques with the aim of improving the SAO/NASA Astrophysics Data System (ADS 1 ).
The method we present is not bound to radio astronomy. It can be applied to any domain. The code used in this work is publicly available at https://github.com/mb010/Text2Tag and is written to be transferable to other fields.
Our work is structured as follows. In Section 2 we detail the data used in and collected through our experiments. We present the proposed method in full in Section 3 before presenting the details of its application and the resulting taxonomy in Section 4. Initial physical results using the semantic taxonomy are presented in Section 5. The results and impact are discussed in Section 6 and conclusions are made in Section 7.

DATA
Two experiments were designed and executed. Both made use of early versions of cutouts prepared for the RGZ EMU project as described in Section 2.1. The intent and design are detailed for the Plain English Annotations experiment as well as the Expert Classification experiment in Sections 2.2 and 2.3 respectively. An anonymised version of the data is publicly available 2 .

Image Data
To produce the data analysed in this work, users were asked to consider individual images in turn. An example of one of these images is presented in Figure 1. These images are early versions of the data to be used in the RGZ EMU project. This version consists of three panels containing a 6 arcmin by 6 arcmin cutout from the EMU pilot survey. The panels show EMU contours with the false colour EMU image, a Digitized Sky Survey (DSS; Lasker et al. 1990) cutout, and a Wide-field Infrared Survey Explorer (WISE; Wright et al. 2010) cutout. Each image is centred on an EMU Selavy catalogue component (Norris et al. 2011(Norris et al. , 2021b. The cutouts are subject to a number of criteria designed to select a small number of sources for early testing. Components which had an angular extent of less than 27 arcsec (1.5 beam widths) were removed. Components which were within 45 arcsec (2.5 beam widths) of another catalogued component were also removed, as these are largely simple doubles with little to no morphological features and 1 See: https://ui.adsabs.harvard.edu 2 https://zenodo.org/record/7254123#.Y7VvGtLP3Lp can be classified algorithmically. This resulted in a list of 306 sources, for which cutout images were made. Our final sample consists of 299 of these cutouts because an undetected upload error caused seven cutouts to not be uploaded to the Zooniverse platform.

Plain English Annotations
To derive the desired plain English taxonomy, we started with plain English descriptions (annotations) of the given object or phenomenon. Using the 299 sets of images outlined in Section 2.1, we built a private Zooniverse 3 project, where we presented users with an empty text box and a prompt reading: Please describe the source: • in the middle of the frame and any associated emission.
• use simple English.
This data was intentionally collected to be relatively unconstrained to encourage the annotations to cover diverse ideas of source features. Thus users were enabled to highlight and describe whatever caught their attention within the image. The trade-off in this unstructured description approach is that the resulting data are unwieldy and noisy (the consistency and formatting of phrases is not constrained). As such, the method outlined in Section 3 contains a significant overhead of data cleaning which is common with any unstructured natural language data.
The data collection for this experiment ran from the 17 th of December 2021 to the 27 th of January 2022. We offered users who processed more than 100 sources co-authorship on this publication which is a direct result of their efforts. In total, we had 19 users annotate an average of 154 sources each, resulting in a total of 2,920 descriptions consisting of a 8,486 comma separated annotations. Almost all of these users are astronomers, and more than three quarters of them have at least some academic experience of radio morphologies.

Expert Classification
We conducted a second experiment to collect expert classifications on the same sets of images. This experiment was conducted with the 3 See: https://www.zooniverse.org/ aim of extracting ideas represented by annotations which are relevant to the expert's science cases.
We established a separate private Zooniverse project and invited a number of experts to participate in classifying the radio morphologies of the objects in the images described in Section 2.1. To classify objects with predefined classes participants were prompted with: Radio Morphology: Please describe the source: • in the middle of the frame and any associated emission • select one or more tags that fit object radio morphology.
We presented the participants with 22 classes, which they could use as they wished, including assigning none or all of them to the subject. The abstract classes listed were selected from a compiled list of radio morphology classes presented in Rudnick (2021)

METHOD
To the best of our knowledge, there is no existing process or NLP approach which produces a semantic taxonomy from a corpus of short annotations. The closest approaches are widely used topic modelling approaches. These approaches capture topics within a corpus through distributions of terms in documents. Vayansky & Kumar (2020) present a helpful review of topic modelling variants. These models are designed to return a distribution of terms which belong to each discovered topic. They are not designed to return terms which communicate what a given topic is. We explicitly want to build a taxonomy on a certain subject. Therefore the terms which effectively capture the meaning of a topic are essential.
Although a panel of experts may be able to manually define a semantic set of terms for a given problem, the success of such an approach would depend on whether the panel agree, the backgrounds of the experts, and their ability to distil complex ideas into simple plain English effectively. This manual approach would likely also lack the reproducibility and tractability that is expected by the physical sciences.
We therefore propose a method through which short annotations are distilled into semantic tags in accordance with a specific science case and its respective features (including classes). The workflow of the method is presented in Figure 2. The derived taxonomy should provide wide coverage of objects of interest, have the ability to distinguish features, be clear in what semantic feature it describes, and be appropriate to the science cases.
Conceptually, in the framework similar annotations are aggregated to produce a single term which we call 'tag'. We rank how important tags are based on the impact they have in classifying the existing abstracted science classes. A selection is made on the most important tags to form a taxonomy.
A technical outline of the method is presented in Section 3.1. The implementation details for our data and project are presented in Section 3.2.

Technical Outline
Sequences of words are processed where represents the ℎ word of in a given annotation, a = ( 1 , 2 , ..., ). Here a is the ℎ annotation of the annotations in our corpus. Note that in the NLP literature, the equivalent of annotations would be 'documents'. This method is expected to work best on extremely short annotations (documents), where each annotation contains a single idea. The annotations are embedded into a -dimensional vector through a pre-trained model, emb : This is currently implemented such that the order of the words does not affect the encoding (i.e. in a bag-of-words paradigm). We embed each word within an annotation through For pairs of annotations, ( , ) ∈ [1, ] 2 , a similarity value is calculated using the cosine similarity, cs sim : R − → [−1, 1], which takes the dot product of two vectors scaled by the inverse of the product of the euclidean norms of those vectors: According to a similarity threshold, , averaged vectors are then calculated through: where is the number of non-zero elements being summed over. The model, emb , used to embed the annotations is then used to produce the token which is closest to v : where is the ℎ entry of all unique derived tags. As tags are derived from the annotations ≤ .
For each annotated object, we define t = ( 1 , ..., ) as a vector encoding of the tags, where is 1 if that tag was present in an annotation associated with that object, or 0 if that tag is not associated with it. As each object has multiple individual annotations associated with it, it can be described through its derived tags t ∈ {0, 1} .
We consider each science class in the set of science classes . Using the encoded tag vector, t, for each object, we fit a model, : {0, 1} − → {0, 1}, to predict the presence of each science class ∈ . For each model, , and tag representation, , we calculate an importance where a larger value of ( , ) means that , and subsequently , are more important to the classification output.
To recover the importance of the ℎ tag, we take an average across all models, , for a given tag, . We take the support weighted average of the importance of each model Here, is the number of (positive) class entries. Note that other weightings may be preferable depending on the available data and purpose. For instance, if a multi-objective regression task were used instead to calculate ( , ) , then a uniform weighting across tasks may be more appropriate. We normalise the importance values, , such that Finally, the tags, , are sorted by their . Although tags are all expected to have some non-zero , a majority of the information is contained within the top tags. Additionally, the tags ranked lowest in this scheme are expected to be noisy (e.g. annotations which contain incorrect spellings, reference otherwise irrelevant features, or are only impactful on a given prediction through the random association of its small sample size). We set an importance threshold to select the top < tags. These most important tags, consisting of strings, ∈ Taxonomy , is the derived taxonomy.
Some of the tags, , may require clarification to allow the tag to be clear upon first reading. To do so the raw annotations which that tag was derived from are taken into consideration, in order to verify what it represents.

Implementation
The exact implementation of the method outlined in Section 3.1 will depend on the data being used. The details for our implementation are as follows.

Pre-Processing
The goal of pre-processing the annotations is to format the data in a uniform manner without disrupting the content (i. e. standardising grammar, spelling, and formatting). To do this, a number of common NLP data processes are applied. These are applied in the order they are presented in.
All annotations are set to lower case and all accents are removed from characters. Ampersands and new line commands are removed or replaced as appropriate. Forward slash and full stop characters are replaced with commas as they are often observed to represent separate ideas, which our method assumes are comma separated. Double whitespaces are corrected and hyphens are removed.
Based on manual inspection, additional corrections are made to a number of annotations. These are spelling mistakes such as 'copact' being corrected to 'compact'. We then drop any annotations which mention 'DSS', 'WISE' or 'optical' as these annotations are not expected to be a comment on the radio morphology, which is our target of interest.
At this point, the sets of comma separated annotations are separated into individual annotations. Contractions are expanded. A list of stopwords 4 is extended to include 'like', as well as scientific terms which the pipeline should not be affected by as they are both technical and not related to morphology. These additional stopwords include 'emu', 'galaxy', 'galactic', 'emission' and 'source'. Terms for cardinal directions ('north', 'south', 'east', 'west') are also added to the stopwords since our focus is on the features themselves instead of their position relative to a specific source. This stopwords list is applied to the annotations.
We consider both lemmatized 5 and unlemmatized approaches to the data moving forward. Each annotation has now been cleaned, and the data are largely consistently formatted.

Embeddings
When embedding the cleaned annotations into vectors, we use SpaCy's large English language model 6 which is the largest available model within the SpaCy package (v3.3.0) that has tokens (largely words) embedded. It contains 685k embeddings. For a given vector, the model can return the closest embedded word(s). This feature is essential to our process, and is a key factor in the decision to use this model. Other advanced, such as transformer based models, can take word order into account, but do not have these token embeddings. In the implemented SpaCy version, the vector embeddings are learned through the GloVe algorithm introduced by Pennington et al. (2014). GloVe (Global Vectors for Word Representation) is an unsupervised representation learning algorithm which aims to embed words in a space which presents various desirable features. These features include semantic and linguistic similarity, which is advantageous for our use case.
To decide what similarity threshold to use to aggregate over, we consider the histogram of cosine similarities of all annotations. This is presented in Figure 3. This histogram is presented using annotation embeddings which were lemmatized and does not include self-similarities. The peak at 1.0 is due to tags which are identical (maximally similar). The rigorous cleaning process reducing short annotations to a few words before the annotations are embedded is likely a factor. Additionally, self-consistent vocabulary across multiple annotations may be embedded to an (essentially) identical vector.
To capture the excess tail of more similar vectors, we consider similarity thresholds above 0.5 for our models. Lower thresholds would contain the bulk of all annotation pairs, which would be counterproductive in trying to capture individual concepts. However, we would like to explore thresholds down to 0.5 to reduce the number of unique terms and maximise the number of entries per unique tag. The tags derived from the aggregated vector encodings, are returned by SpaCy as single tokens (words).

Model Definition
We predict science classes on a source by source basis. We use 22 science classes to train the models to classify science classes from our derived tags. Three science classes presented to our expert classifiers are functionally removed as they are either not usable without spectra (Compact Symmetric Object; CSO), extremely contested and largely disused (Fanaroff-Riley Class 0; FR0, Hardcastle & Croston 2020; Rudnick 2021), or largely uninformative when considering extended radio morphology (Single). Furthermore, the following three science classes have insufficient positive cases for training: Double-Double Radio Galaxies (DDRG; Schoenmakers et al. 2000), Hybrid Morphology Radio Sources (HyMoRS; Banfield et al. 2015;Kapińska et al. 2017), and Odd Radio Circles (ORC; Norris et al. 2021a). The experts often did not agree with their usage of the terms. We explore what degree of agreement (expert threshold) beyond which we will consider a source as being positively classified with a certain science class in Section 4.1.
To train models that predict science classes from our derived tags, we have 299 sources. This is a small data set. We chose a relatively simple model in response. We train random forests in a one-vs-rest scheme, i. e. one random forest model to classify one science class.
We treat a set of these models as a single model which predicts the multi-label target of an input. The random forests use the Gini impurity criterion, with the aim of improving the explainability of the selected features (Menze et al. 2009). The random forests are configured with 500 estimators, no maximum depth, and a seeded random state to allow for reproducible results. Unspecified features of the models are inherited from default values as implemented by Scikit-Learn v1.1.0.

Evaluation
Another challenge of the relatively small data set is the evaluation of the trained models. We use cross validation to maximise our use of the data. The model is evaluated through 10-fold cross validation, where we train 10 models on 10 different sets of nine tenths of the available data and evaluate each on the respectively withheld final tenth. With predictions for each tenth from one of the ten trained models, we recover predictions for each data point. These predictions result in approximate generalised performance metrics for the models.
We choose to track the performance of our models with macro and weighted F1 scores. An F1 score is the harmonic mean of precision and recall, and can be written as where is the number of true positives, is the number of false positives, and is the number of false negatives. The macro and weighted F1 scores are extensions to enable evaluations of multilabel problems. The macro F1 score is calculated by averaging F1 scores calculated in a one-vs-rest scheme for each target class. The weighted F1 score is calculated identically to the macro F1 score, except that we take the weighted mean where each score is first scaled by the number of samples of a given class, which may be more telling in an imbalanced data classification problem.

Importance
To estimate the importance of each of the tags, we use Shapley values. Shapley values are a common explainability tool used in machine learning applications (Lundberg & Lee 2017). These values convey how much a feature has contributed to the prediction of the respective model in comparison to the average predictions of the model. Exact Shapley values for each input tag and science classification are calculated using the trained random forest model and the SHAP package for trees introduced in Lundberg et al. (2020). These values are the importance values used to estimate which tags capture the semantics of radio morphology.

Data Configuration
To evaluate which data configuration (i.e. data processing parameter selection) is best, we consider the F1 scores of models trained on various configurations of the data. We grid search data across configurations including four expert thresholds, eleven similarity thresholds, and with or without lemmatization. This results in 88 configurations, which we construct and evaluate.
For the expert classification we demand that at least 20 %, 40 %, 60 %, or 80 % of the votes made on a given source agree. We call these confidence thresholds. Note that we make use of percentages. Sources which have not been classified by all experts can still have their classifications reflected in the confidence thresholds used in this search. We do not consider 100% agreement amongst experts as so few classifications would survive that we could not train a model (highlighting a serious issue of the current classification scheme). The F1 scores are presented in Figure 4, for which the statistics of the random forest models are taken over all 22 configurations (2 lemmatization and 11 similarity threshold combinations) for each expert threshold.
Simply stating that the model improves as the expert threshold is increased is largely true, see Figure 4. However, with increasing expert thresholds, the task which the model has been asked to complete becomes easier as the noise in the classifications is functionally reduced. The subset of data where experts have a high consensus are more likely to have clearly identifiable morphologies (reduced aleatoric uncertainty) or present with a morphological classification which is more widely agreed upon amongst the experts (reduced epistemic uncertainty). In an attempt to capture a broader perspective on what radio morphologies are, while maintaining accuracy of the classifications, we select an expert threshold of 60% for the remainder of this work. Figure 5 shows the performance of the 22 models for each similarity threshold and lemmatization configuration. We select the configuration with a similarity threshold of 0.80 and lemmatized inputs. This configuration results in 213 unique tags. The model achieved a weighted F1 score of 0.254 and a macro F1 score of 0.350. This is the model with the highest weighted F1 score. The highest macro F1 score is 0.352 held by both configurations with a similarity threshold of 0.6.

Tag Ranking
The Shapley values are calculated for each tag provided to the model with respect to the model's outputs and the full data set. This provides us with a Shapley value for each science class and tag combination. We take the support weighted average of the Shapley values across the science cases to provide us with a descriptive Shapley value for a given tag. We normalise these values so that they sum to one across all tags. We call these values the comparative weighted Shapley values. These are presented as percentages which reflect how much sway a given tag has over the science classification of the model.
We calculate the comparative weighted Shapley values and present the 70 most important terms in Figure 6. To select a usable volume of tags, we define the taxonomy to be the top tags required for 68% of the descriptive power to be maintained (approximately 1% of the comparative weighted Shapley value). In this data configuration, this results in a taxonomy of 33 tags.
Highlighting the benefits of Shapley ranking over a simpler approach such as correlations, we present an interactive graph visualisation of moderately strong correlations between all combinations of both tags and science cases in Figure 7. Importantly, the graph does not contain all 33 tags. This is because most of the top 33 tags are not strongly correlated with other terms. They are still the most impactful to the model's decision, as non-linear combinations of tags can be used to classify the science case. Their value to the non-linear classifications is captured by Shapley values.

Taxonomy Adjustments
Limited to single words, the derived tags may not be an optimal selection. Suboptimal tag representations may also occur when the conjugation of a given term is relevant to the use case but lemmatization has removed it even if it would been more easily understood (e.g. 'extended' in comparison to 'extend'). Furthermore, the method only outputs single words, even if the concept the tag represents is better represented by multiple terms.  We therefore investigate each tag in turn by considering all annotations which contribute to it. We adjust tags in an attempt to optimise the taxonomy for grammatical and conceptual clarity. The adjusted tags are listed below, including descriptions of the original annotations / concepts which a tag represents.
• trace: derives directly from numerous annotations stating 'traces host galaxy'. This is a more clear expression for what this tag represents. We therefore alter 'trace' to 'traces host galaxy'.
• disk: derives from annotations such as 'emission from galaxy disk'. We therefore merge this with 'traces host galaxy' (originally 'trace'). This refers to the radio emission tracing the host galaxy rather than the morphology of the host.
• bright: refers to bright features of a presented cutout. This includes cores as well as neighbouring sources. This information is more clearly contained in the catalogues of radio component fluxes. Therefore, this tag is dropped.
• spiral: derives from spiral galaxies being the hosts of the radio emission. This tag is changed to 'traces host galaxy' as it then contains the relevant radio morphology information.
• asymmetric: original annotations refer to asymmetric structure. To highlight the difference between asymmetric structure and brightness, we rename this tag to 'asymmetric structure'.
• component: refers to the number of components which the source is composed of. This tag is dropped in favour of catalogues which list how many separated components are assigned to a source.
• counterpart: refers to matching emission in either optical or infra-red. Host identification and 'traces host galaxy' will capture this information. Therefore, this tag is dropped.
• middle: Largely referring to presence and features of the central core of a radio galaxy. We therefore rename this tag to 'core'.
• brighten: Refers to 'edge brightened' sources. We therefore clarify this by altering this tag to 'edge brightened'.
• linear: Refers to non bent radio morphologies, which is captured in the absence of the 'bent' tag. Therefore this tag is dropped.
• elongate: Refers to elongated structures in the radio emission. This is captured by the absence of the 'bent' tag when sources are also 'extended' and is therefore dropped.
• radio: Annotations were written commenting on the radio emission in general ways (such as the presence of a jet or how many components are visible). This information is all mapped by other tags / processes. For this reason, we drop this tag from the taxonomy.
• overlap: Often refers to emission which overlaps with radio contours or vice versa. It is therefore be changed to 'traces host galaxy'.
• brightness: Original annotations almost exclusively refer to 'asymmetric brightness' across components or within the structure being discussed. We therefore clarify this tag by changing it to 'asymmetric brightness'.
• straight: Refers to the non-bent structure of the radio galaxy. We therefore drop it in favour of the absence of the tag 'bent'.
• lobes: We make a grammatical change to 'lobe' with the intent to make this tag less ambiguous for future users.
• edge: Highlighting clear edges of sources, as opposed to diffuse edges. This is largely equivalent to and is merged into 'edge brightened'.
• margin: The annotations from which this tag derives refer to the source extending beyond the margins of the cutout. This is being accounted for with updated cutouts, and is not morphologically relevant beyond the angular extent of a source, which is better presented in a catalogue format.

Semantic Taxonomy
After the adjustments made to the tags in Section 4.3, we have 22 unique semantic tags. In alphabetical order, the semantic tags we propose to use for radio galaxy morphology are: amorphous, asymmetric brightness, asymmetric structure, bent, bridge, compact, core, diffuse, double, edge brightened, extended, faint, host, hourglass, jet, lobe, merger, peak, plume, small, tail, and traces host galaxy.

Effectively Assigning Tags
We have succeeded in deriving semantic tags for radio morphologies (see Section 4.4). However, for RGZ EMU and other citizen science approaches it is not effective to ask citizen scientists to use 22 tags. Terms would likely be ignored in a long list, and users would easily bottleneck on a small number of tags, neglecting the remainder of the taxonomy. This would be detrimental to both the science case and the user experience.
To improve the scientific results as well as the user experience, and to make the most effective use of the citizen scientists' time and energy, we consider which tags within the taxonomy can be most easily computed algorithmically at other stages of processing, e.g. 'small' is easily calculated through the angular extent of the assembly mask for a given source. We aim for 10 tags which can be presented on a single screen to the citizen scientists. We consider each term in turn, and outline how each term might be assigned in Table 1. The tags which we believe are least easily computed will benefit the most from citizen scientist input. These are the tags which are presented as 'proposed for tagging' in Table 1. These ten tags are those which the RGZ EMU project will present to its citizen scientist volunteers.

RGZ EMU Early Feedback
While working towards our final release of RGZ EMU, we asked a small group of 16 testers who have never worked on radio galaxy studies before (8 from China, 7 from Pakistan and 1 from Germany) for feedback on an early version of the tags terms provided by a beta version of the pipeline presented in this work. The tags presented to the testers were: bent, bridge, complex, diffuse, distorted, elongated, hourglass, jet, plume, tail.
In general, our testers found most provided tags self explanatory. The main concern which the testers raised, was around the definition of three words they were not very familiar with: 'plume', 'tail' and 'elongated'. We believe there are two main contributions to this phenomena: (i) Our testers were all non-native English speakers, which is likely to explain their struggle with the meaning of 'plume', (ii) The testers showed differences in their thinking around terms. For example, this included describing 'elongated' as 'extended', 'tail of a comet', 'oval shape', or a 'jet-like structure'.
To address these concerns, the final workflow will contain examples and conceptual definitions (see Appendix A) which users can reference for guidance. Furthermore, the RGZ EMU team is considering the translation of the tags into multiple languages, where issues such as this may be less relevant.

RADIO ASTRONOMY CHALLENGES AND SEMANTIC MORPHOLOGIES
The semantic taxonomy derived in this work (Section 4.4) is expected to be most useful as a tool by which astronomers can select samples of radio galaxies from source catalogues in a flexible manner. Assuming each of the tags is present or not, we can estimate how many populations can be selected. For the full taxonomy of 22 tags, 2 22 = 4 194 304 populations can be selected (2 10 = 1 024 for the ten tags that RGZ EMU citizen scientists will use; see Section 4.5).
In practice, the number of populations that the tags map may be quite different. For example, the binary estimate presented here does not consider the use of other catalogued features, such as flux, spectral index, or redshift. It also does not take into account that certain tags may be fundamentally correlated. Additionally, one might expect that given enough data, catalogues containing vote fractions for each tag could enable uncertainty and strength estimates, i.e. how 'bent' a source might be could be approximated by the fraction of citizen scientists which return the tag for that source.
To demonstrate the utility of such semantically selected samples, we here synthesise a catalogue and perform some example selections.
To synthesise our small data set into a catalogue, we estimate the tags for this catalogue by considering a source to have been assigned a given semantic tag if at least one of its annotations maps onto one of the tags in our final taxonomy. In this way, we treat our source annotations as a tagged catalogue 7 . Future catalogues will likely improve upon this synthesised catalogue through multiple individuals making direct use of available semantic tags.
We use this pseudo catalogue to demonstrate the impact that catalogues using a semantic taxonomy can have by considering two practical use cases. Firstly, we demonstrate the recovery of an existing population of radio galaxies in Section 5.1. We then highlight our ability to find morphological outliers in Section 5.2.

Detecting Traditional Populations
We demonstrate how traditional populations can be recovered by recovering star-forming galaxies. We do this by considering sources tagged with traces host galaxy. In practice we query our data for sources which were originally tagged with 'trace'. This is the closest proxy to 'traces host galaxy' tag that we can produce with our current data (see Sections 4.2 and 4.3).
By simply considering sources with the 'trace' tag, we identify 38 objects. These are listed along side their respective expert starforming galaxy (SFG) classification and estimated tags in Table 2. This simple approach recovers 33 of the 45 sources labelled as SFGs by our experts (with at least 60% expert agreement).
Five sources with the 'trace' tag were not classified as SFGs in our expert classification scheme, i.e. 34 -38 in Table 2. Images of these sources are presented in Figure 8, where their radio contours are shown overlaid on optical data from the Dark Energy Survey (DES; Abbott et al. 2018). Combining the deeper and higher resolution DES optical data into an RGB image aids visual interpretation compared to using DSS greyscale images. Now the primary choice for the EMU Zoo project, DES data were not initially used due to concerns about accessibility and coverage. Each of these sources is discussed individually below with respect to the optical morphology catalogues presented in Walmsley et al. (in prep.) made with the Zoobot 8 package (as described in Walmsley et al. 2022), where the percentage of people who would have answered with a given feature is stated behind each catalogued feature.  Table 2. Sources selected for the traces host galaxy tag ('trace' in practice; see Section 5.1). Using an expert threshold of 60% (see Section 4.1) we confirm whether or not the sampled sources are SFGs. For each source, the tags are listed in alphabetical order.
Source 34 is a smooth (78%) cigar shaped (70%) galaxy (could be an edge-on galaxy). Source 35 is a featured (82%) face-on (98%; not edge-on) spiral (73%) galaxy without a bar (70%). Due to selection cuts, Source 36 was not included in the Walmsley et al. (in prep.) catalogues; however, the radio emission is expected to stem from at least two small galaxies bounded by the contours in the image. Source 37 is a smooth (67%) round (59%) galaxy. Source 38 is a featured (91%) face-on (98%; not edge-on) spiral (98%) galaxy. It does not have a bar (71%) but has a small bulge (80%) and tightly wound spiral arms (83%). Consequently, of the five sources initially not classified as SFGs, when reconsidered with the deeper DES (Abbott et al. 2018) images and optical morphology catalogues, it is clear that at least two sources (35 and 38) are star-forming spiral galaxies. It is likely that the experts simply did not feel confident in classifying these sources as SFG with the limited resolution and sensitivity of the DSS data.
Twelve additional sources that were classified as SFGs with 60% agreement amongst our experts, but that did not have the 'trace' tag assigned to them are presented in Table 3. If the selection had been made on 'traces host galaxy', rather than 'trace', then these sources would have been included in Table 2. This is because 'traces host galaxy' is derived through multiple tokens such as 'counterpart' (see Section 4.3).
With a consistent use of the 'traces host galaxy' tag, the remaining eight SFG sources from Table 3 (sources 5 -12) are likely to be tagged as such, as their radio emission do largely trace the respective optical host (see Appendix B).
Of the 45 sources our experts classified as SFGs, the selection of traces host galaxy would have recovered at least 82% (33 and 4 from Tables 2 and 3 respectively) of these sources, making it a strong candidate to select SFG populations from future catalogues and highlighting how such a catalogue can be used.

Rare Source Detection
We here highlight the flexibility and practicality of our taxonomy by considering a combination of tags that a radio astronomer might consider to be abnormal. We query to find a source which a) appears  Table 2 tagged with trace (selected as a proxy for 'traces host galaxy') which are not classified as SFGs by experts. EMU radio brightness contours matching those in Figure 1 with DES cutouts, combining g, r, and i band data into an RGB image following Lupton et al. (2004). Sources are annotated with their source numbers and (J2000) coordinates as presented in Table 2. Each source is shown to the same angular scale, as highlighted by the radio beam size in the bottom left of each panel.

No.
Coordinates (J2000) Tags 21h 02m 57s -54 • 29 35 amorphous, asymmetric structure, bent, core, diffuse, faint, host, peak, tail Table 3. Sources which had expert SFG classifications (above 60% agreement) but were not selected through 'trace' as described in Section 5.1 and used for Table 2. to be a merger, b) presents bridged features and c) is not faint. The result is a single entry: source 17 from Table 2, shown in Figure 9.
This source is, as expected, an unusual object requiring expert followup. It is a composite of emission from a flocculant spiral faceon 2MASS galaxy (Skrutskie et al. 2006), plus apparently associated emission to its SE with no obvious separate optical counterpart. The burned out (blue) object at the southern edge of the contours is listed as two stars in the Gaia catalogues (Gaia Collaboration et al. 2016, and has no obvious connection to the radio structure. A very careful evaluation of the chances for serendipity, and the possible physical nature of this source, are beyond the scope of this paper.
However, the control provided by these semantic tags has allowed for the selection of an unusual object worthy of further study. 6 DISCUSSION AND IMPACT

Taxonomy
The proposed semantic tags will find immediate use in the RGZ EMU citizen science project (Tang & Vardoulaki et al. in prep.). For future implementations of science tags being assigned to sources, we suggest that the community uses a hash symbol to denote the use of a tag (e.g. '#compact') as suggested in Rudnick (2021) to distinguish Figure 9. Rare source selected from the synthesised initial semantic tag catalogue by querying "hourglass \ (amorphous ∪ traces host galaxy ∪ bent)" (set theory notation). EMU radio brightness contours matching those in Figure 1 ontop of a DES cutout prepared as in Figure 8. from traditional classification frameworks. This should prove useful to the legibility and analysis of future catalogues and works. This is the first step of the tagging framework in radio astronomy morphology. The taxonomy is intentionally designed to be extensible, such that when the community decides a feature of interest is not being captured by the current version of the taxonomy it can be updated to include the appropriate tag. The presented set of semantic tags are a first step towards mapping common features of radio morphologies using plain English annotations.
The specificity of the tags in comparison to the current classification scheme may be a concern to some astronomers. The inconsistency with which current radio morphological classifications are defined means that the language currently in use does not have the desired specificity either -regardless of how specific a term is in an individual astronomer's mind. Furthermore, we highlight that terms are expected to be used more consistently and clearly when selected directly rather than being derived through annotations.
A science class mapping using the semantic taxonomy is one of the goals of the RGZ EMU team. This mapping will be constructed towards the end of the RGZ EMU project. This mapping should be able to provide the traditional classification of objects by predicting them based on the tags that citizen scientists have assigned to objects. It will include the most common radio morphology science classes which the community is more accustomed to. While it is hoped that the tags will span the full space of possible scientific classifications, there may be cases where the provided mapping does not cover a science case perfectly in its current form.
Regardless, the ability to combine the tags to select semantic populations, as demonstrated in Section 5, will enable feature specific population studies and sources to be omitted if they present features that are not relevant to a given science case.

Experiment and Implementation
As described, the experimental set up of this work has a number of limitations, including the small size of the data set: given the degree of variation present across radio galaxy morphologies it is not possible to capture all abstract science classes of radio galaxies completely. For example, as stated in Section 3.2.3, ORCs are so rare that we could not train on them, and thus they are not taken into consideration in the weighted Shapley values. Such biases are therefore currently passed onto the derived and proposed semantic taxonomy.
Additionally, abstract science classes based primarily on morphology do not directly encode other physically relevant information. Given sufficient information, it may be more informative to derive semantic tags directly from physical parameters, e.g. active galactic nuclei accretion rates. This would encourage the derived semantic tags to carry information regarding the physics itself, rather than a proxy, e.g. abstract science classes. In practice, collecting a large enough sample with which to do this is not feasible at this time, but should be considered in future approaches and other domains.
We use a pre-trained NLP model in our approach to this task. This model is a limitation even though it is also a key factor in the success of our implementation and experiments. For instance, the model we used only returns individual 'tokens', and not fully grammatically correct terms or phrases. We amend this through manual inspection and adjustments, however we recognise this is not a scalable solution, and will not be possible in all situations. We hope that future approaches will find solutions to this problem. The NLP literature is currently developing at a significant pace. We therefore urge future iterations and applications of this approach to actively re-consider which pre-trained model is used. We expect that by using the most up to date pre-trained model, future implementations will have more robust and versatile encodings of annotations and tags.
Finally, we note that the semantic radio morphology taxonomy derived in this work is inherently bound to the instrument with which the radio galaxies were imaged (ASKAP). Data from a more sensitive (e.g. SKA; Dewdney et al. 2009) or higher resolution (e.g. LOFAR long baseline Morabito et al. 2022) instrument might require additional or different semantic tags. This would be simple to implement under the proposed tagging paradigm, as it would be sufficient to supplement the existing taxonomy with the appropriate semantic tags.

Semantic Taxonomies
The proposed method, and respective task of deriving semantically meaningful tags (first outlined in Bowles et al. 2022), have the potential to impact other fields. We therefore discuss their potential impact and limitations in a domain agnostic tone.The move away from technical classes to semantic language capturing features may have a broad impact across a number of technical fields, especially where complex classes have been defined and the field has since moved beyond those initially valuable classification schemes, as is the case in radio astronomy. This could include any feature rich data product, especially where features are often repeated across classes.
The collaborative nature of science may be improved by the use of simplified and semantic language. Complex ideas are often shrouded in equally complex terminology, which can be highly effective when experts communicate with one another, but quickly becomes a hindrance to communicating in any other situation. Capturing features of an object using dictionary level definitions will lower the barrier to entry for established researchers who are not domain experts to study the features captured by the semantic language. This can be a significant benefit, where domain specific terminology could be an active barrier to communication in inter-disciplinary research. Additionally, the use of plain English should enable scientific collaborations within a given field, i.e. between radio astronomy domain experts and astronomers who are not experts in radio morphology.
Outreach efforts are also likely to benefit from the change to language. We hope that the simple language will reduce the barrier to entry for those who would like to become experts in the respective field. This will have direct impact on the accessibility of technical fields as a whole including communities who have not had much practice in the use of scientific language. This is in perfect alignment with the educational aspects of citizen science, which are often used to engage underprivileged communities in science with the aim to inspire and empower. The hope of citizen science outreach is that students who have seen, interacted with, and subsequently added to the international body of science feel empowered to pursue STEM subjects. Clearer language will improve engagement to support this goal.
The science in citizen science projects is also expected to benefit from the new language. Easily understood concepts presented by the simplified language should lead to improved usage of tags for a given source (Wald et al. 2016). Additionally, the reduction in training time / effort of the citizen scientists is hoped to lower the labelling cost for projects as a whole by reducing the labelling cost for individual citizen scientists.
Deep learning and machine learning models currently learn to predict scientific classes from images (or similarly high dimensional data). Learning to encode these classes can be quite challenging as the concepts and definitions represented by these classes can be both abstract and contentious. This may be partially addressed by training models to encode a semantic taxonomy instead, as models would learn the features instead of abstracted classes. Derived semantic taxonomies presents more clearly defined concepts and may encourage models to learn a more robust feature space. This could improve the effectiveness of the encoded features of a model for various other tasks. Even in a simple use case, the more robust feature space may allow models to be more generalisable and less brittle, i.e. transferable or fine-tunable to differing data sets and tasks, which would be of immediate benefit to a number of applications.
The anglocentric nature of this work was alluded to previously. Although the language improvements, may benefit many populations, it does still marginalise those who do not speak English natively. The RGZ EMU team is considering a number of strategies to mitigate the effect of anglocentric labelling, including translation into multiple languages. However, we recognise that there are broader complex issues around use of language in science and recommend this as a topic of discussion in future work.
Finally, the ethics of deriving terms from an unstructured data set include careful consideration of the potential presence and impact of malicious agents. Caution is therefore advised when applying this process to other fields. In this work, the data set used was small enough that each annotation was inspected individually.

CONCLUSIONS
In this work, we derive a flexible English taxonomy for radio astronomy and the respective morphological tagging. The proposed taxonomy of 22 semantic tags is • the product of experiments collecting expert classifications and plain English annotations on radio source morphologies using selected cutouts from the EMU pilot survey, • reduced to a set of 10 terms to maximise its effectiveness within citizen science projects, starting with RGZ EMU, • derived analytically through a novel method with minimal clarifying intervention.
We demonstrate the first effective use cases of the newly derived semantic morphology taxonomy. We show that using the tags we can recover • known scientific morphologies, and • rare sources with abnormal morphologies.
The method which was developed, detailed, and applied in this work is domain agnostic. The method • provides a framework through which plain English annotations of complex ideas can return a ranked taxonomy on a given subject, • can be applied to any scenarios where language is a barrier to future research, • can increase the accessibility of complex scientific concepts by distilling concepts into simpler English for the public, collaborators, and citizen scientists.
The potential scientific impacts, applications, and communication benefits of this method and taxonomy are discussed at length in Section 6.
Supercomputing Applications at the University of Illinois at Urbana-Champaign, the Kavli Institute of Cosmological Physics at the University of Chicago, the Center for Cosmology and Astro-Particle Physics at the Ohio State University, the Mitchell Institute for Fundamental Physics and Astronomy at Texas A&M University, Financiadora de Estudos e Projetos, Fundação Carlos Chagas Filho de Figure B1. Composite images of SFG sources not captured by the 'trace' tag, as discussed in Section 5.1. EMU contours following Figure 1 with optical DES RGB backgrounds as in Figure 8. Cutout centre coordinates are presented on the image along side their respective source numbers associated with Table 3 and  Table B1. The radio beam size for all panels is shown in the lower left of the figure; all cutouts are 3 × 3 .