Morphological Classification of Radio Galaxies with wGAN-supported Augmentation

Machine learning techniques that perform morphological classification of astronomical sources often suffer from a scarcity of labelled training data. Here, we focus on the case of supervised deep learning models for the morphological classification of radio galaxies, which is particularly topical for the forthcoming large radio surveys. We demonstrate the use of generative models, specifically Wasserstein GANs (wGANs), to generate data for different classes of radio galaxies. Further, we study the impact of augmenting the training data with images from our wGAN on three different classification architectures. We find that this technique makes it possible to improve models for the morphological classification of radio galaxies. A simple Fully Connected Neural Network (FCN) benefits most from including generated images into the training set, with a considerable improvement of its classification accuracy. In addition, we find it is more difficult to improve complex classifiers. The classification performance of a Convolutional Neural Network (CNN) can be improved slightly. However, this is not the case for a Vision Transformer (ViT).


INTRODUCTION
Radio galaxies are galaxies that emit a large fraction of their electromagnetic output in the radio band. The structures visible in radio wavelengths are typically larger than the structures visible in optical wavelengths. Radio galaxies are a class of active galactic nuclei (AGN) and are powered by supermassive black holes at the centres of galaxies. The extended emission is produced by synchrotron radiation of highly relativistic particles accelerated by the AGN. Studying radio galaxies helps to understand the effects of massive black holes on their environment (see e.g. McNamara & Nulsen (2007)). The jets of highly energetic particles emitted by giant radio galaxies potentially play a major role in the creation of cosmic magnetic fields (Vazza et al. 2022).
A lot of new radio sources will be discovered with the new generation of radio telescopes (e.g. LOFAR, MeerKAT, and in the future the SKA (van Haarlem & et al. 2013;Jonas & MeerKAT Team 2016;Carilli et al. 2004)). Processing the incoming data is one of the biggest challenges in radio astronomy. The cause is not only the enormous amount of data, but also the higher source density due to the improved sensitivity of the instruments. Novel techniques are required for this purpose. For instance, the SKA data challenges have demonstrated the difficulties of source finding for SKA data (Bonaldi & et al. 2021). Deep learning has been used to automate processes in radio astronomical data reduction, for example in the automatic flagging of data (see e.g. Mosiane et al. (2017)). Another example is the work by Mesarcik et al. (2020) who have used a Variational Autoencoder (VAE) in combination with other methods to automatically inspect data to diagnose system health for modern radio telescopes. Commonly large amounts of labelled training data are required for supervised algorithms, which are not always available.
However, the existing number of radio sources with morphological labels is limited (the MiraBest data set contains 1254 FRI, FRII and hybrid FR sources (Porter 2020)). These class labels are typically extracted from catalogues created and curated manually by experts. Small data sets used in the training of deep learning models for galaxy classification can be enlarged by data augmentation (Maslej-Krešňáková et al. 2021), e.g. by applying random rotations and reflections to the images (classical augmentation). A different approach based on equivariance implements the symmetry constraints of the problem directly in the construction of the model . This may help classifiers to understand symmetries without relying exclusively on augmentation and may be particularly useful for problems with sparse data.
In this work, we investigate a novel application of generative models to enhance the available training sets. For this augmentation technique, multiple neural networks are combined to learn the underlying distribution of a data set. We focus on the task of classifying different morphological types of radio galaxies. The morphological classification scheme by Fanaroff-Riley is fundamental for such applications (Fanaroff & Riley 1974). For the class FRI, the unique maximum of the radio emission resides in the centre of the source and the surface brightness decreases along the jets. For FRII sources, the two maxima of the radio emissions are located at the edges of the jets and the surface brightness in the centre is lower. As radio sources have a large variety of structures, we consider two more classes. Unresolved and point sources are contained in the Compact class. The Bent class consists of sources for which the angle between the jets differs significantly from 180 degrees. The two sub-types Narrow-Angle Tail (NAT) and Wide-Angle Tail (WAT) are further discriminated by the angle, but are fully subsumed in the Bent class for this study. As in Alhassan et al. (2018); Samudre et al. (2021), we study a four-class classification problem, including bent-tail and compact sources in addition to the classes FRI and FRII of Fanaroff & Riley (1974). Figure 1 illustrates the considered classes (FRI, FRII, Compact, and Bent).
Other studies probe the use of generative models to create images of radio galaxies (Ma et al. 2018Bastien et al. 2021). These studies are based on VAEs. Generative adversarial networks (GANs) have been applied to astrophysical images in Schawinski et al. (2017). For a semi-supervised GAN application to radio pulsars see Balakrishnan et al. (2021). In Hackstein et al. (2023) various evaluation metrics are used to compare different generative models trained on optical galaxy images.
In this study, we investigate whether different radio galaxy classifiers can be improved when training is supported by providing additional data generated with a Wasserstein Generative Adversarial Network (wGAN). For similar approaches from different fields see, for example, Frid-Adar et al. (2018); Zhu et al. (2017); Gowal et al. (2021). We extend our framework presented in Kummer et al. (2022) to handle larger ratios between real and generated images. Additional images are only generated when they are needed  5-fold cross train  316  659  232  198  1405  5-fold cross valid  79  165  59  50  353  test  100  100  100  100  400   total  495  924  391  348  2158  relative frequency 0.23 0.43 0.18 0.16 1 in total during training. As before, we start with a simple model, namely a Fully Connected Neural Network (FCN). In addition, we apply wGAN-supported augmentation to a CNN and a Vision Transformer (ViT), see Dosovitskiy et al. (2020).
The long-term goal is to use classification models to process incoming data from new radio telescopes. For this purpose, classification models need to generalise particularly well. A common problem in astronomy is the scarcity of labelled data in the face of large amounts of new data to process. This is a very different situation as for instance in particle physics, where simulations are highly fine-tuned and experiments are constantly repeated. In particular, for forthcoming radio surveys, and even for the majority of sources in FIRST, no morphological labels are available. As a result unsupervised, semi-supervised and self-supervised methods have gained attention without reaching the performance of supervised methods (Mostert et al. 2021;Slĳepcevic et al. 2022a;Slĳepcevic et al. 2022b). The current classification scheme of radio galaxies and our physical interpretation will be challenged by new radio surveys. For instance, Mingo et al. (2019) detected a large population of low-luminosity FRII sources in the LOFAR Two-Metre Sky Survey (LoTSS), see Shimwell et al. (2019) ;Shimwell & et al. (2022), that is not expected from the conventional FR distinction based on radio luminosity. Discoveries of rare morphologies can help to extend our understanding of radio sources, but are potentially prohibited by supervised learning techniques. Unsupervised methods such as self-organising maps can be used efficiently to discover such rare morphologies (Mostert et al. 2021).
This paper is organised as follows: In Section 2, we introduce the data set used for training, validation, and testing. The generative model and its implementation are described in Section 3. The training procedure and the assessment of image quality are discussed in Section 4. The results of the comparison between only classical and classical plus wGAN-supported augmentation for different classifiers are presented in Section 5 before we conclude in Section 6.

DATA
We combine different catalogues (Gendre & Wall 2008;Gendre et al. 2010;Miraghaei & Best 2017;Capetti et al. 2017b,a;Baldi et al. 2017;Proctor 2011) that characterise radio sources from the FIRST survey to create a data set of 2158 radio galaxy images with morphological labels. The labelling in the catalogues is typically performed by experts by considering radio images and the corresponding optical counterparts. We group radio sources into 4 classes, namely FRI, FRII, Compact and Bent. The source coordinates are compared between catalogues to remove duplicates. Sources that appear with different labels are regarded as ambiguous and are removed entirely. More details on the acquisition of the data set can be found in Griese et al. (2023). The data set is published on zenodo (Griese et al. 2022) and on GitHub (https: //github.com/floriangriese/RadioGalaxyDataset). The radio galaxy images of the FIRST survey are collected from the virtual observatory skyview 1 . We start from the original images with a size of (300 × 300) pixels. Then we adopt the preprocessing procedure from Aniyan & Thorat (2017). In particular, we set all pixel values below three times the local RMS noise to the value of this threshold. We apply classical augmentation to all images during training consisting of random rotations and reflections of the base image. This augmentation is done every time an image comes up in the training loop, so that the augmentation factor simply depends on the number of iterations of the training procedure. Consequently, classical augmentation retains the class imbalance present in the base image set. The augmented images are then cropped to the input size of our generative network, i.e. to (128 × 128) pixels. Subsequently, the pixel values are rescaled to the range [-1, 1] to represent floating point greyscale images.
We separated 100 sources per class from the data set for the final evaluation of our models. For validation purposes during training (e.g. choosing the best model), we use a 5-fold cross-validation. Therefore, we do not need a separate validation set. As a result, we lose less training data. In particular, we split the training set into five blocks and did five separate training runs. For each of these runs one 1 https://skyview.gsfc.nasa.gov of the five blocks was used as the validation set and the remaining four blocks represented the corresponding training set. The quantities per class and per split are shown in Table 1.

WASSERSTEIN GAN
The ability to learn representations of underlying statistical distributions of data sets makes generative models a powerful tool for the creation of additional data points. In particular, sampling from those representations allows to speed up conventional simulation techniques significantly and may be useful for further subsequent treatments (Buhmann et al. 2021(Buhmann et al. , 2022. Three different categories of generative models are wellestablished: GANs, VAEs, and flow-based models. Diffusion models represent a relatively new development in this area. In this work, we focus on GANs. They consist of two neural networks: a generator that generates fake images from a noise vector and the discriminator that discriminates between real and fake images. This architecture was first introduced in Goodfellow et al. (2014); Salimans et al. (2016). In a two-player minimax game, the generator learns to create fake images, which become less and less distinguishable from the real ones in the course of the training. The loss function for this setup reads (Goodfellow et al. 2014;Salimans et al. 2016): where represent real samples and ( ) =˜generated samples. For this project, we employ a variant of the standard GAN setup called wGAN that uses the Wasserstein-1 metric, also referred to as the Earth Mover's distance, as main term in the loss function . This loss function is calculated as where denotes a 1-Lipschitz function that is learned during the training procedure. The discriminator of a standard GAN is transformed into a critic and is used to estimate the Wasserstein distance between real and generated images. Hence, the absolute value of the loss function is correlated with the image quality, resulting in the name change. Additionally, the training of wGANs is often more stable and more likely to converge than standard GAN setups. To approximate the Wasserstein-1 metric by use of a critic network, it has to be ensured that the 1-Lipschitz constraint is fulfilled. This is achieved by applying a gradient penalty term to the loss function as in Gulrajani et al. (2017) for random samplesˆ∼ Pˆ.
Since we work with image data, it has proven to be the most promising approach to construct a wGAN setup based on convolutional layers (Radford et al. 2015). The generator receives a noise tensor of size 100×1 and a class label and, through multiple layers of 2D transposed convolution operators, enlarges this to a 128 × 128 tensor, consistent with the dimensions of real images. The critic is given either real or generated images, as well as the class label . The output of the critic is a single real value, which represents the belief of the critic for the image to be real. Generator and critic are trained intermittently, where the critic has five training cycles per training cycle of the generator. When training the generative model, it is necessary to apply classical augmentation such that the symmetries of the training set are also present in the generated data sets, and to avoid introducing a bias due to the limited number of Morphologies of radio galaxies are diverse and result in very different images. Consequently, it is reasonable to condition the networks with the class label such that a combination of image and class label is provided to the networks. In particular, this allows applying supervised learning techniques on the output of the generator. For our setup, this is achieved for the generator by applying a 2D transposed convolution operator on a matrix of image dimensions filled with the class label. The transpose-convoluted layer is then concatenated to the first transpose-convoluted layer of the noise tensor. Batch normalisation in 2D and ReLU (Rectified Linear Unit) activation functions are used. The concatenated tensor is then passed through five additional 2D transposed convolutions, where no normalisation or activation is applied after the last layer. Instead, the individual pixel values are clipped to [-1,1] for conversion to grayscale. The critic is built analogously, but uses 2D convolutional layers, resulting in a single output node representing the critic score for image quality. Here, layer normalisation and Leaky ReLU functions are used except for the last layer. The Leaky ReLU activation function is an attempt to avoid the "dead Neuron" phenomenon of the pure ReLU function, where any gradient information is lost if the input is negative. This makes the critic more stable against sub-optimal starting points. The Layer Norm computes the normalisation over the features instead of batches. A schematic of the wGAN setup can be found in Figure 2. For more details on the architectures see Table B1 and Table B2.

Training
For each choice of training and validation data in the cross-validation procedure a wGAN training run is launched on the corresponding training set. The training is performed with a single NVIDIA A100 GPU provided by the Maxwell cluster at DESY for 40k generator iterations, i.e. weight updates. A batch size of 400 is chosen and one training run takes roughly seven hours to complete. The choice of the batch size did not have a strong impact on the performance of the model, so that we chose a size that still comfortably fits into the GPU's memory, while being large enough to fully profit from the computing speed-up of larger batches. The generator and critic weights are saved every 250 iterations, allowing to scan for the best training state later on, as described in the following section. Choosing such an iteration for every model and training run is necessary as wGAN training runs generally do not converge fully but rather fluctuate around an optimal value. This means that it is not instructive to simply use the final state of the model after training and instead other metrics need to be studied to choose an optimal working point. While comparing different model setups, we are only interested in the performance of these optimal working points. All models are implemented and trained in PyTorch (Paszke & et al. 2019). For an overview of training details we refer to Table B5. The choice of hyperparameters is inspired by values obtained by Buhmann et al. (2021). With the exception of the learning rate, other hyperparameters have not been further optimised.

Evaluation of image quality
In this section, we present images created using the generator of the wGAN and examine the quality of the generated images in several ways.

Distribution-based comparison
We define a set of distributions to compare generated images with the training data set, in order to determine the quality of generated images and thus to find the best performing training iteration. This includes normalised histograms of pixel intensities, the number of pixels with an intensity greater than zero and of the sum of intensities. These histograms are compared for each class individually and the relative mean absolute error (RMAE) between the generated set of 10k images and the training set is computed. The RMAEs for the different distributions are summed up to yield a single figure-of-merit (FOM), where the wGAN training iteration with the lowest FOM value is used in the following as the best model. This procedure is followed for each of the four classes separately, i.e. we allow a different iteration of the generator training to yield the best model for each class. The chosen distributions are commonly used for images (e.g. photography), but it is important to note that they do not specifically contain information on the shape of the radio galaxies within these images. The choice of RMAE is based on its very fast computing time and robustness against empty bins while we acknowledge that other test metrics can be used.
Arbitrarily chosen examples of these distributions are shown in Figure 3, where the distribution of the real images is shown in orange and the distribution of the generated images in blue. The uncertainty for each bin is given by the square-root of entries in that bin before normalisation. The bottom panels in this figure show the per-bin divergence between the distributions, where absolute deviations larger than 1 are indicated by the corresponding value written in boxes. Here, only examples from the first cross-validation fold (of five) are shown.
Overall, the distributions of the generated images tend to follow the distribution of the real images. Nevertheless, the generated images have difficulties in recreating very low, but non-zero, intensities. This can be seen for pixel values between 1 and 20 in Figure 3a, which directly translates into under-representing the number of pixels with an intensity > 0 in Figure 3c.

Visual comparison
In order to get a visual idea of image quality, we generated a set of 5k images per class and compared them to the full training data set over all cross-validation folds. The images are rotated so that their principal components are aligned. Subsequently, we compute the pixel-by-pixel difference for all possible pairs of real and generated images. All classes also include a few difficult to define sources with  rather small spatial extension that are easy to emulate but do not show the generator's capability of reproducing the more interesting extended sources. Thus, we only consider images with an intensity sum of at least 15k (5k) for the extended (compact) radio galaxies. We show the resulting closest pairs for each class in Figure 4. By eye, the generated images appear very similar to the analogue real images, indicating a good performance of the generator setup in terms of fidelity. In addition, the diversity of the generated data is crucial for the study in Section 5. To also get an impression of this diversity we show a random set of generated images in Appendix A.

Classifier-based comparison
Next, we use a CNN trained solely on the data set of real images to assess the image quality further. We compare the performance of the same classifier evaluated on the real test set and a set of generated images. The architecture of the CNN used for this experiment is summarised in Table B3 and the hyperparameters in Table B5. A comparison of the confusion matrices on both sets tests for any bias introduced by the image generation. In particular, we evaluate the conditioning on the class labels. In the top panel of Figure 5, we show the confusion matrix of the classifier on the real test set. Comparing this to the confusion matrix obtained by the same classifier on a set of generated images on the bottom panel of Figure 5, we find that the class conditioning of the generated images works overall quite well. However, confusion for images of the class FRI with the predicted classes FRII is enhanced on the generated test set. The classification performance of the Compact class is decreased on the generated test set, where particularly the misidentification of true Compact class images as FRII images is increased. This might be due to the fact that some FRII-like sources resemble a combination of two compact sources. Confusion for true Bent class images predicted to be of the FRI class is slightly reduced. The confusion between FRI and benttail sources is expected to be large as these classes contain sources that have faint, smeared out radio structures. In contrast, FRII and compact sources typically share sharp margins.

RESULTS OF CLASSIFIER TRAINING USING WGAN-SUPPORTED AUGMENTATION
We assess the new approach of supplementing the training set with generated images by comparing the performance of different classifiers (each trained on different setups with increasing amount of generated data). Our benchmark is the performance of the classifier trained on the original training set. We test the performance of the classifier trained with the original training set plus simulated images by the generator of the wGAN against this benchmark. We start with a FCN (see Table B4). Subsequently, we increase the complexity of the classifier by training a CNN (see Table B3). Finally, we apply our framework to a state-of-the-art classifier, namely the ViT (Dosovitskiy et al. 2020). Inspired by the performance of transformers in natural language processing, like BERT (Devlin et al. 2018) and GPT (Radford et al. 2018(Radford et al. , 2019Brown et al. 2020), vision transformers are frequently used in computer vision tasks e.g. classification, object detection and segmentation (Khan et al. 2022;Shamshad et al. 2023;Ulhaq et al. 2022). The self-attention mechanism enables learning long range relationships between items within a sequence. Further, the architecture provides a scalability to high complexity models (Khan et al. 2022). As the transformer assumes less prior knowledge than a CNN based model, it requires more training data, Thus, the transformer models are typically pre-trained on large-scale data sets to learn more general representations and afterwards the learned representations are fine-tuned to the task with limited data (Khan et al. 2022). In our case, we use the default ViT-B_16 vision transformer configuration with pre-trained weights from the ImagetNet21k data Real FRI FRII Compact Bent Generated Figure 4. Closest matching pairs in terms of a pixelwise difference between generated and real images for each class. The set of real images is the full training data set over all cross-validation folds; the generated data set consists of 5k images per class. Images are aligned according to the first principal component.
set 2 with a resetted head layer. The wGAN-generated images with pixel sizes 128x128 are zero-padded up to 224×224 pixels to fit the pre-trained model input size. As an attention based model, the ViT splits the image into fixed-size patches processed by the transformer encoder. We generate images on the fly, i.e. each time a generated image is loaded it is newly generated. The images are generated such that the resulting data set is balanced. As a loss function, cross-entropy loss is implemented, weighted for the imbalanced data only runs. The training setups are not optimised to reach maximal classification accuracies. The goal of this study is only to compare classical augmentation with wGAN-supported augmentation for each of the classifiers. We do not compare performance between the three classifiers in detail either. For further training details see Table B5.

Evaluation metrics
To compare the overall performance among different training setups and to determine the best training iteration of a classifier training run (see Figure 6), we use the multi-class Brier score (Brier 1950). The Brier score is essentially the mean squared error of the predicted probabilities of a classifier for all classes. This has the advantage that also the certainty of the classifier's decision is considered, which winner-takes-all FOMs such as accuracy do not take into account. For each setup, i.e. for a given ratio between the number of generated and real images , denoted = / , we have five models due to the 5-fold cross-validation.
The final evaluation is performed on an independent test set that contains real data only. We use the most commonly applied metric in radio astronomy publications: multi-class accuracy. In order to estimate statistical fluctuations, we average the performance metrics over the five best models of each cross-validation fold.
Accuracy The multi-class accuracy, i.e. number of correct classifications over number of all classifications, on the test data set is shown in Figure 7 for the three different classifiers investigated here. The results are shown for different training scenarios, where the number of generated images used to augment the training data set (represented by ) is varied. The blue markers (uncertainty bars) represent the mean (standard deviation) of the obtained results over all cross-validation folds for the augmented training data sets and the horizontal orange line (area) show the corresponding result for the real data only case. Figure 7a presents the results for the FCN, which yields an improvement in accuracy of (17.5 ± 4.7) % over the baseline setup at = 2. All augmented training setups outperform the real data only case which reaches an accuracy of (58.7 ± 1.8) %.
The highest obtained average for the CNN classifier is reached for = 3, as can be seen in Figure 7b, which is (3.0 ± 1.8) % higher than the real data only baseline at (78.9 ± 1.1) %. The highest obtained average using wGAN augmented training data for the ViT classifier is reached at = 2, see Figure 7c, which is (0.7 ± 2.0) % lower than the real data only baseline at (80.6 ± 0.9) %. For additional performance analyses per class we refer to Appendix C.

DISCUSSION AND CONCLUSION
The approach used for the study presented here, utilising a wGAN, is novel to the field of radio astronomy. We are able to generate highly realistic images of radio sources of the four different radio galaxy classes. For this, we rely on the good agreement between the image metric distributions, such as the pixel intensity histogram, between real and generated images, as well as the good agreement between the confusion of a CNN classifier trained only on real data  obtained on a real data only test set and a generated data only test set. Particularly the latter, provides confidence for the class conditioning of the generator. Following a visual inspection, we note that the generated images tend to have sharper edges, i.e. low intensity pixels directly next to high intensity pixels. This is not the case for real images, which are smeared due to detector resolution effects. Resolving these issues would yield even more realistic generated images.
However, we do not observe issues known from other state-of-theart generative networks in radio astronomy. VAE-based models suffer from different noise levels between generated and training data or pseudo-textures and pseudo-structures (see e.g. Bastien et al. (2021)). The results of this study therefore constitute a major improvement in generated image quality.
This high quality of the images allows us to use them to improve the training of an external classifier, called wGAN-supported augmentation here. This represents an extension to realistic data of the studies done in Butter et al. (2021), which showed that statistical information contained in a simplistic toy training data set can be augmented using generative models. Another extension of this study to more realistic data in the field of particle physics is given in Bieringer et al. (2022).
We find in agreement with these studies that generated images individually contain less information than real data. An additional test presented in Appendix D shows that the performance of a classifier worsens if the amount of real training data is reduced and replaced by generated data. However, the statistical power of the training set can be increased by the inclusion of generated data.
Here, we are able to show that adding generated images to the training data set does clearly improve the classifier performance on a real data only test set for the FCN classifier, where the largest improvement of (17.5 ± 4.7) % over the baseline setup is reached for = 2, meaning a training data set consisting of all real images plus twice as many generated images. Additionally, similar improvements are seen for all other values that have been tested.
For the considerably more complex CNN classifier, the improvement is not so consistent and already the baseline performance is far better than even the enhanced performance of the FCN classifier. However, we do obtain a maximal improvement of (3.0 ± 1.8) % for = 3, which also represents the overall highest accuracy for any of the setups investigated here.
Finally, for the most complex classifier architecture, the ViT, we are not able to show a conclusive improvement of the classifier performance, so that we might expect a dependency of the ability of generated images to add useful information to the training data set on the baseline performance (often connected to the complexity) of the classifier in question. A naïve interpretation could be that the better performing architectures are simply more sensitive to even small differences between the real and generated images. Additionally, the robustness of the ViT might be an issue because it was pre-trained with natural images and only fine-tuned with radio galaxy images due to the limited data sample size.
Further, we considered a three-class classification problem with extended sources only. We found that the overall accuracy is reduced as compact sources are easier to classify. More importantly, the significance of the improvement by including generated images in the training is not enhanced as the variations in the cross-validation tend to increase as well.
The best overall accuracy is obtained by using the CNN and wGAN augmented training data, but only by a small margin. Yet, we have shown that wGAN augmentation works in principle (similar to the goal in Butter et al. (2021), as noted above) and can significantly improve a somewhat simpler algorithm. This can be useful for applications of classification algorithms in resource-constrained environments, i.e. disk-space and inference time restrictions.
Our generative model is able to generate large sets of radio galaxy images of different morphologies very quickly. A batch of 100 images can be generated on a NVIDIA V100 GPU in ∼0.1 seconds and in ∼4.5 seconds on CPU. Therefore, our wGAN can play an important role in the simulation and analysis of large radio surveys. Future work involving much larger training sets from the LOFAR telescope will explore this further. Moreover, wGAN-generated images can be used to validate new interferometric machine-learning algorithms, see e.g. Schmidt et al. (2022). To this end, we provide the model and weights with documentation at https://github.com/floriangriese/w GAN-supported-augmentation.

APPENDIX A: GENERATED AND REAL IMAGES
Here we show a random sample of 24 real images in Figure A1 and 24 generated images in Figure A2 in order to give a visual impression of the diversity of the data.

APPENDIX B: CLASSIFIER ARCHITECTURES
Detailed information about the architecture of the implemented models is given in this appendix. The structure and the corresponding number of parameters for the the critic of the wGAN is given in Table B1, and for the generator in Table B2. Detailed information about the CNN is given in Table B3 and for the FCN in Table B4. In Table B5 we summarise the hyperparameters of all model trainings we conducted for this study.      Figure C1. Precision for each class on the test data set for the three different classifier architectures for different training scenarios. The markers (uncertainty bars) represent the mean (standard deviation) of the obtained results over all cross-validation folds for the classically + wGAN augmented training data sets.

APPENDIX C: CLASS-WISE PERFORMANCE
In this section we demonstrate the performance per class of the classifiers studied in Section 5. In particluar, we show the classwise Precision in Figure C1, Recall in Figure C2 and F1 Score in Figure C3.

APPENDIX D: ADDITIONAL TEST
Here we present an additional test to compare the information content of real and generated images during classifier training. The classifier architecture for this test is the CNN introduced in Table B3. We train the CNN on different compositions of the original training set and a batch of generated images of the same size. We observe that the classifier performance worsens gradually as we remove real images and add generated images to keep the size of the training set fixed (see Figure D1). From this experiment we can confidently conclude that the generated images are less informative compared to real images. Note, that we had to exclude some runs with low amount of real data  Figure D1. Accuracy on the test set achieved by the best CNN model trained on combined, i.e. generated + real, data sets with varying fractions of real images. The overall number of images in the training set corresponds to the full real-only training set and class imbalance is kept. Classical augmentation is used on both types of images during training.
due to the inability to classify the compact sources correctly along with the extended sources. This paper has been typeset from a T E X/L A T E X file prepared by the author.