Pulsar Candidate Identification Using Semi-Supervised Generative Adversarial Networks

Machine learning methods are increasingly helping astronomers identify new radio pulsars. However, they require a large amount of labelled data, which is time consuming to produce and biased. Here we describe a Semi-Supervised Generative Adversarial Network (SGAN) which achieves better classification performance than the standard supervised algorithms using majority unlabelled datasets. We achieved an accuracy and mean F-Score of 94.9% trained on only 100 labelled candidates and 5000 unlabelled candidates compared to our standard supervised baseline which scored at 81.1% and 82.7% respectively. Our final model trained on a much larger labelled dataset achieved an accuracy and mean F-score value of 99.2% and a recall rate of 99.7%. This technique allows for high quality classification during the early stages of pulsar surveys on new instruments when limited labelled data is available. We open-source our work along with a new pulsar-candidate dataset produced from the High Time Resolution Universe - South Low Latitude Survey. This dataset has the largest number of pulsar detections of any public dataset and we hope it will be a valuable tool for benchmarking future machine learning models.


INTRODUCTION
Discovering a new pulsar can often lead to new and exciting science. Some examples include the discovery of PSR B1257+12 (Wolszczan & Frail 1992), which led to the discovery of the first set of extrasolar planets. The first binary pulsar PSR B1913+16 (Hulse & Taylor 1975) and the subsequent measurement of its orbital period decay provided the first indirect evidence of gravitational waves. The discovery of the first pulsar triple system (a pulsar orbiting two white-dwarfs) led to one of the most stringent test of the Strong Equivalence Principle (SEP), a prediction of general relativity (Voisin et al. 2020). More recently, PSR J1141-6545 was used to infer Lense-Thirring precession (relativistic frame-dragging), a prediction of General Relativity (Venkatraman Krishnan et al. 2020). These examples are only some of the highlights that display the value of pulsar discoveries. Therefore, in order to keep pushing the boundaries of fundamental physics, it is important that we continue the investigation of new techniques in order to enhance the discovery process.
Identifying radio pulsars involves finding usually broadband periodic signals in noise-dominated data. As pulsar signals pass ★ E-mail: vishnu@mpifr-bonn.mpg.de through the interstellar medium (ISM) before arriving at radio telescopes, their radio emission is "dispersed" by the free electron content in the ISM. The amount of dispersion is proportional to the dispersion measure (DM) which is proportional to the integrated column density of free electrons between the pulsar and the observer. This creates a frequency-dependant delay such that lower frequency signals arrive later compared to higher frequency signals. Since the true DM of a pulsar is a priori unknown, we typically de-disperse the data into multiple trial values. Once the data is dedispersed, periodic signals are identified in the timeseries by calculating a Fast Fourier Transform (FFT) (Cooley & Tukey 1965) or by using the Fast Folding Algorithm (FFA) (Staelin 1969). Once the top candidates have been identified, these signals are folded at their respective spin-period and dispersion measure to form a pulsar candidate. Pulsar candidates are four-dimensional data-cubes consisting of time, frequency, rotational phase and power of a signal. These are the end products that are produced by most FFT based pulsar-search pipelines 1 . The final step is usually performed manually. Pulsar candidates are visualized as series of diagnostic plots (see Section 2.6 for more details) which are used by pulsar astronomers to identify if a signal is from a genuine pulsar or not.
Modern pulsar surveys, like the High Time Resolution Universe Low Latitude survey (HTRU-S Lowlat) Keith et al. (2010) typically produce around 40 million pulsar candidates in one processing run. The increasing number of pulsar candidates can be attributed to multiple factors. Modern surveys tend to have higher time and frequency resolution. In addition, HTRU-S Lowlat also has a relatively long 72-minute integration time, which leads to larger FFTs and the requirement of additional acceleration/template-bank trails in order to be sensitive to binary pulsars. We refer the readers to Lyon et al. (2016) for a more in-depth review on this topic. Out of these 40 million candidates, only a few hundred thousand of them are expected to be real pulsar detections (multiple detections of known pulsars + new discoveries). This is because even after refining candidate lists to eliminate multiple occurrences of the same pulsar across many DM and acceleration/template bank trials and harmonics, many bright pulsars can appear in the sidelobes of many pointings and in a survey like HTRU-S Lowlat this results in a few hundred thousand detections of the known pulsars in the survey. Assuming an extremely optimistic average inspection time of one second per candidate, working 12 hours a day, it would take a human 2.5 years to go through the entire dataset. Future pulsar surveys using the Square Kilometre Array (SKA) telescope 2 , are expected to increase this number further. Therefore, automated selection techniques that are optimised based on both speed and accuracy are of high importance for current and future pulsar surveys. Several papers have been written to address this topic. Eatough et al. (2010) used twelve hand-crafted numerical features/scores to describe each pulsar candidate. These twelve features were then attached to a multi-layer perceptron to identify pulsars in the Parkes multi-beam pulsar survey (PMPS) (Manchester et al. 2001). Lee et al. (2013) introduced a candidate ranking scheme based on six quality factors that were selected based on domain knowledge. Zhu et al. (2014) developed a Pulsar Image Classification System (PICS) which is an ensemble machine learning model based on Convolutional Neural Networks (CNN), Support Vector Machines (SVM) and Artificial Neural Network (ANN). This technique was trained on candidates from the Pulsar Arecibo L-band Feed Array (PALFA) survey (Cordes et al. 2006), and were successfully applied for identifying pulsars in the Green Bank North Celestial Cap (GBNCC) survey (Stovall et al. 2014). More recently, Guo et al. (2019), used a combination of deep convolution generative adversarial network and support vector machines (DCGAN + L2SVM) to achieve excellent results for candidates in the HTRU Medlat and PMPS survey. However, all these techniques require a large number of labelled pulsar candidates in order to perform well. In practice, since the number of pulsar detections is only a small fraction of the total candidates, (< 1 per cent), previous works either under-sample the number of nonpulsars in their training data or over-sample the pulsar detections. In this paper, we present results from training a machine learning algorithm to address the practical scenario where we typically have a small amount of labelled data along with a large amount of unlabelled data. This is called Semi-Supervised learning. Past applications using a similar approach in astronomy include applying Semi-Supervised learning on data from the Very Long Baseline Array (VLBA) Fast Radio Transients Experiment (V-FASTR) for radio pulsar candidate classification (Jones et al. 2012;Bue et al. 2014), applying a Semi-Supervised distributed algorithm called Co-2 https://www.skatelescope.org/ Training, Distributed, Random Incremental Forest (CoDRIFt) 3 for single pulse pulsar candidate classification (Devine 2020) and another study which uses a Semi-Supervised Deep Convolutional Neural network to classify radio galaxy images (Ma et al. 2019). We also compare our results to the purely supervised approach as done by previous works.

Machine Learning
Machine learning is a branch of computer science that deals with solving problems by learning through experience. In the classical setup, a human defines all the steps necessary for a computer to solve the problem. However, for complex tasks when it is not trivial to come up with a model to map the input data to our desired output, it is often desirable to learn from the data itself. This process of learning through experience is usually called "training" an algorithm.
There are broadly three classes of machine learning that are relevant for the work in this paper.
(i) Supervised Learning: In supervised learning, we have data and its corresponding label, which in our case is a binary label between pulsar and non-pulsar signals. To the best of our knowledge, all the currently published papers in pulsar candidate classification fall under this category.
(ii) Semi-Supervised Learning: Semi-supervised learning is a branch of machine learning that combines a small amount of labelled data along with a large number of unlabelled data in order to obtain better learning performance. This is the problem we are trying to tackle in this paper.
(iii) Unsupervised Learning: In unsupervised learning, no labels are provided during training. It is up to the algorithm to find useful structure in the input data.

Artificial Neural Network (ANN)
ANNs are a class of supervised machine-learning algorithms that are commonly used for classification tasks. Variants of this network have been used previously in solving the pulsar candidate classification problem in Eatough et al. (2010); Bates et al. (2012); Zhu et al. (2014); Bethapudi & Desai (2018). This algorithm has also been used in this paper for comparing our proposed architecture to the standard supervised learning case. We briefly summarise the different components of an ANN and its operation. For a more thorough explanation, refer to chapter five of Bishop (2006).
The simplest unit of an ANN is a neuron. These neurons are loosely inspired by biological neurons in the sense that there are input(s) to the neuron, an activation function and an output. Neurons are usually grouped together in layers. The first layer (often called the input layer) of the ANN is usually attached to the image or data we are interested to predict on and the last layer (often called output layer) is usually attached to the label we want to predict. Figure 1 is an example of single layer neural network, where there is one input layer, hidden layer and output layer respectively. Each neuron of a layer is connected to all the neurons of the next layer. A neural network with several hidden layers is usually referred to as a deep neural network or multi-layer perceptron (MLP). Assume we have an input vector ì of 8 elements { 1 , 2 , ... 8 } which is passed onto a neuron in the next layer. This neuron then calculates a weighted sum of all the values of ì and applies a non-linear activation function on it. Mathematically, this outputˆcan be written as: w is the weight of the i th neuron which decides the relative importance of a neuron, w 0 is the bias term which is a trainable constant value for each layer, is the activation function used. The purpose of an activation function is to decide if the neuron should be activated or not. This function helps in normalising the weighted sum values and additionally they introduce non-linearities to the network. Some of the activation functions used in this paper include the sigmoid function ( ) = 1 1+ − , tanh ( ) = −2 −1 −2 +1 and Rectified Linear Unit 'ReLU' ( ( ) = (0, )).
The process by which a neural network "learns" is by minimising a loss function. Loss functions are defined as the difference between the predicted output of a neural network to the ground truth labels. Since, we are dealing with a binary classification problem, the output of our neural network is a probability value between 0 and 1 for a candidate to be a pulsar. A value closer to either extremum indicates high confidence in our prediction. Our goal is to minimise the cross-entropy loss between our predicted labels and true labels. We use the standard softmax function for converting values into probabilities. Here, ì is the input vector passed on-to the softmax function and K is the number of classes in the classifer. In practice, these loss functions are minimised in an iterative fashion by calculating their negative gradients and propagating it backwards to the network using a process called back-propagation (Rumelhart et al. 1986). Here the input image is a grey scale image of size 48x48 and the output contains two nodes for the binary class labels. 8 and 16 filters of size 4x4 were used for the convolutional layers. These act as feature extractors that identify import patterns in the input image. This diagram was created using an open source tool called NN-SVG 7 (LeNail 2019).

Convolutional Neural Network (CNN)
CNNs are a class of machine learning algorithms that are commonly applied in the field of image classification. One of the earliest applications of CNNs was the LeNet-5 network, which was successfully used to recognise hand-written digits (LeCun et al. 1998). CNNs have also been successfully applied to the pulsar candidate classification problem previously in Zhu et al. (2014); Guo et al. (2019). An example of a CNN is given in Figure 2. The major difference here is that the fully connected layers have been replaced with convolutional layers. These convolutional filters act as feature extractors to identify important parts of the input image. The convolutional layer is typically followed by a max-pooling operation, where we find the maximum value of preceding neuron cluster and store them to a single neuron in the current layer. For example, if we have a 48x48x1 tensor, after a 2D max pooling operation of 2x2, the tensor's size changes to 24x24x1. This is done to constrain the dimensionality of the network while propagating only important information to the next layer. This is usually followed by a activation function, typically ReLU. Many such convolution, max-pooling and activation function layers can be concatenated together to a form a deep CNN. This is usually followed by a fully connected layer also known as a dense layer which is then finally connected to the output layer. The output layer for a classification problem is the amount of class labels we have in our data. Additionally for deep CNNs, a dropout layer is typically added after the max-pooling operation, which randomly drops off a certain percentage of the preceding nodes. This is done so that the neural network learns to better generalize its performance across the entire data. This is used as a regularization technique to avoid overfitting. CNNs are also trained using back-propagation. However, in practice since it is computationally difficult to calculate the gradient of the loss function for all images in the training data, we typically divide the data into mini-batches and use the stochastic gradient descent algorithm.

Generative adversarial network (GAN)
GANs are a class of machine learning algorithms (Goodfellow et al. 2014) in which two neural networks are trained simultaneously with opposing goals. They act against each other as adversaries in a minmax two-player game. A generative model G is tasked with generating new data that captures the distribution of the input data. A discriminator model D is tasked with classifying samples as either REAL (that belong to the original data distribution) or FAKE (samples generated by G). Assuming (D, G) is the value/loss function we are interested in then the first term in Equation 2, is the expectation of the logarithm of D's predictions when an input is from the real data sample. D's goal is to maximise this term. The second term represents one minus the expectation of the logarithm of D's predictions when data generated by G is passed onto to D. The goal of D is to maximise the second term but the goal of G is to minimise this term. This sets up the adversarial framework. In the ideal case, the generator perfectly samples the distribution of the input data distribution, and the discriminator output equals to 1/2.
In practice however, we use a minibatch stochastic gradient descent method and train the generator and the discriminator alternatively. The algorithm and the proof for the convergence of the algorithm can be found in Goodfellow et al. (2014), however for the benefit of the reader we briefly summarise the training of this network in algorithm 1. Update the Discriminator by ascending its stochastic gradient. .

Fix the weights of D and update G
Sample a mini-batch of m noise samples (1) , ..., from a noise distribution p ( ).
Update the Generator by descending it's stochastic gradient.

end end
GANs have been used successfully in a wide range of tasks ranging from computer vision, where they have been used for generating photo-realistic images (Karras et al. 2018), converting text to images (Reed et al. 2016), used as feature extractors for unsupervised learning (Radford et al. 2015). In astronomy, GANs have been successfully demonstrated to recover features from astrophysical galaxy images beyond the deconvolution limit (Schawinski et al. 2017), create high-fidelity weak lensing convergence maps (Mustafa et al. 2019), modeling exoplanet atmospheres (Zingales & Waldmann 2018) and more recently also in pulsar candidate identification (Guo et al. 2019). The standard framework of GAN described here usually comes under the category of unsupervised learning as no class labels are provided while training the network.

Semi-Supervised Generative Adversarial Network (SGAN)
SGANs are a variant of GAN where we can leverage both the ability of the generator to create realistic samples and the readily available unlabelled pulsar candidates to solve the semi-supervised classification problem. In the standard GAN problem, the output of D is a probability for the input image belonging to the training set. We modify this standard architecture slightly by adding the samples from G into our training set. We label them as a new generated class say K+1 where K is the total number of classes in our original classification problem. We then change the dimension of D's output from a binary classification output to a multi-class classification output {Pulsar, Non-Pulsar, Fake Data}. The main advantage of this technique is that we can now learn from our pulsar survey's unlabelled data.
There are three major components of this network, a supervised discriminator, an unsupervised discriminator and an unsupervised generator. The setup for the unsupervised discriminator and generator are similar to the standard GAN architecture discussed in Section 2.4. The supervised discriminator is provided with class labels (Pulsar or Non-Pulsar) that are available from our training set. The remaining unlabelled pulsar candidates were provided to the unsupervised discriminator with a positive label ('1') and generated fake candidates from G were provided with a negative label ('0'). For every training epoch, 50 per cent of the samples were taken from the generator and 50 per cent of the samples were taken from a combination of both labelled and unlabelled candidates. A schematic of this architecture can be found in Figure 3.
Mathematically, the loss function for SGANs can be written as The first term in the loss function is the so called supervised loss, L supervised . This is similar to the standard loss function of any supervised classification model (p model ) where x is the data and y is it's corresponding label. The unsupervised loss term (L unsupervised ) consists of two parts, the first part corresponds to one minus the expectation that the model will output the new 'FAKE' class (K+1) given that the data is real and the second term is the expectation that the model will correctly identify the newly generated 'FAKE' class given that that the data comes from the generator. If we substitute, x, y . Schematic of the SGAN Architecture used in this paper. The generator is initialised with a noise (a.k.a latent vector) variable, which it then transforms to a fake generated image. The discriminator is fed images from three sources i) Labelled Pulsar candidates, ii) Unlabelled pulsar candidates provided with a positive label and iii) Fake generated images from the generator provided with a negative label. The discriminator is tasked with minimising both the supervised and unsupervised loss function.
is that the generator needs to be trained to approximate the input data distribution which in turn minimises the first term of the unsupervised loss function. The formalism for SGANs described here and several practical implementation tricks we used, were largely inspired by the work of Salimans et al. (2016).

Data Preprocessing and Features Used
A pulsar candidate is a four dimensional data cube of frequency channels, time, power and rotational phase of the signal. Since it is inconvenient to visualize four dimensional data, the convention is to plot various two-dimensional and one-dimensional projections of this data-cube to decide if a signal is really from a pulsar or not. The four feature plots pulsar astronomers most often use are: (i) Pulse Profile: This one dimensional intensity curve is created by integration over both time and frequency axes while preserving phase. Most real pulsars tend to have a one or multiple narrow peaks. However, there are some known exceptions. Some known pulsars, especially millisecond pulsars (MSPs) tend to have broader or close to sinusoidal profiles.
(ii) Frequency-phase Plot: This two dimensional plot is created by integrating over the time axis only. Real pulsars tend to be broadband, therefore, we expect a persistent bright signal (vertical line) across all sub-bands. However, pulsar scintillation caused by the interstellar medium can sometimes increase or decrease the signal in some frequency channels (e.g. PSR B0355+54 Xu et al. (2018)).
(iii) Time-Phase Plot: This two dimensional plot is created by integrating across the frequency axis only. We expect most pulsars to be persistent across observing time. There are some notable exceptions, for example a nulling pulsar like PSR J1727-2739 (Wen et al. 2016), relativistic binary pulsars which can have quadratic or cubic residuals in the time-phase plot or mildly accelerated pulsars where the acceleration falls between trial values.
(iv) DM-Curve: This is a one dimensional plot to find the best-fit dispersion measure value. In order to produce this, the candidate data is dedispersed around a few trial DM values from the DM used to fold the candidate. For each trial, it calculates the chi-squared of the dedispersed pulse profile against a horizontal line fit. A large chi-squared value is an indication that the signal deviates from white noise. Since pulsars are non-terrestrial signals, we expect the signal to peak at a non-zero DM value. The sharpness of the DM curve depends on the duty cycle of the pulsar.
We used the four features mentioned above to train the semisupervised network. Before the data is passed onto the network, it is important to standardise the data, so that the algorithm is agnostic to spin-period, dispersion measure, observing frequency and integration time of an observation. We use the publicly available data pre-processing code made available by Zhu et al. (2014) for our work. 8 An example of the four features and the different types of signal in the training set is shown in figure 4. In order, to have the same number of bins for all candidates, the code down-samples and interpolates the data using linear interpolation for the 1-D plots and spline interpolation for the 2-D plots. The data is also normalized to have zero median and unit variance. We use 60 bins for the DM-curve and 64 bins for the pulse-profile. The time-phase and frequency-phase plots were resampled to a size of 48x48 bins. The bin sizes for the different features were chosen to maintain consistency with (Zhu et al. 2014

Metrics Used
We use a combination of seven metrics to evaluate our different machine learning models. Since we have reformulated our network to a binary classification problem, there are four possible scenarios when we compare the predicted labels to the true labels. An example of this is shown in table 1. This is usually referred to as confusion matrix in the literature. Based on this we calculate the following metrics: (i) The simplest metric to calculate is accuracy. While this can be a useful metric to evaluate our model, care must be taken to ensure that our training data is balanced. In unbalanced training datasets, a high accuracy score alone is not an indication of a useful machine learning model.  defined as the ratio of false positives to the total sum of false positives and true negatives. Unlike the other metrics used in this paper, a lower score of FPR is more desirable.

FPR = FP
FP + TN (vi) Specificity is defined as the ratio of true negatives to the total sum of true negatives and false positives. This is analogous to the recall rate we defined earlier. A high specificity rate indicates that our model was successful in extracting most of the non-pulsar signals from our data.

Specificity = TN
TN +FP (vii) G-Mean is defined as the geometric mean of recall and specificity.

DATA USED IN THIS STUDY
We used observations from the High Time Resolution Universe South Low-Latitude Survey (HTRU-S Lowlat) to generate pulsar candidates. HTRU-S Lowlat is one part of the entire HTRU Survey that focuses on the inner galactic plane covering galactic longitude −80°< l < 30°and galactic latitude |b| < 3.5°. The observations were recorded for an integration time of 72 minutes with a frequency bandwidth of 400 MHz using the 64-m Parkes Radio Telescope. We refer the reader to Keith et al. (2010) for a full list of the observational setup and system configuration. To date, HTRU-S Lowlat has discovered >100 new pulsars. A full list of the initial discoveries and timing solution for these pulsars can be found in (Ng et al. 2015;Cameron et al. 2020). The pulsar candidates used for training all our models were generated from the re-processing of the HTRU-S Lowlat survey using the stochastic template-bank algorithm Harry et al. (2009), and folded using the PRESTO software suite Ransom (2011). The aim of the re-processing pipeline is to find compact relativistic binary pulsars that may have been missed by the first pass time-domain segmented acceleration search pipeline Ng et al. (2015). The total number of pulsar candidates produced by the reprocessing pipeline for the entire survey is around 40 million. We selected 84,691 candidates that were labelled by eye to have approximately similar number of pulsar and non-pulsar candidates. We carefully chose pulsar candidates of different significance levels in order to create a diverse labelled candidate dataset. Our lowest detection significance of a true pulsar candidate is 4.3 sigma. The breakdown of candidates have been shown in Table 2. To the best of our knowledge, this labelled dataset has the largest number of pulsar detections out of all the publicly available pulsar candidate datasets. Labelled Pulsar candidates are critical for training machine learning algorithms as they are scarce (< 1 per cent) compared to the total candidates produced in a pulsar survey.

RESULTS
We start by splitting our entire labelled dataset into a train, validation and test dataset (60% train, 15% validation and 25% test, see table 2). The test dataset was never seen by the network while training. This dataset is only used in the end as a benchmark to evaluate all the different experimental setups described below. See figure 6, for the detection significance levels of pulsars in our test set. The validation dataset was used to tune hyperparameters of the different architectures and to select the best model during training. We train all the models separately on each of the four features described in section 2.6. Our software was built using Keras 9 (Chollet et al. 2015), a high-level open-source neural network library with Tensorflow 2.0 backend 10 (Abadi et al. 2015).

Supervised Learning Baseline
Our first goal was to build a model based on supervised learning that would act as our comparative baseline. For this, we trained a convolutional neural network (CNN) on the "Time-Phase" and "Freq-Phase" features and a multi-layer perceptron (MLP) for the "DM-Curve" and "Pulse Profile" features. Each network was trained for a total of 1000 epochs for different amounts of labelled data, saving only the model that produced the highest accuracy on our validation dataset. For each batch of labelled candidates, we split the training data to have an equal number of pulsar and non-pulsar signals. For example, 100 labelled candidates implies that the training data had 50 pulsar and 50 non-pulsar candidates. Since, the results are dependent on the subset of training samples used while training, we randomly selected five different combinations of labelled candidates and report the average values. The mean F-score performance of each of the four features is shown in Figure 5. We observe that in the regime of extremely limited labelled data (labelled candidates <= 500), "DM Curve" acts as the best discriminator between pulsar and non-pulsar signals. However, as the number of labelled data increases, information about the persistence of the signal in "Freq-Phase" and "Time-Phase" regime become equally important. The individual score from each feature were combined using a Logistic regression model with "L2" regularization. This is marked as the combined model. Ideally, the combined model should be the best performing model. This holds true for our experiments with labelled candidates greater than five hundred. However, our combined model performs worse in the low labelled data regime (labelled candidates = 100) because the models trained on "Pulse-Profile", "Time-Phase" and "Freq-Phase" brings down the net average performance of the network.

Model Architecture and Implementation Details
There are three major components to an SGAN Network. A supervised discriminator, an unsupervised discriminator and an unsupervised generator. The simplest implementation is to have a single discriminator with multiple output layers. The first output layer solves the unsupervised task and outputs if the data is REAL/FAKE and the second layer solves the supervised task and outputs if the signal is from a pulsar or not. The drawback of this approach is that when we pass unlabelled candidates from the generator, there is no supervised label associated with them. Hence, this creates the need for an extra 'FAKE' class label for the supervised classifier. In this paper, we follow the technique described in Salimans et al. (2016) which removes the need for an extra class label. In this case, we built two separate models for the supervised and unsupervised task. Both models share the same feature extraction layers. However, the supervised model is attached to a softmax activation function whereas the unsupervised model takes the output from the supervised model prior to the activation function and calculates a normalized sum of exponential outputs. This custom activation function for the unsupervised discriminator D(x) forces the model to give a strong prediction for real samples and lower values for the generated fake samples Our work is built on top of an open-source implementation of SGAN networks for MNIST digits 11 . We extensively modified the discriminator and generator architecture in order to get better results for our data. The discriminator architecture is similar to the CNN model used for the supervised baseline model. We obtained better results with larger convolutional kernels of size 7x7 compared to the 3x3 kernels that worked well for the supervised baseline models. The discriminator trained on the "DM Curve" and "Pulse-Profile" was a 1-D convolutional neural network with a convolutional kernel size of 7. Additionally, we used max-pooling to down-sample our images instead of stride convolutions. For the generator, we perform the transpose of convolutions in order to create images that are fed into the discriminator. Additionally, we also do batch normalization in order to speed up the training of the generator. The generator for the 1-D data was a multi-layer fully connected dense neural network. We use the tanh function as the activation function for the output layer of the generator. GANs can easily suffer from overconfidence. Therefore, as a regularization technique, we used soft and noisy labels while training. This means that if a candidate is real, instead of giving the label a value equals 1, we give a value in a range between The boost in performance of the "Pulse-Profile" and "DM-Curve" feature for labelled candidates ≥ 10000, was critical to improve the overall performance of the combined model. Similar improvements were also seen for all metrics defined in section 2.7. These results can be found in table A1. 0.7-1.2 for the 2-D features and a value between 0.9-1 for the 1-D features. Around 5 per cent of the time, we intentionally flip the labels, we found that this helps to improve the overall performance. Our best performing model uses the Adam Optimizer (Kingma & Ba 2014) with a learning rate of = 0.0002 and 1 = 0.5.

Effect of unlabelled data
In order to test if the SGAN network can learn from unlabelled data, we split the training data-set into smaller groups ranging from 100, to 30,000 candidates similar to our supervised learning baseline model experiments. Similarly, the number of unlabelled candidates used while training was also varied from 0 to 20,000. We trained the SGAN network for 400 epochs in each configuration, saving only the model that produced the best results on the validation dataset. Since, the results are dependent on the subset of training samples used while training, we randomly selected five different combinations of labelled and unlabelled datasets and report the average values. The mean F-score values of SGAN trained on the all four features is shown in Figure 7. We observe that increasing the number of labelled candidates in the training set drastically improves the final performance of the network. In addition, we clearly see that unlabelled candidates also improve the overall performance of the network. This effect is particularly significant in the low labelled data regime. For example with 100 labelled candidates in the training set, the unlabelled data improved the F-score of the network by at least 6 % for all features including an improvement of 12 % on the network trained on the "Freq-Phase" feature. As the number of labelled candidates increase, we see that the semi-supervised classifier still provides better results. However, the performance boost provided by unlabelled candidates is significantly lower. We believe the reason for this is two-fold. With larger amount of labelled data, there is little room for improvement as the network has already learnt a good solution to solve the pulsar candidate identification problem. The second reason is that in order to fully utilise the strengths of the semi-supervised algorithm, we need to use a significantly large fraction of unlabelled candidates compared to the labelled candidates. Our final model which was trained on a much larger unlabelled candidate database has been described in Section 4.4.

Comparing SGAN with supervised models
In this section, we compare the performance of our two-layer ensemble SGAN network to the ensemble standard supervised baseline algorithm described earlier as well as a re-trained version of the Pulsar Image Classification system (PICS) (Zhu et al. 2014). In all cases, the individual scores from each of the four features were combined using a Logistic regression model with L2 regularization. These results indicate that the SGAN architecture provides better classification results compared to both the supervised machine learning algorithms for all combinations of labelled data. For labelled candidates below 1000, the re-trained version of PICS has a lower false-positive rate compared to SGAN, however this comes at the cost of a significantly lower F-Score, thus making it a less desirable model. The difference in performance between the supervised baseline model and the retrained version of PICS can be partially attributed to the fact that PICS was not re-trained on thee same validation dataset. For labelled candidates below 1000, each of the supervised model ended up optimising for different metrics. Our supervised model resulted in a better recall rate and F-score at the cost of a worse false-positive rate and specificity score. We refer the readers to the appendix A1 for the scores across all metrics. The SGAN model provides the overall maximum gain when there are fewer labelled candidates (< 1000) available.
We trained all three networks with the same labelled candidates for each experiment. In addition, unlabelled candidates were also used to trained the SGAN model. The same validation dataset was used to tune hyperparameters for the supervised model and the SGAN model. We didn't use a validation set for re-training PICS. This was because there wasn't a provision to provide a validation dataset in the re-training script provided by (Zhu et al. 2014). We presume that PICS trained on minimising the overall training loss. We find that the ensemble SGAN outperforms the standard supervised baseline algorithm as well as re-trained version of PICS for all combinations of labelled datasets and based on all the metrics discussed in Section 2.7 including higher accuracy, precision, recall rates and a lower false positive rate. For reasons of brevity we only show the mean F-score and False-Positive Rate (FPR) values in Figure 8. The full table comparing results of all the metrics can be found in Table  A1 in the appendix.

Best Performing Model
In this section, we describe our best performing model that was trained using the entire training set plus 265,172 unlabelled candidates. Results from five different training runs from the best performing semi-supervised and supervised models are summarised in Table 3. The confusion matrix of the predictions of this model on the test set is shown in Table 4. Our best model achieved an overall F-score of 99.2 %, recall rate of 99.7% and a false positive rate of 1.63%. Our best performing model has been merged into the HTRU-S Lowlat survey post-processing pipeline, and has already discovered eighteen new pulsars. These new pulsars had a detection significance ranging from 5.8 -19 sigma. The SGAN network played a crucial role in discovering the lower detection significance pulsars as they are usually buried inside several non-pulsar candidates. A full list of these pulsars with their respective Spin-Period, DM and timing solutions will be the subject of a future publication. . Recall rate (left) and False-Positive rate (FPR) (right) from the best-performing models of the three networks across different candidate detection significance levels (SNR). For the x-axis, we divided the candidates in our test dataset into ten quantile regions based on their detection significance level with a similar number of candidates across each bin. As expected, all three models perform better when the significance level of the candidate is higher. The re-trained version of PICS suffers from a large performance loss (recall: 0.45) at lower candidate significance levels (0-7.4 sigma). The SGAN network also suffers a small performance loss, however it still does better compared to the other models. The FPR for all three networks are much more interesting. As expected, we see a large FPR when the detection significance is lower, followed by a lower FPR rate. However, we again see an increase in the FPR rate at high significance levels. This is mostly caused by bright broadband RFI signals which look like pulsars and are detected with very high significance levels. The FPR rate is highest for significance levels between 7.4-12.1 sigma. These are mostly weak pulsar-like signals that are caused white-noise lining up to look like pulsars. See table A2 for the performance of the network across other metrics. . The x-axis of these plots were made by dividing the duty cycle and spin-period of all pulsars in our test dataset across ten quantile regions such that they have a similar number of pulsars in each bin. It is easier to spot narrow duty cycle pulsars by eye, we see a similar effect with our trained neural networks. The magnitude of difference however varies across each model and is significantly lower for the SGAN model. Our models also tend to do slightly better for slow pulsars. However, both these effects could be correlated as pulsars with a slower-spin period also tend to have a narrow duty cycle.

Performance across Detection Significance, Duty cycle and Spin Period
In this section, we briefly analyse the performance of the best model from all three networks with various pulsar parameters like detection significance, duty cycle and spin-period. We start by splitting all the candidates in our test dataset based on their detection significance into ten quantile regions such that each bin has a similar number of candidates. We then calculate the performance of the best performing model from each of the three networks described earlier in this bin range. This is shown in Figure 9. As expected, we see an improvement in the recall rate of the network for higher detection significance. We also explored the performance of the neural network across the duty cycle of the pulsars in our test dataset. Similar to the previous experiment, we divide the candidates in the test set based on their duty cycle into ten quantile regions such that there are similar number of candidates in each bin. For a human, it is easier to spot pulsars with a narrow duty cycle. We see a similar trend in the performance of the neural networks as well. However, the performance drop is not as drastic as compared to the detection significance level. We also repeated this experiment across different spin-period ranges. Both of these are shown in Figure 10. We see that pulsars with a slower spin-period appear to be slightly easier for the neural network to find. However, these two effects are highly correlated as slow spin-period pulsars also tend to have narrow duty cycle ranges.

Inference Speed
Since our software has been built using Keras and Tensorflow, the final trained models can be used to evaluate pulsar candidates both on a CPU or on any Nvidia GPU. The inference rate of our model, benchmarked on a single Nvidia Tesla P100 GPU, is 5.22 ± 0.01 ms using a batch size of 20,000. This makes our model particularly suitable for implementing in a blind survey like the HTRU-S Lowlat as the entire batch of 40 million candidates can be scored in ≈ 58 hours on a single GPU. Additionally, our architecture can be easily retrained and redeployed as more labelled data become available. Our software can be found on GitHub 12 . We also provide a Dockerfile which can be used to create a Docker image in order to ensure easy reproducibility of our results.

Training Time and suitability for future pulsar surveys
The training time for SGAN is considerably longer compared to training a standard supervised deep learning architecture like a CNN or an ANN. For example, our supervised baseline pipeline on average took less than an hour to finish training for 1000 epochs on a Nvidia Tesla P100 GPU whereas our final model from the SGAN architecture took about four hours to train for 400 epochs. The main reason for this is because we are now training two neural networks alternatively and using a considerably larger sample of data (unlabelled candidates) while training. Keras and Tensorflow currently support training on multiple GPUs which can help reduce the net training time. While the training time is still acceptable for our needs, this architecture may not be the best approach for online data processing where the model needs to be re-trained in quasi real time. We refer the readers to the work of Lyon et al. (2016) that focuses more on speed rather than final classifier performance. The advantage of using our proposed architecture is higher performance because we can learn from unlabelled candidates. This is especially useful for RFI-rejection as such signals can have different signatures depending on the source (aircraft navigation, mobile phone, WiFi, satellites). Additionally, the RFI environment near a telescope is expected to change with the advancement of fifth generation wireless technology and therefore the ability to have a system that can adapt on relatively short timescales with high performance has huge value. Additionally, this technique also helps minimise the number of labelled candidates which saves human hours required to achieve satisfactory performance. The amount of unlabelled candidates that needs to be used while training depends on the number of labelled candidates available and is usually a trade-off between performance and training time. In our experiments, we achieved improved performance by having at least 5000 unlabelled candidates when the number of labelled candidates were below 1000. We observe that 12 https://github.com/vishnubk/sgan with 50,814 labels, we needed more than 200,000 unlabelled candidates to notice an improvement. It is also important to experiment with different amounts of unlabelled data as sometimes having more can make the model perform worse. We believe that using a static trained supervised model for classifying candidates from future pulsar surveys may not be the most optimal approach. Since labelling millions of candidates is not a scalable solution, we hope more attention goes into solving the pulsar candidate classification problem using a combination of labelled and unlabelled candidates.

Future Work
In this section, we briefly discuss some techniques that can be used to improve on our current models.

Improving the Supervised Baseline Model
The performance of our supervised baseline model in the regime of high labelled candidates can be improved using a much deeper convolutional neural network. Large Networks pretrained on Ima-geNet like VGG16 (Simonyan & Zisserman (2014)), InceptionV3 (Szegedy et al. (2015)) and ResNet50 (He et al. (2015)) among others can be used and their final few layers can be re-trained on a pulsar candidate dataset. This technique is called transfer learning and it has been successfully employed in various computer vision tasks including classifying Fast Radio Bursts (FRB) and RFI (Agarwal et al. 2019). In order to have a fair comparison between such networks and SGAN, we propose using a similar deep architecture for the discriminator of SGAN and comparing their performance.

Improving the SGAN Model
We believe that the performance of our SGAN model can be improved further by using a technique called feature matching. For this, we change the loss function of the generator such that its goal changes from beating the discriminator to minimizing the statistical difference between real and generated images. We refer the readers to Salimans et al. (2016) for a more detailed explanation of this technique. Another technique to improve the final semi-supervised classification accuracy is to use a Bad GAN. Instead of training towards a perfect generator, which produces images indistinguishable from real images, our goal in this architecture is to generate data that complements data produced from the discriminator. The drawback of this approach is that the quality of the generated images in general would be worse but this architecture has been shown to provide better classification results on the MNIST, SVHN and CIFAR-10 datasets (Dai et al. 2017).

CONCLUSION
In this paper we use an ensemble Semi-Supervised Generative Adversarial (SGAN) framework to classify pulsar candidates in the HTRU-S Lowlat Survey. We demonstrate that this algorithm achieves an overall F-score of 99.2% on our dataset and outperforms the standard supervised baseline algorithm and the re-trained version of PICS. The performance difference between both the techniques is significant in the low labelled-candidate regime. SGAN achieved a recall rate of 96.0% with 100 labelled-candidates compared to 85.6% from our supervised baseline model and 60.3 % from the retrained version of PICS. The main advantage of our proposed network is the ability to leverage readily available unlabelled candidates for achieving better results. We believe this technique will be even more useful for future pulsar surveys as the number of pulsar candidates scale up and maintaining a large labelled dataset becomes increasingly challenging. Our architectures are frequency and telescope agnostic, therefore they can be in principle applied to other ongoing pulsar surveys. We additionally share our code, and a dockerfile to enable reproduciblity of our work.

DATA AVAILABILITY
The data underlying this article will be shared on reasonable request to the corresponding author.