Machine-z: Rapid Machine Learned Redshift Indicator for Swift Gamma-ray Bursts

Studies of high-redshift gamma-ray bursts (GRBs) provide important information about the early Universe such as the rates of stellar collapsars and mergers, the metallicity content, constraints on the re-ionization period, and probes of the Hubble expansion. Rapid selection of high-z candidates from GRB samples reported in real time by dedicated space missions such as Swift is the key to identifying the most distant bursts before the optical afterglow becomes too dim to warrant a good spectrum. Here we introduce"machine-z", a redshift prediction algorithm and a"high-z"classifier for Swift GRBs based on machine learning. Our method relies exclusively on canonical data commonly available within the first few hours after the GRB trigger. Using a sample of 284 bursts with measured redshifts, we trained a randomized ensemble of decision trees (random forest) to perform both regression and classification. Cross-validated performance studies show that the correlation coefficient between machine-z predictions and the true redshift is nearly 0.6. At the same time our high-z classifier can achieve 80% recall of true high-redshift bursts, while incurring a false positive rate of 20%. With 40% false positive rate the classifier can achieve ~100% recall. The most reliable selection of high-redshift GRBs is obtained by combining predictions from both the high-z classifier and the machine-z regressor.


INTRODUCTION
Gamma-ray bursts (GRBs) are often characterized as the most energetic electromagnetic explosions since the beginning of the Universe. Their optical afterglows are in principle detectable out to redshift z > 10 ( Lamb and Reichart 2000;Mesler et al. 2014). Therefore, studies of distant GRBs can probe the physics of the early Universe including the reionization, the evolution of star formation, and the process of metal enrichment (Lamb and Reichart 2000;. High-redshift GRBs can be used to pinpoint and characterize the faint galaxies that supplied most of the re-ionization photons and to constrain the reionization redshift (Ioka 2003;Wang et al. 2012). Multi-wavelength studies of distant GRB afterglows can provide new information about the metal and dust content of these objects (Frail et al. 2006;Cusumano et al. 2006;Mesler et al. 2014). This in turn offers a unique method to understand the metal enrichment history of sources during the re-ionization epoch. While most studies of the re-E-mail: tilan@lanl.gov; tilan.ukwatta@gmail.com ionization epoch are based on observations of quasars, high-z GRBs have a number of significant advantages over quasars due to their unique characteristics. GRB afterglows are exceptionally bright and provide plenty of photons for sensitive spectroscopy. They have simple, easy to model power-law spectra dominated by the continuum emission that are well suited for detecting absorption signatures of the intergalactic medium. Additionally, the neighborhoods of GRB progenitors are relatively "clean" compared to quasars, which are often contaminated by continuous ejection of material from the central engine.
The Swif t Gamma-Ray Burst Mission (Gehrels et al. 2004) has proven to be effective in detecting very highredshift GRBs. The most distant spectroscopically confirmed GRB on record is GRB 090423 with z = 8.2 (Tanvir et al. 2009;Salvaterra et al. 2009) 1 . The highest photometrically measured burst is GRB 090429B with a redshift of 9.4 (Cucchiara et al. 2011). There is now a handful of spectroscopically confirmed GRBs with z > 5. The main chal-lenge in this work is to reliably identify high-redshift bursts suitable for detailed spectroscopic follow-up before the optical emission fades away. Decisions to use precious observing time on large telescopes must be made within the first hours or even minutes after the burst based on limited information. Previous attempts to screen high-z GRBs using promptly available high-energy data Salvaterra et al. 2007;Koen 2009Koen , 2010Ukwatta et al. 2008Ukwatta et al. , 2009Morgan et al. 2012), while showing some promise, lacked the accuracy necessary to facilitate a reliable follow-up program. As a result, they were never widely adopted by observers.
The main difficulty lies in extracting numerous weak correlations from readily available high-dimensional data and efficiently combining the information they contain. A modern approach based on machine learning is ideal for this purpose. Starting from a catalog of GRBs with known redshifts we can use supervised learning to "train" algorithms that effectively encode the relationship between input data and output labels. Classification algorithms deal with predicting discrete labels (in this case high-z versus low-z), while regression algorithms predict continuous labels (here the redshift value). Both types of models are supported by the random forest algorithm that in recent years has emerged as one of the best performing machine learning tools in observational astrophysics.
In this paper we present a rapid machine learned redshift estimator called machine-z and a high-z classifier for GRBs detected by Swif t. Both machine-z and high-z are developed independently and each tool may be used to reinforce conclusions from the other. In Section 2 we describe the input data and the method. Our high-z classifier is developed in Section 3 and the machine-z indicator is developed in Section 4. In Section 5 we compare our algorithms and results with previous work and evaluate the performance of our new tools using a sample of recently detected bursts that are not included in the training catalog. In Section 6 we summarize the results.

GRB Sample
Our sample consists of 284 Swif t GRBs with spectroscopic redshift measurements 2 . The Swif t mission payload consists of three major instruments: the Burst Alert Telescope (BAT), the X-ray Telescope (XRT) and the UV Optical Telescope (UVOT) (Gehrels et al. 2004). BAT is a soft gamma-ray wide field instrument sensitive to photons in the energy range 15 keV to 350 keV and it is the GRB discovery instrument. Once BAT discovered a GRB and determined its sky position, the Swif t satellite slews to the location of the burst so that the narrow field instruments XRT and UVOT can quickly start observing the afterglow. In order to provide a rapid machine-z redshift and high-z classification, we limited this study to readily available measurements from all three Swif t instruments. These measurements were adopted as numerical features for classification and regression, and are listed in Table 1  In total we considered 25 features. The features generally capture the timing and spectral properties of the prompt and afterglow emission from the burst. Some measurements are encoded as two separate features to ensure that all available information is included. For example, the prompt emission recorded by BAT is fitted using either a power law (PL) model or a cutoff power law (CPL) model, depending on which one provides a better fit. The power-law index is an important feature, but the existence of a high-energy cutoff provides additional information and is included as a separate feature FitType that encodes the best fit model (0 for PL and 1 for CPL). Similarly, UVOT magnitudes are reported as either detections or upper limits. A numerical flag similar to FitType is introduced to include this information for each photometric band of UVOT.
The sample includes GRBs discovered between 2005 and the end of 2014. Fig. 1 shows the redshfit distribution of all bursts in the sample. The lowest redshift value is 0.033 and the highest is 8.26. We adopt z = 4 as the threshold between low-redshift and high-redshift bursts. Out of 284 GRBs in the sample 25 or ∼ 9% are high-redshift according to this definition.

Random Forests
The random forest (RF) algorithm (Breiman 2001) has been shown to provide superior performance on many classification problems (Caruana and Niculescu-Mizil 2006;du Buisson et al. 2015;D'Isanto et al. 2016). Over the past few years the method found several interesting applications in observational astrophysics including selection of explosive transients in imaging data (Wright et al. 2015;Goldstein et al. 2015), classification of X-ray sources (Farrell et al. 2015), and redshift prediction (Carliles et al. 2010;Morgan et al. 2012). RF has the ability to select useful features, relatively immune to data over-fitting, can handle nonlinear relationships, and provide probabilistic outputs (see Morgan et al. (2012) and references therein). Given input training data the algorithm creates a large number of decorrelated binary decision trees. Each tree in the forest is grown by splitting the portion of the training data associated with a particular parent node between two child nodes according to the value of one or more features. Splits are chosen to maximize node purity (classification) or minimize variance (regression). The process starts from the root node that holds the entire data set and continues until the size of the leaf node falls below a specified threshold. Randomness enters in two distinct ways. A new bootstrap sub-sample is drawn from the original training sample to train the next tree. Then a random subset of all available features is selected to optimize each split. The size of the random subset is specified by the user. Prediction for a single tree is accomplished by propagating a previously unseen feature vector starting from the root and until a leaf node is reached. The predicted label is typically a majority class of the leaf node (classification) or a numerical average (regression). The results from all trees in the forest are then combined to compute the posterior probability of possible outcomes and the final prediction. Typically the majority vote is adopted for classification and the mean for regression. All results described in this paper were obtained using the python implementation of RF distributed with the scikit-learn package 3 (Pedregosa et al. 2012).

Missing Features
Missing features are common in real world data and our GRB sample is no exception. A popular approach to handle missing input values is by imputation i.e. assigning values estimated from the distribution of all remaining instances.
The method works well unless missing values carry a special meaning in a given particular problem domain. In our input GRB catalog some features could not be extracted due to low signal to noise ratio of the original data or a non-detection in a particular photometric band. Both occurrences are actually expected to be correlated with redshift (distance) and are therefore examples of informative missing features. Blue filter "dropouts" are especially interesting because they are often indicative of high redshift. In order to preserve all available information about the redshift we assigned all missing features to −1000. The effect on performance is negligible as long as the plug-in value is well outside the normal range for all features. Decision trees are generally very good in utilizing such special values if the missing features are in fact informative or marginalizing them out if they are not.

Data Imbalance
Another common problem when searching for rare objects of interest is highly uneven distribution of training data between classes. The redshift distribution in Fig. 1 represents competition between survey volume growing rapidly with distance and decreasing efficiency of detecting more distant bursts. The input catalog for our study is quite unbalanced with fewer than 10% of GRBs at z > 4. A low fraction of high-redshift bursts in the training sample will typically result in a tendency to classify all bursts as low-redshift as the algorithm attempts to minimize the overall error rate. This imbalance will result in poor performance of the classifier on new data. One possible solution is to assign higher weights to high-z bursts during training. However, the price is often additional complexity and a tendency for overfitting. It is  2, 5, 8, 10, 12, 15, 18, 20 m 2, 5, 8, 10, 12, 15, 18, 20, 22, 25 not clear at this point what is the optimal way to introduce weights in random forest and almost certainly the answer depends on the problem at hand. A preliminary investigation of the redshift bias discussed in Section 4.3 indicates that the underlying cause is imbalanced training data combined with noisy features in a significant fraction of the sample. A detailed treatment of these intricate effects warrants a separate investigation and will be presented elsewhere.

Receiver Operating Characteristic (ROC) Curve
The receiver operating characteristic (ROC) curve is a convenient way to track performance and compare classifiers and/or feature sets. We compute this curve using randomized 10-fold cross-validation (train on 90% of the sample and test on 10%). A single point on the ROC curve corresponds to an average of 100 independent cross-validation runs for a fixed threshold applied to the probability of the burst having a high redshift. The curve is traced by varying the threshold between 0 and 1 (see Fig. 3 for an example).
A perfect classifier has zero false positive rate and 100% efficiency (upper left corner of the diagram). Fast rising ROC curves are generally better than a slow rising ones. The area under the curve can be used as a rough measure of classification performance. An ideal classifier has the area of 1, while completely random selection on average yields half of the total area of the diagram.

Tuning the Classifier
RF classifiers take several parameters that can be tuned to improve performance. Among those the most important are the number of trees in the forest (ntrees), the minimum size of the leaf node (nodesize), and the number of randomly selected features that will be used to optimize node splits (m). In order to approximately optimize the high-z classifier we performed a simple parameter search over a grid given in Table 2

Classification Feature Importance
While RF is relatively immune to correlated and uninformative features, it is still beneficial to investigate the relative importance of input features on performance. We start with a pool of available features that initially contains all features in Table 1. The first most informative feature is selected to maximize classification performance (area under ROC curve) using only one feature at a time from the pool of N available features. The next best feature is selected after looping over N − 1 features remaining in the pool and maximizing classification performance using two features. The process continues until the pool of available features is empty. We use parameter values from Section 3.2, i.e. ntrees = 300 and nodesize = 12, except m which cannot be larger than the number of features selected for a given iteration. The relative importance of all 25 classification features is shown in Figure 2. As more features are included in training, the area under the ROC curve increases rapidly with a maximum value of 0.89 around 8-th feature followed by a gradual decrease. It is interesting to note that the 8 best features selected in this way include information from all three Swif t instruments. The absence of the total burst duration (BAT T90) on this list is somewhat surprising and may be attributed to a large intrinsic spread of burst durations that dominates the influence of time dilation on this time scale.
The ROC curve of the final high-z classifier trained using the 8 best features is shown on Fig. 3. The curve shows a steep rise and reaches 100% recall at about 40% false posive rate. We can reduce the false positive rate by half by changing the probability threshold and accepting 80% recall.

Machine Learned Scoring
The ROC curve is not the only way to measure the performance of selecting high-redshift GRBs. Morgan   consideration. The new event is assigned a rank n, with n−1 previously seen events having a higher probability of being high-z compared to the new event. The learned follow-up rank of the new event is Q = n/(N + 1). This formulation supports "go or no go" decisions for newly detected bursts, given the fraction F of all GRBs that can be followed up with the currently available resources. If Q < F the event should be observed. Otherwise it makes more sense to wait for a better candidate. The algorithm automatically adapts to changes in the redshift distribution of the input GRB sample and the amount of telescope time that can be devoted to follow-up. We tested the above approach by simulating follow-up decisions for bursts on our training sample using a randomized cross-validation procedure described in Section 3. The Q scores were calculated using approximately optimized input parameters and features found in Sections 3.2 and 3.3. The effectiveness of this hypothetical observing campaign as a function of the requested follow-up fraction F is shown in Fig. 4. The top panel (a) shows that the fraction of bursts recommended for follow-up by the algorithm (Q < F ) closely tracks the requested value F . The middle panel (b) shows the purity of the follow-up sample, i.e. the number of actual high-z GRBs divided by the total number of selected high-z candidates. Ideally, the purity would be close to 100% when the follow-up resources are limited (low F ), as shown by the green line. The bottom panel (c) shows the efficiency of selecting high-z GRBs (the fraction of all high-z bursts that were actually observed). Again, perfect classification performance is shown by the green line.
From Fig. 4 it is clear that the classifier can identify high-redshift bursts with a high probability, especially when follow-up resources are very limited (low F ). In other words, the purity is highest when very few bursts can be followed up and therefore reliable predictions matter most. At 1% followup fraction the purity exceeds 80%. On the other hand, an observer with enough telescope time to follow up 40% of all GRBs will be able to find all true high-z bursts (100% efficiency).

MACHINE-Z REDSHIFT ESTIMATOR
The high-z classifier developed in section 3 helps to select the highest priority follow-up targets, but it does not provide an actual value for the predicted redshift. In this section, we adapt the methods from section 3 to solve a regression problem and develop an RF based redshift estimator that we call machine-z.

Tuning the Regressor
Since the performance of a regressor is evaluated differently from a classifier, we performed an independent parameter  search to approximately optimize input parameters of the RF regressor. For this purpose we used the "leave-one-out" cross-validation method that for N bursts consists of N runs with N − 1 instances used for training and one for testing. We leverage the stochastic nature of RF training to increase the signal-to-noise ratio of the final cross-validated performance estimate by repeating the process 10 times with different seeds. The quality of prediction is measured using the Pearson correlation coefficient between machine-z output and true redshift. Table 2 defines the search grid for approximate parameter optimization. In this case we found that a forrest of 100 fully developed trees (with as little as one burst per leaf node) and m = 5 random features per split provides the best results. This is different from parameters adopted in section 3.2. The resulting correlation coefficient is 0.52.

Regression Feature Importance
Feature importance for machine-z is determined using a method similar to Section 3.3 except for the objective function. The area under the ROC curve is now replaced by the redshift correlation coefficient. Until the number of selected input features reaches m = 5 all features are randomized during node splitting. The relative importance of various features is shown in Fig. 5. The correlation coefficient starts from a sub-optimal value for the first feature, then increases, eventually flattens after the 11-th feature, and then slowly decreases beyond 16-th feature. We selected the first 11 features in this plot as input features for the machine-z estimator. Adding features beyond 11 does not improve predictions and increases the risk of overfitting or diluting the signal with noisy features.

Correction for Noise and Imbalance
A comparison between machine-z predictions and true redshift for GRBs in the training set is presented in Fig. 6. While there is a good correlation between the predicted and the actual redshift, the range of the output values is squeezed relative to the input. This appears to be a consequence of the interaction between noisy features and the fact that high-redshift bursts are strongly underrepresented in training data (see Fig. 1). When high-redshift bursts are given higher weights (e.g. by including multiple copies of the same burst in training data), the bias observed in Fig. 6 changes. A thorough investigation of this behavior is beyond the scope of this paper and will be presented elsewhere. For this initial release of the algorithm we introduce a simple linear correction that shifts and stretches the range of the output redshift values while preserving the correlation coefficient. The final corrected redshift predictions are computed using a straight line fit to data in Fig. 6 and taking into account the cross-validation uncertainty in machine-z output: z corrected = (z uncorrected − 1.07 ± 0.05)/(0.35 ± 0.03). The final corrected machine-z predictions as a function of the true redshift are shown in Fig. 7. The range of the output is now similar to that of the input and the correlation coefficient is the same as in Figure 6.
There are several interesting trends to note in Fig. 7. First, the lower right area of the plot is not populated. This means that in most cases machine-z does not fail to recognize a high-redshift burst. Second, the density of bursts peaks roughly along the dashed line, so for a significant fraction of bursts the redshift estimate is close to the true value. This can be seen more clearly in Fig. 8 showing the distribution of the relative differences between machine-z estimates and actual redshifts. Third, the algorithm does occasionally predict a high redshift for a low-z burst as shown by the upper left portion of the plot. Even though following up these false positives will tend to waste some telescope time, machine-z will rarely miss the all important high-z bursts. An inconvenient side effect of our simple correction for the redshift bias is that for a few GRBs the predicted redshift is negative. This is not a problem as long as the tool is used to select high-redshift GRBs, as the negative predictions only occurr at low redshift. classification to screen high redshift GRBs using promptly available Swif t data. We compared our GRB sample and feature set with the Morgan et al. (2012) data available in machine readable form. Small differences in the RF implementation between those two studies have no bearing on this comparison. The ROC curves for the two data sets are given in Fig. 9. The ROC curve corresponding to our data set (red curve) has a slightly larger area than that for the Morgan et al. (2012) data (blue curve). Note that the red curve rises to 100% recall more rapidly than the blue curve. Fig. 10 presents another performance comparison of the two data sets. As shown by the top panel (a), there is no significant difference in the fraction of bursts recommended for follow-up versus the requested fraction. However, there is a significant difference in the purity of the burst sample selected for follow-up shown in the middle panel (b). Our high-z classifier returns samples of very high purity when the fraction of bursts that can be observed is low. In con- trast with that, the Morgan et al. (2012) data set starts near zero purity at low followup fractions and peaks at ≈ 60% around F = 10%. Furthermore, the purity curve delivered by high-z is similar in shape to the ideal purity curve and merges with the ideal curve around the follow-up fraction F ≈ 35% . By comparison, the Morgan et al. (2012) data set shows a qualitatively different shape and an undesirable very low purity at small follow-up fractions. Finally, the bottom panel (c) in Fig. 10 compares efficiency curves of the two data sets. Both curves show the expected gradual rise from zero. The high-z data set presented here reaches the curve corresponding to a perfect classifier around follow-up fraction F ≈ 40%. The Morgan et al. (2012) data set does not reach the ideal curve until ≈ 80%. This difference in behavior may be explained by a different approach to integrating XRT and UVOT measurements with non-detections. While the Morgan et al. (2012) classifier uses 12 features, our high-z algorithm uses 8 features. Some features such as the XRT column density are common to both data sets. However, Morgan et al. (2012) use only one measurement from XRT and only one feature from UVOT. The latter is limited to a yes or no flag indicating the existence of a UVOT detection in any photometric band.

Validation
The training data for our high-z classifier and machine-z estimator is limited to GRBs discovered by Swif t prior to 2015 for which a redshift measurement is available. In 2015 Swif t found 22 additional bursts that have a spectroscopic redshift. We used the 2015 sample as a validation set to investigate the effectiveness of our redshift prediction algorithms. The results for individual bursts in the test sample are shown in Table 3. Two out of 22 bursts (GRB 151112A and GRB 151027B) qualify as high-redshift according to our classification in section 2.1 (z > 4). Both algorithms flag them as having high redshift. The Q scores for these two bursts also indicate that follow-up is recommended if the requested follow-up fraction is at least 20% of all GRBs. In addition to these two clear cut cases, our classifier identified 8 other bursts in the validation sample as high-z. However, out of those 8 bursts only one (GRB 151111A) has a machinez estimate z > 4. There are also two low-z bursts (GRB 151029A and GRB 150301B) with predicted redshifts above 4. These outcomes are consistent with our performance estimates and confirm the usefulness of our tools in prioritizing follow-up observations of candidate high-redshift GRBs. We can expect that the most robust results will be obtained if both high-z classifier and machine-z estimator predict high redshift. In this case we would have selected three candidate high-z bursts shown as gray rows in Table 3, two real ones and a single false positive. Note that the false positive (GRB 151111A) is a burst with an intermediate redshift z = 3.5.

Extensions to Other Data Sources
The present paper addresses redshift prediction for Swif t GRBs. Transfering a trained classifier from one data set to another is very important, but typically challenging. Unfortunately, in most cases the performance is strongly degraded even if differences between data sources are purely incidental (e.g. slightly different energy ranges of flux measurements or different estimators of model parameters). If the new features are qualitatively similar to the old ones, one can shift and rescale the numbers to approximately match the distribution of each new and old feature. This requires only a modest amount of new data and may be a productive approach for future missions similar to Swif t including Space-based multi-band astronomical Variable Objects Monitor (SVOM; Cordier et al. (2015)). In other cases we are forced to build new training sets and that can be time consuming.
Another possible approach would be to consider lower level data such as time-resolved spectra of prompt GRB emission. Generic intermediate level features can be obtained for example from wavelet analysis that captures the intrinsic structure of the data and then apply a high level classfier such as RF (Ukwatta and Wozniak 2016). Those "abstract" features may prove more transferable from one data set to another and may eventually facilitate early redshift prediction for GRBs detected by future missions such as SVOM.

SUMMARY
We presented a method for selecting candidate high redshift gamma-ray bursts that can be used to prioritize follow-up observations. The algorithm utilizes numerical and categorical features from all three instruments onboard the Swif t satellite that are readily available within the first few hours after a GRB discovery. We independently developed high-z classification and machine-z regression tools based on algorithms and features tailored to each task. A subset of features that provides most information to support redshift prediction was identified in both cases. The results were validated using a small sample of recently discovered GRBs that were not included in training data. The most robust selection of high redshift bursts is achieved by combining the classification and regression output to make the final prediction.