Galaxy Zoo: Clump Scout -- Design and first application of a two-dimensional aggregation tool for citizen science

Galaxy Zoo: Clump Scout is a web-based citizen science project designed to identify and spatially locate giant star forming clumps in galaxies that were imaged by the Sloan Digital Sky Survey Legacy Survey. We present a statistically driven software framework that is designed to aggregate two-dimensional annotations of clump locations provided by multiple independent Galaxy Zoo: Clump Scout volunteers and generate a consensus label that identifies the locations of probable clumps within each galaxy. The statistical model our framework is based on allows us to assign false-positive probabilities to each of the clumps we identify, to estimate the skill levels of each of the volunteers who contribute to Galaxy Zoo: Clump Scout and also to quantitatively assess the reliability of the consensus labels that are derived for each subject. We apply our framework to a dataset containing 3,561,454 two-dimensional points, which constitute 1,739,259 annotations of 85,286 distinct subjects provided by 20,999 volunteers. Using this dataset, we identify 128,100 potential clumps distributed among 44,126 galaxies. This dataset can be used to study the prevalence and demographics of giant star forming clumps in low-redshift galaxies. The code for our aggregation software framework is publicly available at: https://github.com/ou-astrophysics/BoxAggregator


INTRODUCTION
One of the main goals for modern observational cosmology is to discover and understand how galaxies and their constituent substructures have assembled and evolved throughout cosmic history.
During the last two decades, a large number of observational data have been assembled, which show strong evidence for a substantial evolution in the dominant mode of star formation in galaxies between z ∼ 3 and z ∼ 0.2 (e.g.Madau & Dickinson 2014;Murata et al. 2014;Guo et al. 2015;Shibuya et al. 2016;Guo et al. 2018).
Early observations using the Hubble Space Telescope (HST) revealed that typical massive galaxies (M 10 10 M ), populating the z ∼ 2 star forming main sequence (Noeske et al. 2007), exhibit thick, gas-rich, clumpy disks with star formation rates Ṁ ∼ 100 M yr −1 (e.g.Genzel et al. 2011;Elmegreen et al. 2004b,a).Many of these z ∼ 2 galaxies were found to exhibit discrete, sub-galactic regions of enhanced star formation (hereafter referred to as "clumps") with apparent radii 1 kpc and stellar masses M 10 7 M (Elmegreen 2007).More recent evidence suggests that these clumps may in fact be aggregations of smaller substructures that could not be resolved by HST (e.g.Wuyts et al. 2014;Fisher et al. 2017;Dessauges-Zavadsky & Adamo 2018), but this remains to be confirmed.The prevalence of giant star-forming clumps at high redshift and the overall characteristics of their host galaxies are in stark contrast with the thin, uniform and generally quiescent ( Ṁ ∼ 1yr −1 ) disk morphologies that prevail among star-forming galaxies in the local Universe (e.g.Simard et al. 2011;Willett et al. 2013a).
The mechanisms that drove this evolution of star formation activity, their onset epochs and the timescales over which they operated, remain to be fully established.If they can be accurately determined, the abundances of clumps within galaxies at different redshifts, together with their spatial distributions and intrinsic properties, provide obvious diagnos-tics for the transition from clumpy to more diffuse star formation.Historically, the most extensive surveys of clumpy star formation have relied on HST imaging and focused on intermediate and high redshift galaxies (e.g.Murata et al. 2014;Guo et al. 2015;Guo et al. 2018).A common conclusion of these studies is that the overall fraction of massive (M 10 9.5 M ), clumpy star forming galaxies decreases rapidly for z 2 and falls below ∼ 5% by z ∼ 0.2.
The scarcity of clumpy galaxies in the local Universe makes the task of identifying them in large numbers much more challenging and related studies at low redshift have entailed focused investigations of small samples containing ∼ 50 galaxies, or fewer (See, however Mehta et al. 2021).Identifying enough low-redshift clumpy galaxies to enable accurate inference of their overall population demographics and characteristics requires wide-field imaging surveys that encompass a large fraction of the sky and a reliable method for discovering candidate systems.In recent years, extensive groundbased surveys like the Sloan Digital Sky Survey Legacy Survey (SDSS; York et al. 2000) and the Dark Energy Camera Legacy Survey (DECaLS; Dey et al. 2019) have delivered publicly available wide field imaging data that make systematic searches for large numbers of low-redshift clumpy galaxies possible.Galaxy Zoo: Clump Scout (Adams et al. 2022) is a citizen science project that used SDSS imaging data and was designed to let volunteers from the general public identify clumpy galaxies and the clumps they contain.Multiple volunteers inspect images of galaxies and provide two dimensional annotations marking the locations of any clumps the galaxies contain.
One of the most challenging aspects of collecting data using a citizen science approach is calibrating the reliability of the responses that volunteers provide.Translating astrophysical analyses into a citizen science context can be difficult because the subject matter and related concepts are often not familiar to non-experts.This unfamiliarity can result in annotations that are noisy, with large variations between the responses of different volunteers.The traditional approach for mitigating such noise is to collect a large number of independent annotations and derive an average result representing the overall consensus between volunteers.This has two obvious disadvantages: firstly, volunteer effort may be wasted if more responses are accumulated than are actually required to mitigate the variation between responses and secondly, even after a large number of responses have been collected, there is no formal guarantee that the consensus is accurate or sufficiently precise.
To address these issues, more quantitative approaches have been developed that attempt to infer statistical estimates for the reliability of consensus derived from citizen science annotations and classifications.For example, Marshall et al. (2016) developed the Space Warps Analysis Pipeline (SWAP) which used a binomial model for a simple true-or-false response to derive a Bayesian estimate for the probability that astrophysical images included signatures of strong gravitational lensing.The SWAP algorithm was also used by Wright et al. (2017) to accelerate consensus for citizen-science classification of potential supernova flashes and assign false-alarm probabilities to candidate events.Later, Beck et al. (2018) showed that applying SWAP to galaxy morphology labels collected via the Galaxy Zoo platform (Lintott et al. 2008;Willett et al. 2013b) increased the rate of classification by 500% and reduced the volunteer effort that was required by a factor of ∼ 6.5, relative to the Galaxy Zoo standard requirement for 40 volunteers to inspect each galaxy.
In this paper we build on the principle of SWAP and develop an aggregation approach to derive quantitative estimates for the reliability of two dimensional labels of clump locations within galaxies based on annotations provided by Galaxy Zoo: Clump Scout volunteers.Like SWAP, we rely on a statistical model to derive probabilistic estimates for several quantities that determine the reliability of a label that represents the consensus of multiple independent annotations.Two dimensional annotations are more complex than the simple binary classification tasks that SWAP was designed to process and our statistical model is necessarily also more complicated.We base our approach on an method that was initially presented by Branson et al. (2017) (Hereafter BVP17), who tested their algorithm on small and relatively noise-free annotation datasets that contained a few thousand annotations and were collected from paid workers on the Amazon Mechanical Turk platform1 .We have developed a new implementation of this algorithm that is computationally efficient enough to process millions of independent annotations provided for tens of thousands of images by the Galaxy Zoo: Clump Scout volunteers.Our goal is to find out whether this algorithm can be used successfully to derive complicated twodimensional labels with quantitative reliability estimates in a mass-participation citizen-science context using noisy annotations provided by a cohort of non-expert volunteers.We also aim to determine whether the reliability estimates we derive can be used to accelerate the labeling process and reduce the amount of volunteer effort that is required to accurately label the clumps in each galaxy.
The remainder of this paper is organised as follows: In section 2 we describe how the imaging data presented to volunteers in Galaxy Zoo: Clump Scout were selected and prepared.In section 3 we outline the annotation workflow that volunteers used to annotate the images and the training they received.In section 4 we provide details of the statistical model that underpins our aggregation algorithm.In section 5, we explain how our algorithm actually computes the labels it derives.In section 6, we present the results of applying our algorithm to the Galaxy Zoo: Clump Scout data and analyse the quantitative reliability metrics that are generated.In section 7 we discuss the implications of these results in the context of the goals outlined above and the suitability of citizen science as a method for complex astrophysical image analysis.Finally, in section 8, we summarise our findings and conclude.

DATA
In this section we briefly describe the the galaxy selection criteria and the image preparation pipeline used for Galaxy Zoo: Clump Scout.A much more detailed description is provided by Adams et al. (2022).

Galaxy Image Selection
The galaxy images used in Galaxy Zoo: Clump Scout comprise three subsets of the sample that was visually inspected and morphologically classified by volunteers contributing to the Galaxy Zoo 2 (GZ2) citizen science project (Willett et al. 2013a).The criteria that were used to select these subsets are described in detail in Adams et al. (2022).For convenience, this section summarises the most relevant properties of the galaxies that were inspected by the Galaxy Zoo: Clump Scout volunteers.
A primary sample of 53,613 galaxies with 0.02 ≤ z ≤ 0.25 was selected based on the morphological labels provided by GZ2 volunteers.We anticipated that the presence of obvious star-forming clumps in images of smooth elliptical galaxies was very unlikely so for this primary sample, we limited our selection to galaxies for which more than 50% of volunteers responded negatively2 to the question "Is the galaxy simply smooth and rounded, with no sign of a disk?".
To estimate the number of clumpy galaxies that were observed by SDSS but which were excluded from our primary sample, we also include a smaller, secondary sample.This sample contains 4,937 galaxies for which fewer than 50% of GZ2 volunteers identified features or a disk and was selected within a more restricted redshift range 0.02 ≤ z ≤ 0.075.
Finally, Galaxy Zoo: Clump Scout volunteers also annotated a sample of 26,736 galaxies matching the selection criteria used for the primary sample, but which had simulated emission from clumps with known photometric and physical properties superimposed (see Adams et al. 2022, for details of the simulation procedure).Annotations of these simulated clumps were used by Adams et al. (2022) to derive an estimate of the Galaxy Zoo: Clump Scout sample completeness for clumps with specified photometric properties.
Stellar mass estimates for galaxies in all three samples were taken from the SDSS DR7 MPA-JHU value-added catalog (Kauffmann et al. 2003;Brinchmann et al. 2004).All three samples include galaxies with stellar masses 10 8.5 M M 10 12 M .

Galaxy Image Preparation
For each of our selected galaxies we extract square cutouts from SDSS g, r and i band FITS (Pence et al. 2010) images that are normally 6 times larger than the galaxy's measured 90% r -band Petrosian radius, but have a minimum side length of 40 pixels3 .Experience from previous iterations of the Galaxy Zoo projects, including GZ2, has shown that sizing cutouts relative to the host galaxy radius in this way provides sufficient angular resolution for volunteers to discern morphological features, while including enough of the surrounding context to help distinguish those features from instrumental noise and background objects.We then resample these single band images onto a common pixel grid with SDSS native resolution (0.396 /pixel) before combining them (without PSF-matching) into a three-channel colour composite.We assign the g, r and i bands to the red, blue and green channels respectively and scale each band independently using the formula presented in Lupton et al. (2004).For an input pixel intensity Ix in band x the scaled intensity I x is computed using We specify that {Q, α, m} = {7, 0.2, 0} for all bands and that {βg, βr, βi} = {0.7,1.17, 1.818}.Finally, we re-scale each colour image so that its height and width are both 400 pixels.Note that this means the angular size of the cutout image pixels varies between 0.1 pix −1 and ∼ 18 pix −1 for different subjects depending on the angular size of the central galaxy.
The number of SDSS native image pixels spanned by the SDSS imaging PSF FWHM varies between 1.5 and 7.5, with a median value ∼ 2.8, for 99% of the unscaled cutout images.
The remaining of subjects 1% populate a tail out to ∼ 18 pixels.In the final scaled cutouts the number of pixels spanned by the PSF FWHM varies between ∼ 1 and ∼ 70 with a median value of ∼ 11.Examples of the images generated using this procedure are shown in figures 6, 18, D2 and D3

COLLECTING ANNOTATIONS
To identify the locations of clumps within their host galaxies, we designed a web-based citizen science project using the Zooniverse project builder interface4 .

Volunteer Training
For non-expert volunteers, identifying genuine clumps among the potentially complex features of their host galaxies can be daunting.To improve volunteers' confidence and help them to provide accurate annotations we provided several pedagogical and training resources.Following the approach of other Zooniverse projects, we designed a detailed practical tutorial explaining each step of the annotation workflow.This tutorial was automatically presented to volunteers when they joined the project and remained available for reference thereafter.Additional reference images and explanatory text were provided using the Field Guide feature of the Zooniverse interface.A separate About section of the project provided pedagogical material explaining the scientific motivation of the project.Finally, to guide the progress of first-time volunteers, we provided expert labels for a small subset of our galaxy images.Ten such images were interspersed with decreasing frequency among the first ∼ 20 subjects that each volunteer inspected.We implemented a system to provide real-time feedback for volunteer annotations of expert-labeled galaxy images and inform them if they missed genuine clumps or mistakenly annotated an object that experts had disregarded.This feedback system was designed to refine volunteers expectations regarding the visual appearance of genuine clumps during the early stages of their engagement with the project.

The annotation workflow
Volunteers following the Galaxy Zoo: Clump Scout workflow inspect a sequence of single subject galaxy images (hereafter "subjects") that are randomly drawn from a global subject set.The subject selection ensures that no volunteer inspects the same image more than once and each subject is inspected by a group of approximately 20 volunteers.Each volunteer first annotates the two-dimensional location of the central bulge of the central galaxy in the image if it is visible, before proceeding to annotate the locations of any clumps they can discern.To mitigate against the possibility that volunteers would disregard genuine clumps with appearances that confound their expectations, we provided an opportunity to mark clumps as "unusual".We investigate the impact of including or discarding this unusual clump subset in section 6.
The full Galaxy Zoo: Clump Scout dataset contains 3,561,454 click locations, which constitute 1,739,259 annotations of 85,286 distinct subjects provided by 20,999 volunteers.

Initial annotation processing
We expect that even the largest individual clumps will be at best marginally resolved for the lowest redshift galaxies in our data sample.This implies that almost all clumps will appear as point-sources with a light profile equal to the instrumental point-spread function (PSF).Our data preparation procedure (subsection 2.2) results in subject images that have different pixel sampling of the PSF depending on the angular size of the central host galaxy.To account for this fact, we transform the two-dimensional point estimates for clump locations that volunteers provide into square boxes with side-length equal to twice the full width at half maximum (FWHM) of the pertinent subject's PSF.Assigning a finite, instrumentally motivated clump extension allows us to identify groups of volunteer clicks with separations that are smaller than the PSF.A prior assumption of our data aggregation approach that it is impossible for a single volunteer to mark separate clumps within the same subject that are closer than twice the PSF FWHM5 .It is likely that any such multiplets that volunteers do provide represent noise peaks in contrast-enhanced subject images or are simply accidents.In section 4, we describe how our aggregation algorithm effectively deduplicates multiple nearby annotations by individual volunteers.

A scale-free distance metric
Using square boxes to define the marked clump locations allows us to inexpensively compute the ratio of the area of the intersection between pairs of boxes and the area of their union (see Figure 1).We use the complement of this ratio, which is commonly referred to as the Jaccard distance (Jaccard 1912), as a scale-free distance metric between any volunteer-marked locations.
Geometric illustration of the ratio between the area of the intersection between two boxes (dotted region) and the area of their union (dashed region).We use the complement of this ratio as a scale-free distance metric bounded between zero and unity.
The Jaccard distance is maximally unity if the boxes are disjoint and minimally zero if they coincide perfectly.

DATA AGGREGATION MODEL
The core of our data aggregation approach is based on a custom implementation of the probabilistic model and algorithm proposed by BVP17.In this section, we present a detailed description of the model and explain how it is used to optimise the efficiency of clump detection using the volunteers' annotations.We recognise that this paper contains a lot of somewhat complicated notation, so to aid the reader we have included a reference table of the most commonly recurring symbols in Appendix B.

Overview
We construct a global model that simultaneously considers NS individual elements of the full subject set S ≡ {si} N S i=1 and individual members of the entire volunteer cohort V .Each subject si ∈ S, is inspected by a randomly selected group of volunteers Vi ∈ V , who each provide a set of independent two dimensional annotations of visible clump locations Zi ≡ {zij} Throughout this paper we will use the notation |X| to denote the number of elements in the set X, so here |Vi| denotes the number of volunteers who annotate the subject si.For convenience, we define Sj ∈ S to denote the subset of subjects that are inspected by the jth volunteer.For every subject si, we define a true label yi to encode the unknown locations of all real clumps in the image.Using only the information provided by the global set of volunteer annotations Z ≡ i Zi, we wish to derive a separate estimated label ŷi for each subject that closely approximates yi.Our goal is to minimise the mismatch between ŷi and yi while keeping the number of volunteers who annotate the subject si as small as possible and thereby to optimise our use of volunteers' effort.We facilitate this aim by computing a "risk" metric Ri for each subject that represents a weighted combination of quantitative magnitude estimates for several sources of approximation error in the estimated label (see subsection 5.7 for more details).We expect that the risk for a particular subject will decrease as the number of volunteer annotations for that subject increases.Accordingly, by choosing an appropriate global risk threshold Ri < τ , we aim to be able to confidently retire individual subjects from the classification pool as soon as the expected error is acceptably small.This approach differs from many traditional crowd-sourcing techniques, which require a fixed number of volunteers to inspect each subject.Such approaches are generally less efficient because stable consensus between volunteers is often achieved before the prescribed number of annotations have been gathered.An additional benefit of our approach is that particularly difficult subjects can be segregated for expert inspection if their risk remains high after many volunteers have inspected the subject.

Associating subject annotations with subject labels
Each of the volunteer annotations zij ∈ Zi forms a set of k=1 that encodes the locations of any clumps that the volunteer perceived in the subject si.Analogously, we model the true clump locations for si as an abstract set of |Bi| ≥ 0 rectangular boxes such that yi ≡ {b l i } |B i | l=1 .The concrete sizes and shapes of these boxes are ultimately determined by our aggregation algorithm, but for subject si they are guaranteed to be at least as large as the boxes comprising the volunteer annotations for that subject.Our goal is to associate each of the click locations corresponding to volunteer annotations for a particular subject with a single true clump location.Formally, we aim to associate each of the concrete elements of Zi with a single abstract element of yi.This task is complicated for several reasons.Different volunteers may annotate different subsets of clumps and the order in which they do so is not defined nor even constrained.Volunteers may miss some real clumps, so there may be elements of yi that have no counterpart annotations in a particular zij.Conversely, the set of annotations provided by a particular volunteer, for a particular subject may contain false positives, so some elements of a particular zij may not correspond with any elements of yi.
Figure 2 provides a schematic illustration of the process by which we associate volunteer annotations with probable clump locations and subsection 5.3 explains the notation and the computational details.Formally, our aggregation algorithm computes an optimal set of mapping indices The possibility of false positive boxes in zij is accounted for by defining a singleton "∅" element to which they can be associated.

Modelling volunteer skill
For a given subject, the visibility of clumps to a particular volunteer, and the positional accuracy with which they are able annotate the clumps they do perceive is likely to be influenced by several factors.These may include: domain expertise, experience gained from time spent contributing to Galaxy Zoo: Clump Scout, confusion regarding the detailed task instructions and even the screen size and resolution of device they typically use to provide annotations.
To model the impact of these factors we consider three scenarios, which relate a particular volunteer's annotations to the locations of real clumps in the subject image.Consider the annotations provided by the jth volunteer in our cohort.
In the first scenario, volunteer j provides a true positive by marking a location that lies close to a real clump.It is unlikely that any volunteer's mark precisely annotates the true clump location and indeed, different volunteers may have different perceptions of where the clump actually is.We model any positional offset between the volunteer j's annotation and the true clump location as a random Jaccard distance dj, drawn from a Gaussian distribution with zero mean and a volunteer-specific variance σ 2 j .
dj ∼ Gaussian 0, σ 2 j (3) In the second scenario, the volunteer provides a false positive by marking a location which does not correspond to the location of a real clump.We model the rate of false positive annotations for volunteer j by considering each mark they provide as a Bernoulli trial with "success" probability p fp j .Finally, volunteer j may provide an implicit false negative by failing to mark the location of a real clump.We model the false negative rate for volunteer j by considering each opportunity to mark a real clump location as a Bernoulli trial with "success" probability p fn j .Hereafter, we refer collectively to the three model parameters Sj ≡ {σj, p fp j , p fn j } as volunteer j's "skill" parameters.

Modelling subject difficulty
Notwithstanding the skill of individual volunteers, there are numerous image characteristics that may result in varying degrees of clump visibility at different locations for different subjects in S.An obvious example is clump contrast; bright clumps that appear superimposed on a smooth, faint background galaxy will be easier to discern than faint clumps on a bright, noisy background.For simplicity, we assume that the impact of all such confounding factors manifests as a positional offset between the true location of a clump and any volunteer annotations that identify it.For a particular true clump location b l i ∈ yi, we model the size of this offset as a random Jaccard distance d i,l drawn from a Gaussian distribution with zero mean and variance σ l i 2 .
Hereafter, we refer to the set Di ≡ {σ l i } as the subject "difficulty".

Modelling volunteer annotations
We combine our volunteer skill and image difficulty models to define a compound model for the annotation zij that each volunteer provides for a each subject si.
The first and second terms represent binomial models, which compute the probability that zij contains n fn false negatives, n fp false positives and ntp = |Bij| − n fp true positives, given p fp j and p fn j .
Figure 2. Schematic illustration of how elements of volunteers' annotations are associated with elements of the subject label y i .We illustrate a case in which three volunteers have provided three independent annotations of the same subject.Volunteers 1 and 2 both annotate subsets of the real clumps in the image.Volunteer 3 mistakenly marks two foreground stars as clumps.The central column lists the value of {a k ij } computed for each of the boxes forming the volunteers' annotations.For volunteers 1 and 2, these values define the index of the corresponding box in y i .Both annotations provided by volunteer 3 probably mark foreground stars and neither is marked by another volunteer.In this toy example, the algorithm maps both to the "∅" element, thereby defining them as false positives.
The third term considers the Jaccard distances d kl between any true positive (i.e. a k ij = ∅) box b k ij ∈ zij and their counterparts b l i ∈ yi as well as the subject's difficulty Di and the volunteer's skill Sj.
We combine the Gaussian components of the volunteer skill and image difficulty models by computing a combined variance parameter where η weights the relative impact of volunteer skill and image difficulty according to the p-values computed by their respective probability models.Formally, we model η as the expected value of a binary indicator variable, e e = 1 if volunteer skill dominates d kl 0 if image difficulty dominates d kl (7) We assume that both sources of variance are equally likely to dominate for any particular volunteer annotation (i.e.P (e = 1) = P (e = 0)), which implies (e.g.Ivezić et al. 2019) and the set of all volunteer skills S ≡ {Sj} j=1 given the union of all volunteer annotations, which we denote Z.
The additional terms in Equation 9 represent prior distributions for the parameters of our model.
(i) π(Di) models the prior probabilities of observing the difficulty parameters associated with the ith subject.
(ii) π(Sj) models the prior probability of observing the volunteer skill parameters associated with the jth volunteer.
(iii) π(yj) models the prior probability that the unknown true label for si is yi.For simplicity, we assume that all possible labels are equally likely.
For practical reasons, we choose prior distributions for each parameter that are the conjugate priors 6 of that parameter for the corresponding likelihood model distribution.This choice facilitates straightforward computation of model parameter updates when new annotations are collected.Specifically, we use Beta distribution priors for the binomial-distributed parameters {p fp j , p fp j } Intuitively, this prior simulates the information gained by performing n β Bernoulli trials with success probability p k 0 .For the parameters that are modeled as variances of Gaussian likelihood models {σ 2 j , σ l i 2 }, we specify scaled inverse chisquared priors which simulate the information gained from a sample of nχ previous observations drawn from a Gaussian distribution with zero mean and variance σ 2 0 .The initial values for parameters of our prior models {p fp 0 , p fp 0 , n fp β , n fn β , σ 2 0,S , nχ,S, σ 2 0,V , nχ,V } are hyperparameters of our algorithm which must be chosen a-priori.Table 1 lists the values that we assign to each of these hyper-parameters when processing the Galaxy Zoo: Clump Scout dataset.
In Appendix A we provide detailed rationale for our choice of prior distribution models and show how they yield estimates for our likelihood model parameters that become increasingly data-dominated as more annotations are collected.

COMPUTING AGGREGATED LABELS
Figure 3 provides a schematic overview of how our implementation computes aggregated labels for subjects.In subsequent subsections we describe the illustrated operations in detail.
6 Specifying a conjugate prior π(θ) for parameter θ in Bayes's rule yields a posterior distribution p(θ|z) ∝ π(θ) • p(z|θ) that has the same functional form as the prior itself.Note that in general the conjugate prior depends on both the likelihood model and the parameter of interest.For example, the variance and mean of a Gaussian likelihood function have different conjugate priors.

The Working Batch
To minimise the dependence of aggregated clump locations on our choice of model prior hyper-parameters we design our aggregation framework to process elements from a dynamically maintained working batch containing data and metadata for 25 thousand classifications.7Each element in the working batch represents a single click location marking a clump as part of the annotation provided by a single volunteer.
To populate the working batch, we select subjects that have been inspected by at least three volunteers and have at least one annotated clump.For each selected subject, we assemble all its available annotation data and append them to the working batch in a single block of elements.This ensures that any subject retirement decision is made on the basis of all available information.We specify a minimum target batch size and new blocks are added until the size of working batch exceeds this target.If five or more volunteers inspect a subject and none annotate a clump, we assume that no clumps are present and preemptively retire the subject instead of adding its data to the working batch.Whenever a volunteer inspects a subject that has at least one clump annotation, but does not annotate any clumps themselves, we append a single empty classification element to the working batch.We require records of these empty classifications in order to compute the probability that a particular volunteer fails to annotate a real clump, i.e. p fn j .After processing a single batch of classification data, the most likely outcome is that only a subset of the corresponding subjects will have Ri < τ (see subsection 4.1 and subsection 5.7) and be deemed sufficiently low-risk for retirement.We update the working batch by removing the classification data for retired subjects and replenishing them with new blocks of classification data for active subjects.Once a subject is retired, the aggregated estimated label ŷi is considered final and any subsequently submitted classifications for that subject will not be included in subsequent batches.
We impose a maximum lifetime for any data element by specifying the maximum number of batch replenishment cycles that they can persist within the working batch.Subjects whose data remain after this lifetime has expired are retired and flagged for inspection by experts.This forced retirement strategy prevents the working batch becoming stale and dominated by inherently difficult or high-risk subjects that never retire normally.

Initialisation
Processing of each working batch begins with an initialisation phase.Adding new blocks of data to the working batch implies introducing new subjects to our likelihood model.We initialise the subject difficulty parameters of all new subjects to the same value, which we specify as a hyper-parameter of our aggregation framework.The newly added data blocks may include annotations that were provided by previously unknown volunteers.If so, we initialise the skill parameters for all new volunteers identically using three of the hyper-parameters that were introduced in subsection 4.6.
A subset of elements in the working batch correspond with subject blocks from earlier batches that did not retire.We reinitialise the parameters for these subjects, and re-compute the skill parameters of returning volunteers to reflect only their annotations for subjects that have retired.This parameter propagation strategy allows us to use information that we have learned about volunteers' skills, while ensuring that the subjects that persist between batches are processed identically to new subjects that happen to have received annotations from returning volunteers.After initialising or propagating the model parameters for all elements of the working batch, we cache their values.
To complete the initialisation phase for each new working batch we use the algorithm described in subsection 5.3 to perform preliminary clustering of overlapping volunteer annotations for each subject.The subsequent subsections explain how we apply iterative expectation maximisation to refine the initial clusters, while simultaneously computing the maximum likelihood solution of Equation 9.

Computing box associations
For each subject si ∈ S, we follow the approach of BVP17 and implement a Facility Location algorithm (Mahdian et al. 2001) to approximately8 derive the maximum likelihood mapping l=1 (see subsection 4.2 and Figure 2).Facility location algorithms form clusters with a specific topology comprising one or more cities, uniquely connected to a single, central facility.9This topology is illustrated in Figure 4.
Our implementation identifies disjoint, spatially concentrated subsets of the boxes in Zi which we then identify with true clump locations b l i ∈ yi.We label each of these aggregated clusters with the index l and denote them as Z l i .Establishing a new cluster entails labelling a particular box b k ij ∈ Zi as a facility and connecting at least one other box b k ij that was provided by a different volunteer.Note that by associating box b k ij with cluster Z l i as either a city or a facility, we establish the mapping a k ij = l.Each box in the set of volunteer annotations is associated with at most one true clump and each subset may contain at most one box per volunteer.These constraints reflect our assumption that separate marks provided by the same volunteer are intended to indicate separate clumps.
We specify that assigning facility status to a particular box incurs a real-valued cost where d represents the Jaccard distance between b k ij and b k ij .Combining these cost definitions yields the assembly cost for an individual cluster Some boxes may represent false positive annotations.To handle these cases we follow the approach of BVP17 and establish a dummy facility at zero cost.Connections to the dummy facility identify boxes as false positives and incur box-specific costs Let Z i be the set of all established clusters for subject si.
) imply an expression for the total cost Ci of all established clusters and all connections to the dummy facility that closely approximates the negative natural logarithm of the product over volunteers j π(Sj)p(zij|yi, Di, Sj)) defined in Equation 5.
The facility location algorithm is designed to compute the box-to-cluster mapping that minimises Ci, which simultaneously yields the approximate maximum likelihood solution of Equation 5for given volunteer skill and image difficulty parameters.
To derive the aggregated estimate for the subject label ŷi, we merge the individual boxes comprising each cluster by computing the mean coordinates of their corresponding vertex indices.11This yields a rectangular representation for each true clump location that is at least as large as each of the boxes comprising the set of annotations for the ith subject, Zi.
During the initialisation phase, we use a simplified set of facility location costs that do not depend on the volunteer skill parameters or the image difficulties.We specify that establishing a new facility during initialisation incurs the same cost for any volunteer annotation where fV ∈ [0, 1] is a hyper-parameter that represents the fraction of volunteers who inspected a subject that must contribute a box to an assembled cluster and we remind readers that |Vi| denotes the number of volunteers who inspected the ith subject.The initialisation-phase cost of connecting box where dmax ∈ [0, 1] is a hyper-parameter that represents the maximum Jaccard distance between any city in a cluster and its central facility.Finally, connecting any box to the dummy facility during initialisation incurs unit cost Table 1 lists the values we adopt for fV and dmax.

Computing Image Difficulty
For each rectangular box bl i ∈ ŷi comprising the estimated label for the ith subject, we use the global hyper-parameter σ 2 S,0 to define a subject-specific minimum difficulty Intuitively, if a subject's label includes more identified clump locations then we assume that clumps are easier to precisely locate and the minimum difficulty is reduced.We then update the minimum value to reflect the scatter between the subset of volunteer boxes b k ij ∈ Z l i that were associated with the corresponding ground truth cluster.For each of these true positive boxes we compute the Jaccard distance d k ij l between it and its corresponding rectangular box in the estimated subject label, bl i .Using these distances in conjunction with Equation A11 we estimate where and nχ,S is another hyperparameter or our algorithm (see subsection 4.6 and Appendix A).

Computing Volunteer Skill
We compute each volunteer's skill parameters p fp j , p fp j and σ 2 j (see subsection 4.3) by comparing their individual clump annotations zij ∈ Z for each subject in si ∈ Sj with the corresponding label estimate ŷi.For each volunteer, we compute the number of false positives by counting the subset of their annotation boxes that were associated with the dummy cluster.
We compute the number of false negatives for a volunteer by summing the number of established clusters for each image they inspected that do not contain one of their boxes.
Note that ∅ in Equation 27 represents the empty set and not our notation for the dummy facility ∅.Analogously, we compute the number of true positives by counting the total number of clusters to which the volunteer contributed.
ntp,j = We use the expressions in Equation 26and Equation 27 in conjunction with Equation A6 to compute estimates for p fp j and p fn j .
To compute σ 2 j for each volunteer we follow a similar approach to that used when computing image difficulties.We compute the Jaccard distances {d k ij l } n tp,j l=1 between all true positive boxes and the merged rectangular box bl i that was derived from the cluster to which they are associated.We then use these distances in conjunction with Equation 28and Equation A11 to estimate As a consequence of our prior specifications the formulations of Equation 29, Equation 29and Equation 31 can all be factored into terms that depend only on the current working batch and terms that depend only on prior information.This allows us to straightforwardly update the skill parameters of returning volunteers without having to reconsider the annotations they contributed to previous working batches.

Computing Maximum Likelihood Labels
Once the associated clusters have been defined and the subject difficulties and volunteer skills have been computed we are able to compute the likelihood of each subject's estimated label using Equation 8, Equation 6 and Equation 5. Practically, we compute the log-likelihood for each subject and sum these to derive a global likelihood for all annotation data that comprise the current working batch.
Recall (subsection 5.3) that we use a simplified set of facility location costs to derive an initial clustering solution for each new working batch.These costs are used for initialisation because they can be computed without having estimated volunteer skills or subject difficulties, but they will generally not yield a set of clusters that correspond with the maximum likelihood solution of Equation 5 for any subject.Similarly, the likelihood model parameters that we compute based on the initial clustering solution are unlikely to be good estimates of the subject difficulties or volunteer skills.As illustrated by the red boxes in Figure 3, we use an iterative approach to derive the maximum likelihood solution for Equation 5 and the corresponding best estimates of the likelihood model parameters.
After the initial set of volunteer skills have been computed, we recompute the box associations for all subjects using the nominal facility location costs specified in Equation 16, Equation 17, Equation 19.Using these clusters we recompute the likelihood model parameters and the corresponding subject label likelihoods.We repeat this procedure until the sum of log-likelihoods for all subjects converges to its maximum value.

Computing Subject Risks
In subsection 4.1 we introduced the concept of a "risk" metric Ri that can be computed for any subject si and used to quantitatively determine whether the estimated label ŷi is sufficiently representative of the unknown true label yi to be scientifically useful.Specifying a risk that decreases monotonically as the reliability of ŷi increases enables a principled decision to retire the subject si when its risk falls below a predefined threshold value which we denote τ .
To compute the risk for the ith subject, we follow the approach of BVP17 and define Ri for each subject as the weighted sum of three separate terms.
The first term, N fp i , represents an estimate of the number of detected clumps that are spurious, while N fn i estimates the number of genuine clumps that have not been detected.Finally, N σ i (δ) estimates the number detected clump locations that are genuine but insufficiently accurate in the sense that their Jaccard distance from the true clump location is likely to exceed a threshold value δ, which we specify as a hyperparameter.
The weight terms α fp , α fn and ασ are hyper-parameters that allow the properties of the clump sample for retired subjects to be tuned for particular scientific investigations.For a specific value of τ , increasing the value of α fp relative to the other weights will result in a purer clump sample, while a relative increase in α fn increases the sample completeness.Specifying a larger value for ασ will result in more accurate clump locations, which may be useful for studies considering the radial distribution of clumps within their host galaxies.
To estimate the expected number of genuine clumps in the estimated label for the ith subject, we consider each established cluster Z l i ∈ Zi and identify two subsets V mark,l i , V miss,l i ∈ Vi of the volunteers who inspected the ith subject si.The volunteers in V mark,l i are those who inspected si and contributed a box to the lth cluster Z l i .Conversely, V miss,l i contains the volunteers who inspected si but missed the clump associated with Z l i .To estimate the overall probability that Z l i represents a false positive detection, we combine the probability that all volunteers in V miss,l i correctly omitted the detected clump from their annotation with the probability that all volunteers in V match,l i provided a spurious annotation.
Similarly, to estimate the overall probability that the cluster Z l i represents a true positive clump detection, we combine the probability that all volunteers in V miss,l i missed a genuine clump with the probability that the associated boxes provided by all volunteers in V mark,l i were correct.
Finally, we use Equation 33and Equation 34 to estimate the number of clusters in Zi that are false positives by summing the expected value of an indicator variable that equals 1 when the cluster is a false positive and 0 otherwise for all clusters Z l i ∈ Zi (Recall that we used an analogous approach to compute the parameter η in subsection 4.5).
To estimate the expected number of clumps in the estimated label for the ith subject that are genuine, but have insufficiently accurate locations12 , we consider the sets of true positive boxes, supplied by the volunteers in V mark,l i , that were associated with each cluster Z l i ∈ Zi.We model the Jaccard distance d l between estimated clump location bl i ∈ ŷi and the true clump location b l i ∈ yi as a random sample from a Gaussian distribution with zero mean and variance derived by summing the constituent box variances defined in Equation 6.
Using this Gaussian model, we estimate the expected number of estimated clump locations that are inaccurate by more than δ by summing the probabilities {p σ l } l=1 that the errors in the individual clump locations exceed this threshold.
Our approach for estimating the expected number of genuine clumps that are not represented in estimated label for the ith subject (i.e. the number of false negatives) emulates the one used by BVP17.We begin by using the facility location algorithm to re-cluster the annotations for each subject, subject to three additional constraints that are based on the original maximum likelihood solution.
(i) Volunteer boxes that were originally associated with true positive clusters are not considered as potential cities.This means that the only way that true positive annotations can contribute to clusters is by becoming facilities.
(ii) Only annotations that were not defined as facilities originally are considered as potential facilities.This prevents rediscovery of the clumps that were indicated by the maximum likelihood solution for the subject.
(iii) There is no dummy facility available, so all annotations must either become a facility or connect to an existing facility, regardless of how high the connection or establishment costs are.
We assume that each of the assembled clusters comprising the constrained facility location solution Z i represents a potentially missed clump detection.For each new cluster we compute its assembly cost C l using Equation 18and compare this with the cost of connecting all the cities it contains to the dummy facility.
To compute an initial estimate for N fn i we sum the expected value of an indicator variable that equals 1 when the cluster is a false negative and 0 otherwise for all clusters in Z i.
where p fn l estimates the probability that cluster Z l i identifies a real clump b l i that was originally missed by the maximum likelihood solution.By analogy with Equation 20 Furthermore, p tn l is the probability that the boxes in Z l i all correspond with false positive clicks.
This initial estimate cannot be computed for a subject if no volunteer boxes were originally connected to the dummy facility.However the absence of nominally false positive boxes does not imply that no clumps have been missed.To estimate how many clumps might have been missed when no false positives are present we consider intersections between the global set of all annotations provided by all volunteers for all subjects in the working batch.We use this global set to estimate a subject-agnostic probability that two boxes coincide at any particular location within a subject.The higher this probability is for a particular location, the more likely it is that a clump will be located there.We define a coincidence when two boxes are separated by a Jaccard distance less than dmax. 13We begin by randomly shuffling the elements of the working batch.We then process the randomised elements sequentially to find any mutually coinciding subsets of volunteer boxes.For each box, we check for coincidence with any of the previously processed elements.If the boxes coincide we increment a coincidence count, which we denote n ∩k ij , for the previously processed element and remove the current element from the shuffled batch.If no coincidences are found we retain the current element, which allows coincidences between it and subsequent elements to be identified and counted.After the shuffled working batch has been processed, we estimate the probability of a coincidence with each of the remaining elements by computing the ratio between its accumulated coincidence count and the total number of an- notations nz comprising to the working batch. 14 Figure 5 illustrates the different stages in our computation of the p ∩k ij using all boxes in the working batch.Let B ∩ represent the remaining elements of the shuffled working batch, for which a value of p ∩k ij has been computed.For each subject in the original working batch, we find the subset B of elements in B ∩ that constitute that do not coincide with any of the boxes bl i ∈ ŷi that comprise the estimated subject label.For each box in B , we increment N fn,init i by the product of the probability that a coincidence occurs at that location and the probability that all volunteers who inspected the image would have missed the clump.
Note that for subjects that have no clumps identified in their estimated labels, B → B ∩ .In practice, we find that N fn,init i always dominates the estimate of N fn i and that the second term in Equation 43 is always 1.

Subject retirement and batch finalisation
Computing the expected false positive, false negative and inaccurate true positive counts (i.e.N fp i , N fn i and N σ i (δ)) independently for each subject allows us to define a compound retirement criterion that specifies maximum permissible values , N fp i,max , N fn i,max and N σ i,max , for each of these quantities as well as a threshold τ on the overall subject risk.Table 2 lists the thresholds we use in practise as well as the values we adopt for the coefficients specified in Equation 32.
Once the subject risks have been computed we retire those subjects for which the overall risk Ri < τ and N fp i , N fn i and N σ i are all less than their specified maximum permissible values, before removing their elements from the working batch.We also identify and remove any stale subject data that have persisted for the maximum allowed number of batch replenishment cycles without retiring.Such subjects are likely very difficult or complicated so we mark them for expert inspection, assessment and labelling.For the remaining subjects 14 Recall (subsection 4.2) that we define an annotation k=1 to be the set of box markings provided by a particular volunteer when they inspect a particular subject, so the number of annotations is generally less than the size of the working batch.that were not retired, we re-initialise their difficulty parameters and discard any associated clusters that were established when the working batch was processed.Annotation data that were provided by a single volunteer for different subjects can appear in separate working batches, especially if volunteers return to the project regularly over an extended period of time.It is also possible that only a subset of the subjects annotated by a volunteer in a single working batch are retired when batch processing completes.If a volunteer's annotation data persist between batches, those persistent data should not be used to update volunteer skills multiple times during multiple batch processing cycles.This could lead to pathological subjects unfairly inflating or reducing the skill parameter values (p fp j , p fn j , σ 2 j ) for a particular volunteer.To avoid this scenario, we restore the volunteer skills that were cached at the start of the latest cycle and update them using only annotation for subjects that did retire.
The batch processing cycle then restarts by acquiring new annotation data and repopulating the working batch.

RESULTS
Recall that the full Galaxy Zoo: Clump Scout dataset (Z) contains 3,561,454 click locations, which constitute 1,739,259 annotations of 85,286 distinct subjects provided by 20,999 volunteers and that approximately 20 volunteers inspected each subject.Using this dataset, we identify 128,100 potential clumps distributed among 44,126 galaxies.Figure 6 shows five examples of galaxies in which clumps were detected.

Testing the effect of volunteer multiplicity
We expect that the performance of our aggregation framework will vary depending upon the number of volunteers who inspect each subject.To investigate this dependence we assemble 17 subsamples of annotations { Zn} 20 n=3 ∈ Z, that contain between 3 and 20 annotations per galaxy.Each Zn is constructed by randomly sampling n annotations for each subject si ∈ S. For example, Z5 includes 5 randomly sampled annotations for each galaxy in the Galaxy Zoo: Clump Scout subject set.We then use our aggregation framework to derive the set of corresponding estimated subject labels Ŷ ( Zn) ≡ {ŷi,n} |S| i=1 where ŷi,n = ŷi(Z = Zn) is the label for si based only on the n annotations for that subject within Zn15 .In subsequent sections, we will examine the differences between results derived using these different restricted datasets.Note that the dataset containing 20 annotations per subject, denoted Z20, is not quite the full Galaxy Zoo: Clump Scout dataset Z because the Zooniverse interface occasionally collects more than 20 annotations per subject.

Aggregated clump properties
Our aggregation algorithm assigns a separate false positive probability p fp l to each clump it identifies (see subsection 5.7).The left-hand panel of Figure 7 shows the distribution of this false positive probability for clumps detected using 20 annotations per subject, which is strongly bimodal with ≈ 90% of clumps having 0.2 < p fp l > 0.8.The right hand panel shows how the distribution of the false positive probabilities for all identified clumps evolves as more volunteers annotate each subject.For fewer than 5 annotations per subject (i.e.n 5) the estimates for the clumps' false positive probabilities remain somewhat prior-dominated and the distributions are unimodal with medians close to the hyper-parameter value p fp 0 = 0.1.For more than 5 annotations per subject (i.e.n > 5), the distributions become progressively more bimodal which increases their interquartile ranges.The distribution medians decrease monotonically as the number of annotations per subject n → 20, which indicates that providing more volunteer annotations per subject allows our framework to more confidently predict the presence of clumps.For every bounding box in each subject's maximum likelihood label, we also compute the probability p σ l that the Jaccard distance between it and the unknown true location of the clump exceeds δ = 0.5.The left hand panel of Figure 8 shows the distribution of p σ l for clumps detected using 20 annotations per subject, while the right hand panel shows how the distribution p σ l of evolves as more volunteers annotate each subject.Again, our model priors appear to dominate for fewer than 5 annotations per subject and the distribution medians decrease monotonically as the number of annotations per subject n → 20.This pattern indicates that providing more volunteer annotations per subject allows our framework to more precisely determine the locations of clumps.
Figure 9 illustrates the spatial distribution of the detected clump locations, in bins of estimated clump false positive probability p fp l .We observe that 99.9% of clumps with p fp l 0.5 (i.e.likely true positives) are located within a central circular region occupying 20% of the area of their corresponding images.In contrast, clumps with p fp l 0.5 (i.e.likely false positives) are 10 times more likely to fall outside this region.This central concentration of confidently identified clumps is reassuring because it reflects the typical footprints of the target galaxies in each subject image, which is where we would reasonably expect to find genuine clumps.For all clumps, regardless of their estimated false positive probability, we observe a clear under-density at the centre of the distribution, which likely reflects the fact that most volunteers correctly distinguish the target galaxies' central bulges from clumps.

Comparison with expert annotations
To quantify the degree of correspondence between the clumps identified by volunteers and those identified by professional astronomers, we used the Galaxy Zoo: Clump Scout interface to collect annotations from three expert astronomers for 1000 randomly selected subjects and compared the recovered clump locations with those derived from volunteer clicks by our aggregation framework.
For each subject in this expert-annotated image set, we consider the 17 different estimated labels ŷi,n that were computed using 3 ≤ n ≤ 20 volunteer annotations per subject (see subsection 6.1).We then filter each of these 17 labels by selecting a subsample of its bounding boxes that have associated false positive probabilities p fp l that are less than a selectable threshold value, which we denote p ,fp .By setting p ,fp close to one, we expect to select only the bounding boxes that mark real clumps.Conversely, we expect that setting p ,fp close to zero results in a subsample that is likely to contain more false positive bounding boxes.We use the symbol Ŷ n (p ,fp ) to denote the set of estimated labels for all expert-annotated subjects that were computed using n volunteer annotations per subject and filtered to include only those bounding boxes with false positive probabilities less than p ,fp .
For a particular false positive filtering threshold p ,fp and number of annotations per subject n, we consider the filtered labels for all 1000 expert-annotated subjects and define N FP n to be the total number of empirically false positive aggregated clump bounding boxes in Ŷ n (p ,fp ) that contain zero expert click locations.Conversely, N FN n denotes the total number of expert clicks located outside of any aggregated box, which we designate as false negatives.We identify the remaining N TP n aggregated boxes that coincided with an expert click location as true positives.
Using the set of aggregated clump designations we compute the aggregated p fp -threshold-dependent clump sample completeness and purity Figure 10 illustrates how the completeness and purity of our aggregated clump sample depend on n.In the left-hand panel we plot Cn and Pn values derived using the whole expert-identified clump sample as a ground-truth set.The values plotted in the right-hand panel are derived by comparing a restricted set of nominally normal ground-truth clumps which experts did not identify as "unusual" (see subsection 3.2) with aggregated clumps that the majority of volunteers who identified the clump classified it as being normal in appearance.In both panels, the crosses show the "optimal" completeness and purity values that maximise the hypotenuse C(p ,fp ) 2 + P(p ,fp ) 2 over all possible p ,fp thresholds.For comparison, the square and triangular points in Figure 10 respectively illustrate the maximum values of completeness and purity that can be achieved independently.
For both the full and the restricted ground truth sets, we observe a general trend that increasing the number of volunteers who inspect each subject increases the optimal sample completeness at the expense of reducing purity.Using the expert classifications as a benchmark it is clear that our most complete aggregated clump samples suffer substantial contamination.In the most extreme case, using the "normal" clump comparison sets for n = 20 and letting p ,fp = 1 yields ∼ 97% completeness, but only ∼ 35% purity.The high level of contamination indicates that volunteers are much more optimistic than experts when annotating clumps i.e. volunteers will mark features that experts will ignore.Moreover, while completeness values generally improve when comparing the restricted "normal" clumps, the corresponding purity values are substantially worse than those derived from the full clump samples.This degradation in purity for the "normal" clump subset likely indicates that volunteers and experts disagree about the definition of a "normal" clump with volunteers being less likely to label a clump as unusual.
The top row of Figure 11 shows the g, r and i band flux distributions 16 for aggregated clumps that are empirically determined to be false positive and true positive when comparing them with expert clump annotations.To better represent the appearance of the clumps that volunteers and experts actually see, the band-limited fluxes shown in Figure 11  than empirically true positive clumps.The bottom row of Figure 11 shows all non-redundant flux ratios for the g, r and i bands.In general the empirically false positive clumps are brighter in the g band and would appear bluer in the subject images.Overall, the distributions in Figure 11 suggest that volunteers are more likely to mark faint features than experts, particularly when those features appear blue.Figure 18 shows typical examples of the faint blue features that volunteers annotate but experts ignore.
Figure 12 illustrates the degree of correspondence between the value of p fp l assigned to each clump by our aggregation framework and their empirical categorisation as true or false positives.The figure compares the distributions of p fp l for empirically true positive and false clumps identified using all available annotations in for the expert-annotated subject set.The distributions represent the restricted subset of clumps in Ŷ20 that the majority of volunteers labeled as "normal".However, we recognise that volunteers and experts may disagree about what criteria define a "normal" clump.Therefore, to avoid conflating this categorical disagreement with genuine cases when experts and volunteers mark different features (regardless of the annotation tool used) we consider any expert identified clump when assigning true-positive or falsepositive labels.The majority of aggregated clumps in both categories have very low estimated false positive probabilities (p fp l 1), indicating a high degree of consensus between volunteers, albeit that this consensus disagrees with the expert annotations.Although clumps in both empirical categories have estimated p fp l values spanning the full range [0, 1], we note that 95% of empirically true-positive clumps have p fp l < 0.3 compared with only 68% of empirical false positives.This reinforces the evidence implicit in Figure 10 that the aggregated clump sample can be made purer with respect to the expert sample by applying a threshold on p fp l .

Volunteer Skill Parameters
Our aggregation framework allows us to monitor the evolution of volunteers' skill parameters as they spend time in the project.The top panel of Figure 13 shows the distribution of the Galaxy Zoo: Clump Scout volunteers' subject classification counts.The distribution is bottom-heavy with a median of 3 subjects per volunteer and 19,859 volunteers (∼ 95%) annotating fewer than 10 images, and only 176 volunteers (∼ 0.08%) annotating more than 20017 .The remaining panels of Figure 13 illustrate how our estimates of the volunteers' skill parameters evolve as volunteers inspect and annotate increasing numbers of subjects.For all three skill parameters, the mean and median of the maximum likelihood estimates increase monotonically from their prior values as volunteers annotate more subjects.The relatively slow evolution of p fp regularisation that results from setting the hyper-parameter n fp β = 500 (see Table 1).

Subject risk and its components
The distributions shown in Figure 14 reveal how the expected numbers of false positive bounding boxes N fp i , missed clumps (or false negatives) N fn i and inaccurate clump locations N σ i (see subsection 5.7) evolve for the subjects in the the Galaxy Zoo: Clump Scout subject set as more volunteers annotate them.For the majority of subjects, our framework estimates values less than one for all risk components, regardless of how many volunteers annotated them.The distributions of N fp i , N fn i and N σ i become broader and their median values decrease monotonically as n → 20.This pattern indicates that for the majority of subjects, increasing the number of volunteers who annotate each subject improves the reliability of their consensus labels.
A minority of subjects have estimated values for one or more of N fp i , N fn i or N σ i that are greater than one.For this subset of subjects, their associated risk component distributions appear to stabilise after five or more volunteers have annotated each subject.We suggest that estimates for subjects that are annotated by fewer than 5 volunteers (i.e. for n 5) are noise-dominated or prior-dominated and somewhat unreliable.The structure that is visible in the distributions of N fp i in the upper-left panel is produced by a strong bimodality in the distribution of false positive probabilities (p fp l ) for the clumps in the corresponding sets of estimated labels (i.e. the clumps in the corresponding Ŷn -see Figure 7).For each clump in the estimated label for a particular subject, its false positive probability is very likely to be close to zero or one.The expected number of false positive clumps in a subject's estimated label is derived by summing a term that includes these probabilities in its denominator, so the distributions of will naturally be concentrated into peaks around integer values of N fp i .Similar structures that are visible in the distributions of N fn i are produced by a strong bimodality in the summand in Equation 39.The fraction of subjects for which N fp i > 1 peaks at ∼ 10% for n = 15 and decreases to ∼ 8% for n ∼ 20.In contrast, the fraction of subjects for which N fn i > 1 does not peak, but increases quasi-monotonically to reach ∼ 2% as n → 20.The fraction of subjects for which N σ i > 1 is negligible and < 0.05% for all n.
The overall median values for the estimated numbers of missed clumps and inaccurate clump locations per subject both decrease monotonically as the number of volunteers who inspect each subject increases.However the overall median for the expected number of false positive clumps per subject increases slowly until n = 13 before beginning to decrease.We assess the feasibility of reducing N fp i by discarding aggregated clumps with high individual false positive probabilities.The upper right panel of the Figure 14 shows the effect of filtering clumps with p fp l > 0.85 on the distribution of N fp i .Applying this filter substantially reduces the estimated number of false positive clumps after 5 or more volunteers annotate each subject and moreover, the fraction of subjects for which the expected number of false positive clumps per subject exceeds one now peaks at ∼ 0.1% for n = 5 and decreases rapidly thereafter.We note that filtering clumps based solely on their estimated false-positive probabilities may inadvertently discard real clumps if p fp l does not correlate appropriately with observable quantities like brightness and colour that can indicate whether a particular feature is a genuine clump or spurious 18 .Figure 15 illustrates the overall effect of discarding clumps with individual false positive probabilities larger than 0.85 on the number of clumps per galaxy that our framework 18 Indeed, for this reason Adams et al. (2022) apply a very permissive p fp l threshold before filtering further based on observable clump parameters.
identifies using different numbers of volunteer annotations per subject.The impact is strongest for n 7 but the overall effect is small with 0.5 fewer clumps identified per galaxy.The left hand panel of Figure 16 plots fluxes in the g, r and i bands versus the estimated individual false positive probability (p fp l ) for all clumps that our framework identifies using 20 annotations per subject.In all three bands, the mean flux of clumps with p fp l < 0.2 is ∼ 1.5 times larger than the mean flux for clumps with p fp l > 0.2.The right hand panel of Figure 16 plots the non-redundant flux ratios i/g, r/g and i/r versus p fp l .On average, clumps with low estimated false positive probability appear brighter in bluer bands.Overall, 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1.0 we observe a pattern whereby clumps that appear brighter and bluer in the subject images tend to have lower p fp l .We verified that this pattern does not change significantly when clumps are filtered according to the fraction of volunteers that labeled them as "unusual".This is reassuring because real clumps are expected to be bright and blue in colour and suggests that filtering clumps based on p fp l is well motivated physically.The correlations with flux and colour also resemble the empirical patterns described in subsection 6.3 where we observed that the sample of clumps that coincided with expert clump annotations were brighter and bluer than the sample of clumps that did not.
Figure 17 shows how the fractions of subjects that are retired for different reasons vary as more volunteers annotate each subject.More than 90% of subjects meet the subject retirement criterion specified in subsection 5.8 regardless of how many volunteers annotate each subject.Of the remaining subjects, ∼ 7 − 9% become stale after persisting in the working batch for more than 10 replenishment cycles and are removed.The fraction of stale subjects peaks for n = 6 annotations per subject and decreases monotonically thereafter as more annotations per subject are used.Fewer than 1% of subjects failed to retire for any n.The fraction of unretired subjects is maximally 0.9% for n = 3 and falls to < 0.1% for n = 20.We comment that for n < 6 the computation of R and its components is likely to be dominated by our model priors and therefore the apparent decrease in the number of stale subjects should probably not be interpreted as improved performance within this domain.

DISCUSSION
Using the annotations provided by the Galaxy Zoo: Clump Scout volunteers our framework has identified a large cata-logue of potential clumps.In addition, our aggregation framework provides quantitative metrics for the reliability of the estimated subject labels it computes.These diagnostics allow us to better understand how volunteers interpreted the definition for a clump that they were provided with and how they execute the annotation task.
The observable properties of the clumps we detect appear plausible, both in terms of their spatial distribution within the subject images and their fluxes in the SDSS g, r and i bands.The central concentration of confidently identified clumps in Figure 9 is reassuring because it reflects the typical footprints of the target galaxies in each subject image, which is where we would reasonably expect to find genuine clumps.For clumps with any estimated false positive probability p fp l , we observe a clear under-density at the centre of the distribution, which likely reflects the fact that most volunteers correctly distinguish the target galaxies' central bulges from clumps.
The clump flux and colour distributions in Figure 16 reveal that brighter, bluer clumps tend to have lower false positive probabilities (p fp l ).This trend is also reassuring because real clumps are expected to be bright and blue in colour and suggests that filtering clumps based on p fp l is well motivated physically.The correlations with flux and colour also resemble the empirical patterns described in subsection 6.3 where we observed that the sample of clumps that coincided with expert clump annotations were brighter and bluer than the sample of clumps that did not.By comparing expert labels for 1000 subjects with those estimated by our framework using volunteer annotations, we showed that volunteers are much more optimistic that experts when annotating clumps.Overall, the distributions in Figure 11 suggest that volunteers are more likely to mark faint features than experts, particularly when those features ) values for aggregated clump locations that coincide expert annotations (orange) and those that did not (blue).We use coincidence with any expert clump to establish the truepositive or false-positive categories, but only aggregated clumps that the majority of volunteers labeled as "normal" are considered.The inset shows a zoomed view of the distributions for p fp l < 0.01.
appear blue.This results in aggregated clump samples for the 1000 test subjects that appear quite heavily contaminated with respect to the expert labels.Moreover, this apparent contamination worsens if clumps that experts or the majority of volunteers labeled as "unusual" are discarded.This degradation in purity for the "normal" clump subset likely indicates that volunteers and experts disagree about the definition of a "normal" clump with volunteers being less likely to label a clump as unusual.
Using Figure 12 we illustrated that our framework tends to estimate lower false positive probabilities for clumps that were marked by both volunteers and experts.The formulation of our likelihood model means that smaller estimated false positive probabilities correlate broadly with a greater degree of consensus between skilled volunteers that a clump exists at a particular location.Therefore, it seems that while many volunteers mark features that experts would not identify as clumps, features that experts do mark tend to have also been marked by a majority of more skilled volunteers who inspected the corresponding subject.The correlation between clumps' false positive probabilities and their expert classifications also reinforces the evidence implicit in Figure 10 that the aggregated clump sample can be made purer with respect to the expert sample by applying a threshold on p fp l .We note that while using a visual labelling approach to This means volunteers who annotate many subjects will contribute to several bins.However, their skill parameters are sampled at the point that they had inspected the maximum number of subjects represented by a particular bin.Statistics for the different volunteer skill parameters p fp j , p fn j and σ j are shown in the upper-middle, lower-middle and bottom panels respectively.Red and blue markers plot the median and mean skill parameter of all volunteers contributing to a particular bin, respectively.The orange band illustrates the inter-quartile ranges of the bin-wise distributions.Dotted and dashed lines indicate the 5th and 95th percentiles respectively.
identify clumps provides more flexibility than relying on a fixed set of brightness or colour thresholds, it is also unavoidably subjective.To illustrate how this subjectivity may be impacting the empirically determined purity and completeness of our clump sample, Figure 18 shows typical examples of the faint blue features that volunteers annotate but experts ignore.Many of these any do appear clump-like and it is not always obvious why experts have not marked them.Based on these observations, we suggest that the sample of clumps identified by our framework using volunteer annotations may not be as severely contaminated as Figure 10 implies.We also note that the clump samples our framework derives are generally very complete and include the majority of expertlabeled clumps.This means that that subsets of clumps for particular scientific analyses can be selected from a nominally impure sample using physically motivated criteria based on directly observable or derived characteristics of the individual clumps.For example, Adams et al. (2022) derive samples of bright clumps by using criteria based on photometry extracted from clumps and their host galaxies.
In addition to providing quantitative estimates for the reliability of individual clump labels, our framework allows us to investigate the performance of individual volunteers and the entire volunteer cohort.The positive gradients of skill parameter evolution curves in Figure 13 decrease with increasing number of subjects inspected (their second derivatives are negative except in the final bin which contains relatively few volunteers).This suggests that the the volunteer skill parameters may converge to stable asymptotic values for very large numbers of inspected subjects.The fact that this convergence was not achieved for the Galaxy Zoo: Clump Scout dataset likely indicates that the global maximum likelihood solution is dominated by the large number of volunteers who inspect very few images and may provide noisy annotations due to their relative inexperience.
The noisiness of volunteer annotations probably indicates that identifying clumps within star forming galaxies, which can have complex underlying morphologies, is relatively difficult for inexperienced non-experts.In subsection 6.4 we noted that most volunteers only annotated a small number of galaxies and may not have had time to learn the visible characteristics of genuine clumps.While it may be the case that the task of clump identification is too difficult for typical Zooniverse volunteers, this seems unlikely and there are several plausible strategies for making complex and subtle image analysis tasks more feasible for citizen scientists.The most obvious is to improve the amount and quality of the initial training that is provided to volunteers.However, Zooniverse volunteers are accustomed to participating in projects with minimal tutorial material so imposing a more rigorous training requirement may discourage widespread participation.As discussed in subsection 3.1, the volunteers to who contributed to Galaxy Zoo: Clump Scout received real-time feedback for a small number of expert-labeled subjects that they annotated during the early stages of their participation.Providing more detailed feedback for a larger sample of subjects may help volunteers to better understand the task they are being asked to perform.Some Zooniverse projects also provide a dedicated tutorial workflow with an accompanying video tutorial in which experts annotate the same subjects that volunteers see and explain their reasoning19 .When using feedback as a training tool, it is important that the feedback subjects contain galaxies and clumps that are properly representative of the global populations within the full subject set, but it is difficult to ensure that this is the case unless the experts themselves inspect a large number of subjects.Moreover, the feedback messages that volunteers receive must be carefully chosen to avoid discouraging volunteers if their annotations disagree with those of experts.
An alternative to explicit training and feedback that was pioneered by the Gravity Spy project20 involves incrementally increasing the difficulty of subjects that volunteers inspect and annotate as they spend longer engaged with the project and their skill improves (Zevin et al. 2017).Using this "leveling up" approach requires an a priori metric for the relative difficulty of subjects for volunteers, as well as ongoing assessment of volunteers' skills.While our framework naturally fulfills the latter requirement, it does not facilitate prior segregation of subjects to populate the different difficulty levels.
It might be possible to formulate a heuristic approach to estimating subject difficulty based on observable properties of the clumps' host galaxies, but that is beyond the scope of this paper.As we discuss in subsection 4.1, the consensus reliability metrics that our framework computes may enable quantitatively motivated early retirement of subjects if it can be established that a stable consensus solution has been reached.In subsection 5.7 we described how our framework formulates a subject retirement criterion based on estimated metrics that are proxies for the completeness (N fn i ), purity (N fp i ) and accuracy (N σ i ) of that subject's label.Figure 17 seems to show that more than 90% of subjects fulfil this criterion, even when only n = 3 volunteers inspect each subject.However, the dis-tributions shown in 14 appear to be noise or prior dominated for n 5 and we suggest that estimates of the subject risk R and its components {N fn i , N fn i , N σ i } for that domain should be treated with some caution.
In Figure 14, we showed that discarding clumps with estimated individual false positive probabilities p fp l > 0.85 substantially reduces the number of subject labels that are expected to include one or more false positive clumps and that this number reduces rapidly once more than 7 volunteers have annotated each subject.
We interpret the fact that the estimated number of missed clumps per subject (N fn i ) increases as more volunteers annotate each subject as an effect of some of those volunteers marking very faint features.Potential false-negative clumps identified by the second, constrained run of the facility location algorithm (see subsection 5.7) are typically on the threshold of identification by our framework, which normally means that several volunteers have marked them21 .If the fraction of highly optimistic volunteers within the overall cohort is small, then a relatively large number of volunteers must inspect each subject for faint features to reach the threshold where they are considered potential false negatives.The increase in N fn i as the number of annotations per subject n → 20 is then an indication that more faint features are reaching, but not surpassing, our framework's detection threshold.Figure 15 provides an empirical estimate for the number of clumps per galaxy that are missed when fewer volunteers inspect each subject.Although the mean number of identified clumps per galaxy does increase in the interval 7 < n < 20, the rate of increase is very slow and increasing n from 7 to 20 results in just 0.5 more clumps with individual false positive probabilities p fp l < 0.85 per galaxy on average.In line with our previous observations regarding volunteer optimism, we suggest that many of these additional clumps may in fact be faint, blue features within the target galaxies.As Figure 10 illustrates, our comparison with expert labels also suggests that n ∼ 7 provides that best compromise between the completeness and purity of our aggregated clump sample.
Empirically, it seems like at least 5 volunteers must inspect each subject to obtain a stable solution for the subject labels and that the majority of genuine clumps could be identified by our framework for most subjects using the annotations provided by ∼ 7 volunteers.Increasing the number of volunteers beyond this threshold seems to introduce more noise into the annotation data and also results in progressively fainter features being identified.Retiring the majority of subjects after inspection by 7 volunteers, if it could have been well motivated, would have reduced the volunteer effort required for the Galaxy Zoo: Clump Scout project by a factor > 2. Unfortunately, we must acknowledge that the reliability metrics computed by our framework do not seem to converge in a way that is useful to facilitate an early retirement decision.For most subjects, our framework predicts expected numbers of false positives, false negatives and inaccurate true positives that are less than one for any number of annotations (i.e.N fp i , N fn i , N σ i 1 ∀ n) and so these subjects would have been retired when n < 7 based on the thresholds   2. As we show in Figure 10, retiring subjects this early would yield a lower sample completeness, even for the brighter clumps that experts also identified.Moreover, while predicted numbers of subject labels containing false positive or inaccurate clump locations both decrease for n 7 as n → 20, the predicted number of subjects labels that are missing real clumps increases.Using any retirement criterion predicated on N fn i 1, considering the annotations from more volunteers would result in more subjects becoming stale in the working batch and therefore requiring inspection by experts.Fortunately in the case of Galaxy Zoo: Clump Scout, the fraction of subjects for which the estimated number of false positive clumps N fn i > 1 for any n is < 3% of the overall dataset (∼ 2500 subjects), so visual inspection by experts would be feasible.

SUMMARY AND CONCLUSION
In this paper we have presented a software framework that uses a probabilistic model to aggregate multiple annotations that mark two-dimensional locations in images of distant galaxies and derive a consensus label based on those annotations.The annotations themselves were provided via the Galaxy Zoo: Clump Scout citizen science project by nonexpert volunteers who were asked to mark the locations of giant star forming clumps within the target galaxies.Among a sample of 85,286 galaxy images that were inspected by volunteers, our software framework identified 44,126 that contained at least one visible clump and detected 128,100 potential clumps overall.
To empirically evaluate the validity of the clumps we identify, we compared our aggregated labels with annotations provided by expert astronomers for a subset of 1000 galaxy images.We found that Galaxy Zoo: Clump Scout volunteers are much more optimistic than experts, and are willing to mark much fainter features as potential clumps, particularly if those features appear blue in colour.However, volunteers also mark the vast majority of bright clumps that experts identify, so although the sample of clumps we identify is ∼ 50% contaminated with respect to the expert identifications, it is 90% complete.
In addition to our empirical evaluation, we have used the statistical model that underpins our framework to compute quantitative metrics for the reliability of the overall aggregated labels that we derive for each image.These metrics suggest that stable consensus for most images' labels is achieved after ∼ 7 volunteers have annotated it, which is < 50% of the 20 annotations that were collected for each image via Galaxy Zoo: Clump Scout and would represent a significant saving in volunteer effort.However, the annotation data are quite noisy with large variation between the numbers of locations that are marked by different volunteers and this noise makes it difficult to define a robust "early retirement" criterion that could be used to safely curtail collection of annotations before 20 have been acquired.
We suggest that the noisy annotation data reflect the fact that inexperienced non-experts find the task of identifying clumps difficult, or that the task was not properly explained.In section 7, we discuss how different approaches to volunteer training could be used to help volunteers better distinguish the visible characteristics of genuine clumps from those of the faint, blue features that many ultimately marked.On the other hand, one of the benefits of using citizen science to identify clumps is that it avoids being overly prescriptive regarding the definition of a clump.Galaxy Zoo: Clump Scout represents the first extensive wide-field search for clumpy galaxies in the local Universe and it may be that low-redshift clumps have different properties to their more distant counterparts.Using strict thresholds on brightness or colour might result an unexpected population of fainter clumps being missed.Moreover, the sample of clumps identified by volunteers appears to be very complete and so, if a subset of bright clumps is required for science analysis, such a sample can be straightforwardly constructed using photometric measurements for each clump (e.g.Adams et al. 2022).
Although our framework was developed to aggregate annotations for a specific citizen science project, its applicability is more general.A large number of projects running on the Zooniverse platform collect two dimensional image annotations.Many of those projects consider subjects that are more familiar to non-experts and may be less prone to noise.In such cases, our framework may be able to substantially reduce the amount of effort and time taken to reach consensus for each subject.

DATA AVAILABILITY
The data underlying this article were used in Adams et al. (2022) and can be obtained as a machine-readable table by downloading the associated article data from https://doi.org/10.3847/1538-4357/ac6512.
Annotation A set of click locations provided by a single volunteer as they inspect a single subject.The click locations are later expanded into a set of square boxes as explained in subsection 4.2.
Label A set of zero or more rectangular bounding boxes, derived by our aggregation framework for a single subject image, that estimates the locations of any clumps it contains.
Skill A compound metric, describing a particular volunteer, that estimates the probability will mark a spurious clump, the probability that they will miss a real clump, and the accuracy of the locations they provide for any real clumps they mark.
Difficulty A quantitative metric for the degree to which the properties of a single subject image affect the ability of volunteers to perceive and accurately label any clumps it contains.
Risk A metric that is designed to quantify the reliability and scientific utility of a single subject's consensus label.
Retire Stop collecting annotations for a subject.

APPENDIX C: TABLE OF SYMBOLS
In this section we provide a reference table for symbols that recur in multiple sections of this paper.

APPENDIX D: COMPARISON WITH SCIKIT LEARN MEANSHIFT CLUSTERING ALGORITHM
We emphasise that the aim of this paper is not to present a novel and very complicated clustering algorithm.Indeed, our focus is the likelihood model that we use to estimate the Galaxy Zoo: Clump Scout volunteers' skills, the difficulty of the subjects that they inspect, and the reliability of the consensus labels that we derive.Nonetheless, we recognise that there are many well established clustering algorithms in the literature and that some of them may outperform our framework's ability to actually detect clumps, even if they cannot provide the same auxiliary information about the final subject labels.Presenting exhaustive comparison between our framework and every alternative algorithm is beyond the scope of this paper.However, we have tested several of the methods available from the Scikit Learn Python package (Pedregosa et al. 2011).In this section we present a representative comparison between our framework and the Scikit Learn MeanShift clustering algorithm.We set the MeanShift algorithm's bandwidth parameter set equal to the size of the SDSS imaging PSF for each subject image and all other parameters were left set to their default values 22 .
Figure D1 shows the distribution of the difference between the number of clumps detected by our framework and the number detected using the MeanShift algorithm for each subject in the Galaxy Zoo: Clump Scout subject set.For the majority of subjects, our framework detects more clumps than the MeanShift algorithm.In Figure D2 we show some representative subjects for which our framework detects more clumps than the MeanShift algorithm and in Figure D3, we show subjects for which the reverse is true.It is not obvious from these figure that either algorithm is particularly biased towards detecting clumps with specific properties.There is some evidence that our algorithm detects fainter potential clumps than the MeanShift algorithm, and seems less vulnerable to misidentifying objects like stars and background galaxies as clumps.Even when such objects are detected by our framework, they tend to be assigned false positive probabilities greater than 0.8.In some cases, our framework fails to detect clumps that many volunteers identify.We speculate that this is a result of a small number of volunteers with very high p fp j identifying the clump, which causes our framework to deem other volunteers' clicks as false positives as well.
Table C1.Table of the most commonly recurring symbols used in this paper.We divide the symbols into categories and provide a brief description of how they should be interpreted.Complete descriptions of each symbol are provided in the main text at the point they are first introduced.

Category Symbol Description
Object indices i Index over subjects.j Index over volunteers.k Index over a volunteer j's clump identifications for a single subject.l Index over aggregated clump locations.

S
The global set of subject images.

S j
The set of subject images inspected by volunteer j. s i A single subject image in S.

R i
The risk for subject s i .
The expected number of spurious clump locations (false positives) in the label for subject i. N fn i The expected number of missed clumps (false negatives) in the label for subject i. N σ i The expected number nominally true positive clump locations in the label for subject i that differ from the (unknown) true clump location by a Jaccard distance greater than 0.5.

Subject difficulties
The variance of a Gaussian model for the Jaccard distance offset between the estimated location of the lth detected clump for subject i and its corresponding (unknown) true location.

D i
The difficulty of subject i defined the set of σ l i 2 values for all detected clumps in the image.

V
The global set of volunteers.

V i
The subset of volunteers who inspected subject i.

Volunteer skills p fp j
The probability that volunteer j will click on a spurious clump.p fn j The probability that volunteer j will miss a real clump.σ 2 j The variance of a Gaussian model for the Jaccard distance offset between volunteer j's true positive click locations and the corresponding (unknown) true clump locations, independent of subject.

S j
The skill of volunteer j defined as the set {p fp j , p fn j , σ 2 j }.

Z
The global set of volunteer annotations.

Z i
The set of annotations for a single subject image provided by all the volunteers who inspected it.Zn A randomly selected subset of Z containing exactly n annotations per subject.z ij A single annotation provided by volunteer j after inspecting subject i.

B ij
The set of boxes, corresponding to click locations provided by volunteer j for subject i.

B i
The set of all boxes, corresponding to click locations provided for subject i by all volunteers who inspected it.b k ij A single box, corresponding to the location of a single click provided by volunteer j for subject i.
The variance of a Gaussian model for the Jaccard distance offset between volunteer j's kth true positive click location for subject i and its corresponding (unknown) true clump location.a k ij An integer value that maps the kth click in volunteer j's annotation of subject i to a specific clump in that subject's estimated label (or to the dummy facility if it is deemed to be a false positive).

Y
The global set of subject labels.y i The unknown true label for subject i. b l i A single box comprising part of the unknown true label for subject i. ŷi The estimated label for subject i that is computed by our framework.bl (d kl |e = 1) = Gaussian |d kl | 2 ; σj 2 pi = P (d kl |e = 0) = Gaussian |d kl | 2 ; σ l i 2 4.6 Global model and parameter priors Our combined model for a single volunteer annotation of a single subject (i.e.p(zij|yi, Di, Sj), Equation 5) forms the kernel of a joint model for the set of all subject true labels Y ≡ {yi} |S| i=1 , the set of all subject difficulties D ≡ {Di} |S| i=1

Figure 3 .
Figure 3. Schematic overview of the aggregation algorithm.

Figure 4 .
Figure 4. Top: The topology of the clusters that are assembled by the Facility Location algorithm.In this case the set of boxes has been partitioned into three clusters.Within each cluster, the central facility (F 1-3) is connected to one or more cities (C 1-5).Each city is connected to exactly one facility.Bottom: Possible arrangement of aggregated box clusters corresponding to the illustrated topology for an image after inspection by three volunteers.Blue boxes b l i correspond to facilities (F 1-3) and red boxes b k ij correspond with the cities (C 1-5).Note that each volunteer may contribute at most one box to each cluster and in this case the same volunteer contributed the boxes that were assigned facility status.

Figure 5 .
Figure5.Computing the random coincidence probability using all boxes in the working batch.Left panel: Shaded boxes represent all elements in the first working batch.Solid boundaries indicate groups of boxes that coincided using the dmax = 0.9 criterion.Note that large boxes may validly encompass all or most of smaller ones without coinciding if the ratio of the box areas areas in normalised coordinates less than 0.9dmax.Boxes that did not coincide with any others are shown using dashed lines.Middle panel: The elements of B ∩ coloured according to the number of boxes they were found to coincide with.Right panel: Two dimensional map showing the mean probability that one or more boxes in the working batch will accidentally coincide at a given two-dimensional location.

Figure 6 .
Figure6.Examples of clump-hosting galaxies, illustrating the ability of our framework to exclude false-positive annotations.The left hand column shows galaxy images as they were seen by volunteers.The second column overlays all volunteer annotations on a grey-scale image of the same galaxy.In the third column volunteer annotations that were assigned to a facility and identified as clumps are shown in colour.Annotations that were assigned to the dummy facility are shown in black.The fourth column shows the clump locations that we ultimately identify.

Figure 7 .
Figure10illustrates how the completeness and purity of our aggregated clump sample depend on n.In the left-hand panel we plot Cn and Pn values derived using the whole expert-identified clump sample as a ground-truth set.The values plotted in the right-hand panel are derived by comparing a restricted set of nominally normal ground-truth clumps which experts did not identify as "unusual" (see subsection 3.2) with aggregated clumps that the majority of volunteers who identified the clump classified it as being normal in appearance.In both panels, the crosses show the "optimal" completeness and purity values that maximise the hypotenuse C(p ,fp ) 2 + P(p ,fp ) 2 over all possible p ,fp thresholds.For comparison, the square and triangular points in Figure10respectively illustrate the maximum values of completeness and purity that can be achieved independently.For both the full and the restricted ground truth sets, we observe a general trend that increasing the number of volunteers who inspect each subject increases the optimal sample completeness at the expense of reducing purity.Using the expert classifications as a benchmark it is clear that our most complete aggregated clump samples suffer substantial contamination.In the most extreme case, using the "normal" clump comparison sets for n = 20 and letting p ,fp = 1 yields ∼ 97% completeness, but only ∼ 35% purity.The high level of contamination indicates that volunteers are much more optimistic than experts when annotating clumps i.e. volunteers will mark features that experts will ignore.Moreover, while completeness values generally improve when comparing the restricted "normal" clumps, the corresponding purity values are substantially worse than those derived from the full clump samples.This degradation in purity for the "normal" clump subset likely indicates that volunteers and experts disagree about the definition of a "normal" clump with volunteers being less likely to label a clump as unusual.The top row of Figure11shows the g, r and i band flux distributions16 for aggregated clumps that are empirically determined to be false positive and true positive when comparing them with expert clump annotations.To better represent the appearance of the clumps that volunteers and experts actually see, the band-limited fluxes shown in Figure11are independently scaled in the same way as the corresponding bands of the Galaxy Zoo: Clump Scout subject images (see subsection 2.2).The distributions reveal that empirically false-positive clumps are ∼ 5 − 10 times fainter on average

jFigure 8 .
Figure 8. Left panel: Distribution of the estimated probability that an individual clump location is inaccurate (p σ l ) for clumps identified using 20 annotations per subject (i.e. using Z20 ).The distribution is concentrated close to zero with all clumps having p σ l 0.3.The inset shows the distribution in for p σ l < 0.01.Right panel: Distributions of p σ l corresponding to n between 3 and 20 volunteer annotations per subject.The distribution medians decrease monotonically from ≈ 0.05 for n = 3 to ≈ 4 × 10 −4 for n = 20, while the distribution interquartile ranges become wider as more volunteers annotate each subject.Note that the colour scale shows the number density of clumps to account for the fact that the two-dimensional histogram bins cover different areas.

Figure 10 .
Figure10.Purity versus completeness for different numbers of volunteers per subject.The left-hand panel shows values derived using the full volunteer label sets and all expert-identified clumps as a benchmark sample, while the values shown in the right-hand panel are derived by comparing the sets of clumps which experts and volunteers identified as "normal".Squares indicate the values of the maximum possible completeness and purity for each number of volunteers, which can generally not be realised simultaneously.Crosses indicate the optimal completeness and purity values that can be simultaneously realised for each volunteer count.

Figure 11 .Figure 12 .
Figure 11.Top row: Flux distributions in g, r and i bands for clumps that are empirically determined to be false positive or true positive by comparing with expert clump annotations.Dashed vertical lines indicate the distribution means.Bottom row: Flux ratio distributions for clumps that are empirically determined to be false positive or true positive by comparing with expert clump annotations.Dashed vertical lines indicate the distribution medians.In both rows, the fluxes in each band are scaled in the same way as the corresponding bands of the subject images (see subsection 2.2) to better reflect the data that volunteers actually see.

Figure 14 .
Figure 14.Evolution of the distributions for components of subject risk as the number of volunteer annotations per subject increases.Distributions for the expected numbers of false positive bounding boxes N fp i , missed clumps (or false negatives) N fn i and inaccurate clump locations N σ i are shown in the upper left, lower left and lower right panels respectively.The upper right panel shows the distributions of N fp i after discarding individual clumps with false positive probabilities p fp l > 0.85.Note that the y axis changes from logarithmic to linear scaling at the values indicated by the black horizontal dashed lines to better illustrate the evolution of structures in each distribution.

Figure 15 .
Figure 15.Evolution of the distribution of the number of clumps per galaxy as more volunteers inspect and annotate each subject.The red markers and lines plot the distribution means for the different numbers of volunteers per subject.Top panel: Number of clumps per galaxy with any value for their estimated false positive probability p fp l .Bottom panel: Number of clumps per galaxy with p fp l < 0.85

Figure 16 .
Figure 16.Left panel: Clump flux in g, r and i bands versus estimated clump false positive probability p fp l .Right panel: Clump flux ratios g, r and i bands versus estimated clump p fp l .In both panels, the fluxes in each band are scaled in the same way as the corresponding bands of the subject images (see subsection 2.2) to better reflect the data that volunteers actually see.On average, clumps with low p fp l

Figure 17 .
Figure 17.Fraction of subjects retired for different reasons versus number of volunteers per subject

Figure 18 .
Figure 18.Six curated but representative examples of subject images that show agreements and disagreements between volunteers and experts.Features labeled as clumps by volunteers but ignored by experts are highlighted by white boxes.Red boxes highlight features that were annotated by both experts and volunteers.Red circles highlight features that were annotated by experts but not by volunteers.Volunteers tend to mark fainter features than experts, particularly if those features appear blue in colour.None of the features highlighted in this figure were labeled as "odd" by a majority of volunteers or the experts who marked them.

22Figure D1 .
Figure D1.Distribution of the distribution of the difference between the number of clumps detected by our framework and the number detected using the MeanShift algorithm for each subject in the Galaxy Zoo: Clump Scout subject set.

iA
single box comprising part of the estimated label for subject i. p fp l The probability that the lth clump in the estimated label for a subject is a false positive.p σ lThe probability that the Jaccard distance between the lth clump in the estimated label and the corresponding (unknown) true clump location exceeds 0.5.

Figure D2 .
Figure D2.Examples of clump-hosting galaxies, for which our framework detects more clumps than the Scikit Learn MeanShift algorithm.The first column shows galaxy images as they were seen by volunteers.The second column overlays all volunteer annotations on a grey-scale image of the same galaxy.The coloured boxes in the third column show the clump locations that out framework identifies.Dashed boxes indicate clumps with false positive probabilities p fp l > 0.8 assigned by framework label.Finally, the red circles in the fourth column show the clumps detected by the MeanShift algorithm.

Figure D3 .
Figure D3.Examples of clump-hosting galaxies, for which our framework detects fewer clumps than the Scikit Learn MeanShift algorithm.The images, boxes and circles shown in the various columns have the same meaning as in Figure D2.

Table 1 .
Framework hyper-parameter values used to process the Galaxy Zoo: Clump Scout dataset.

Table 2 .
Parameters used to determine subject retirement and compute overall subject risk Evolution of volunteer skill parameter statistics versus number of subjects inspected.The top panel show the distribution of the number of volunteers who have inspected at least as many subjects as indicated by the upper boundary of each bin.