Motivation: Neutrophil extracellular traps (NETs) are believed to be essential in controlling several bacterial pathogens. Quantification of NETs in vitro is an important tool in studies aiming to clarify the biological and chemical factors contributing to NET production, stabilization and degradation. This estimation can be performed on the basis of fluorescent microscopy images using appropriate labelings. In this context, it is desirable to automate the analysis to eliminate both the tedious process of manual annotation and possible operator-specific biases.
Results: We propose a framework for the automated determination of NET content, based on visually annotated images which are used to train a supervised machine-learning method. We derive several methods in this framework. The best results are obtained by combining these into a single prediction. The overall Q2 of the combined method is 93%. By having two experts label part of the image set, we were able to compare the performance of the algorithms to the human interoperator variability. We find that the two operators exhibited a very high correlation on their overall assessment of the NET coverage area in the images (R2 is 97%), although there were consistent differences in labeling at pixel level (Q2, which unlike R2 does not correct for additive and multiplicative biases, was only 89%).
Availability and implementation: Open source software (under the MIT license) is available at https://github.com/luispedro/Coelho2015_NetsDetermination for both reproducibility and application to new data.
Supplementary information:Supplementary data are available at Bioinformatics online.
Neutrophils are important effectors of the innate immune system in mammals, constituting the first line of defense against many microbial pathogens. Upon tissue injury and infection, neutrophils circulating in the blood are recruited to the site of inflammation or infection, contributing to pathogen clearance through phagocytosis, production of highly toxic reactive oxygen species and release of granule proteins with antimicrobial activity (Mayadas et al., 2014). In recent years, it has been shown that some inflammatory mediators, as well as a wide range of microbes including bacteria, fungi and protozoa, can stimulate neutrophils to undergo a distinctive form of cell death designated NETosis (Yipp and Kubes, 2013). This process leads to the extracellular release of fibers of DNA associated with histones, granule proteins and peptides—neutrophil extracellular traps (NETs). These are capable of entrapping microbial pathogens and mediate their extracellular killing. It is also becoming increasingly clear that NET formation is linked to multiple neutrophil-mediated pathologies and that it is implicated in vascular injury and thrombosis (Mayadas et al., 2014). Since the first identification of NETs by Brinkmann et al. (2004), this has become an area of active research, with many studies relying on the in vitro production and quantification of NETs as means to further understand the mechanisms involved in their formation, as well as their importance in pathogen clearance and in the development of multiple diseases (Buchanan et al., 2006; Kessenbrock et al., 2009; Marin-Esteban et al., 2012; Neumann et al., 2014; Wartha et al., 2007; Yost et al., 2009).
The spectrofluorometric quantification of NET content based on DNA release using DNA intercalating dyes has been frequently used. However, this can be confounded by the necrotic release of DNA, and by the influence of peptides present in NETs, such as LL-37, which can affect the binding of DNA to such dyes (Neumann et al., 2014). The quantification of the area covered by NETs on fluorescent microscopy images is a suitable alternative, but visual quantification, besides being tedious and labor-intensive, has reproducibility problems and can present interoperator differences. In order to avoid these pitfalls, an automatic quantification method is needed. A method for the semiautomatic quantification of in vitro NET formation in microscopy images has been recently proposed, based on the calculation of the percentage of neutrophils that are under NETosis, corresponding mostly to a cell death program that leads to chromatin decondensation and NET formation (Brinkmann et al., 2013). This method is useful for evaluating NET induction by different stimuli, but is not suitable for NET degradation experiments. In these experiments, NET production is chemically induced, so that virtually all neutrophils are activated, and the ability of degrading NETs is then evaluated based on the quantification of the DNA fibers.
Preliminary analysis showed that threshold-based methods, which Brinkmann et al. (2013) successfully used to detect NETs, were not able to quantify the area of the slide that was covered in NETs (which requires a more precise quantification).
We thus developed a more sophisticated method based on a supervised machine learning approach that has shown good results in other areas of bioimage informatics (Boland and Murphy, 2001; Boland et al., 1998; Conrad and Gerlich, 2010; Glory and Murphy, 2007; Loo et al., 2007; Nanni et al., 2010) including (most similar to our work) in the segmentation of fluorescent images (Nattkemper et al., 2002). These methods have even been reported to outperform human evaluation on some tasks (MacLeod et al., 2010; Nattkemper et al., 2003). In particular, we used a supervised learning approach whereby we learn a regression model to determine NET fraction in local regions based on numeric features computed from the pixels in that region. Finally, all the regions in an image are aggregated for a combined prediction. We present three different implementations of this generic framework based on particular choices for splitting the image into regions and different feature sets.
We also show that we can combine the outputs of the different implementations to obtain a single combined result, which outperforms all the individual predictions, as evaluated by cross validation.
Part of our data was annotated by two different experts enabling us to quantify interoperator variation. We observe that the two experts have large disagreements on a pixel-by-pixel level, with smaller disagreements on the global appreciation of the image. However, we also document a systematic bias between the two experts, highlighting the potential pitfalls of relying on human visual classification.
2.1 Data acquisition
Human blood-derived neutrophils were stimulated with 25 nm of phorbol 12-myristate 13-acetate (PMA) (Sigma) for 4 h to induce nearly 100% NET formation, as previously described (von Köckritz-Blickwede et al., 2010). The cells were then incubated for 1 h at 37°C and 5% CO2 with purified bacterial nucleases or with the supernatant of bacterial cultures for NET degradation. NET samples were immunostained using a mouse monoclonal anti-H2A-H2B-DNA complex (Losman et al., 1992) followed by a secondary goat-anti-mouse Alexa-Fluor-488 antibody and embedded in ProlongGold + DAPI (Invitrogen) to counterstain DNA (von Köckritz-Blickwede et al., 2010). Images were acquired using a Leica TCS SP5 confocal microscope with a HCX PL APO 40 × 0.75–1.25 oil immersion objective or a Zeis LSM 710 confocal microscope with an EC Plan-Neofluar 40 × 1.30 oil immersion objective. For each preparation, randomly selected images were acquired. For most preparations, at least three images were collected.
2.2 Data labeling
NETs were labeled visually using ImageJ (Schneider et al., 2012). A total of 88 fields were labeled from 30 different slides. A subset of the data (37 fields) was labeled independently by two experts, that were blinded to each other’s classification. Figure 1 shows an example image, including manual segmentation by both operators.
2.3 Method evaluation
The result of our algorithm is an estimate of the fraction of the image area that is covered by NETs, which must be compared with the human-annotated value f. For evaluation, we used the Q2 measure, defined by Wold (1982) as:
We used a cross-validation scheme whereby all the fields from the same slide were held-out for testing, while the fields from the other slides are used for training.
Software was implemented in Python, using the mahotas library for image processing and feature computation (Coelho, 2013) and scikit learn for regression (Pedregosa et al., 2011). The whole dataset, code, and the learned models are available for download under the MIT license at https://github.com/luispedro/Coelho2015_NetsDetermination. A tutorial on how to apply the methods to other data is available at that location as well. In the case of application to images with very similar characteristics (same tagging strategy and comparable acquisition apparatus and settings) to our own, the model may be applicable directly. For other data, it may be more appropriate to fit a new model with our software.
3.1 A framework for automated NET quantification with three proposed implementations
We present a generic framework for NET quantification, based on human-labeled data. As input, this framework takes in a set of images, consisting of DNA and histone channels, and a human labeling of these images where regions of interest, namely NET areas, have been annotated. The framework will then produce a model which can be applied to new images to produce an estimate of the NET area in those images. We describe several methods which fall into this framework and finally we demonstrate that combining the methods obtains a better result than any single method.
The framework is illustrated in Figure 2 and consists of the following steps:
partition the image into regions and compute numeric descriptors (features) in each region,
based on the human-labeled data, learn a regression from features to estimate the fraction of area covered by NETs in each region. Estimates for regions of the same image are aggregated to form a raw image estimate.
Finally, apply a linear correction to this raw estimate to correct for biases in the previous steps.
The first step is motivated by the fact that NETs are present in discrete regions in the images and not throughout the image. In order to characterize these images by machine-learning methods, we compute numeric features from each region. Previous work had used similar techniques in cell segmentation problems (Chen et al., 2011; Sommer et al., 2011). We used regression to estimate the fraction of regional NET coverage, instead of classifying the regions into NET containing versus NET empty, because preliminary results showed that developing a segmentation method to obtain good agreement with manually drawn boundaries would be nontrivial and an unnecessarily hard step. Partitioning of the image can be achieved by oversegmentation or be defined geometrically (e.g. using a regular grid).
This function is estimated from the training data. In our implementation, we used random forest regression (Breiman, 2001) for this step (using 100 trees). This nonlinear method has the advantage of providing out-of-band estimates for the training data, which are comparable to what would have been obtained by cross-validation. These estimates are used below.
The parameters β and α were learned by least-squares minimization on the training data using the out-of-band estimates for . This step corrects biases in the first regression (Wolpert, 1992) (in the results section, we present the empirical evidence for the necessity of this step). This differs from the generic stacking estimator which uses estimates from cross-validation, but when using random forests, out-of-band estimates are similar to cross-validated estimates and can be obtained at a very low computational cost (Breiman, 2001).
Finally, the fraction estimates are clipped to values between zero and one.
In what follows, we describe three different implementations of this framework, which differ in the way in which the image is broken up and the features computed.
3.1.1 Oversegmented regions
We oversegmented the Alexa channel with seeded watershed initialized with regularly spaced seeds. For each region, for both the Alexa and the DAPI channel, we computed the following features (a total of 77 features were used):
the size of the region in pixels,
Haralick texture features for each of the channels (Haralick et al., 1973),
Haralick texture features of the Sobel filtered version of each of the channels,
Pearson correlation between channel intensities,
The inclusion of Sobel filtering was motivated by the observation that NETs are fibrous. Therefore, we reasoned that an edge-enhancing filter would be appropriate. The other features were included as they have been previously been shown to produce good results in similar bioimage problems (Newberg et al., 2009).
Two overlap features were used: the fraction of pixels above threshold in either channel and the fraction above threshold in both channels. The mean intensity value was used as the threshold.
Regions that were smaller than 16 pixels were removed from consideration as the statistics computed on them can be unreliable due to the small number of pixels. Additionally, this step reduces the size of the input of the regression module, speeding up the process. The removed regions represent, on average, less than 4% of the total area of the image, therefore the impact of excluding these regions is small.
3.1.2 Regular sampling of filtered images (pixel regression)
Similar to what is done in Ilastik, a machine-learning based segmentation framework (Sommer et al., 2011), we compute several filtered versions of the image. Then, we sample from these filtered images on a regular grid (each sample point is 8 pixels away from the next and points close to the border are discarded. This results in 3844 samples from each 512 × 512 image). In particular, from each channel, we computed the following filtered versions of the image (in total, 18 features are gathered at each sampling point):
Gaussian filtered images with different standard deviations (σ = 4, 8, 12 and 16)
difference of Gaussians (between σ = 16 and σ = 8)
thresholded image using both the mean pixel value and mean pixel value plus two standard deviations as thresholds
a version of the image where pixel values were converted to z-scores (i.e. mean subtracted and normalized by the estimated standard deviation).
From these images, pixel values were sampled in a regular grid, forming a feature vector . To obtain an estimate of the local NET fraction at each pixel, we considered the human-labeled NET image as a binary image and blurred it with a Gaussian filter (σ = 8 pixels). Thus, at each pixel, we have a weighted average of the NET fraction in the local region. These values are sampled at the same locations as the features to form the regression labels.
3.1.3 Dense local features
Local features have recently been shown to perform well in other bioimage informatics machine learning problems (Coelho et al., 2013; Liscovitch et al., 2013), including segmentation (Song et al., 2013). In our work, we used dense sampling (Nowak et al., 2006), whereby feature descriptors are computed in a regular grid (as above, we used an 8 pixel separation between sampling points, while discarding areas close to the image border). We compute SURF features (Bay et al., 2008) computed at four different scales (namely setting 1, 2, 4 and 8 pixels as the initial scale).
As in the case of pixel regression, we associated a local NET fraction with each feature vector by smoothing the binary map with a Gaussian filter.
3.2 A combination of methods outperforms any single method
Figure 3 shows the errors committed by the individual models (errors estimated by cross-validation). We can see that the local-feature based methods show correlated errors, but this is not the case for the other methods. These data suggest that a better prediction could be obtained by combining the methods. We thus combine all methods by averaging the raw predictions of the methods before applying Equation (4).
Figure 4 summarizes the results of this study. Interestingly, the performance of all the methods is similar, with the exception of SURF features using the largest scale (8 pixels).
The results also clearly demonstrate the value of the linear correction. SURF features with a scale of 1 was the worst method before the correction, but the best (tied with pixel regression) after correction. This is explained by the fact that this model had a clear multiplicative bias (see Supplementary Fig. S1). The overall Q2 obtained by combining the methods was 93% (compared with 90% for the best performing single prediction).
In Figure 5, we plot the estimated results versus the underlying gold standard of human classification (the estimates were obtained by cross-validation). We can see that the estimates are predictive of the underlying value throughout the whole range of possible values.
3.3 Per pixel agreement between operators is low, but image-level agreement is high
Part of the data was independently labeled by two human operators (The operators were CP and AN.). Thus, we are able to quantify the amount of variance and bias in human annotation.
We found that, on a pixel-by-pixel level, the two labelings presented substantial differences. In fact, the Pearson correlation between the two binary pixel assignments was 0.68 (corresponding to 47% explained variance). Reporting explained variance of binary variables may be misleading (Abelson, 1985), but the Jaccard Index is only, on average, 0.67 (this measure ranges from 0 to 1, with 1 denoting perfect agreement). See Figure 1 for a comparison of the two operators on a single image.
However, the fraction of the area assigned to NETs by each of the operators showed a much higher correlation (R2 is 97%). At the same time, we observed a systematic difference in labeling since one of the experts consistently reported a lower value. Reflecting this, the Q2 measure (which penalizes for this additive error) is only 89% (see Fig. 6), slightly below the value obtained by the automated prediction.
3.4 Cross-validation per slide is a stricter measure of performance than cross-validation per field
As described above, several fields were acquired from each microscope slide. When presenting the results above, we used a cross-validation scheme whereby we held out a slide at a time for testing and trained on the rest of the data. To support our choice, we also considered an alternative schedule whereby a single field is left out at a time. Thus, each field is evaluated by a model which is obtained from training data which includes other fields from the same slide.
The results show that such approach would lead to over-inflated performance measures. For instance, when the combined method is evaluated in this manner, we obtain a Q2 value of 95% (higher than the 93% we obtained in stricter testing approach). Similarly small but positive differences are observed when evaluating all the individual methods (see Supplementary Table S1).
We present a framework for automated quantification of NET area in fluorescent microscopy images. Based on this generic framework, we implemented three specific methods. Follow-up work may make different choices of implementation and arrive at different methods. The use of different feature sets, different oversegmentation methods or a different regression methodology are all potential avenues to explore whilst staying within the generic framework we propose in this work.
The final output of our method is a single fraction measurement which does not directly relate back to pixel assignments. While the ideal method would compute a perfect segmentation, from which the NET fraction would be trivially estimated, it is not necessary to obtain this segmented image in order to solve a biologically relevant question. A single fraction estimate can provide enough information for the user. This is in accordance with work in computer vision which had previously shown that to get accurate cell counts, it is not necessary to completely segment individual cells (Lempitsky and Zisserman, 2010).
The fact that we only aimed to obtain single overall estimate from each image, instead of pixel assignments, allowed us to build a stacked estimator from the different methods to obtain a final prediction, even though these do not share the same partitions of the image (Wolpert, 1992). On our data, the gains from combining different predictions is at least partially explained by the fact that the errors of the different methods are only weakly correlated.
Previous work by Coelho et al. (2013) had demonstrated the need for careful delineation of data for cross-validation in the context of subcellular location determination. In this work, we reinforce the point in a different context, namely automated determination of NET content, by demonstrating that when different images from the same sample are used for training and testing, the measured performance is increased relative to validation from data from different samples. This is likely due to overfitting to match very particular aspects of each slide as opposed to generalizable features of NETs.
We also empirically investigated and measured the agreement between different human evaluators. The results require a nuanced interpretation. The pixel-level agreement is very low, but the two experts have very high correlation in their evaluations of the images. This is another instance where pixel-level measures are misleading, a phenomenon we and others had already investigated in the case of nuclear segmentation (Coelho et al., 2009; Yang-Mao et al., 2008). The general conclusion is that for evaluation of automated techniques, the evaluation measure should be as close as possible to the underlying biological goal (in our case, the fraction of the sample covered by NETs) and not an intermediate result (the exact locations of such areas).
The high correlation between human operators was hiding a systematic bias in quantitative results. One of the operators systematically reported a lower value. The automated method was more consistent with the data that it was trained on, resulting in a lower Q2 value.
Although the samples were prepared using the same protocol, the image acquisition equipment and parameters were not the same (see Section 2). Despite these differences, the proposed algorithm was able to classify equally well images obtained with either condition, suggesting that it is insensitive to any potential specificities introduced.
The automated method has a lower cost, perfect reproducibility, and results in a faster processing of the images than human annotation. Automated analysis opens up the possibility of multistudy comparisons by consortia of independent laboratories, which can be trained with their data and assessed by the method proposed. This can result in more robust training sets that can accommodate more significant differences in sample preparation protocol and data acquisition conditions. In its current form, the robustness of the proposed automated process, when combined with automated image acquisition, can facilitate the analysis in a single laboratory of numbers of fields and samples that would be prohibitive using manual annotation. Such large-scale analysis will certainly improve confidence in the determination of NET area, providing more accurate and reproducible analyses than is currently available from human annotation. Our data is available for others in the community.
L.P.C. was supported by Fundação para a Ciência e a Tecnologia (grant PTDC/SAU-GMG/115652/2008). M.v.K.-B. was supported in part by DFG grant KO 3552/4–1. A.N. was supported by a fellowship of the Akademie für Tiergesundheit (AfT) and the PhD programme ‘Animal and Zoonotic Infections’ of the University of Veterinary Medicine Hannover, Germany. A.F. received a Short-Term Fellowship 2011 by the Federation of European Biochemical Societies (FEBS) and an ESCMID Travel Grant for Training in a Foreign Institution 2011 by the European Society of Clinical Microbiology and Infectious Diseases (ESCMID).
Conflict of Interest: none declared.