An image dataset related to automated macrophage detection in immunostained lymphoma tissue samples

Abstract Background We present an image dataset related to automated segmentation and counting of macrophages in diffuse large B-cell lymphoma (DLBCL) tissue sections. For the classification of DLBCL subtypes, as well as for providing a prognosis of the clinical outcome, the analysis of the tumor microenvironment and, particularly, of the different types and functions of tumor-associated macrophages is indispensable. Until now, however, most information about macrophages has been obtained either in a completely indirect way by gene expression profiling or by manual counts in immunohistochemically (IHC) fluorescence-stained tissue samples while automated recognition of single IHC stained macrophages remains a difficult task. In an accompanying publication, a reliable approach to this problem has been established, and a large set of related images has been generated and analyzed. Results Provided image data comprise (i) fluorescence microscopy images of 44 multiple immunohistostained DLBCL tumor subregions, captured at 4 channels corresponding to CD14, CD163, Pax5, and DAPI; (ii) ”cartoon-like” total variation–filtered versions of these images, generated by Rudin-Osher-Fatemi denoising; (iii) an automatically generated mask of the evaluation subregion, based on information from the DAPI channel; and (iv) automatically generated segmentation masks for macrophages (using information from CD14 and CD163 channels), B-cells (using information from Pax5 channel), and all cell nuclei (using information from DAPI channel). Conclusions A large set of IHC stained DLBCL specimens is provided together with segmentation masks for different cell populations generated by a reference method for automated image analysis, thus featuring considerable reuse potential.


Data Description Context
We present an image dataset generated as a part of an accompanying publication, which is concerned with method development and comparison for automated segmentation and counting of macrophages in di use large B-cell lymphoma (DLBCL) tissue sections [1]. DLBCL is an aggressive cancer disease which is characterized by a large heterogeneity of pathological, clinical and biological features [2]. Therefore, a crucial step for the classi cation of DLBCL subtypes as well as for providing a prognosis of the clinical outcome is the analysis of the tumor microenvironment in terms of counts, local distributions and functions of the di erent cell populations and, particularly, of the tumor-associated macrophages occuring there [3].
Until now, most information about macrophages is obtained either by gene expression pro ling [4] or by manual counts in immunohistochemically (IHC) stained tissue microarrays or high-power elds, thus either gathering information in a completely indirect way or accepting extreme subsampling rates [5]. A reliable approach for fully automated segmentation, identi cation and counting of IHC stained macrophages within whole tissue slides has been addressed in [1].
Our dataset contains monochrome uorescence microscopy images of 44 DLBCL tissue samples wherein di erent macrophage populations (using antibodies against CD14 and CD163) and B-cells (using antibody against Pax5) as well as all cell nuclei (using DAPI) have been stained and imaged at di erent wavelengths. Further, we supply processed images, comprising "cartoon-like" TV-ltered images (generated by Rudin-Osher-Fatemi ltering) as well as results of the automated macrophage segmentation. For this publication, we completed these data by automated segmentation of B-cells and the cell nuclei.

a) Preparation and staining of DLBCL tissue.
From the les of the Lymph Node Registry Kiel, 44 DLBCL biopsy specimens have been selected. For every specimen, from formalin-xed para n-embedded tissue a slice of 2 µm thickness has been obtained. In order to detect speci c macrophages and its relation to B-cells, a triple IHC staining has been done, using primary antibodies against CD14 (Cell Marque, Cat# 114R-14, RRID: AB_2827391; 1:10), CD163 (Novus, Cat# NB110-59935, RRID: AB_892323; 1:100) and Pax5 (Santa Cruz Biotechnology, Cat# sc-1974, RRID: AB_2159678; 1:100) labelled with donkey anti rabbit Alexa 488, donkey anti mouse Alexa 555 and donkey anti goat Alexa 647 (all from Invitrogen, Thermo Fisher Scienti c, Waltham, MA, USA; 1:100) as secondary antibodies. Subsequently, the slices have been incubated with DAPI (Invitrogen, Thermo Fisher Scienti c, Waltham, MA, USA; 1:5000) and cover-slipped with mounting medium. Use of tissue was in accordance with the guidelines of the internal review board of the Medical Faculty of the Christian-Albrechts-University Kiel, Germany (No. 447/10).

b) Selection of tumor subregions and image acquisition.
Within every tissue sample, the tumor area was de ned and marked by a pathologist based on inspection of conventional Haematoxylin-Eosin (HE) staining in a neighboring reference slice. Subsequently, within the IHC stained slice, a rectangular subregion of the tumor area has been selected, taking care for acceptable tissue and staining quality. Maximum size of tumor subregions is 10 mm 2 .
Images of tumor subregions within the IHC stained slides have been captured by Hamamatsu Nanozoomer 2.0 RS slide scanner (Hamamatsu Photonics, Ammersee, Germany) with 20 × magni cation at four wavelengths, resulting in single images for the CD14, CD163, Pax5 and DAPI channels, respectively, which were saved in .ndpi output format with default settings as used in clinical trial routine. Note that, at this point, moderate built-in compression by imaging device was accepted. Single-channel raw images have been converted into c) Image processing. For every tile, the segmentation method from [1] has been applied to the CD14, CD163, Pax5 and DAPI channel images, resulting in ROF-ltered images (saved as type "cartoon"), a mask for the evaluation subregion within the tile, indicating the presence of tissue at all, as inferred from DAPI channel information (saved as type "evalmask"), and segmentations of macrophages within the CD14 and CD163 channels (saved as type "segment"). Due to the large inhomogeneity of IHC staining, even across a single target macrophage, we provide two further masks containing the convex hulls of the segmented features instead of the features themselves (saved as type "convhull"). The segmentation masks for doublestained macrophages are saved as type "multiple". For a general description of the ROF lter based segmentation method, we refer to [1]. Here, we describe in more detail the generation of segmentations for the Pax5 and DAPI channels, which are new in this paper. Let us recall the notation from [1] where the indices i and j count the current intensity threshold and the features to be inspected at this stage, s(F j ), c(F j ) and r(F j ) denote the size of a feature F j itself, the size of its convex hull and the ratio of the principal axes' lengths of the smallest ellipse covering the feature, respectively. s min , s max , c max and r max denote the minimal and maximal feature size (in px), the maximal area excess of the convex hull (in percent) and the maximal ratio of axes, respectively.
In order to obtain a segmentation of the DAPI channel, the ROF-ltered image has been further subjected to a local Narendra-Fitch contrast enhancement [7] p(k, l) enhanced = m(k, l) + c where c > 0 is a weight parameter and m(k, l), σ(k, l) denote the mean and standard deviation of the intensities within a subregion centered at the pixel p(k, l) original , respectively. We used c = 0.75 and a square subregion of 11 × 11 px size. Then, in a rst run, Steps 3 -10 of the ROF lter based segmentation have been applied, using the bounds s min = 60 and s max = 119 for the feature size but modifying geometrical rule No. 3) for feature classi cation from [1] as follows: If s min ≤ s(F) ≤ s max then test whether the feature satis es both of the criteria 3b) r(F j ) ≤ r max (the feature is not too elongated) and 3d) c(F j )/s(F j ) ≤ 1 + c max /100 (the deviation from circular shape is bounded from above). If yes, save the feature F j into the output mask, interpreting it as a cell nucleus, and mask it in I (3) (i). If not then neglect the feature and mask it in I (3) (i) as well. Here, we used the parameter values r max = 2.5 and c max = 150. In a second run, Steps 3 -10 of the ROF lter based segmentation have been repeated with the parameter settings s min = 120 and s max = 180, using again the described modi cation of rule No. 3) but saving only those features into the output mask which are completely disjoint to the output of the rst run. Finally, the results of both runs have been combined into a single mask (saved as type "segment"). Within a further result mask of type "convhull", the convex hulls of the detected features have been stored. For the segmentation of the B-cells, the ROF-ltered image of the Pax5 channel has been subjected to a moderate Narendra-Fitch contrast enhancement as well, using the parameter c = 0.1 and a square subregion of 15 × 15 px size. To the result, Steps 3 -10 of the ROF lter based segmentation have been applied, using the bounds s min = 80 and s max = 159 as well as the described modi cation of rule No. 3) with parameters r max = 2.5 and c max = 150 but saving into the output mask (of type "segment") only features which intersection with the convex hull of some cell nucleus, as obtained in the segmentation of the DAPI channel, is nonempty. Thus, numerous artifacts appearing in the Pax5 staining will be excluded. Again, the convex hulls of the dectected features have been stored within a further mask of type "convhull".  Table 2. It is based on BCL2-staining for tissue slides obtained from the same biopsy specimens as before but not necessarily adjacent to the slides used for the generation of the image data presented here. Staining was microscopically examined and semi-quantitatively scored by an experienced pathologist. Each stained slide was evaluated for the percentage of stained tumor cells by visual estimation in a representative tumor area. The estimated value was graded into following scores: 0 -all cells negative, 1 -up to 25% positive cells, 2 -25% -50% positive cells, 3 -50% -75% positive cells, 4 -over 75% positive cells.

Dataset structure
Image data are organized by tissue specimens (toplevel folders) and tiles (second-level folders), the latter ones ordered by position.
Top-level folders are named specimen_xx_tile_yy_zz__logfile.txt is provided, containing detailed information about procedures, parameters and results of automated segmentation.

Reuse potential
Although there is a vast number of publications concerned with the composition of tumor microenvironment in various types of lymphoma disease, image datasets of IHC stained cancer tissue are rarely publicly accessible if at all, cf. the discussion in [9]. Most data generated for the purpose of such analyses are not ndable or not even accessible. For example, the Genomic Data Commons Data Portal of the National Cancer Institute [10, 11] currently lists only 48 cases of mature B-cell lymphoma with an image of a HE-stained slide available, while IHC stainings are missing at all. In this situation, the image dataset presented in this note constitutes a document of interest in itself.
We will outline the most important options for further use of the data. First, it allows for a detailed morphometrical investigation of the imaged macrophages and B-cells with respect to the distribution of geometrical parameters as size, diameter, perimeter, etc., as well as to overall shape patterns. Second, the data may be used for validation, calibration and comparison of cell segmentation methods (manual, automated) and related software packages, making available a large reference dataset together with the output of a reference method as described in [1]. Note that, for these purposes, it is particularly adequate to use data admitting a routine quality level. Third, the original images as well as the segmentations presented here could be used for the generation of a su ciently large training set for automated macrophage detection by machine learning methods. Fourth, the data may be used for study of co-localization and clustering of macrophages and B-cells within lymphoma tissue and cancer microenvironment, employing appropriate methods of point-pattern statistics [12,13]. Finally, the dataset enables a closer study of the double-stained macrophage subpopulation. In order to facilitate a possible further processing of the obtained features (e.g. extraction of barycenters, replacement of the features by equally sized circles or squares), not only the masks for the segmented features themselves but as well for its convex hulls are provided.
To illustrate the described reuse potential, we include a set of composite gures, each combining information from several separate images. Figure 2.A shows an original image at CD14 channel (greyscale, original contrast-enhanced by factor 3.5 and inverted) with superimposition of the mask of the evaluation subregion, as obtained from the DAPI channel (light blue), and the segmentation of the CD14-stained macrophages (olive green). Figure 2.B shows the same tile as imaged at the Pax5 channel (greyscale, original inverted) with superimposition of the cell nuclei segmentation from DAPI channel (light blue, convex hulls) and the segmentation of the CD163-stained macrophages (dark yellow). In Figure 2.C, for the same tile, both macrophage segmentations (olive green or dark yellow, convex hulls) are combined in order to reveal double-stained parts (light yellow). In Figure 2.D, we superimposed to Figure  2.C the segmentation of B-cells from the Pax5 channel (magenta and grey, convex hulls). Observe that in Figs. 2.B and 2.D, some B-cells are positioned inside of macrophages, indicating that they are engulfed by the macrophages for phagocytosis (examples marked by arrows). It is obvious that co-localization and clustering patterns as empirically noticeable here must be investigated on a sound base of statistical methodology.
To improve reusability, BLC2 scores for the biopsy specimens are provided.

Availability of supporting data
All image data are made publicly accessible under CC0 1.0 license at the Leipzig Health Atlas (LHA) repository [14] and can be reached from the address [15]. Each top-level folder can be downloaded as .zip le and bears a separate identier, e.g. https://health-atlas.de/lha/7YXMMFNPDG-0 within the repository, see Table 3. Two folders with total size larger than 1 GB (Nos. 04 and 44) have been splitted into a pair of les. Snapshots of the datasets are available in the GigaScience Gi-gaDB repository as well [16].  Examples of B-cells positioned inside of macrophages indicated by arrows (the same cells as in Figure 2.B).

Data Description Context
We present an image dataset generated as a part of an accompanying publication, which is concerned with method development and comparison for automated segmentation and counting of macrophages in diffuse large B-cell lymphoma (DLBCL) tissue sections [1]. DLBCL is an aggressive cancer disease which is characterized by a large heterogeneity of pathological, clinical and biological features [2]. Therefore, a crucial step for the classification of DLBCL subtypes as well as for providing a prognosis of the clinical outcome is the analysis of the tumor microenvironment in terms of counts, local distributions and functions of the different cell populations and, particularly, of the tumor-associated macrophages occuring there [3]. Until now, most information about macrophages is obtained either by gene expression profiling [4] or by manual counts in immunohistochemically (IHC) stained tissue microarrays or high-power fields, thus either gathering information in a completely indirect way or accepting extreme subsampling rates [5]. A reliable approach for fully automated segmentation, identification and counting of IHC stained macrophages within whole tissue slides has been addressed in [1].
Our dataset contains monochrome fluorescence microscopy images of 44 DLBCL tissue samples wherein different macrophage populations (using antibodies against CD14 and CD163) and B-cells (using antibody against Pax5) as well as all cell nuclei (using DAPI) have been stained and imaged at different wavelengths. Further, we supply processed images, comprising "cartoon-like" TV-filtered images (generated by Rudin-Osher-Fatemi filtering) as well as results of the automated macrophage segmentation. For this publication, we completed these data by automated segmentation of B-cells and the cell nuclei. Images of tumor subregions within the IHC stained slides have been captured by Hamamatsu Nanozoomer 2.0 RS slide scanner (Hamamatsu Photonics, Ammersee, Germany) with 20 × magnification at four wavelengths, resulting in single images for the CD14, CD163, Pax5 and DAPI channels, respectively, which were saved in .ndpi output format with default settings as used in clinical trial routine. Note that, at this point, moderate built-in compression by imaging device was accepted. Single-channel raw images have been converted into c) Image processing. For every tile, the segmentation method from [1] has been applied to the CD14, CD163, Pax5 and DAPI channel images, resulting in ROF-filtered images (saved as type "cartoon"), a mask for the evaluation subregion within the tile, indicating the presence of tissue at all, as inferred from DAPI channel information (saved as type "evalmask"), and segmentations of macrophages within the CD14 and CD163 channels (saved as type "segment"). Due to the large inhomogeneity of IHC staining, even across a single target macrophage, we provide two further masks containing the convex hulls of the segmented features instead of the features themselves (saved as type "convhull"). The segmentation masks for doublestained macrophages are saved as type "multiple". For a general description of the ROF filter based segmentation method, we refer to [1]. Here, we describe in more detail the generation of segmentations for the Pax5 and DAPI channels, which are new in this paper. Let us recall the notation from [1] where the indices i and j count the current intensity threshold and the features to be inspected at this stage, s(F j ), c(F j ) and r(F j ) denote the size of a feature F j itself, the size of its convex hull and the ratio of the principal axes' lengths of the smallest ellipse covering the feature, respectively. s min , s max , c max and r max denote the minimal and maximal feature size (in px), the maximal area excess of the convex hull (in percent) and the maximal ratio of axes, respectively.
In order to obtain a segmentation of the DAPI channel, the ROF-filtered image has been further subjected to a local Narendra-Fitch contrast enhancement [7] p(k, l) enhanced = m(k, l) + c where c > 0 is a weight parameter and m(k, l), σ(k, l) denote the mean and standard deviation of the intensities within a subregion centered at the pixel p(k, l) original , respectively. We used c = 0.75 and a square subregion of 11 × 11 px size. Then, in a first run, Steps 3 -10 of the ROF filter based segmentation have been applied, using the bounds s min = 60 and s max = 119 for the feature size but modifying geometrical rule No. 3) for feature classification from [1] as follows: If s min ≤ s(F) ≤ s max then test whether the feature satisfies both of the criteria 3b) r(F j ) ≤ r max (the feature is not too elongated) and 3d) c(F j )/s(F j ) ≤ 1 + c max /100 (the deviation from circular shape is bounded from above). If yes, save the feature F j into the output mask, interpreting it as a cell nucleus, and mask it in I (3) (i). If not then neglect the feature and mask it in I (3) (i) as well. Here, we used the parameter values r max = 2.5 and c max = 150. In a second run, Steps 3 -10 of the ROF filter based segmentation have been repeated with the parameter settings s min = 120 and s max = 180, using again the described modification of rule No. 3) but saving only those features into the output mask which are completely disjoint to the output of the first run. Finally, the results of both runs have been combined into a single mask (saved as type "segment"). Within a further result mask of type "convhull", the convex hulls of the detected features have been stored. For the segmentation of the B-cells, the ROF-filtered image of the Pax5 channel has been subjected to a moderate Narendra-Fitch contrast enhancement as well, using the parameter c = 0.1 and a square subregion of 15 × 15 px size. To the result, Steps 3 -10 of the ROF filter based segmentation have been applied, using the bounds s min = 80 and s max = 159 as well as the described modification of rule No. 3) with parameters r max = 2.5 and c max = 150 but saving into the output mask (of type "segment") only features which intersection with the convex hull of some cell nucleus, as obtained in the segmentation of the DAPI channel, is nonempty. Thus, numerous artifacts appearing in the Pax5 staining will be excluded. Again, the convex hulls of the dectected features have been stored within a further mask of type "convhull".  Table 2. It is based on BCL2-staining for tissue slides obtained from the same biopsy specimens as before but not necessarily adjacent to the slides used for the generation of the image data presented here. Staining was microscopically examined and semi-quantitatively scored by an experienced pathologist. Each stained slide was evaluated for the percentage of stained tumor cells by visual estimation in a representative tumor area. The estimated value was graded into following scores: 0 -all cells negative, 1 -up to 25% positive cells, 2 -25% -50% positive cells, 3 -50% -75% positive cells, 4 -over 75% positive cells.

Dataset structure
Image data are organized by tissue specimens (toplevel folders) and tiles (second-level folders), the latter ones ordered by position.
Top-level folders are named specimen_xx_tile_yy_zz__logfile.txt is provided, containing detailed information about procedures, parameters and results of automated segmentation.

Reuse potential
Although there is a vast number of publications concerned with the composition of tumor microenvironment in various types of lymphoma disease, image datasets of IHC stained cancer tissue are rarely publicly accessible if at all, cf. the discussion in [9]. Most data generated for the purpose of such analyses are not findable or not even accessible. For example, the Genomic Data Commons Data Portal of the National Cancer Institute [10, 11] currently lists only 48 cases of mature B-cell lymphoma with an image of a HE-stained slide available, while IHC stainings are missing at all. In this situation, the image dataset presented in this note constitutes a document of interest in itself.
We will outline the most important options for further use of the data. First, it allows for a detailed morphometrical investigation of the imaged macrophages and B-cells with respect to the distribution of geometrical parameters as size, diameter, perimeter, etc., as well as to overall shape patterns. Second, the data may be used for validation, calibration and comparison of cell segmentation methods (manual, automated) and related software packages, making available a large reference dataset together with the output of a reference method as described in [1]. Note that, for these purposes, it is particularly adequate to use data admitting a routine quality level. Third, the original images as well as the segmentations presented here could be used for the generation of a sufficiently large training set for automated macrophage detection by machine learning methods. Fourth, the data may be used for study of co-localization and clustering of macrophages and B-cells within lymphoma tissue and cancer microenvironment, employing appropriate methods of point-pattern statistics [12,13]. Finally, the dataset enables a closer study of the double-stained macrophage subpopulation. In order to facilitate a possible further processing of the obtained features (e.g. extraction of barycenters, replacement of the features by equally sized circles or squares), not only the masks for the segmented features themselves but as well for its convex hulls are provided.
To illustrate the described reuse potential, we include a set of composite figures, each combining information from several separate images. Figure 2.A shows an original image at CD14 channel (greyscale, original contrast-enhanced by factor 3.5 and inverted) with superimposition of the mask of the evaluation subregion, as obtained from the DAPI channel (light blue), and the segmentation of the CD14-stained macrophages (olive green). Figure 2.B shows the same tile as imaged at the Pax5 channel (greyscale, original inverted) with superimposition of the cell nuclei segmentation from DAPI channel (light blue, convex hulls) and the segmentation of the CD163-stained macrophages (dark yellow). In Figure 2.C, for the same tile, both macrophage segmentations (olive green or dark yellow, convex hulls) are combined in order to reveal double-stained parts (light yellow). In Figure 2.D, we superimposed to Figure  2.C the segmentation of B-cells from the Pax5 channel (magenta and grey, convex hulls). Observe that in Figs. 2.B and 2.D, some B-cells are positioned inside of macrophages, indicating that they are engulfed by the macrophages for phagocytosis (examples marked by arrows). It is obvious that co-localization and clustering patterns as empirically noticeable here must be investigated on a sound base of statistical methodology.
To improve reusability, BLC2 scores for the biopsy specimens are provided.  Table 3. Two folders with total size larger than 1 GB (Nos. 04 and 44) have been splitted into a pair of files. Snapshots of the datasets are available in the GigaScience Gi-gaDB repository as well [16].

Answers to Reviewer's comments
The authors would like to thank both reviewers to valuable comments, which helped us to improve the quality of the paper.