ESQmodel: biologically informed evaluation of 2-D cell segmentation quality in multiplexed tissue images

Abstract Motivation Single cell segmentation is critical in the processing of spatial omics data to accurately perform cell type identification and analyze spatial expression patterns. Segmentation methods often rely on semi-supervised annotation or labeled training data which are highly dependent on user expertise. To ensure the quality of segmentation, current evaluation strategies quantify accuracy by assessing cellular masks or through iterative inspection by pathologists. While these strategies each address either the statistical or biological aspects of segmentation, there lacks a unified approach to evaluating segmentation accuracy. Results In this article, we present ESQmodel, a Bayesian probabilistic method to evaluate single cell segmentation using expression data. By using the extracted cellular data from segmentation and a prior belief of cellular composition as input, ESQmodel computes per cell entropy to assess segmentation quality by how consistent cellular expression profiles match with cell type expectations. Availability and implementation Source code is available on Github at: https://github.com/Roth-Lab/ESQmodel.

The pipeline utilizes a combination of CellProfiler [7] image processing scripts and a user created Ilastik image segmentation project.The CellProfiler scripts can be run as given with small modifications to the parameters, such as the number of markers.The procedure to setting the parameters are detailed in the README file of the repository of the modified pipeline.The Ilastik segmentation project needs to be created by the user.The procedure to segment the cells depend on the markers available in the panel.
We detail the markers used for each dataset below.
• METABRIC breast cancer dataset (BC): We used Ir191 and Ir193 DNA intercalators to identify the nucleus.Cytokeratins (CK5, CK7 and panCK) were used to identify the cytoplasmic regions.
• Classical Hodgkin's lymphoma dataset (CHL): We used Ir191 and Ir193 DNA intercalators to identify the nucleus.For membrane segmentation, we used the membrane markers from the IMC cell segmentation kit (TIS-00001) from Fluidigm.CD30 was used to identify Hodgkin and Reed-Sternberg cells.
• Reactive lymph node dataset (RLN) and human tonsil dataset: We used Ir191 and Ir193 DNA intercalators to identify the nucleus.For membrane segmentation, we used the membrane markers from the IMC cell segmentation kit (TIS-00001) from Fluidigm.
To introduce the different segmentation errors in Ilastik, we used different strategies to train Ilastik to perform erroneous segmentation.For merge segmentation, we merged two nuclei together and filled in the cytoplasm that surrounds the two nucleus.For partial segmentation, we filled in only parts of the nucleus and drew the cytoplasm as a thin layer outside of the nucleus.For split segmentation, we split the nucleus into half.

Watershed Segmentation Settings
We ran the watershed algorithm for cell detection in QuPath [8].As a pre-processing step, the IMC converter [9] was used to convert MCD files to OME-TIFFs.Cell detection relies on a nuclear stain for each dataset.For all datasets, we used the Ir193 DNA intercalator.The parameters set are the following: we set the minimum area to 5 µm 2 and cell expansion was set to 2µm, intensity threshold was set to 5. The parameters were consistent across all images.

StarDist Segmentation Settings
We also employed StarDist [10] for cell detection in QuPath.As a pre-processing step, the IMC converter was used to convert MCD files to OME-TIFFs.StarDist is similar to watershed as it only depends on a nuclear stain and we also used the Ir193 DNA intercalator for each dataset.For our experiments, we used single channel pre-trained models for StarDist that were developed by the StarDist developers: dsb2018 heavy augment.pband dsb2018 paper.pb[11].The model that performed relatively better on our datasets was termed as the regular StarDist while we termed StartDist-Poor as StarDist used with the alternative model that had poorer performance.

Cellpose Segmentation Settings
We also introduced Cellpose [12] for cell detection in QuPath.OME-TIFFs were directly used in QuPath where the procedure is similar to StarDist but requires a QuPath Cellpose/Omnipose extension (https://github.com/BIOP/qupath-extensioncellpose).For our experiments, we used the default settings in the detection script.

DeepCell Segmentation Settings
We also used DeepCell Mesmer [13] for cell detection in QuPath.OME-TIFFs were directly used in QuPath where the procedure is similar to StarDist but requires an ImageJ Plugin for interacting with the DeepCell Kiosk (https://github.com/vanvalenlab/kiosk-imageJ-plugin).For our experiments, we preprocessed all tiffs to generate new tiffs with only the nuclear stain and a cytoplasmic marker to satisfy the requirements of Mesmer.The tiffs were used for cell boundary detection in ImageJ where the overlay was transferred back to the original tiffs.

Fig. 1
Fig. 1 Performance on simulated data.Dumbbell plots of change in average entropy between well segmented and three cases of erroneously segmented expression data are shown.Two sets of data were created with (a) 25% and (b) 50% of total amount of cells being erroneously segmented.Colors represent different segmentation datasets.

Fig. 2
Fig. 2 (a) Correlation plot between the published annotation and in-house annotation of the METABRIC IMC dataset.(b) Correlation plot between the number of edges per cell and cellular entropy.

Fig. 4
Fig. 4 Comparison of different segmentation methods on tonsil imaged across different spatial imaging platforms.Line plots of the average entropy for each ROI under four different segmentation conditions.

Fig. 5
Fig. 5 Distribution of average entropy per cluster count of the METABRIC IMC dataset.Each dot represents a segmented and processed IMC image.The cluster counts are assigned to each image as annotated in the original publication data.An increase in color gradient indicates greater average entropy.An increase in the size of dots indicates greater cell count in the images.

Table 2 .
Pairwise comparison of segmentation performance of three representative methods as well as controls on two methods.All p-values are computed using the Student's t test.A p-value <0.05 is considered significant.

Table 3 .
Comparison of various existing segmentation metrics with entropy from ESQmodel.Spearman's rank correlation coefficient was calculated for each pair of metrics across tissue types.Supplementary Table 4. Scores of entropy from ESQmodel and various existing