PS-Net: human perception-guided segmentation network for EM cell membrane

Abstract Motivation Cell membrane segmentation in electron microscopy (EM) images is a crucial step in EM image processing. However, while popular approaches have achieved performance comparable to that of humans on low-resolution EM datasets, they have shown limited success when applied to high-resolution EM datasets. The human visual system, on the other hand, displays consistently excellent performance on both low and high resolutions. To better understand this limitation, we conducted eye movement and perceptual consistency experiments. Our data showed that human observers are more sensitive to the structure of the membrane while tolerating misalignment, contrary to commonly used evaluation criteria. Additionally, our results indicated that the human visual system processes images in both global–local and coarse-to-fine manners. Results Based on these observations, we propose a computational framework for membrane segmentation that incorporates these characteristics of human perception. This framework includes a novel evaluation metric, the perceptual Hausdorff distance (PHD), and an end-to-end network called the PHD-guided segmentation network (PS-Net) that is trained using adaptively tuned PHD loss functions and a multiscale architecture. Our subjective experiments showed that the PHD metric is more consistent with human perception than other criteria, and our proposed PS-Net outperformed state-of-the-art methods on both low- and high-resolution EM image datasets as well as other natural image datasets. Availability and implementation The code and dataset can be found at https://github.com/EmmaSRH/PS-Net.


The details of subjective experiments design.
In this section, we provide more details for the subjective experiment (in Sec. 3 of the manuscript).
Before the experiment, the 20 subjects were introduced the value of cell membrane segmentation to connectivity and the importance of structure before testing. And then 10 simple examples are used to teach them about the experimental process.
The 200 groups of images for the subjective experiment were randomly selected from the segmentation results produced by six segmentation methods (U-Net, SENet, LinkNet, GLNet, CASENet, and U-Net++). And the distribution and selection of data during the experiment were random.
During the experiment, the 200 groups of images were equally divided into four batches to prevent subjects from experiencing fatigue during evaluation. And the subjects were instructed to select one of two images that they believed most closely resembled the ground truth (left or right). If the subject was uncertain, they could select "Difficult to choose". Continuous judgment without interruption was required for each batch of images.
After the experiment, three concepts were defined to calculate the consistency of a metric with human judgment: • "valid group": for each group of images, if there were more than a half (10) of the subjects choose the same segmentation (left of right), then, this group was called a "valid group".
• "human choice": for each "valid group", the image chosen by most people was considered the "human choice".
• "metric choice": for each "valid group", the segmentation, whose score calculated by the metric shows that it is more similar to the ground truth, was considered the "metric choice".
A total of 113 valid groups were identified, and the consistency was defined as N(human choice = metric choice)/N(valid groups).
To maintain continuity, participants were required to complete the judgment of each group within a specified time (less than 10 minutes) and complete each batch's judgment continuously without interruption. 2 Eye movement Experiments.
We use the EyeLink 1000 Plus to record the saccades, which has built-in saccade detection algorithms that can be used to detect saccades automatically. First, we calibrate the eye tracker according to the manufacturer's instructions to ensure accurate tracking of eye movements. Then, a stimulus is set up to display that will be presented to the participant. Next, we show two images like Supplementary Fig. 3 on the stimulus display for subjects. The EyeLink 1000 Plus will record eye movements as the subject looks at the stimulus display. We use EyeLink's software development kit (SDK) to write code that will detect saccades based on the eye movement data recorded during the experiment. Finally, we analyze the eye movement data by the fixation and saccades.

Fixation maps of eye movement experiments
Supplementary Fig. 4 shows some examples of average fixation map from 20 subjects in the eye movement experiments.    Figure 2: Examples of subjective experiment images. Each row shows two groups of images with their scores. For each group (a)-(j), The four images from left to right are: the segmentation of one method, the ground truth image, the segmentation of the other method, and PHD scores with different tolerance thresholds of the two segmentations. In each group, the figure with green box is the choice of most subjects. The two scores (F1 and IoU) below the segmentation results are the evaluation of the segmentation based on the ground truth, which are not shown to the subjects during the subjective experiments. For (j) which has no green box in the last row, it means that most of the subjects chose "Difficult to choose". The horizontal axis of the PHD scores represents the tolerance threshold.

Saccades of eye movement experiments
3 Formulas of evaluation metrics.
Supplementary Tab. 1 shows the formulas of evaluation metrics used in the manuscript.
• TP (true positives), TN (true negatives), FP (false positives), and FN (false negatives) compare the segmentation results with ground truth. The terms positive and negative refer to the classifier's prediction, and the terms true and false refer to whether that prediction corresponds to the ground truth.
• X and Y are two point sets.
x and y are the points in X and Y respectively.
• For V-Rand, suppose that S is the predicted segmentation and W is the ground truth segmentation. Define p ij as the probability that a randomly chosen pixel belongs to segment i in S and segment j in W . This joint probability distribution satisfies the normalization condition ij p ij = 1. The marginal distribution s i = j p ij is the probability that a randomly chosen pixel belongs to segment i in S, and the marginal distribution w j = i p ij is defined similarly. For V-Info, the mutual information I(S; W ) = ij p ij log p ij − i s i log s i − j w j log w j is a measure of similarity between S and W . H(S) = − i s i log s i is the entropy function. α is a manually

Criteria Formulation
set parameter, and usually it is 0.5. For the implementation code for V-Rand and V-Info, see https://imagej.net/tutorials/segmentation-evaluation-after-border-thinning.
• Betti error: which directly compares the topology (number of handles) between the segmentation and the ground truth. We randomly sample patches over the segmentation and report the average absolute difference between their Betti numbers and the corresponding ground truth patches.
• In PHD, the τ represents the tolerance threshold.
4 More details of segmentation experiments.

Datasets
ISBI 2012 dataset (1) contains the Drosophila ventral nerve cord serial-section electron microscopy data, which was captured from a first instar larva. For training, the provided set included a 2×2×1.5 µm 3 volume imaged from 30 sections and publicly available manual segmentations. For testing, the provided set included only image data, with segmentations kept private for the assessment of segmentation accuracy. The size of each image is 512 × 512. U-RISC dataset (2) was annotated upon RC1, a large-scale retinal serial section transmission electron microscopic dataset. RC1 came from the retina of a light-adapted female Dutch Belted rabbit after in vivo excitation mapping. The imaged volume represents the retinal tissue with a diameter of 0.25 mm, spanning the inner nuclear, inner plexiform, and ganglion cell layers. Serial EM sections were cut at 70-90 nm with a Leica UC6 ultramicrotome and captured at the resolution of 2.18 nm/pixel across both axes using SerialEM. There are two tracks in the public challenge of U-RISC, and we choose the track with higher resolution of images in our experiments. For training and testing, the provided set included a total of 50 and 20 images sections and publicly available manual segmentations. The size of each image is 9989 × 9959. Road dataset (3) has 1108 images from the Massachusetts Roads Dataset, which is one of the largest publicly available collections of aerial road images, containing both urban and rural neighborhoods, with many different kinds of roads ranging from small paths to highways. The resolution for each image is 1500×1500. In our experiments, the set is split into 1108 training and 49 test images.
CrackTree dataset (4) contains 206 images of cracks in the road. The resolution for each image is 600×800. The multiple shadows and cluttered background makes their detection a challenging task.
Potential applications include quality inspection and material characterization.

Baseline methods.
In this paper, we use eight baseline methods for the segmentation of U-RISC datasets as follows: U-Net (5) is a fully convolutional neural network designed for image segmentation tasks. The U-Net architecture consists of an encoder and a decoder, which are connected by a series of skip connections. The encoder consists of a series of convolutional and pooling layers, which extract high-level features from the input image. The decoder then uses up-sampling and concatenation operations to recover the spatial resolution of the image and produce the final segmentation mask.
CASENet (6) is designed for semantic segmentation. The architecture consists of three modules: feature extraction, context modeling, and semantic segmentation. The feature extraction module extracts features from the input image, while the context modeling module captures contextual information at multiple scales using dilated convolutional layers. The semantic segmentation module then produces the final segmentation mask using convolutional layers.
LinkNet (7) is based on a modified U-Net architecture with a series of skip connections. The network consists of an encoder and decoder connected by a series of residual connections. The encoder consists of a series of convolutional layers followed by max pooling, while the decoder consists of a series of up-sampling and convolutional layers. The skip connections are added between the corresponding layers in the encoder and decoder, allowing the network to effectively use both low-level and high-level features for segmentation.
GLNet (8) is designed for semantic segmentation of high-resolution images. The network consists of three modules: global feature extraction, local feature extraction, and feature fusion. The global feature extraction module captures the global context information of the input image using a series of convolutional layers. The local feature extraction module extracts detailed local features using dilated convolutional layers. The feature fusion module combines the global and local features using a fusion block, which consists of convolutional and pooling layers.
SENet (9), or Squeeze-and-Excitation Network, is a deep neural network architecture designed to improve the performance of convolutional neural networks (CNNs) by explicitly modeling the interdependencies between channels. The architecture consists of a series of convolutional layers followed by a squeeze-and-excitation (SE) block. The SE block consists of a squeeze operation, which reduces the spatial dimensions of the input feature maps to a single value, followed by an excitation operation, which applies a non-linear transformation to the squeeze output.
U-Net++ (10) is a deep neural network architecture designed for semantic segmentation of images. The architecture was proposed in 2018 and is based on the popular U-Net architecture, but with a series of modifications to improve performance. The network consists of an encoder and decoder, similar to the U-Net architecture, but with a series of nested dense skip connections added between the corresponding layers in the encoder and decoder.
Mosin. (11) builds on the popular U-Net architecture, but with modifications to incorporate topological constraints. The network consists of an encoder and decoder, similar to the U-Net architecture, but with the addition of a topological loss function. The topological loss function penalizes deviations from the expected topology of the segmented structures, such as disconnected regions or spurious holes.
DMT (12) represents the topological structure of the image using a set of critical points and their connecting paths. The Morse complex is then used to guide the segmentation process, by identifying regions of interest and constraining the segmentation to follow the topological structure of the image.
The eight methods employ identical training and testing data sets for performance comparison, utilizing parameters and loss functions proposed in their respective references.

Experiment settings
The proposed deep network is implemented using PyTorch open-source deep learning library (13). Training and deployment of the network are conducted on an NVIDIA GTX Geforce A100 GPU. The data augmentation method is used for the input images by applying rotation and shift transformations for training. The number of patches N is set to 4. The tolerance threshold τ in loss functions is set to 2. The weights λ 1 and λ 2 of the final loss function are both set to 0 at the first k epochs. From the k + 1 epoch, as the number of training epoch increases, the weight also increases by 0.1 for every 10 epochs until it reaches 1. The k here is set to 5. The increase ratio in each epoch is the variance of L phd and L sim of the last two epochs, normalized by the number of skeleton points. For all datasets, we use a three-fold cross-validation.
Because these methods for comparison do not provide the predictions and the results of other evaluation metrics, the results of Tab. 1 and Tab. 2 in the manuscript are obtained by training and testing with the official code. The results of Tab. 3 are reported by the original articles. For the typesetting of the article, we report them separately.   For the function f + and f − , there are varies of optional formulas. Here we choose the formula as sigmoid-like, tanh-like, and ReLU-like for comparision. The difference between sigmoid-like and sigmoid is the inflection point, where it is 0 in the sigmoid, while the tolerance threshold τ in the sigmoid-like, and others are the same. In our experiments, the three formula have the same consistency with human beings, while the computational cost of ReLU-like is minimal. So we choose f + (d) = d and f − (d) = 0. When selecting the tolerance for the PHD loss function during the training phase, multiple factors should be considered, including the size of the dataset, and the desired level of accuracy. Generally, lower tolerance values can increase precision but also raise the risk of overfitting, while higher tolerance values can result in a more generalized model but potentially compromise accuracy.
To investigate the optimal tolerance for the PHD loss function, ablation experiments were conducted by varying the tolerance and observing its effect on the model's performance. This provides insights into the trade-offs between precision and generalization and helps identify the ideal tolerance for the specific task and dataset at hand. The results of the ablation experiments in Supplementary Tab. 5 demonstrate that the proposed PS-Net attains superior performance when τ equals 2, as evidenced by the scores of 15.29/13.52/6.979 (PHD-0/5/10). We also observed that increasing or decreasing the tolerance led to a decrease in performance. Therefore, we chose τ = 2 for subsequent experiments.

Feature visualization for PHD loss function on U-RISC dataset
Supplementary Fig. 6 shows the feature visualization for PHD loss function on U-RISC dataset.

More results of cell membrane segmentation on ISBI 2012 dataset.
More examples for the segmentation results of ISBI 2012 dataset (in Sec. 5.3 of the submission) are shown in Supplementary Fig. 7.

More results of cell membrane segmentation on U-RISC dataset.
More examples for the segmentation results of U-RISC dataset (in Sec. 5.4 of the submission) are shown in Supplementary Fig. 8. 4.10 Ablation studies for the hyperparameters on U-RISC dataset.
In our experiments, the values for λ 1 and λ 2 are first set to 0 at the first k epochs, and as the number of training epoch increases, the weight also increases by 0.1 for every 20 epochs until it reaches 1. In this section, we test the fixed lambda 1 and lambda 2 including (0.5, 0.5), (1,0), (0,1), and (1, 1), and we found that the best combination was the adaptive one, which produced the highest accuracy metrics. The results are shown in the Supplementary Tab. 5. We also tested different values for k, including 0, 3, 5, 10, and 15. We found that increasing k beyond 5 did not significantly improve the results. A small k can also degrade performance with more computational costs. Therefore, we used a value of k = 5 for our experiments, which struck a balance between accuracy and efficiency.
We acknowledge that there may be other combinations that could also produce good results. Therefore, we suggest that future research could explore different combinations of lambda1, lambda2, and k to further understand the impact of these parameters.

Ablation studies for the topology loss.
Further, we also compared the PHD loss with another topology loss, clDice loss (30), using the same network architecture as PS-Net on the U-RISC dataset. In PS-Net, the PHD loss is used for two purposes: firstly, it is used to measure the dissimilarity between the ground truth and the predictions by the global and local branches respectively, and secondly, it is used to measure the dissimilarity between the predictions of the two branches. In the new experiments, we replaced both of them with the clDice loss, and kept all other parameters the same as those used in Table 2 of the manuscript. The equation of the clDice loss is:  where α is the parameter in the clDice loss. In accordance with the experimental setup of clDice loss (30), we trained the network by varying the value of α from 0.1 to 0.5. Our results in Supplementary Tab. 6 indicate that the utilization of PHD loss leadsOur results in Supplementary Tab. 6 indicate that the utilization of PHD loss leads to higher scores compared to clDice loss across all α values, highlighting the superiority of PHD loss in effectively segmenting cell membranes in EM images. Furthermore, when specifically considering the performance of the clDice loss, the model consistently achieved the highest performance for the PHD, F1, V-Rand and V-Info metrics at α = 0.4, (with the best IoU at α = 0.5),while for the F1 metric, the best performance was observed at α = 0.1. To analyze the running efficiency of our PS-Net, we compare PS-Net with other six networks: U-Net, SENet, LinkNet, GLNet, CASENet, and U-Net++. The performance of two measurements are reported: the number of parameters and the floating-point operations (FLOPs). The experiments are performed on a single Nvidia A100 GPU. And for a quick testing, the resolution of input images is chosen as 512 × 512 for all the methods. As shown in Supplementary Tab. 7, we observe that the number of parameters and FLOPs of our method are similar to U-Net, but smaller than GLNet. The main reason is that we use U-Net as default segmentation module, and the global-local strategy does not add more parameters in the network. Besides, the local branch of PS-Net uses the input with 1/4 size of the original image, resulting in a geometrically less FLOPs compared with the original size of the image. With the acceptable computational complexity similar to that of U-Net, our method has the best performance (F1 score and PHD). 6 Attribution analysis for global-local strategy.
To acquire a deeper understanding of the impact of the global-local strategy on the network, we added an attribution analysis on the trained models with/without global-local strategy by a gradient-based attribution method, integral gradient (31). In brief, IG aims to explain the relationship between the predictions and the input image or features based on gradients. To be specific, we choose one of pixels in the segmentation result, and then explore which areas of the input image contribute more to its decision. The output is a heatmap with the same size as the input image. And the value of each pixel on the heatmap reflects its contribution to the final prediction of the chosen pixel. Specifically, we choose one pixel p in the prediction image, which is predicted correctly by the network with the global-local strategy, while predicted incorrectly by the network without global-local strategy. And then, we visualize the contribution of each pixel of the input image to the prediction of p.
From our result in the Supplementary Fig. 9, the chosen points are predicted correctly by the network with the global-local strategy (red in (3)), while incorrectly by the network without global-local strategy (green in (4)). From our result, with the global-local strategy, the number of pixels that have a high contribution to the prediction is much larger than that of the network without the strategy. It suggests that the global-local strategy enables the network to utilize features of larger regions for the decision and helps improve the prediction performance.
(1) Original Image (2) Ground Truth (3) Prediction w GL (4) Prediction w/o GL Attribution Map of the red point in (3) Attribution Map of the red point in (4) Figure 9: The images in the first row are the original image, ground truth, and two predictions with/without global-local strategy, respectively. The pixels pointed by red/green arrows are the prediction points chosen in the attribution method. The images in the second row are the attribution maps of the selected points in the two predictions. The size of the area corresponding to the same color is the same. In the attribution maps, blue means the network is likely to predict the pixels as cell membrane, while opposite for red.
7 The derivation process of the PHD loss function.
First, for the thinning process, we conducted a fully differentiable implementation for the Zhang-Suen method with PyTorch. The original Zhang-Suen skeleton method is non-differentiable due to the thresholding operations. To make it differentiable, we define a differentiable approximation to the thresholding operation used in the Zhang-Suen algorithm. We apply a soft thresholding function, ReLU, to the difference between the local intensity values and the threshold value, and use the resulting value to update the continuous function. The gradient of the ReLU function is well-defined and can be used to propagate gradients through the skeletonization algorithm. After applying the modified algorithm to the continuous function, we can obtain a continuous skeleton that is differentiable with respect to the input image.
Then, for the PHD loss, we modified the loss function. Specifically, 1) when d(x, y) = τ , the gradient is not differentiable, so we set the gradient= 0 in the implementation in order to simplify mathematical calculation.
3) when d(x, y) > τ , take x as an example, the formula of its gradient is (2)