NEATmap: a high-efficiency deep learning approach for whole mouse brain neuronal activity trace mapping

ABSTRACT Quantitative analysis of activated neurons in mouse brains by a specific stimulation is usually a primary step to locate the responsive neurons throughout the brain. However, it is challenging to comprehensively and consistently analyze the neuronal activity trace in whole brains of a large cohort of mice from many terabytes of volumetric imaging data. Here, we introduce NEATmap, a deep learning–based high-efficiency, high-precision and user-friendly software for whole-brain neuronal activity trace mapping by automated segmentation and quantitative analysis of immunofluorescence labeled c-Fos+ neurons. We applied NEATmap to study the brain-wide differentiated neuronal activation in response to physical and psychological stressors in cohorts of mice.


/ 25
Methods Animals 8-12 weeks old male C57BL/6J mice were used for this study.All animals were purchased from Shanghai-Slac co., ltd.All animal used in this study were group housed with 12/12-hours light/dark cycle (light on at 7 a.m.).The food and water were provided ad libitum.Animals were group-housed for a week before experiment.All animal experiments were carried out following the protocol (No.ABSL-2-2022030402) approved by the Animal Care and Use Committee of the institute.

Behavior experiments
For forced swimming test, animal was put into a 5 L glass beaker with half-filled water at 25 °C for 5 minutes.Animal behavior was recorded with a video camera (Supplementary Video 6).After swimming the animal was dried with cotton tissue and a nearby heater before returning back to the home-cage for rest.The animal was sacrificed with transcardial perfusion for brain collection 85 minutes later after swimming.For acute social defeat emotional stress test, the C57BL/6J mice were cohoused by pairs for a week before experiment.Retired male CD1 mice were screened for following social defeat experiment.
The social defeat stress (SDS) mouse was subjected to a novel CD1 mouse for 10 minutes while the cohoused emotional stress (ES) mouse receiving indirect stimulation by observing the SDS mouse in the same chamber.ES mice were isolated by a porous transparent during the test to the avoid having physical contact.After the stress test, mice were put back and rested in home-cage for 80 minutes before been sacrificed for whole brain neural activity mapping.

Sample preparation
Sample collection.The sample preparation pipeline is described in Supplementary Fig. 11.After the behavior test, all animal was anesthetized intraperitoneal injection of 1% pentobarbital sodium (80 mg/kg).All brains were collected by transcardial perfusion of 40 mL 1x phosphate salient buffer (PBS) and 20 mL 4% paraformaldehyde (PFA) respectively.Then were immersed in 4% PFA at 4 °C overnight for further fixation.
Embedding.The brain was immersed in hot 10% gelatin solution in a small plastic cube, then the cube was moved into ice cold water bath for gelatin clotting.Then the embedded sample was put back into 4% PFA for 24 hours' further fixation.
Slicing.All brains were sliced into 300 µm thick brain sections by a homemade vibrational slicer.All sections of each brain were collected sequentially.

/ 25
Slice mounting and refractive index matching.Slices of each brain were mounted on a customized glass slide (100  × 100 ) for the convenience of imaging.The sample slide was transferred into refractive index matching solution (RIMS) over night for homogenizing the optical property of the slice and surrounding solution.

Whole brain light sheet microscopic imaging
The refractive index-matched brain slices were transferred to a VISoR imaging system described as previously [1].Briefly, the system was equipped with four lasers (Coherent, OBIS series) and a sCMOS camera (Hamamatsu, Flash 4.0 v3).All images were collected through a 10x 0.3 NA water-immersion objective lens (Olympus) and a 0.63x adaptor (TV0.63,Olympus).The final voxel size was about 1 × 1 × 3.5  3 .The c-Fos signal was collected in 561 nm channel (Supplementary Video 7), meanwhile, the autofluorescence in 488 nm channel was also collected for non-neuronal structure detection.

Whole brain image reconstruction
The reconstruction process [2] is divided into five steps.(1) According to the real coordinates of the captured images, they are stacked into columns.Whole slices were reconstructed from columns by the correction coefficients and true coordinates of the overlapping regions of each column.(2) The upper and lower surfaces of each contrast-enhanced brain slice were fitted by linear regression and interpolation.
(3) The rigid transformation and B-spline [3] algorithm was used to detect the texture and edges of opposing surfaces of adjacent brain slices.(4) Constraints of the adjacent correspondence displacement and displacement vector to prevent the accumulation and propagation of multiple slip errors.(5) The distortion of slice size and shape was minimized by using moving-least-square [4].

Training data preparation
We generate the training dataset through the following five steps.
1. Auto-contrast normalization.Due to the large variability of signal intensities within and between brain slices, the mean and standard deviation of the intensities were first fitted to a Gaussian distribution to the whole stack intensity profile.
Intensity filter.The intensity interval of stretch was obtained from the mean and standard deviation of the fitted Gaussian distribution.In the Eq. ( 1), min  and max  are the min and max of the filtered image, min  and max  are min and max of the raw image, mean and std are the mean and standard deviation of the Gaussian distribution,  and  are scale factors.The smaller the values of  and  are, the higher the contrast will be. is the pixel intensity in the Eq.(2).
2. 2D spot filter.For strongly supervised training on the transformer architecture, we need to perform spot segmentation of c-Fos + punctate cells.The 2D/3D spot filter calculates the Laplacian of Gaussian on the image in 2D/3D, respectively, both of them can detect the major structures in the 3D image stack.
However, subtle differences have been observed when applied to 3D cell imaging datasets [5].For the c-Fos expressed cells in each 2D frame there are more general spot-like morphological structures, rather than individual circular structures.3D spot filter may fail to detect the complete structure of cells, while 2D spot filter showing a better detection completeness (Supplementary Fig. 12).Therefore, we used 2D spot filter to conduct preliminary detection of c-Fos + signal in the Training data preparation process (Fig. 2d).
3. Binarization.The step converts the 3D detection results of c-Fos + from the 2D spot filter into 3D mask annotations.

Brain edge detection.
There are some cells with stronger signal intensity at the border of the brain slice, which are not the result of c-Fos immunofluorescence signal.We detect brain boundaries using the method of Canny edge detection [6] to remove them.
5. Mannel correction.The results obtained above were checked and recomputed with optimized parameters.The high-precision masks generated through the aforementioned process, combined with the image data, constitute the training dataset (Supplementary Figure 2).

3D-HSFormer architecture
We have designed a hybrid architecture artificial neural network with dilated convolutional neural network and transformer (Fig. 1c) for segmenting dual-channel (c-Fos and autofluorescence channels) whole-brain imaging datasets.The hybrid architecture outperformed the pure transformer architecture [7].Dilated convolution [8] can obtain information at different scales and increase the receptive field without reducing the image resolution.The dilation rate (d) is set to {1, 2, 5} in the Dilated convblock (Fig. 1c), respectively.The Swin Transformer architecture [9] showed great performance in tasks such as classification, object detection, and medical image segmentation [10] in 2D images.We further extended it to three dimensional for segmenting small and dense targets (e.g., c-Fos + cells in whole mouse brain) in 3D microscopic images.
The proposed transformer architecture is mainly divided into encoder and decoder structures.The encoder is used to learn the features of different scales' output of the dilated convolution block, and the decoder is utilized to up-sample the learned features to obtain the segmentation of each c-Fos + cell.We described the modules in the encoder and decoder in details as following.
Encoder.Assume that the input image size is  ×  ×  × 1, the feature map size obtained after the = ((̂)) + ̂ (5) where ̂ and   refer to the output features of the (S)W-MSA module and the MLP module of the  ℎ block, respectively.As reported in previous works [7,9], self-attention is calculated as follows: where , ,  ∈ ℝ  3 × indicate the query, key and value matrices;  3 is the number of patches in a window, and  is the query or key dimension.And, the values in  are taken from the bias matrix  ̂∈ ℝ (2−1)×(2−1)×(2−1) .prevent losing the down-sampled feature information [11].The Linear projection is used to linearly map the up-sampled features.

Training schedule
The network was trained and tested on a workstation equipped an NVIDIA Tesla V100S-PCIe with 32GB RAM.The network was developed with Python 3.6 and Pytorch 1.6.0[12].The images and labels in the training data were prepared from 3D images of the brain slices and training data preparation results by clipping into subvolumes, the size of which is 64 × 256 × 256 pixels (Supplementary Fig. 2).The network training was driven by a Stochastic Gradient Descent (SGD) optimizer with a learning rate of 1 × 10 −2 and a weight decay rate of 1 × 10 −4 for the moment 0.99.In the training phase, we used the data of a whole brain in the FST group, set the batch size to 2, and trained the network for 10 epochs.
The total training loss function is divided into two parts: cross-entropy loss and dice loss [13].By using this loss combination, the frequent occurrence of a certain class due to clipping can be prevented.The cross-entropy loss and dice loss are calculated as: where  is a pixel in image domain Ω,  ̃: Ω → ℛ is the predicted score for  ∈ {0, … , },  is the number of classes, and : Ω → {0, … , } is the ground-truth segmentation map.Therefore,  ̃() () is the predicted score for ground-truth class () at image x(Eq.( 9)).In Eq. ( 10),   is the  ℎ output of the 1 × 1 convolution layer passed through a softmax function and  ̃ is the ground-truth class.
The total loss is: where  ∈ [0, 1] controls the proportion of cross-entropy loss and dice loss to the total loss.

Transfer learning
To demonstrate the wider application of NEATmap for analyzing other cell marker labeled volumetric datasets beyond c-Fos immunofluorescence dataset, we employ transfer learning for fine-tuning of our pre-trained 3D-HSFormer with iDISCO processed c-Fos dataset and FISH labeling of cell-type marker genes' datasets in thick brain slices.
Open source iDISCO whole mouse brain c-Fos dataset annotated by our training data preparation process (Supplementary Fig. 2) was used for fine-tuning of our pre-trained 3D-HSFormer.The iDISCO c-Fos mouse brain dataset was cut into 41 slices of 64 × 1892 × 1802 pixels in size.
Each slice was divided into 49 subvolumes of 64 × 256 × 256 pixels in size, resulting in a total of 2009 subvolumes of 3D data for training (the training process was the same as described in the previous section).We randomly chose 10% of these subvolumes for validation.

Automated whole brain dual-channel segmentation
The network designed in this process allows whole brain testing of c-Fos channel and autofluorescence channel in datasets of different animals.Each brain was sectioned into 50 slices approximately, and every brain slice was clipped into 70 subvolumes.We use the c-Fos imaging data of a mouse brain from FST group to train the 3D-HSFormer.

Evaluation metrics
We used three voxel-wise segmentation metrics, Dice, Sst and Jc to evaluate the segmentation accuracy of 3D-HSFormer.The Dice coefficient measures the similarity between the predicted segmentation map and the ground-true set of segmentation maps.The Dice coefficient is calculated as: where  and  ̃ are the predicted segmentation set and the ground-true segmentation set, respectively.
Dice ∈ [0, 1], with 0 meaning no overlap and 1 meaning that the network predicts a perfect segmentation.
The Dice value of ~0.7 indicates a good segmentation result, and Dice value of ~0.9 is close to groundtruth accuracy.
The Jaccard coefficient was used to compare the similarities and differences between the predicted set and the ground-truth set.The sensitivity describes the ratio of identified true positives to all true positives.
The true positive (TP) indicates valid match, and false negative (FN) indicates that there is no valid match for the ground-truth segmentation.We use these values to calculate the sensitivity of the predicted segmentation results as: The range of values of Jc and Sst indicators is the same as the range of values of Dice indicator, and larger values indicate better results of prediction segmentation.
We use the Wilcoxon test to compare the performance of different algorithms, and each group of data is the predicted segmentation result of each algorithm applied to the same brain slice clipped into 70 subvolumes.

Post-processing
The predicted segmentation subvolumes of dual-channel images were obtained in the Test phase and were spliced according to their positions in the raw image coordinates (Fig. 1f).First, the Autofluorescence filter was utilized to clear the segmented results of the network predicted autofluorescence signal in the c-Fos channel.The Autofluorescence filter can be formulated as: where set  and set  are the segmentation results of c-Fos channel and autofluorescence channel respectively by 3D-HSFormer.Set  is the result of 3D-HSFormer segmentation of c-Fos + signal after Autofluorescence filter.Secondly, the No-soma filter was used to remove unreasonably small or large objects from the network predicted segmentation results.Finally, the Intensity-based prediction filter was designed to profile the average signal intensities of detected c-Fos + cells across the whole brain at voxel level.Then the segmentation results can be selectively output based on the user specified intensity ranges.
The Intensity-based prediction filter can be formulated as: where (⋅) is the image under c-Fos channel,  seg ,  seg and  seg are the coordinates of the filtered segmentation mask result, and  is the c-Fos + mask under the corresponding coordinates.(⋅) is the image under c-Fos channel, and  is the c-Fos + signal intensity under corresponding coordinates.

Brain registration and cell counting
For brain registration, we used the average template, annotation file of the Allen mouse brain atlas (CCFv3).The reconstructed brain was registered with a reference autofluorescence brain template with

Quantification and statistics
We performed whole brain inference of data of animals in two behavioral tests (a total of 30 mice) with the already trained neural network.The cell density was quantified number of c-Fos + neurons per volume  3 and the volume ratio was computed as fraction of total volume of c-Fos + neurons in the region over the total region volume.The physical size of c-Fos + neurons was set to 24  according to the average measurement size (Supplementary Fig. 3e, f).
The Wilcoxon signed-rank test was utilized to assess significant differences in the performance of various metrics among deep learning models.For determining significant differences in the number of c-Fos + neurons between groups, we employed the student's t-test.To control the false detection rate for multiple comparisons, the false discovery rate (FDR) q-value was calculated, with sample size (n), as where   is the number of c-Fos + neurons in the particular brain area of  ℎ animal either from the experimental group or Ctrl group, ̅  is the mean number of c-Fos + neurons of all Ctrl animals in that brain area and   is the standard deviation of the number of c-Fos + neurons of all Ctrl animals in that brain area (Fig. 6a).

/ 25
The coefficient of variation (CV) is the ratio of the standard deviation to the mean and shows the extend of variability over the mean of the group.In the Fig. 6b, the coefficient of variation in the SDS group and the ES group were calculated as following: where   represents the CV for either the SDS group or the ES group,   and   correspond to the standard deviation and mean of the number of c-Fos + neurons in the respective group.
We utilize Spearman correlation to assess the correlation between brain areas in the Fig. 6e, f.The formula is as following: where  ∈ [−1,1] is Spearman correlation coefficient, and  represents the total number of animals in the group.In the Fig. 6e,   represents the difference between the two ranks (from low to high) of c-Fos + neurons in different brain areas for the  ℎ animal.Similarly,   represents the rank difference in c-Fos + neuron numbers in the same brain area for the  ℎ animal between the SDS group and the ES group (Fig. 6f).

4 ×
Dilated convblock is  ×  ×  × 16 .The Patch partition can divide the feature map into nonoverlapping patches of size 64 with a patch size of 4 × 4 × 4 .The Linear embedding maps the input features to any dimension (denoted as ) linearly.The Swin Transformer block mainly consists of LayerNorm (LN) layer, multi-head attention module, residual connection and 2-layer multilayer perceptron (MLP)[9].Among them, two successive Swin Transformer blocks use the windowbased multi-head self-attention (W-MSA) module and the shifted window-based multi-head selfattention (SW-MSA) (Supplementary Fig.1b).The above two successive Swin Transformer blocks can be formulated as:

Swin Transformer blocks form Stage 1 .
The Swin Transformer block cannot change the dimension of the token sequence.Patch merging consists of Stage 2 with two successive Swin Transformer blocks, and Stage 3 with six successive Swin Transformer blocks.Patch merging involves down-sampling the token by 2 ×, resulting in a feature dimension that is 2 × the original size.Stage 4 comprises two Patch merging layers followed by two successive Swin Transformer blocks, which output deeper feature.A Bottleneck, formed by two successive Swin Transformer blocks, is employed to learn deep features.Decoder.Patch expanding involves up-sampling the token by 2 ×.For instance, in the Bottleneck, features are passed to the first Patch expanding layer.This layer initially increases the feature dimension ( to 2 × the original size ( using a linear layer.Subsequently, a 6 / 25 rearrange operation is employed to 2 × the input feature resolution and reduce the dimension to quarter of the original ( Patch expanding consists of Stage 5 and Stage 6 with two successive Swin Transformer blocks and six successive Swin Transformer blocks respectively.Stage 7 comprises two Patch expanding layers followed by two successive Swin Transformer blocks.It's worth mentioning that the final Patch expanding in Stage 7 involves an 4 × up-sampling.The Skip connection refers to fusing the multi-scale features of the encoder with the features up-sampled in the decoder.The fusion of the shallow features of the encoder and the deep features of the decoder can voxel sized; secondly, employing Adaptive Stochastic Gradient Descent (ASGD) as the optimizer for non-rigid B-spline transformation; thirdly, utilizing Advanced Mattes Mutual Information as the similarity metric; and finally, employing three image pyramids with variable resolutions for interpolation and resampling.Transformation parameters of brain alignment were used to obtain the coordinates of cells in the whole brain.The original 3D volume of the predicted segmentation result was down-sampled to 25 × 25 × 25  3 voxel size, then flattened and sutured to reconstruct the entire brain.The segmented images output in the Post-processing phase are labeled and analyzed in the connectivity domain (connectivity = 4) to obtain the centroid coordinates and volume of each c-Fos + cells.For each c-Fos + cell, the physical coordinate was calculated as centroid coordinates (, , ) multiple with voxel size (4 ) .The cell counting was completed with previously reported tool Freesia (https://github.com/BilabOptics/freesia-mapping) in each brain area of the whole brain.
shown in the main text and supplementary figure legends.The statistical tests mentioned above are implemented using the SciPy library https://scipy.org/ in Python.The Benjamini-Hochberg multiple comparison test[15] was used to control for the FDR.Cutoff fold change > 1.4 or < -1.4 and adjusted pvalues < 0.01 were employed to select regions with significant differential expression of c-Fos + neurons between the SDS group and the ES group.The relative z-score indicates how much the specific value of an individual animal either in the experimental group or in the Ctrl group differs from the mean of the Ctrl group over the standard deviation of the Ctrl group.More precisely, for each animal, the relative z-score was calculated by following equation The test data is sourced from another iDISCO processed mouse brain c-Fos dataset, which has been annotated by an expert and validated by two other experts.Five slices of 64 × 1778 × 1802 pixels in size were divided into 42 subvolumes of 64 × 256 × 256 pixels in size for each of them.For FISH labeled cell-type markers datasets, annotation was carried out by experts on a thick brain slice dataset of Sst, Vglut1, and Vgat genes respectively.The training dataset for each gene consists of one brain slice of 64 × 3500 × 2500 pixels in size, divided into 70 subvolumes of 64 × 256 × 256 pixels in size.The test data was prepared following the same protocol as training dataset on another brain slice.The training processes were following the same schedule as previous.