Evaluation of methods for generative modeling of cell and nuclear shape

Abstract Motivation Cell shape provides both geometry for, and a reflection of, cell function. Numerous methods for describing and modeling cell shape have been described, but previous evaluation of these methods in terms of the accuracy of generative models has been limited. Results Here we compare traditional methods and deep autoencoders to build generative models for cell shapes in terms of the accuracy with which shapes can be reconstructed from models. We evaluated the methods on different collections of 2D and 3D cell images, and found that none of the methods gave accurate reconstructions using low dimensional encodings. As expected, much higher accuracies were observed using high dimensional encodings, with outline-based methods significantly outperforming image-based autoencoders. The latter tended to encode all cells as having smooth shapes, even for high dimensions. For complex 3D cell shapes, we developed a significant improvement of a method based on the spherical harmonic transform that performs significantly better than other methods. We obtained similar results for the joint modeling of cell and nuclear shape. Finally, we evaluated the modeling of shape dynamics by interpolation in the shape space. We found that our modified method provided lower deformation energies along linear interpolation paths than other methods. This allows practical shape evolution in high dimensional shape spaces. We conclude that our improved spherical harmonic based methods are preferable for cell and nuclear shape modeling, providing better representations, higher computational efficiency and requiring fewer training images than deep learning methods. Availability and implementation All software and data is available at http://murphylab.cbd.cmu.edu/software. Supplementary information Supplementary data are available at Bioinformatics online.

1 Simulation process of simulated cells

Detailed simulation process of SNL cells
The simulation process was: • Sampling of cell body. First, we sample the cell body as the combination of two half ellipses, using radial coordinates. The two half ellipses share the same semiminor axis as b ∼ U (10, 20), and the two semimajor axis a 1 and a 2 also have distribution uniform distribution U (10, 20). Then a Gaussian random noise with standard deviation 0.1 are added into the coordinates. In the next step, we morph and smooth the cell body using spline smoothing method. We randomly sample the proportion of points as control points with distribution U (0.01, 0.20). After randomly choosing control points, we allow the points to randomly move along it's normal direction, and the step size follows U (1, 1.5) * N (0, 0.1), which is the combination of uniform and Normal distribution. Then the moved control points are used for spline smoothing. At last, we add some Gaussian noises to the shape with standard deviation 0.05.
• Sampling of Neurites. The neurite number N can be either fixed or randomly sampled following a truncated Poisson distribution with λ = 1 for a given range of allowed neurite number (0-2 for SNL 3D center slice). First we sample the total length of neurites following U (20N, 25N ). To assign proportions of length for the two neurites, we use a truncated Dirichlet distribution Dirichlet(2, 1) and force the smaller proportion to be larger than 0.2.
• Sampling of branching. Then, we may decide whether a relatively long neurite, which is longer than 5, could generate branches, with a probability 0.4. We assign the weight of length using truncated Dirichlet distribution Dirichlet(1, 1, 1) with any weight should be larger than 0.05. We assign the largest weight to the main part connecting cell body. Then we sample the angles for the two branches with the major axis of the main part. θ 1 = 0 with probability 0.5, and θ 1 ∼ U (−0.45π, 0.45π) with probability 0.5. If θ 1 ≥ 0, θ 2 ∼ U (−0.225π, 0.45π); if θ 1 < 0, θ 2 ∼ U (−0.225π, 0.45π). In this case, we can ensure the two branches are in the two sides of the major axis.
• Sampling of thickness of neurites. For each neurite, if there is no branching, the thickness T 0 at the end point connecting the cell body is sampled with distribution U (2, 3), and the other end point T 1 is sampled as the maximum of U (0.4T 0 , T 0 ) and T 0 − 0.02L + U (0, 0.05), where L is the length of the neurite. If there is branching, L 0 ∼ U (2, 3), the thickness in the branching point T 2 ∼ max(U (0.4T 0 , T 0 ), L 0 − 0.02L 1 + U (0, 0.05)). the thickness in the two end points of the branches follows T 3 ∼ max(U (0.4T 2 , T 2 ), L 0 −0.02L 2 +U (0, 0.05)) and T 4 ∼ U (0.4T 2 , T 2 ), where L 2 is the length of the first branch.
• Connecting neurites and cell body. We first put two neurites to angle 0 and π relative to the cell body, then we allow some small perturbations of the angles with truncated normal distribution N [−0.22,0.22] (0, 0.125 2 ) * 2π. The truncated normal distribution here means that only the random variable between −0.22 and 0.22 from the distribution will be accepted. After this step, we connect the neurites and cell body by removing some points in the cell body while connecting with the end points of neurites, so that it is a valid ordered array for the outline.
• Post-processing. Lastly, We introduce random rotation and variation in size. The rotation angle follows U (0, 2π), and the scale factor follows N (0.5, 0.5 2 ). We translate the coordinates by [256.5, 256.5] and convert that to a binary image with size 512X512, so that the shapes are centered.

Detailed simulation process for SNL (NR2) 3D datasets
The simulation process was: • Sampling of central slice. A central slice is sampled from the SNL 2d simulator, as described above. For SNL 3D dataset, the neurite number ranges randomly from 0-2. For SNL 3D dataset, the neurite number is fixed as 2.
• Sampling of upper slices. Starting from the central slice, image erosion is applied to the current slice, with a disk kernel with a random kernel size. The kernel size is randomly sample between 1-3 with probabilities 0.45, 0.35, 0.2, for the first five slices, and between 2-5 with probabilities 0.3125, 0.2625, 0.2625, 0.1625, after the first five slices. The erosion is stopped until the area of the current slice is equal or smaller than 500 pixels.
• Sampling of bottom slices. Similar as sampling for upper slices, starting from the central slice, image erosion or dilation is applied to the current slice, with a disk kernel with a random kernel size. The probability of erosion and dilation are 0.5 and 0.5. The kernel size is randomly sample between 0 -3 with probabilities 0.25, 0.45, 0.2, 0.1, for the first ten slices, and between 0-4 with probabilities 0.25, 0.25, 0.2, 0.2, 0.2, after the first ten slices. The erosion or dilation is stopped until the ratio of the area of the current slice to that of the central slice is between 0.85 and 1.15. After that, the 3D image is stacked with bottom slices, central slice and upper slices.
• Adding random noise. The positions of seeds are randomly sampled with probability 0.005 in the image. Then image dilation with sphere kernel with size 1 is applied to the seeds to get the random noise image. The same process is repeated to get another random noise image. Then the first noise image is added to the cell image, followed by the subtraction of the second noise image. Then the cell image with noise is binarized by threshold of positive values. Only the largest connected component is kept and holes are filled.
• Image cropping and resizing. The cell image is cropped to slice number 24 by keeping middle 24 slices. Finally the image is resized to the size 256 × 256 × 24. In the main paper we briefly described the network architecture without introduce residual network blocks. Here we show the detailed structure for residual network blocks for both encoder and decoder networks in following three tables. Output B7 m × m × n Residual block type 1 in the encoder network (EB1). Here B1-B8 represent layer names to distinguish these layers here (not same as the implementations); Layer type uses names quit similar to those used in TensorLayer. Input and Output are not actual layers, but just illustrate the start and end of the flow. Parent Layer means the layer that the current layer connects. Tensor size represents the output size.
Here 2m × 2m are abstract input data size, n 0 is the filter number of previous block and n is the filter number for current block. Basically the block down sample the image size 2 × 2, changes filter size from n 0 to n. Miscellaneous column shows some detailed explanations.

Layer name Layer type Parent Layer
Tensor size Output B7 m × m × n Residual block type 2 in the encoder network (EB2). The basic structure is similar as EB1, except that there is no downsampling and no padding in B2 layer, in order to change the image size. Here (m + 2h) × (m + 2h) are abstract input data size, where m is the output image size for the block, 2h is the size that is reduced in convolution, h = ⌈(k − 1)/2⌉, where k is the filter size. n 0 is the filter size of previous block and n is the filter number for current block. Miscellaneous column shows some detailed explanations.

Layer name
Layer type Parent Layer Tensor size 2m × 2m × n Residual block in the decoder network (DB1). The table has almost the same interpretation as last table for encoder network. There are some differences: for layer B2, there is no down sampling in the convolutional layer, instead it has an additional layer after layer B7 for upsampling to change the image size from m × m to 2m × 2m.

Network Structure for Valina autoencoders for CYTO, HPA CL, SNL datasets
These three datasets share the same network structures, which is defined as follows: Encoder network for CYTO, HPA CL and SNL datasets for Valina autoencoders. l is the desired latent dimension. The input image size is 512 × 512. For the size of filters, the first two numbers are the width and height of the filter and the third one is for the filter number. The output size has the same format, that is, the numbers are width, height and number of channels. The layers/blocks are ordered from left to right in the network (below the same).
Using the output of encoder network as input, the decoder network is defined as: Decoder network for MCF7 dataset for Valina autoencoders. l is the desired latent dimension.

Definition of residual blocks
In the main paper we briefly described the network architecture without introduce residual network blocks. Here we show the detailed structure for residual network blocks for both encoder and decoder networks in the following three tables.

Layer name Layer type Parent Layer
Tensor size Miscellaneous B1 Input Output B7 m × m × t × n Residual block in the 3D encoder network (EB3D). Here B1-B8 represent layer names to distinguish these layers here (not same as the implementations); Layer type uses names quit similar to those used in TensorLayer. Input and Output are not actual layers, but just illustrate the start and end of the flow. Parent Layer means the layer that the current layer connects. Tensor size represents the output size.
Here am × am × bt are abstract input data size, n 0 is the filter number of previous block and n is the filter number for current block. Basically the block downsamples the image size a × a × b, changes filter size from n 0 to n. Miscellaneous column shows some detailed explanations.

Layer name
Layer type Parent Layer Tensor size am × am × bt × n Residual block in the 3D decoder network (DB3D). The table has almost the same interpretation as last table for encoder network. There are some differences: for layer B2, there is no down sampling in the convolutional layer, instead it has an additional layer after layer B7 for upsampling to change the image size from m × m × t to am × am × bt.

Network Structure for Valina autoencoders for HeLa, SNL 3D and SNL NR2 3D datasets
These three datasets share the same network structure, which is defined as follows: Encoder network for 3D datasets for Valina autoencoders. l is the desired latent dimension. The input image size is 512 × 512. For the size of filters, the first two numbers are the width and height of the filter and the third one is for the filter number. The output size has the same format, that is, the numbers are width, height and number of channels.
Using the output of encoder network as input, the decoder network is defined as: Decoder network for 3D datasets for Valina autoencoders. l is the desired latent dimension.

Network structure for variational autoencoders
The basic network structures for encoders and decoders for variational autoencoders are mostly the same as those for the corresponding Valina autoencoders, except that there are two fully-connected layers to get the output for mean and log of standard deviations, which is the "reparameterization trick" in variational autoencoders [3]. The input for the decoder is the sampled input with "reparameterization trick".

Network structure for outline autoencoder
The basic structure is a multiple stacks of basic blocks consisting of fully-connected layer, batch normalization layer and ReLu activation layer. The encoder and decoder have four such blocks respectively. The number of filters in the blocks are 2,000, 1,000, 500, 300, respectively for the encoder and in a reversed older for the decoder. In encoder, the input outlines (spharm descriptor) are reshaped as vectors as input for the blocks. The output of the last bock is used as the input for a fully-connected layer to generate the encoded information with given latent dimension. For the decoder, the encoded information is used as input as the blocks and the output from the last block is the input for a fully-connected layer, whose outputs are reshaped to the same shape as the input outlines.

Parameter setting in the training
Deep autoencoders and variational autoencoders share almost the same hyperparameter setting. The optimizer for deep autoencoders, variational autoencoders and outline autoencoder is adam optimizer [2]. Xavier initializer is applied [1]. The batch size is 100 for autoencoders in 2. Basically the learning rate is around 1e-3 to 5e-4 in the beginning, and reduces to half around 100-150 epoches before stopping, and finally reduces to around 1e-5 around 50 epoches before stopping. The details are shown in the corresponding scripts in the package.                      Table S3: Reconstruction errors of cell and nuclear errors in joint modeling. The table shows two latent dimensions 14 and 200 for the four 2D datasets for four representative methods. 'sep', 'joint' and 'united' means separate models, joint models and united models for corresponding methods as described in Methods.