-
PDF
- Split View
-
Views
-
Cite
Cite
Philipp M Pelz, Alexander Rakowski, Luis Rangel DaCosta, Benjamin H Savitzky, Mary C Scott, Colin Ophus, A Fast Algorithm for Scanning Transmission Electron Microscopy Imaging and 4D-STEM Diffraction Simulations, Microscopy and Microanalysis, Volume 27, Issue 4, 1 August 2021, Pages 835–848, https://doi.org/10.1017/S1431927621012083
- Share Icon Share
Abstract
Scanning transmission electron microscopy (STEM) is an extremely versatile method for studying materials on the atomic scale. Many STEM experiments are supported or validated with electron scattering simulations. However, using the conventional multislice algorithm to perform these simulations can require extremely large calculation times, particularly for experiments with millions of probe positions as each probe position must be simulated independently. Recently, the plane-wave reciprocal-space interpolated scattering matrix (PRISM) algorithm was developed to reduce calculation times for large STEM simulations. Here, we introduce a new method for STEM simulation: partitioning of the STEM probe into “beamlets,” given by a natural neighbor interpolation of the parent beams. This idea is compatible with PRISM simulations and can lead to even larger improvements in simulation time, as well requiring significantly less computer random access memory (RAM). We have performed various simulations to demonstrate the advantages and disadvantages of partitioned PRISM STEM simulations. We find that this new algorithm is particularly useful for 4D-STEM simulations of large fields of view. We also provide a reference implementation of the multislice, PRISM, and partitioned PRISM algorithms.
Introduction
Transmission electron microscopy is a powerful tool for studying atomic-scale phenomena due to its unmatched spatial resolution and ability to perform imaging, diffraction, and multiple types of spectroscopic measurements (Egerton et al., 2005; Carter & Williams, 2016; Zuo & Spence, 2017). Scanning TEM (STEM) is a particularly versatile TEM technique, as the STEM probe size can be tuned to any desired experimental length scale, from sub-Ångstrom to tens of nanometers, to best match the length scale of the structures being probed (Pennycook & Nellist, 2011). The size of the probe is also completely decoupled from the step size between adjacent probe positions, allowing experimental fields of view up to almost 1 mm2 (Kuipers et al., 2016). Advances in detector technology have lead to high-speed electron cameras capable of recording full 2D images of the diffracted STEM probe with microsecond-scale dwell times, which has lead to many experiments which record the full four-dimensional (4D) dataset, in a family of methods called 4D-STEM (Ophus, 2019). In parallel, the rise of powerful computational methods have enabled measurements of many different material properties with high statistical significance (Spurgeon et al., 2021).
The combination of computational methods and advanced STEM experimentation has lead to atomic-resolution 3D tomographic reconstructions (Yang et al., 2017), measurements of highly beam-sensitive samples over functional length scales (Panova et al., 2019), images of samples with resolution better than the diffraction limit (Chen et al., 2021), and many other advances in STEM imaging techniques. Many of the technique developments and validation of these experiments make heavy use of electron scattering simulations. The application of data-intensive machine learning methods to STEM experiments can also be aided by simulations (Kalinin et al., 2021).
It is possible to simulate the propagation and scattering of STEM probes through a material by directly computing the Bloch wave eigenstates of the electron scattering matrix (-matrix) (Bethe, 1928). The Bloch Wave method can be employed in diffraction simulations (Zuo et al., 2021), but it is only practical to use for small, periodic unit cells. The majority of the STEM simulations performed currently implement the multislice method (Cowley & Moodie, 1957). The multislice method is typically applied in STEM simulations by performing a separate quantum-mechanical electron scattering simulation for each probe position (Croitoru et al., 2006; Okunishi et al., 2012; Ophus et al., 2016). The multislice algorithm can therefore require extremely large computation times when simulating STEM experiments which can contain 1,0002 probe positions or even higher. To alleviate this issue, various authors have implemented parallelized simulation codes that make use of multiple central processing unit (CPU) or graphics processing unit (GPU) resources (Grillo & Rotunno, 2013; Allen et al., 2015; Hosokawa et al., 2015; Van den Broek et al., 2015; Kirkland, 2016; Lobato et al., 2016; Oelerich et al., 2017; Barthel, 2018; Radek et al., 2018).
It is possible to perform large STEM simulations more efficiently by computing them as a superposition of plane waves (Chen et al., 1995). This idea was developed into an efficient simulation algorithm by Ophus (2017), who named it the plane-wave reciprocal-space interpolated scattering matrix (PRISM) algorithm. In the PRISM algorithm, the -matrix elements are directly computed by multislice simulations. The equivalence of the Bloch wave
-matrix and multislice simulation outputs have been investigated in detail by Allen et al. (2003) and Findlay et al. (2003). The PRISM algorithm has been implemented into multiple simulation codes (Pryor et al., 2017; Brown et al., 2020a; Madsen & Susi, 2020). It has also been extended to a double-
-matrix formalism which can provide an even higher speed boost relative to multislice for inelastic scattering such as in STEM electron energy loss spectroscopy (STEM-EELS) simulations (Brown et al., 2019).
The PRISM algorithm achieves large decreases in calculation times by reducing the sampling of the probe wavefunction in reciprocal space and is highly accurate when the detector configuration is given by large monolithic regions. However, PRISM simulations are less accurate where fine details in the STEM probe and diffracted Bragg disks are necessary, for example, in Juchtmans et al. (2015), Hubert et al. (2019), and Zeltmann et al. (2020). A different form of interpolation has been proposed by Pelz et al. (2020a), where the STEM probe is partitioned into different beams by the interpolation of basis functions constructed from the initial STEM probe. This partitioning of the probe has been shown to be a highly efficient and accurate representation of dynamical scattering of the STEM probe in experimental data and is fully compatible with the PRISM algorithm (Pelz et al., 2020b).
In this manuscript, we introduce the partitioned PRISM algorithm for use in STEM simulations. We describe the theory of multislice, PRISM, and partitioned PRISM simulations and provide a reference implementation of these algorithms. We show that beam partitioning simulations provide an excellent trade off between calculation times and accuracy by measuring the error of diffracted STEM probes with respect to multislice simulations as a function of the number of included beams. We also use this method to simulate the full field of view for a common experimental geometry, a metal nanoparticle resting on an amorphous substrate. These simulations demonstrate that the partitioned PRISM method can produce comparable accuracy for coherent diffraction to PRISM simulations, but for much lower calculation times and lower random access memory (RAM) usage. This is important since many PRISM simulations are constrained by the available RAM of a GPU to hold the -matrix. Finally, we demonstrate the utility of this method in 4D-STEM simulations by simulating the full 4D dataset of an extremely large (5122 probes, 4.6 million atoms) sample cell and measuring the sample strain, where the partitioned PRISM algorithm provides superior performance to a PRISM simulation using roughly the same total calculation time.
Theory
For previously published TEM simulation methods, we will briefly outline the required steps here. We refer readers to Kirkland (2020) for more information on these methods. We will also only describe the scattering of the electron beam while passing through a sample; probe-forming optics and the microscope transfer function mathematics are described in many other works (Williams & Carter, 2009; Spence, 2013; Kirkland, 2020).
Elastic Scattering of Fast Electrons


The Bloch Wave Algorithm



The Multislice Algorithm









The multislice simulation algorithm. (a) Calculate the projected potential slices from the atomic coordinates and lookup tables. (b) Initialize the probe wavefunction, and then alternate between propagation and transmission operators. (c) Final probe at the sample exit plane.






The PRISM Algorithm for STEM Simulations



















The PRISM simulation algorithm. (a) Calculate the projected potential slices from the atomic coordinates and lookup tables. (b) Select an interpolation factor f and a maximum scattering angle |qmax|, and initialize all tilted plane waves needed for these beams. (c) Perform a multislice simulation for each beam over the full field of view, store in the -matrix. (d) Compute outputs by shifting the initial STEM probes and cropping 1/f of the total field of view, and multiplying and summing all
-matrix beams.
The Partitioned PRISM Algorithm
Natural Neighbor Interpolation of the Scattering Matrix
Theoretical and experimental investigations of -matrix reconstructions have shown that once the plane-wave tilts have been removed from all beams, the resulting matrix elements are remarkably smooth (Brown et al., 2020b; Pelz et al., 2020b; Findlay et al., 2021). We have also observed that in many PRISM simulations, the information contained in neighboring beams is very similar. These observations have inspired us to propose a new method for simulating STEM experiments. Rather than computing all beams of the
-matrix with the multislice algorithm, we could instead interpolate them from a reduced set
of parent beams, which are computed with the multislice algorithm in the manner described above.








The remaining tasks are then to choose an interpolation strategy to determine the weights w and to choose the set of parent beams . To maximize the flexibility in choosing the parent beams, which form the interpolation basis of the
-matrix, the chosen interpolation scheme must be able to interpolate an unstructured grid of parent beams. Here, we have chosen to employ the natural neighbour interpolation (Sibson, 1981; Amidror, 2002).
We note two additional methods which can save further computational time. First, part of the computational overhead when performing matrix multiplication of the -matrix is the cropping operator. When using interpolation factors of f > 1 for either traditional or partitioned PRISM, this overhead can represent a significant amount of computation time due to the need for a complex indexing system to reshape a subset of the
-matrix. Thus, in many cases, simulations with f = 2 may require longer computational times than f = 1. We therefore recommend that the scaling behaviour be tested in each case.







Flow chart of the beam partitioning algorithm for STEM simulation. (a) Calculate the projected potential slices from the atomic coordinates and lookup tables. (b) Partition the probe into the desired number of beams, calculate the basis functions for all beamlets. (c) Perform a multislice simulation of all beams defined by the partitioning, store results in compact -matrix. (d) Construct STEM probes at all positions by multiplying the shifted initial probes by the
-matrix and then summing over all beams.
Algorithmic Steps of Partitioned PRISM Simulations
Figure 3 shows the steps of our new simulation algorithm as a flow chart. As in all of the above electron scattering simulation methods, the first step is to compute the projected potentials from the atomic coordinates. Figure 3a shows the sum of all 40 projected potential slices, each having a thickness of 2 Å.
The second step, shown in Figure 3b, is to choose a set of parent beams and then calculate the weight function of all beamlets using the desired partitioning scheme. Here, we have used two rings of triangularly tiled beams, where each ring has a constant radius and the beams are separated by 10 mrads across the 20 mrad STEM probe. The beamlet weights w are calculated using natural neighbor interpolation and are shown in Figure 3a in the top right panel. The parent beams are indicated by small red circles in the condenser aperture, and the beamlet weight distributions for each parent beam are show in gray scale. By taking the inverse Fourier transform of each weight function, we can generate the real-space beamlet basis functions .
Figure 3c shows the third step of the partitioning simulation algorithm, where we perform a plane-wave multislice simulation for each of the parent beams defined above. After the plane waves have been propagated and transmitted through all 40 slices, the tilt of each beam is removed. These outputs are then stored in the compact -matrix
.
Finally, we compute the intensity of each desired STEM probe position as shown in Figure 3d. First, if we are using a PRISM interpolation factor other than f = 1, we crop out a subset of the -matrix. Next, each beam of the
-matrix is multiplied by the beamlet basis functions, and all beamlet exit waves summed to form the complex STEM probe in real space. Finally, we take the Fourier transform of these probes and compute the intensity from the magnitude squared of the wavefunction. If we require sub-pixel shifts of the STEM probes with the cropped region of the
-matrix, we must first multiply the probe basis functions by the appropriate complex plane wave in Fourier space to achieve the desired shift. This adds some computational overhead to each probe, and so if possible we suggest using a potential sampling pixel size that produces a simulation image size which is an integer multiple of the spacing between adjacent STEM probes.
Computational and Memory Complexity
We now approximate computational complexity and memory complexity for the multislice, PRISM, and partitioned PRISM algorithms. We neglect calculation time for the sample-projected potential slice and thermal diffuse scattering, as the added computational and memory complexity is equal for all methods. For simplicity, we assume a quadratic simulation cell with . Each slice of the multi-slice algorithm requires transmission and propagation operations in equation (8), which is
, and
operations to multiply the potential and the Fresnel propagator. For a STEM simulation with
STEM probe positions and
slices for the sample, the total multi-slice complexity is then
(Ophus, 2017). The complexity of the PRISM algorithm is given by
(Ophus, 2017), which consists of
multi-slice simulations for each of the sampled beams, and
operations for the summation of the beams. For the partitioned PRISM algorithm with
partitions, the complexity for the multi-slice calculations is
. For the real-space summation with subpixel precision, a maximum of
operations is necessary, while for the integer positions on the
-matrix-grid, only
operations are necessary.
The memory complexity of the multi-slice algorithm is lowest, since only the current wave of size needs to be held in memory for an unparallelized implementation. All algorithms need
memory to store the results of the calculation if 4D datasets are computed. The PRISM algorithm requires
memory to store the compact
-matrix. For simulations which require a finely sampled diffraction disk,
can quickly grow to 104 or larger, since the number of beams scales with the square of the bright-field disk radius, such that large-scale simulations with fine diffraction disks can outgrow the available memory on many devices. The memory requirements of the partitioned PRISM algorithm scale with
. Since the number of parent beams
can be chosen freely, the memory requirements of the partitioned PRISM algorithm can be freely adjusted to the available hardware (Table I).
Computational and Memory Complexity of Alternatives for Computing STEM Data.
Algorithm . | Time Complexity . | Memory Complexity . |
---|---|---|
Multislice | ![]() | ![]() |
PRISM | ![]() | ![]() |
Partitioned PRISM subpixel precision | ![]() | ![]() |
Partitioned PRISM integer pixel precision | ![]() | ![]() |
Algorithm . | Time Complexity . | Memory Complexity . |
---|---|---|
Multislice | ![]() | ![]() |
PRISM | ![]() | ![]() |
Partitioned PRISM subpixel precision | ![]() | ![]() |
Partitioned PRISM integer pixel precision | ![]() | ![]() |
: number of slices,
: number of beams,
: number of beam partitions,
: number of probes,
: side length of field of view in pixels, f: interpolation factor.
Computational and Memory Complexity of Alternatives for Computing STEM Data.
Algorithm . | Time Complexity . | Memory Complexity . |
---|---|---|
Multislice | ![]() | ![]() |
PRISM | ![]() | ![]() |
Partitioned PRISM subpixel precision | ![]() | ![]() |
Partitioned PRISM integer pixel precision | ![]() | ![]() |
Algorithm . | Time Complexity . | Memory Complexity . |
---|---|---|
Multislice | ![]() | ![]() |
PRISM | ![]() | ![]() |
Partitioned PRISM subpixel precision | ![]() | ![]() |
Partitioned PRISM integer pixel precision | ![]() | ![]() |
: number of slices,
: number of beams,
: number of beam partitions,
: number of probes,
: side length of field of view in pixels, f: interpolation factor.
Methods
All the simulations shown in this paper were performed using a set of custom Matlab codes. In addition to implementing the partitioned PRISM algorithm, we have also implemented both the conventional multislice and PRISM algorithms for STEM simulation, in order to make a fair comparison between the different methods. We have used a single frozen phonon configuration in all cases, in order to increase the number of features visible in diffraction space. No effort was made for performance optimization or parallelization beyond MATLAB's inline compiler optimizations.
The microscope parameters used in Figures 4, 5, and 6 were an accelerating voltage of 80 kV, a probe convergence semiangle of 20 mrad, and a pixel size of 0.1 Å. The probe was set to zero defocus at the entrance surface of the simulation cell. The projected potentials were calculated using a 3D lookup table method (Rangel DaCosta et al., 2021) using the parameterized atomic potentials given in Kirkland (2020). Slice thicknesses of 2 Å were used for all simulations, and an anti-aliasing aperture was used to zero the pixel intensities at spatial frequencies above 0.5 · qmax during the propagation step.

Individual STEM probes computed with beam partitioning. (a) Projected potential and probe position. (b) Partitioning diagram showing beamlet weights. (c) Calculated CBED intensity on a logarithmic scale. (d) Error versus multislice probe simulation. (e) Estimated calculation time of the different probe partitions as a function of the number of probes calculated.

Individual STEM probes computed with beam partitioning and PRISM f = 5 interpolation. (a) Projected potential and probe position. (b) Partitioning diagram showing beamlet weights. (c) Calculated CBED intensity on a logarithmic scale. (d) Error versus multislice probe simulation. (e) Estimated calculation time of the different probe partitions as a function of the number of probes calculated.

Simulation of STEM images using beam partitioning and PRISM interpolation. (a) Calculation time, RMS error relative to multislice, and total RAM required for the -matrix of all simulations. The four detector configurations considered are BF detector from 0 to 8 mrads, an ABF detector from 9 to 20 mrads, a LAADF detector from 25 to 60 mrads, and a HAADF detector from 61 to 100 mrads. Gray circles indicate the simulations where images are shown. (b) STEM images for the four detector configurations and four simulation cases labeled below. Calculation of time speed-up-relative multislice is inset into the top row, while the number of included beams is inset into the bottom row. (c) Pixel-wise errors of each image in (b) with respect to multislice simulations.
The atomic coordinates utilized for our single probe position and imaging simulations are identical to that used previously (Ophus, 2017). The structure consists of a Pt nanoparticle with a multiply-twinned decahedral structure, with screw and edge dislocations present in two of the grains. The nanoparticle measures approximately 7 nm in diameter and was tilted such that two of the platinum grains are aligned to a low index zone axis. It was embedded into an amorphous carbon support to a depth of approximately 1 nm, with all overlapping carbon atoms removed. The cell size is 10 nm × 10 nm × 8 nm and contains 57,443 total atoms. The nanoparticle coordinates were taken from Chen et al. (2013), and the amorphous carbon structure was adapted from Ricolleau et al. (2013).
The atomic coordinates of our 4D-STEM simulations were a multilayer stack of semiconductor materials inspired by the experiments from Ozdol et al. (2015). The simulation cell consists of a GaAs substrate where the Ga and As sites are randomly replaced with 10% Al and P, respectively. The multilayers are an alternating stack of GaAs doped with 10% P and pure GaAs, respectively, each 9 unit cells thick along a [001] direction. The lattice parameters of the GaAs and GaAsP were fixed to be +1.5 and −1.5% of the substrate lattice parameter, which was set to 5.569 Å. The field of view was approximately 500 × 500 Å, and the potential pixel size and slice thicknesses were set to 0.1 and 2 Å, respectively. The cell thickness was approximately 40 Å, giving 4.6 million atoms inside the simulated volume. The STEM probe convergence semiangle was set to 2.2 mrads, the accelerating voltage was set to 300 kV, and the probe was scanned over the field of view with 2 Å step sizes, giving an output of 250 × 250 probes.
The simulations shown in Figures 4, 5, and 7 were computed on a laptop with an Intel Core i7-10875H CPU, operating at 2.30 GHz with eight cores, and 64 GB of DDR4 RAM operating at 2,933 MHz. The simulations shown in Figure 6 were performed on Intel Xeon Processors E5-2698v3 with eight Physical cores (16 threads) and 25GB RAM per simulation. The multislice 512 × 512 results were obtained by splitting the 512 × 512 array into 32 jobs with 16 × 512 positions. Prism f = 2 results with 512 × 512 probe positions were obtained by splitting the array in to eight jobs with 64 × 512 each, all using an identical calculated -matrix. All calculations were performed using Matlab's single floating point complex numbers, and simulation run times were estimated using built-in MATLAB functions, and memory usages were based on theoretical calculations.

Simulated 4D-STEM datasets and strain maps of a multilayer semiconductor stack. (a) PRISM simulation with f = 25 interpolation and 21 total beams. (b) Partitioned PRISM simulation with f = 5 interpolation and 19 total beams. Each simulation shows a virtual bright field image, the mean CBED image, and strain maps in the two cardinal directions. Line traces show average strain perpendicular to the layer direction.
Results and Discussion
Calculation of Individual STEM Probes
To demonstrate the accuracy of our proposed algorithm, we have performed STEM simulations of a common sample geometry: a multiply-twinned Pt nanoparticle resting on an amorphous surface. The total projected potential of this sample is plotted in Figure 4a, as well as the location of a STEM probe positioned just off-center. We have tested a series of beam partitioning schemes, shown graphically in Figure 4b. The first case tested was a single beam, which is equivalent to convolving a plane-wave HRTEM simulation with the STEM probe. We have also used natural neighbor partitioning to calculate the beamlet weights when using a series of concentric hexagonal rings of beams, distorted slightly to the circular probe geometry. These simulations include partitioning the 20 mrad probe by 20, 10, 5, 2.5, and 1.25 mrads, resulting in a total of 7, 19, 61, 217,and 817 parent beams, respectively.
The calculated diffraction space intensities of the probes corresponding to the above cases are shown in Figure 4c, along with the corresponding conventional multislice simulation. We see that using a single parent beam is extremely inaccurate, reproducing only the coarsest features of the multislice simulation. However, the partitioning scheme rapidly converges to the multislice result, shown by the error images plotted in Figure 4d. The 19 beam case has errors falling roughly within 5%, while the 61 beam case drops to <2%. The calculated probe for the 217 beam case has errors on the order of <0.5%, which would likely be indistinguishable from an identical experiment due to measurement noise. Finally, the 817 beam case is essentially error-free.
We can make additional observations about the character of the errors present in the partitioning algorithm. Inside the initial probe disk and in directly adjacent regions, the errors are roughly equally distributed in the positive and negative directions. However, at higher angles, the errors are biased in the negative direction. This indicates that the partitioning approximation is highly accurate at low-scattering angles where coherent diffraction dominates the signal, and is less accurate at high-scattering angles where thermal diffuse scattering dominates. We attribute this effect to the complex phase distribution of the pixels; at low-scattering angles, adjacent beams have very similar phase distributions, which in turn makes the interpolation a good approximation. However, at high-scattering angles, the phases of each pixel are substantially more random, due to thermal motion of the atoms. This means that if too few beams are used to approximate the signal, the coherent summations will tend towards zero due to the random phase factors. Thus, when using a small number of beams in partitioned STEM simulations, high-angle-scattering intensities can be underestimated. In the example diffraction pattern shown in Figure 4d, we also observe that the error magnitude decays from low to high angles when very few beamlets are used, but becomes more uniform when the approximation approaches the fully sampled -matrix. We expect the error distribution to vary with the complexity of the simulated specimen and recommend testing the approximation at few selected positions with high values of projected potential.
The estimated calculation times for these simulations are shown in Figure 4e. When calculating a single STEM probe, multislice is always fastest because the only overhead to the calculation is computation of the projected potentials. The partitioned simulations by contrast require evaluation of the -matrix, which requires the same calculation time as each STEM probe multislice propagation for each beam. However, once the
-matrix has been computed, the calculation of STEM probes via matrix multiplication becomes substantially faster than multislice. The overall simulation time becomes lower than multislice if many STEM probe positions must be calculated. For the 61, 217, and 817 beam cases, these crossovers in calculation time occur for 32, 135, and 1,000 probe positions, respectively. Therefore, even for 1D simulations of STEM probe positions, the partitioning scheme is faster, and for simulations with a grid of 2D probe position this scheme is significantly faster than multislice.
However, using the beam partitioning algorithm on the entire field of view does not utilize the algorithm to its full speed-up potential. The beam partitioning approximation is also compatible with the PRISM approximation. Partitioning reduces the number of entries of the -matrix in diffraction space, whereas PRISM reduces the number of entries using cropping in real space. Figure 5a shows STEM simulations that combine partitioning with a PRISM interpolation factor of f = 5. The 25-fold reduction in sampling of the STEM probes is evident in Figure 5b, where the underlying beam pixels are clearly visible in the STEM probe. The partitioning scheme used is identical to that of Figure 5b, except for the 1.25 and 2.5 mrad partitioning cases. For the 1.25 mrad partitioning, the number of parent beams outnumbers the number of available beams; after removing duplicate beams, this simulation becomes equivalent to a PRISM f = 5 simulation. The 2.5 mrad partitions were changed to a diagonal grid, where every other beam is included in order to produce a more uniform sampling of the
-matrix.
The calculated probe intensities are shown in Figure 5c, along with the corresponding multislice simulation (which was sampled on the same 25-fold reduced grid). The errors of the partitioned PRISM simulations have been compared with the multislice simulation in Figure 5d. The resulting convergence towards zero error is essentially identical to the non-PRISM case (where f = 1). These simulations are also slightly biased towards negative errors at high-scattering angles.
The estimated calculation times are plotted in Figure 5e as a function of the number of probe positions. These simulations are substantially faster than multislice. The 61, 161, and 325 beam cases have a crossover in the calculation with multislice for 32, 83, and 155 probe positions, respectively. If the error for the 161 beam case is within an acceptable tolerance, a 1,000 × 1,000 probe position simulation of this sample can be performed in roughly 50 min, without additional parallelization or utilization of GPU or HPC resources.
Calculation of Full STEM Images
We have also simulated full STEM images with a variety of standard detector configurations, in order to demonstrate the potential of the partitioned PRISM algorithm. These simulations are shown in Figure 6 and include four radially symmetric detector configurations. These are a bright-field (BF) image from 0 to 8 mrads, an annular bright-field (ABF) image from 9 to 20 mrads, a low-angle annular dark field (LAADF) image from 25 to 60 mrads, and a high-angle annular dark field (HAADF) image from 60 to 100 mrads. We have performed these simulations with 512 × 512 probe positions using multislice, PRISM, and partitioned PRISM algorithms. The PRISM simulations used interpolation factors of f = 1, 2, 4, and 8, giving a total number of beams equal to 7,377, 1,885, 489, and 137 beams, respectively. The partitioning included was the scheme described above, where the 20 mrad STEM probe was subdivided by 10, 5, and 2.5 mrads into the parent beams, and where no partitioning was performed (i.e. the original PRISM algorithm). The number of beams for the 10, 5, and 2.5 mrad partitioning was equal to 19, 61, and 217 respectively, except for the 2.5 mrad partitioning for f = 8 interpolation, where the simulation is equivalent to PRISM (137 beams).
Figure 6a shows a summary of the results, where the root-mean-square (RMS) errors in units of probe intensity and calculation times relative to multislice simulations are plotted. Additionally, the RAM requirements for storing the -matrix are shown by the marker sizes. Overall, the results follow the same trend as in the previous section. Using less beams either in the partitioning or higher PRISM interpolation results in a less accurate simulation for all cases. The only exception to this is PRISM f = 1 simulations, which are mathematically identical to multislice simulations (Ophus, 2017). Interestingly, the PRISM f = 1 simulations are faster than f = 2, due to not needing any matrix indexing operations to crop out a portion of the
-matrix. However, f = 1 PRISM simulations also have the largest RAM requirements by a large margin, requiring 15.5 GB. This is potentially an issue for large simulations if we wish to utilize GPU resources, since RAM capacities of current GPUs are often in the range of 4–16 GB, and there is additional overhead for other arrays that must be calculated. This problem can be alleviated by streaming only part of the
-matrix into the GPU RAM (Pryor et al., 2017), but then the large speed-up afforded by performing only a single matrix multiplication per STEM probe is lost.
The BF and ABF simulation errors are shown in Figure 6a, and the partitioned simulations have a very favourable balance between calculation time and accuracy. For PRISM interpolation factors of f = 2 and f = 4, the partitioned simulations have essentially identical accuracy to the PRISM simulations, while requiring far lower calculation times and less RAM to store the -matrix. The 5 mrad partitioning case (61 beams), for example, is 46 (f = 2) and 171 (f = 4) times faster than an equivalent multislice simulation, while having RMS errors on the order of 0.2 and 0.1%, respectively, for the 0–8 mrads BF image and RMS errors on order of 0.5 and 0.2%, respectively, for the 9–20 mrads ABF image.
For the LAADF and HAADF images shown in Figure 6a, the partitioned simulations show somewhat less favourable error scaling than the PRISM algorithm. While the calculation times are reduced by partitioning for a given PRISM interpolation factor, the errors increase roughly inversely proportional to the number of include beams. These errors are still relatively low however, staying roughly constant with the interpolation factor f.
Figure 6b shows the STEM images for the f = 4 cases including conventional PRISM and the 3 partitioning schemes. It is immediately evident that all images contain the same qualitative information, for example, showing that the ABF image is far more interpretable than the BF image. Visually, the BF and ABF images appear indistinguishable from each other, with all atomic-scale features preserved across the different partitioning schemes. The LAADF and HAADF images similarly all contain the same qualitative information, and all highlight the differences between these two dark-field imaging conditions. Here, however, we can see an overall reduction of image intensity inside the nanoparticle for the partitioned simulations with less beams. In the LAADF case, the 19 beam image is noticeably dimmer than the other cases, and for the HAADF case both the 19 and 61 beam partitioning show reduced intensities.
To show the errors more quantitatively as a function of the probe position, we have plotted the difference images with respect to a multislice simulation in Figure 6c. For the BF images, a slight offset in the overall intensities is visible, likely due to the slightly different probe and detector sampling when using f = 4 interpolation. The spatially resolved differences are very low however, for both PRISM and the 2.5 and 5 mrad partitioning simulations. In the regions of highest scattering in the nanoparticle, some errors along the atomic planes are visible in the 10 mrad partitioned simulation. The 2.5 and 5 mrads partitioned PRISM simulations are an excellent replacement for the PRISM simulations, as they offer large calculation time speed-ups for a negligible change in the error.
In the LAADF and HAADF error images plotted in Figure 6c, the errors are increasing proportionally to the inverse of the number of beams included, as we observed in Figure 6a. The HAADF images show higher overall errors than the LAADF images due to the increasing randomness of the pixel phases at high-scattering angles where thermal diffuse scattering dominates the signal. The relative error is also higher at these high-scattering angles, as the number of electrons that scatter to these high angles are significantly lower than those which reach the other detector configurations. Both the LAADF and HAADF errors scale nearly linearly with the nanoparticle projected potential, which indicate that they may be tolerable for relative measurements such as comparing different thicknesses or the signal measured for different atomic species. For quantitative intensity simulations at high angles, we recommend using as many beams as possible for the partitioned PRISM algorithm.
Calculation of 4D-STEM Datasets
Many 4D-STEM experimental methods require fine enough sampling of reciprocal space to resolve the edges of scattered Bragg disks, or fine details inside the unscattered and scattered Bragg disks (Ophus, 2019). In particular, for machine learning methods which are trained on simulated data, we want the sampling and image sizes to be as close to the experimental parameters as possible (Xu & LeBeau, 2018; Yuan et al., 2021). Here, compare the PRISM and partitioned PRISM algorithms for 4D-STEM simulations and assess their accuracy by performing a common 4D-STEM workflow of strain mapping by measuring the Bragg disk spacing (Pekin et al., 2017).
We have simulated a 4D-STEM experiment for a multilayer stack of semiconductor materials similar to the experiments performed by Ozdol et al. (2015), as shown in Figure 7 and described above. Two simulations were performed: the first used only the PRISM algorithm with interpolation factors of f = 25, giving 21 total beams, as shown in Figure 7a. The second combined a PRISM interpolation of f = 5 with partitioning into 19 beams, as shown in Figure 7b. These parameters were chosen to require approximately the same total calculation time (157 and 186 min for pure PRISM and partitioned PRISM, respectively). Both simulations used the same atomic potentials which required 113 min to compute. The -matrix calculation steps required 42 and 30 min for the pure PRISM and partitioned PRISM simulations, respectively. Finally, the 62,500 probe positions required 2 and 43 min for the pure PRISM and partitioned PRISM simulations, respectively.
We have used the py4DSTEM package published by Savitzky et al. (2020) to measure strain in both of the simulations, as shown in Figure 7, by fitting the positions of the Bragg disks. These strains are compared to the ideal strain, estimated by convolving the underlying lattice spacing with a Gaussian kernel with a standard deviation given by the 5 Å estimated size of the STEM probe. In the pure PRISM simulation shown in Figure 7a, there are artifacts visible in both strain maps. The strain perpendicular to the layer direction shows rapid oscillations of ±0.1%, while the strain parallel to the layer direction shows discrete steps. Both of these are due to the very small cropping box used when f = 25, which cuts off the tails of the STEM probe in this simulation. Additionally, the limited sampling of the diffraction disk edges strongly limits the achievable precision in the disk position measurements.
By contrast, the partitioned simulation shown in Figure 7b samples diffraction space five times more finely in both the x and y directions. The resulting strain maps are much flatter, and the measured strain positions agree better with the ideal measurements. This simulation demonstrates that beam partitioning combined with PRISM interpolation can provide a much more efficient use of the calculation time required to generate the -matrix beams than a pure PRISM simulation. This partitioning case uses approximately the same number of beams and requires roughly the same calculation time, but is substantially more accurate at the low-scattering angles used in a coherent diffraction 4D-STEM simulation. We also estimate that a multislice simulation of this same experiment would require approximately 60 days using the same simulation parameters. Even if we were to increase the beam sampling by a factor of 8, the partitioned PRISM simulation would still complete in less than a day.
Conclusion
We have introduced the beam partitioning algorithm for STEM simulation. This algorithm splits the STEM probe into a series of basis functions generated by natural neighbor interpolation between a set of parent beams. We construct the diffracted STEM probe by matrix multiplication of these basis functions with plane-wave multislice simulations of each parent beam which are stored in a -matrix that can be re-used for each new STEM probe position. We have demonstrated that the resulting algorithm converges rapidly to low error with respect to the conventional multislice algorithm, and that it is fully compatible with the PRISM algorithm for STEM simulation.
We have compared our new algorithm with multislice and PRISM simulations of a nanoparticle on an amorphous substrate. With these simulations, we have shown that in general, partitioned beam simulations can provide the same accuracy as PRISM at low- to intermediate-scattering angles (where coherent diffraction dominates the signal), but with much lower calculation times and lower RAM usage. We have also shown that at high-scattering angles, beam partitioning simulation accuracy is somewhat worse than the PRISM algorithm, though still with lower calculation times. These low calculation times may allow the partitioned PRISM algorithm to be used “in the loop” with 3D tomographic reconstruction algorithms, in order to properly model the nonlinear dependence of STEM image contrast on the underlying atomic potentials.
Finally, we have also demonstrated the utility of partitioned PRISM for simulations of large 4D-STEM datasets. We used a common sample geometry composed of a multilayer stack of semiconductor materials with varying compositions on a substrate and performed strain mapping from the diffracted probe signals by measuring the position of the Bragg disks and fitting a lattice. These simulations show that the partitioned PRISM algorithm is particularly well suited for performing fast simulations of large fields of view where high sampling of diffraction space is needed. We believe that our algorithm will find widespread application in simulations of very large simulated cells, such as those calculated with molecular dynamics. Our simulations also show that the beam partitioning -matrix can efficiently represent complex three-dimensional scattering, which may make it useful for inverting experimental data efficiently.
Availability of data and materials
Reference implementations of the algorithms presented in this paper (multislice, PRISM, and partitioned beam STEM simulations) are available at https://github.com/cophus/superPRISM.
Acknowledgments
We thank Hamish Brown and Scott Findlay for useful discussions. Philipp M. Pelz and Mary C. Scott are supported by the Strobe STC Research Center (Grant No. DMR 1548924). Luis Rangel Dacosta and Benjamin H. Savitzky are supported by the Toyota Research Institute; Luis Rangel Dacosta is also supported by the Department of Energy Computational Science Graduate Fellowship. Colin Ophus acknowledges support from the Department of Energy Early Career Research Award program. Work at the Molecular Foundry was supported by the Office of Science, Office of Basic Energy Sciences, of the U.S. Department of Energy under Contract No. DE-AC02-05CH11231. This research used resources of the National Energy Research Scientific Computing Center (NERSC), a U.S. Department of Energy Office of Science User Facility located at Lawrence Berkeley National Laboratory, operated under Contract No. DE-AC02-05CH11231. This material is based upon work supported by the U.S. Department of Energy, Office of Science, Office of Advanced Scientific Computing Research, Department of Energy Computational Science Graduate Fellowship under Award Number DE-SC0021110.