Comparisons between fast algorithms for the continuous wavelet transform and applications in cosmology: the one-dimensional case

The continuous wavelet transform (CWT) is very useful for processing signals with intricate and irregular structures in astro-physics and cosmology. It is crucial to propose precise and fast algorithms for the CWT. In this work, we review and compare four different fast CWT algorithms for the 1D signals, including the FFTCWT, the V97CWT, the M02CWT, and the A19CWT. The FFTCWT algorithm implements the CWT using the Fast Fourier Transform (FFT) with a computational complexity of O ( N log 2 N ) per scale. The rest algorithms achieve the complexity of O ( N ) per scale by simplifying the CWT into some smaller convolutions. We illustrate explicitly how to set the parameters as well as the boundary conditions for them. To examine the actual performance of these algorithms, we use them to perform the CWT of signals with different wavelets. From the aspect of accuracy, we ﬁnd that the FFTCWT is the most accurate algorithm, though its accuracy degrades a lot when processing the non-periodic signal with zero boundaries. The accuracy of O ( N ) algorithms is robust to signals with different boundaries, and the M02CWT is more accurate than the V97CWT and A19CWT. From the aspect of speed, the O ( N ) algorithms do not show an overall speed superiority over the FFTCWT at sampling numbers of N (cid:46) 10 6 , which is due to their large leading constants. Only the speed of the V97CWT with real wavelets is comparable to that of the FFTCWT. However, both the FFTCWT and V97CWT are substantially less efﬁcient in processing the non-periodic signal because of zero padding. Finally, we conduct wavelet analysis of the 1D density ﬁelds, which demonstrate the convenience and power of techniques based on the CWT. We publicly release our CWT codes as resources for the community.


INTRODUCTION
Wavelets are wave-like functions that are localized in both the real and Fourier domains.Hence, by convolving a signal under investigation with the dilated or contracted wavelets, the local features at various scales will be extracted, and this process is called wavelet transform (WT; e.g., Daubechies 1992;Kaiser & Hudgins 1994;Addison 2017).There are two basic types of WT: discrete WT (DWT) and continuous WT (CWT).The DWT, using orthogonal wavelets, operates over coarse dyadic scales and positions.In contrast to the DWT, the CWT offers a highly redundant representation of the signal, which ensures that intricate structures or textures can be resolved quite well (Addison 2018).However, the redundancy also makes the direct computation of the CWT terribly inefficient, which requires a time complexity of O(N 2 ) per scale, where N is the number of data points.One more efficient way to implement CWT is to use the Fast Fourier Transform (FFT) with a complexity O(N log 2 N ), since the convolution in the real domain is equivalent to multiplication in the frequency domain, i.e. convolution theorem (Torrence & Compo 1998;Pérez-Rendón & Robles 2004;Press et al. 2007;Arts et al. 2022).
Since the COBE detection of the CMB anisotropy in 1992, cosmology has emerged as a precision, data-driven science (Turner 2022).The observational experiments such as the Euclid space mission (Euclid, Laureijs et al. 2011), the Dark Energy Spectroscopic Instrument (DESI, Levi et al. 2013), and the Square Kilometre Array (SKA, Bacon et al. 2020), and state-of-the art cosmological simulations such as the IllustrisTNG (Pillepich et al. 2018), the SIMBA (Davé et al. 2019), andthe MillenniumTNG (Hernández-Aguayo et al. 2022), are producing increasingly growing amounts of data, which need to be analyzed by high-performance algorithms and methods.Therefore, the fast CWT algorithms with O(N ) complexity are obviously more attractive than the FFT-based implementation of the CWT (FFTCWT) with O(N log 2 N ) complexity.Fortunately, a great effort has been made to develop fast CWT algorithms without using the FFT (e.g., Unser et al. 1994;Berkner & Wells 1997;Vrhel et al. 1997;Muñoz et al. 2002;Omachi & Omachi 2007;Arizumi & Aksenova 2019), which achieve the time complexity of O(N ) per scale.However, some O(N ) algorithms are only applicable to particular cases.For example, the algorithm of Unser et al. (1994) is restricted to integer scales, the algorithm of Berkner & Wells (1997) is only available for wavelets which are derivatives of the Gaussian function, and the algorithm of Omachi & Omachi (2007) is only applicable for polynomial wavelets.
What we need are fast CWT algorithms with no restrictions on the wavelet, and with arbitrarily fine scale resolution.Therefore, in this study, we will consider the O(N ) algorithms proposed by Vrhel et al. (1997), by Muñoz et al. (2002) and by Arizumi & Aksenova (2019).For convenience, we denote these three algorithms as the V97CWT, the M02CWT, and the A19CWT, respectively.The V97CWT is a fast recursive algorithm based on the finite impulse response (FIR) and infinite impulse response (IIR) filtering techniques with filter coefficients determined by two compactly supported auxiliary functions.The M02CWT reaches the linear complexity by decomposing both the wavelet and the signal into B-splines, and the A19CWT approximates the wavelet as piecewise polynomials and reduces the number of operations using integration by parts.
Motivated by the facts that (1) all these powerful algorithms are 1D, and (2) there is no any publicly available source code for the V97CWT, M02CWT, and A19CWT algorithms, we must conduct a systematic comparison study of them to benchmark their actual performance, which is the basis for developing high-dimensional fast CWT algorithms to analyze high-dimensional data, e.g. the 2D weak-lensing maps and 3D spatial distribution of matter.For some simple 1D functions, such as sine, cosine, and Gaussian functions, their CWTs can be evaluated by analytical calculations.So the accuracy of their numerical CWTs can be verified by the corresponding analytical results.Finally, it should be noted that the CWT for 1D signals is not trivial in astrophysics and cosmology, as it is also applicable to a wide range of scenarios, such as analyzing the light curves of astronomical sources (e.g., Tarnopolski et al. 2020;Ren et al. 2022), subtracting the foreground emission from the 21 cm signal (e.g., Gu et al. 2013;Li et al. 2019), measuring the smallscale structure in the Lyman-α forest (e.g., Lidz et al. 2010;Garzilli et al. 2012;Wolfson et al. 2021), investigating the time-frequency properties of the gravitational waves (e.g., Tary et al. 2018), characterizing the 1D density fields (e.g., da Cunha et al. 2018;Wang & He 2021;Wang et al. 2022), and so on.We publicly release the Fortran 95 implementations 1 and their Python wrappers 2 of the fast CWT algorithms described in this manuscript, in the hope that the community will use them to perform wavelet analysis of 1D signals.
The paper is organized as follows.We briefly introduce the mathematical formalism of the CWT in Section 2. We review the fast CWT algorithms in Section 3, and compare the performance between them in Section 4. We present simple applications of the 1D CWT in cosmology in Section 5. Finally, in Section 6, we summarize our main findings and present the conclusions.For convenience of the readers, in Table 1, we list the acronyms frequently used in our paper, with their meanings explained.

THE FORMALISM OF THE CONTINUOUS WAVELET TRANSFORM
The CWT W f (w, x) of a 1D real signal f (x) is defined as the convolution of f (x) with a scaled wavelet, i.e.
where w is the scale parameter with dimension of [x] −1 , and is the scaled version of the mother wavelet There are many different choices for the mother wavelet.In this study, we consider four kinds of wavelets: the cubic B-spline wavelet (CBSW, Muñoz et al. 2002), the Gaussian-derived wavelet (GDW, Wang & He 2021), the cosine-weighted Gaussian-derived wavelet (CW-GDW, Wang & He 2022), and the Morlet wavelet (MW, Addison 2017).Table 2 shows their formulas and properties, and Fig. 1 gives a graphical representation.As well known, the classical inverse CWT (ICWT) formula is a double integral over scale and space (see e.g.Addison 2017).In fact, there are simpler inverse ways.If the complex wavelet satisfies ψ(k) = 0 for k < 0 and 0 k dk, then the original signal can be reconstructed by the known Morlet formula (see e.g.Shensa 1993;Daubechies et al. 2011) as follows where Re{. ..} denotes the real part, and f = lim ´L/2 −L/2 f (x)dx is the average value of f (x) over all space.If the real wavelet satisfies ψ(x) = ψ(−x) and 0 < |K ψ | < ∞, then a single integral ICWT formula also exists, which is Note that Equation ( 5) is the generalization of the inverse formula in Wang & He (2021) and Wang et al. (2022), which holds for wavelets derived from the smoothing window function.We refer to Appendix A for the derivation of Equation ( 5).
Table 2. Four mother wavelet functions and their properties.
k dk is a constant that ensures the existence of the single integral ICWT formula, χ is the half width of the wavelet's support [−χ, χ], and cw = w/kpseu is the ratio between the wavelet scale and the corresponding pseudo Fourier frequency (see Wang et al. (2022) for the definition of cw).Note the GDW, CW-GDW and MW are not compactly supported, but decay exponentially.For these three wavelets, we set the values of χ to confirm

FAST ALGORITHMS FOR THE 1D CWT
For a discrete signal f (n) ≡ f (n∆x) with sampling interval ∆x, the CWT will be discretized in the following form: where w = w∆x is the dimensionless scale parameter.It is clear that the computation of Equation ( 6) requires N 2 multiplications and additions per scale, where N is the number of sampling points.Therefore, the high computational complexity makes this algorithm impractical for use.Next we will review four CWT algorithms with better performance, namely the FFTCWT, the V97CWT (Vrhel et al. 1997), the M02CWT (Muñoz et al. 2002), and the A19CWT (Arizumi & Aksenova 2019).

FFTCWT
If the discrete signal f (n) is periodic with period L = N ∆x, then it can be decomposed into a Fourier series as follows where the Fourier transform (FT) f (m) is defined as Choose a mother wavelet ψ(x) with the FT ψ(k) Dimensionless scales: w = wmin2 i+j/N subs , where s = iNsubs + j, Nscales = NsubsNlevs and s = 0, 1, . . ., Nscales − 1, i = 0, 1, . . ., Nlevs − 1, j = 0, 1, . . ., Nsubs − 1 . Schematic representation of the FFTCWT algorithm.We define the loop variable s = iN subs + j to merge the two nested loops (for i and j) into one single loop.
By substituting Equation ( 7) into Equation ( 6), we get where Ŵf ( w, m) = L N w f (m) ψ( 2πm N w ), and ψ(k) is the FT of the wavelet ψ(x).Clearly, as the inverse FT of the product Ŵf ( w, m), the CWT W f ( w, n) can be computed efficiently by a standard FFT routine, like the FFTW 3 we used (Frigo & Johnson 2005).
Note that it is necessary to choose a set of discrete scales to use in Equation ( 9).For the CWT, the choice of scales is arbitrary.It is convenient to discretize the scales evenly on a logarithmic scale: where wmin = cwπ/N is the largest scale.Here, the scales are first divided into N levs levels, numbered by i; then each level is divided into N subs sub-levels, numbered by j.Thus, there is a total of N scales = N levs N subs scales.The number of scale levels is determined by N levs = Nint(log 2 cw k Nyq wmin /∆x ), where kNyq is the Nyquist frequency and Nint(. ..) denotes the nearest integer, while the number of sub-levels is determined by the user to allow adjustment of the scale resolution.
For clarity, the sequence of the FFTCWT algorithm is shown as a flowchart in Fig. 2.
Determine the FIR filters qj(n) and h(n), and the IIR filter q12(n) the definitions and properties of the B-splines).By using β 3 (x), the wavelet can be approximated as where x = x/∆x is the dimensionless coordinate, and wmax is the smallest scale.By convolving the above equation with β 0 (x), we get where the FIR filter qj(n) is and the IIR filter where δ K (n) is the Kronecker delta function.
Substituting Equations ( 11) and ( 12) into Equation ( 6), we have the following equations Exploiting the two-scale relation of the B-splines, we can obtain the CWT at scales of wmax/2 i+j/N subs as follows where "[. ..] ↑2 i " denotes the insertion of 2 i − 1 zeros between each point, and h(n) is given by Equation (B5).Equations ( 15), ( 17), ( 18) and ( 20) perform the FIR filtering.Equations ( 16) and ( 19) perform the IIR filtering, please refer to Appendix C for its details.The maximum scale level N levs − 1 is determined by N levs = Nint(log 2 wmax cw π/N ).The sequence of the V97CWT algorithm is shown as a flowchart in Fig. 3.

M02CWT
The M02CWT algorithm represents both the wavelet and the input signal as B-splines.The wavelet function is expressed as where the parameter h is used to regulate the accuracy of the Bspline approximation.If the support of the wavelet ψ(x) is [−χ, χ], then the relation between h and N d is h = χ/(N d + 2).Likewise, the continuous signal fc(x) is represented by its cubic B-spline interpolant: For spline wavelets, e.g. the CBSW, the coefficient sequence d(n) can be easily obtained from its analytic form.For general wavelet functions, the sequence d(n) is calculated in the same method as c(n), which are calculated by (see Appendix C) where f (n) is the discrete input signal.Substituting Equations ( 21) and (B6) into Equation (1), we get ), and v(x) = D −4 fc(x + 2h/ w).Then by using Equations ( 22) and (B11), and considering that we are typically interested in the integer values of x, the above equation becomes Input a signal: Compute g(n) using the whole c(n) Dimensionless scales: w = w02 i+j/N subs where i = Imin, . . ., −1, 0, 1, . . ., Imax, j = 0, 1, . . ., Nsubs − 1.
Compute g(l) using c(l) at mN Exit loop

Exit loop
Exit loop where l0 is the ceiling integer of n − (n ′ − 2)h/ w − 6, and Notice that the computation of g(l) needs to calculate cumulative sum of the sequence c(l) four times, which indicates that in the case of a large amount of data, g(l) becomes increasingly inaccurate as l increases due to the limited precision of floating points.The way to alleviate this issue is to divide the sequence c(l) into many small Input a signal: Compute g(n) using the whole c(n) Dimensionless scales: w = w02 i+j/N subs where i = Imin, . . ., −1, 0, 1, . . ., Imax, j = 0, 1, . . ., Nsubs − 1.
Compute g(l) using c(l) at mN Exit loop

Exit loop
Exit loop segments and then compute g(l) locally on each segment.To do this, we set scales as follows where w0 = 2χ/N , and the scale level i takes the range of Imin = Nint(log 2 cw π/N w0 ) to Imax = Nint(log 2 cw π/2 w0 ).In the case of i < 1, we do not split c(l); whereas in the case of i ⩾ 1, we split c(l) into 2 i parts, as shown by the flowchart in Fig. 4.

A19CWT
Since the mother wavelet is zero outside the support interval [−χ, χ], the CWT at integer positions can be written as By partitioning the support duration [−χ, χ] evenly into 2Nχ intervals, i.e.
we approximate ψ(x) with cubic piecewise polynomials as shown below where . Substituting Equation (28) into Equation ( 27), we have Then we apply integration by parts to Equation ( 29) and arrive at where F (4) c (x) is the 4th antiderivative of fc(x), and In fact, Equation (30) assumes that the 3rd derivative of the wavelet, i.e. ψ ′′′ is constant on the interval [χ n ′ , χ n ′ +1 ), which can be approximated as where ψ ′′ is the 2nd derivative of the wavelet, which can be obtained analytically.Hence the coefficient α n ′ ,3 is given by By using Equations ( 22), (B7) and (B11), F c (x) can be calculated as Therefore Equation ( 30) can be expressed as where l1 is the ceiling integer of n − χ n ′ / w − 6, and the coefficient sequence g(l) is computed by Equation ( 25).By comparing Equations ( 24) and ( 35), we find that the M02CWT and A19CWT are very similar, but the theoretical derivation of the A19CWT is much simpler.To solve the accuracy issue of g(l), we adopt the same scheme as the M02CWT algorithm.The sequence of the A19CWT algorithm is shown as a flowchart in Fig. 5.

Boundary conditions
In the definition of the CWT (see Equation ( 1)), the signal is assumed to be extended to infinity.Nevertheless, in reality, the length of the analyzed signal is finite.Hence, we must make assumptions about the data outside its finite extent.Periodic boundary conditions are the most common choice.On the one hand, many cosmic fields are considered to be periodic.On the other hand, periodic boundary conditions are easy to implement.The FFTCWT inherits the attribute that the signal is assumed to be periodic in the FFT.The V97CWT imposes periodic boundary conditions on the signal by the IIR filtering (see Appendix C).For the M02CWT and A19CWT, the fact that the signal is periodic only imply that the coefficient c(l) is periodic.To ensure that g(l) is periodic, we take the following operations: 1: if i < 1 then ▷ Scale levels less than 1 2: g(0 : N − 1) ← c(0 : N − 1) 3: end for 7: else if i ⩾ 1 then ▷ Scale levels greater than or equal to 1 end for 13: end if In addition, the M02CWT and A19CWT can also easily handle signals with zero boundaries without padding zeros at scale levels of i < 1.After calculating c(l) by Equations ( C9), (C10), and (C13)-(C16), the coefficient g(l) of the signal with zero boundaries can be obtained by the following operations: 1: if i < 1 then ▷ Scale levels less than 1 2: g(−6 : N + 1) ← c(−6 : N + 1) 3: Ci ← g(N + 1) 6: end for 7: for l = l0 to l0 + 7 do ▷ Replace l0 with l1 in the A19CWT 8: if l < −6 then 9: g(l) ← 0 10: end if 14: end for 15: else if i ⩾ 1 then ▷ Scale levels greater than or equal to 1 16: Padding N 4 zeros at start and N 4 +1 zeros at end of c 17: end for 21: end if

Parameter settings
There are some unspecified parameters in the above algorithms, which are wmax, Nq, h, N d , and Nχ.In this subsection, we will discuss how to tune these parameters to make the algorithms sufficiently precise and efficient.
For the V97CWT algorithm, we define the approximation error as below the result of which is shown in Fig. 6.We find that the error AEV( wmax) increases with increasing wmax.To ensure a high precision as well as a sufficiently large scale range, we set wmax = 1.34cw, which satisfy AEV( wmax/2) = 0.1.According to Equation (13), qj(Nq + 1) = 0 yields the relationship Nq(j) = 2 j/N subs χ/ wmax − 1/2.For simplicity, we use the same value of Nq(j) at each j level, i.e.
which is the upper limit of Nq(j).
For the M02CWT algorithm, we define the approximation error as below the result of which is shown in Fig. 7. Since the cubic spline decomposition is perfectly exact to represent the CBSW with h = 1 (see Table 2), we only consider the approximation error for the GDW, CW-GDW and MW.We see that the smaller the h, the smaller the error.However, considering the efficiency of the algorithm, the value of h cannot be chosen too small.Hence, we set the value of h such that the error AEM(h) equals 5 × 10 −4 , and then N d can be determined by h = χ/(N d + 2).
For the A19CWT algorithm, we define the approximation error as below where ψpp(xn) is the piecewise polynomial function given by Equation (28).Because the CBSW is actually a cubic piecewise polynomial function with compact support width 2χ = 6 and segment width ∆χ = 1, the approximation error AEA(Nχ) should be very small at the integer multiples of 3, which is illustrated in Fig. 8. Hence for the CBSW, Nχ = 3 is the best choice.For the GDW, CW-GDW and MW, we set the value of Nχ such that the error AEA(Nχ) roughly equals 5 × 10 −4 , which is in accordance with the parameter settings of the M02CWT.
For clarity and convenience, we list the parameters and their values in Table 3.

PERFORMANCE COMPARISON BETWEEN ALGORITHMS
In the 1D case, it is easy to find functions whose CWTs can be calculated analytically by using Equation (1).Therefore, we can use their analytical results to examine the accuracy of the corresponding numerical outcomes.For instance, we here use the periodic function f1(x) with period 2π and the Gaussian function f2(x) as test signals, which are given below The two signals and their analytical CWTs are shown in Fig. 9.
For the following numerical tests, we write the double precision codes in Fortran 95 language, and compile them by gfortran 6.3.1 with the -O3 flag under the Intel Xeon CPU E5-2678 v3 @ 2.50 GHz processor with 250 GB RAM running Linux (Fedora release 24).

Accuracy comparison
To check the accuracy of these algorithms, we define the error spectrum as follows where W a f (w, x) is the analytical CWT, and W n f (w, x) is the numerical CWT.
The computation of the numerical CWT W n f (w, x) requires the sampling of the signal.For the signals f1(x) and f2(x), we take Figure 9. Left column: the periodic signal f 1 and its CWTs which are calculated analytically for different wavelets.Right column: the same as the left column but for the non-periodic signal f 2 .For comparison between different wavelets, we replace the wavelet scale w by the pseudo wavenumber kpseu (see Table 2), and keep this convention throughout the subsequent plots.N = 512 evenly spaced sample points on the intervals [0, 2π) and [−6, 6), respectively, which is sufficient to avoid the aliasing effect.The periodic boundary condition is used for f1(x), and the zero boundary condition for f2(x).Since the FFTCWT and V97CWT always assume the signals are periodic, we should pad zeros at both ends of the signal before execute the CWT of f2(x).Let Nzeros denote the number of padded zeros at each end, then it can be determined by χ and wmin: However, padding zeros takes up more computational resources and reduce the efficiency of the algorithm, which we will see in the next subsection.
In Fig. 10, we show the error spectra of the periodic signal f1(x).We see that the FFTCWT algorithm yields the highest accuracy, the error of which is less than 10 −10 %.The error of the V97CWT algorithm is between 0.01% and 1%.The errors of these two algorithms do not show any significant dependence on the kinds of wavelets.We observe that for the CBSW, the errors of both M02CWT and A19CWT are approximately between 3 × 10 −9 % and 0.003%.But for other wavelets, the errors are clearly higher than that for the CBSW, which are ranging from 0.003% to 1% in the case of the M02CWT, and from 0.1% to 10% in the case of the A19CWT.The larger error of the A19CWT may be due to the too coarse approximation in Equation ( 32).If Nχ is larger, this approximation will be more accurate, but the A19CWT will be less efficient.
In Fig. 11, we show the error spectra of the non-periodic signal f2(x).We observe that the error magnitudes of the FFTCWT (10 −8 %−0.1%) in handling the non-periodic signal f2(x) are much higher than that (10 −14 %−10 −10 %) in handling the periodic signal f1(x), whereas the error magnitudes of the other algorithms do not change much.Even so, the FFTCWT still provides the best accuracy among all algorithms for all wavelets.Only for the CBSW, the accuracy of the M02CWT and A19CWT can rival the accuracy of the FFTCWT.
In summary, the V97CWT, M02CWT, and A19CWT algorithms, which perform CWT calculations in real space, are not as precise as the FFTCWT.The reason for this is mainly that the former threes make approximations to the wavelet function to trade off the efficiency, but yet the latter does not.For the general wavelets, A19CWT yields the largest error, which is due to twice approximations, namely Equations ( 28) and (32), as stated above.However, for the special wavelet CBSW, the M02CWT and A19CWT algorithms provide a quite high accuracy owing to the fact that Equations ( 21), ( 28) and ( 32) describe the CBSW exactly.

Speed comparison
To check the actual efficiency of the algorithms, we measured the variation of their CPU time with the number of sampling points per scale.Fig. 12 shows the measurements of the periodic signal f1(x).We see that the V97CWT, M02CWT and A19CWT algorithms without using the FFT are indeed very fast and they all have the complexity of O(N ) but with different leading constants.However, they do not show a huge speed advantage over the FFTCWT at the sampling number of N ≲ 10 6 , which can be due to two reasons.On the one hand, the FFT library we used, FFTW, is very well optimized, and is the fastest free library available for computing the FFT.On the other hand, the leading constants of the V97CWT, M02CWT and A19CWT are too large.The V97CWT performances better than the M02CWT and A19CWT, due to its recursive nature.Its CPU time are comparable to that of the FFTCWT for the real wavelets.
Fig. 13 shows the measurements of the non-periodic signal f2(x).It is clearly seen that the FFTCWT and V97CWT consume much more computational time than they do in processing the periodic signal, since we pad many zeros to the signal f2(x) before executing the CWT.Only for the complex wavelet, MW, the FFTCWT still maintains the speed advantage over the other algorithms.For other wavelets, the FFTCWT has no distinct speed advantage.The CPU time consumed by the M02CWT and A19CWT do not differ by whether the signal is periodic or non-periodic.

APPLICATIONS IN COSMOLOGY
There are many 1D signals in the astrophysics and cosmology, such as the light curves of astronomical sources, the Lyman-α forest, the 21 cm signal, the gravitational waves, and the cosmic fields obtained by solving 1D perturbative equations.CBSW GDW CW-GDW MW

A19CWT
Figure 13.Same as Fig. 12, but for the measurements of the non-periodic signal f 2 (x).
RASTI 000, 000-000 (0000) by Muñoz et al. (2002), and the A19CWT proposed by Arizumi & Aksenova (2019).By the convolution theorem, the FFTCWT converts convolution calculations in the real domain into multiplications in the Fourier domain, and then returns the final result by the inverse FT (see Section 3.1 and Fig. 2).The V97CWT use the daughter wavelets with scales of wmax/2 j/N subs and approximate them with two scaling functions, i.e. the zero order and the cubic B-splines.Using the twoscale relation of the B-splines, then the CWT can be calculated recursively (see Section 3.2 and Fig. 3).The M02CWT approximate the daughter wavelet with the rescaled cubic B-spline, and interpolating the discrete input signal with the cubic B-spline.Therefore, the large wavelet convolution kernel is translated into smaller Bspline kernel (see Section 3.3 and Fig. 4).The A19CWT achieves the same purpose as the M02CWT by approximating the mother wavelet as cubic piecewise polynomials and applying integration by parts (see Section 3.4 and Fig. 5).In fact, the precision of algorithms originally mentioned in Muñoz et al. (2002) and Arizumi & Aksenova ( 2019) is terrible on small scales, and we remedy this issue in our M02CWT and A19CWT algorithms (see Appendix D).
We compare the accuracy and speed between these fast CWT algorithms in Figs.10-13 by using two specific signals.Our main findings are summarized as follows: (i) Even though for the non-periodic signal with zero boundaries, the accuracy of the FFTCWT is much lower compared to that for the periodic signal, it is still more accurate than other algorithms.(ii) When the O(N ) algorithms process the non-periodic signal with zero boundaries, the overall magnitudes of their errors do not grow larger compared to when they process the periodic signal.Hence the accuracy of them is robust to different types of signals.The M02CWT achieves the best accuracy among them.(iii) For the GDW, CW-GDW and MW, A19CWT is the least accurate algorithm.But for the CBSW, the A19CWT is just as accurate as the M02CWT, which is because both the cubic Bspline and piecewise polynomials represent the CBSW exactly.(iv) At the sampling number we consider, i.e.N ≲ 10 6 , the algorithms with the complexity of O(N ) per scale do not exhibit an overall speed advantage over the FFTCWT.Only the V97CWT with real wavelets shows a speed comparable to it.(v) For the non-periodic signal with zero boundaries, the FFTCWT and V97CWT are less efficient due to padding zeros to the signal.However, the efficiency of the M02CWT and A19CWT is not affected by the type of signals.
Therefore, the FFTCWT and V97CWT are suitable for the periodic signals.In particular, the V97CWT using real wavelets will perform better than using complex wavelets, e.g. the MW.The M02CWT is suitable for the non-periodic signals with zero boundary condition.We do not refer to the A19CWT algorithm because it is not accurate enough.
As a demonstration of the usage of the CWT, we then apply the FFTCWT to perform wavelet analysis of the 1D density fields.Acting like a "mathematical microscope", the CWT allows us to zoom in on complex structures of the density fields at various scales and locations (see Fig. 14).We also introduce the wavelet-based statistic, env-WPS, which is a bivariate function of the local density environment and the scale.As shown in Fig. 15, the env-WPS tells us that for the initial field, there is no any environment dependence of matter clustering on all scales.However, for the late time field, the matter clustering is dominated by the matter in overdense environments.
Clearly, the env-WPS contains more information about the matter clustering than the usual two-point statistics.The env-WPS can also be generalized to analyze other signals by replacing the local density environment by other attribute.
To analyze the multi-dimensional data, such as the 2D gravitational lensing maps and the 3D cosmic fields, the next natural step is to extend the 1D CWT algorithms to multi-dimensions.It is ease to develop 2D and 3D FFT-based CWT algorithms, since there are publicly available FFT libraries to use.However, multi-dimensional extensions of the rest 1D algorithms are not straightforward.In the future, we will plan to develop the fast multi-dimensional CWT algorithms without the use of FFT.

Figure 1 .
Figure 1.Top row: the CBSW, GDW, CW-GDW and MW in the real domain, at scale w = cw, i.e. kpseu = 1.Bottom row: the respective Fourier transforms of the wavelets in the top row.

Figure 8 .
Figure8.The Approximation error defined by Equation (39) for different wavelets as labeled.For the CBSW, at Nχ equal to the integer multiples of 3, the error is very tiny and nearly 10 −16 .So Nχ = 3 is the best choice for it.The gray horizontal line shows the error level of 5 × 10 −4 , and the gray vertical lines denote Nχ's values where the error reaches 5 × 10 −4 for the GDW, CW-GDW and MW from left to right.

Table 1 .
The acronyms frequently used in the paper, with their meanings explained.
The CPU time per scale of the different algorithms with different wavelets to compute the numerical CWT of the periodic signal f 1 (x).