Efficient nonparametric estimation of Toeplitz covariance matrices

A new nonparametric estimator for Toeplitz covariance matrices is proposed. This estimator is based on a data transformation that translates the problem of Toeplitz covariance matrix estimation to the problem of mean estimation in an approximate Gaussian regression. The resulting Toeplitz covariance matrix estimator is positive definite by construction, fully data-driven and computationally very fast. Moreover, this estimator is shown to be minimax optimal under the spectral norm for a large class of Toeplitz matrices. These results are readily extended to estimation of inverses of Toeplitz covariance matrices. Also, an alternative version of the Whittle likelihood for the spectral density based on the Discrete Cosine Transform (DCT) is proposed. The method is implemented in the R package vstdct that accompanies the paper.


Introduction
Estimation of covariance and precision matrices is a fundamental problem in statistical data analysis with countless applications in the natural and social sciences. Covariance matrices with a Toeplitz structure arise in the study of stationary stochastic processes, which are an important modeling tool in many applications, such as radar target detection, speech recognition, modeling internet economic activity, electrical brain activity and the motion of crystal structures (Du et al., 2020;Roberts and Ephraim, 2000;Quah, 2000;Franaszczuk et al., 1985;Grenander and Szegö, 1958, p.232).
The data for estimation are given as n independent and identically distributed realizations of a p-dimensional vector having a zero mean and a covariance matrix Σ with a Toeplitz structure. Thereby, p is assumed to grow while n may be equal to 1 (a single realization of a stationary time series) or may tend to infinity as well. For p → ∞ and p/n → c ∈ (0, ∞], the sample (auto-)covariance matrix is known to be an inconsistent estimator of Σ in the spectral norm (see e.g., Wu and Pourahmadi, 2009;Pourahmadi, 2013, chap. 2). Therefore, tapering, banding and thresholding of the sample covariance matrix have been proposed to regularize this estimator (see Wu and Xiao, 2012;Cai et al., 2013). The optimal rate of convergence for Toeplitz covariance matrix estimators was established in Cai et al. (2013), who in particular showed that tapering and banding estimators attain the minimax optimal convergence rate over certain spaces of Toeplitz covariance matrices. Optimality of the thresholded estimator was shown only for the case n = 1 (see Wu and Xiao, 2012).
However, all of these estimators have several practical drawbacks. First, they are not guaranteed to be positive definite. To enforce positive definiteness, additional manipulations with the estimators must be performed, see e.g., Section 5 in Cai et al. (2013). Second, the data-driven choice of the tapering, banding or thresholding parameter is not trivial in practice. For n > 1, Bickel and Levina (2008) proposed a cross-validation criterion that approximates the risk of the estimator. Fang et al. (2016) compared this method with a bootstrap based approximation of the risk in an intensive simulation study and recommended cross-validation over bootstrap. For n = 1, to the best of our knowledge, there is no fully data-driven approach for selecting the banding/tapering/thresholding parameter. Wu and Pourahmadi (2009) suggested first to split the time series into non-overlapping subseries and then apply the cross-validation criterion of Bickel and Levina (2008). However, it turns out that the right choice of the subseries length is crucial for this approach, but there is no data-based method available for this.
In this work, an alternative way to estimate a Toeplitz covariance matrix and its inverse is chosen. Our approach exploits the one-to-one correspondence between Toeplitz covariance matrices and their spectral densities. First, the given data are transformed into approximate Gaussian random variables whose mean equals to the logarithm of the spectral density. Then, the log-spectral density is estimated by a periodic smoothing spline with a data-driven smoothing parameter. Finally, the resulting spectral density estimator is transformed into an estimator for Σ or its inverse. It is shown that this procedure leads to an estimator that is fully data-driven, automatically positive definite and achieves the minimax optimal convergence rate under the spectral norm over a large class of Toeplitz covariance matrices. In particular, this class includes Toeplitz covariance matrices that correspond to long-memory processes with bounded spectral densities. Moreover, the computation is very efficient, does not require iterative or resampling schemes and allows to apply any inference and adaptive estimation procedures developed in the context of nonparametric Gaussian regression.
Estimation of the spectral density from a stationary time series is a research topic with a long history. Earlier nonparametric methods are based on smoothing of the (log-)periodogram, which itself is not a consistent estimator (Bartlett, 1950;Welch, 1967;Thomson, 1982;Wahba, 1980). Another line of nonparametric methods for estimating the spectral density is based on the Whittle likelihood, which is an ap-proximation to the exact likelihood of the time series in the frequency domain. For example, Pawitan and O'Sullivan (1994) estimated the spectral density from a penalized Whittle likelihood, while Kooperberg et al. (1995) used polynomial splines to estimate the log-spectral density function maximizing the Whittle likelihood. Recently, Bayesian methods for spectral density estimation have been proposed (see Choudhuri et al., 2004;Edwards et al., 2019;Maturana-Russel and Meyer, 2021), but these may become very computationally intensive in large samples due to posterior sampling.
The minimax optimal convergence rate for nonparametric estimators of Hölder continuous spectral densities from Gaussian stationary time series was obtained by Bentkus (1985) under the L p , 1 ≤ p ≤ ∞, norm. Only few works on spectral density estimation show the optimality of the corresponding estimators. In particular, Kooperberg et al. (1995) and Pawitan and O'Sullivan (1994) derived convergence rates of their estimators for the log-spectral density under the L 2 norm, while neglecting the Whittle likelihood approximation error.
In general, most works on spectral density estimation do not exploit further the close connection to the corresponding Toeplitz covariance matrix estimation. In particular, an upper bound for the L ∞ risk of a spectral density estimator automatically provides an upper bound for the risk of the corresponding Toeplitz covariance matrix estimator under the spectral norm. This fact is used to establish the minimax optimality of our nonparametric estimator for the Toeplitz covariance matrices. The main contribution of this work is to show that our proposed spectral density estimator is not only numerically very efficient, but also achieves the minimax optimal rate in the L ∞ norm, which in turn ensures the minimax optimality of the corresponding Toeplitz covariance matrix estimator.
The paper is structured as follows. In Section 2, the model is introduced and ap-proximate diagonalization of Toeplitz covariance matrices with the discrete cosine transform is discussed. Moreover, an alternative version of the Whittle's likelihood is proposed. In Section 3, new estimators for the Toeplitz covariance matrix and the precision matrix are derived, while in Section 4 their theoretical properties are presented. Section 5 contains simulation results, Section 6 presents a real data example, and Section 7 closes the paper with a discussion. The proofs are given in the appendix to the paper.

Set up and diagonalization of Toeplitz matrices
0. The sample size n may tend to infinity or to be a constant. The case n = 1 corresponds to a single observation of a stationary time series, and in this case the data are simply denoted by Y ∼N p (0 p , Σ). The dimension p is assumed to grow.
The spectral density function f , corresponding to a Toeplitz covariance matrix Σ, is given by so that for f ∈ L 2 (−π, π) the inverse Fourier transform implies Hence, Σ is completely characterized by f , and the non-negativity of the spectral density function implies the positive definiteness of the covariance matrix. Moreover, the decay of the autocovariance σ k is directly connected to the smoothness of f .
As in Cai et al. (2013), we introduce a class of positive definite Toeplitz covariance matrices with Hölder continuous spectral densities. For β = γ + α > 0, where The optimal convergence rate for estimating Toeplitz covariance matrices over P β (M 0 , M 1 ) depends crucially on β. It is well known that the k-th Fourier coefficient of a function whose γ-th derivative is α-Hölder continuous decays at least with order O(k −β ) (see Zygmund, 2002). Hence, β determines the decay rate of the autocovariances σ k , which are the Fourier coefficients of the spectral density f , as k → ∞. In particular, this implies that for β ∈ (0, 1], the class P β (M 0 , M 1 ) includes Toeplitz covariance matrices corresponding to long-memory processes with bounded spectral densities, since the sequence of corresponding autocovariances is not summable.
A connection between Toeplitz covariance matrices and their spectral densities is further exploited in the following lemma.
The proof can be found in Appendix A.1. This result shows that the DCT-I matrix approximately diagonalizes Toeplitz covariance matrices and that the diagonalization error depends to some extent on the smoothness of the corresponding spectral density.
In the spectral density literature the discrete Fourier transform (DFT) matrix F = , where i is the imaginary unit, is typically employed to approximately diagonalize Toeplitz covariance matrices. Using the fact that Whittle (1957) introduced an approximation for the likelihood of a single Gaussian stationary time series (case n = 1), the so-called Whittle likelihood (1) The quantity I j = |F t j Y | 2 , where F j denotes the j-th column of F , is known as the periodogram at the j-th Fourier frequency. Note that due to periodogram symmetry, only p/2 data points I 1 , ..., I p/2 are available for estimating the mean f (2πj/p), j = 1, . . . , p/2 , where x denotes the largest integer strictly smaller than x. The Whittle likelihood has become a popular tool for parameter estimation of stationary time series, e.g., for nonparametric and parametric spectral density estimation or for estimation of the Hurst exponent, see e.g., Walker (1964); Taqqu and Teverovsky (1997).
Lemma 1 yields the following alternative version of the Whittle likelihood where W j = (D t j Y ) 2 . Note that this likelihood approximation is based on twice as many data points W j as the standard Whittle likelihood. Thus, it allows for a more efficient use of the data Y to estimate the parameter of interest, such as the spectral density or the Hurst parameter.
Equations (1) or (2) invite for the estimation of f by maximizing the (penalized) likelihood over certain linear spaces (e.g., spline spaces), as suggested e.g., in Kooperberg et al. (1995) or Pawitan and O'Sullivan (1994). However, such an approach requires well-designed numerical methods to solve the corresponding optimization problem, since the spectral density in the second term of (1) or (2) is in the denominator, which does not allow to obtain a closed-form expression for the estimator and often leads to numerical instabilities. Also, the choice of the smoothing parameter becomes challenging.
Therefore, we suggest an alternative approach that allows the spectral density to be estimated as a mean in an approximate Gaussian regression. Such estimators have a closed-form expression, do not require an iterative optimization algorithm and a smoothing parameter can be easily obtained with any conventional criterion. First Hence, for W j = (D t j Y ) 2 , j = 1, . . . , p it follows with Lemma 1 that where Γ(a, b) denotes a gamma distribution with a shape parameter a and a scale parameter b. Note that the random variables W 1 , . . . , W p are only asymptotically independent. Obviously, E(W j ) = f (πx j ) + O(1), j = 1, . . . , p. To estimate f from W 1 , . . . , W p , one could use a generalized nonparametric regression framework with a gamma distributed response, see e.g., the classical monograph by Hastie and Tibshirani (1990). However, this approach requires an iterative procedure for estimation, e.g., a Newton-Raphson algorithm, with a suitable choice for the smoothing parameter at each iteration step. Deriving the L ∞ rate for the resulting estimator is also not a trivial task. Instead, we suggest to employ a variance stabilizing transform of Cai et al. (2010) that converts the gamma regression into an approximate Gaussian regression. In the next section we present the methodology in more detail for a general setting with n ≥ 1.

Methodology
For Y i ∼ N p (0 p , Σ), i = 1, . . . , n, it was shown in the previous section that with Lemma 1 the data can be transformed into gamma distributed random variables . . , n, j = 1, . . . , p, where for each fixed i the random variable W i,j has the same distribution as W j given in ( First, the transformed data points W i,j are binned, that is, fewer new variables Q k , k = 1, . . . , T , with T < p, are built via Q k = kp/T j=(k−1)p/T +1 n i=1 W i,j , k = 1, . . . , T . Note that the number of observations in a bin is m = np/T . In Theorem 1 in Section 4, we show that setting T = p υ for any υ ∈ ((4 − 2 min{β, 1})/3, 1) leads to the minimax optimal rate for the spectral density estimator. To simplify the notation, m is handled as an integer (otherwise, one can discard several observations in the last bin). Next, applying the variance stabilizing transform (VST) are approximately Gaussian, as shown in Cai et al. (2010). Since the spectral density is a function that is symmetric around zero and periodic on [−π, π], one can mirror the resulting observations to use Y * T −1 , . . . , Y * 2 , Y * 1 , . . . , Y * T for estimation. Renumerating the observations Y * k and scaling the design points into the interval [0, 1) for convenience leads to an approximate Gaussian regression problem where H(y) = {φ(m/2) + log (2y/m)} / √ 2 and φ is the digamma function (see Cai et al., 2010). Now, the scaled and shifted log-spectral density H(f ) can be estimated with a periodic smoothing spline where h > 0 denotes a smoothing parameter, q ∈ N is the penalty order and The precision matrix Ω is estimated by the inverse Fourier transform of the reciprocal of the spectral density estimator, i.e., The estimation procedure forΣ andΩ can be summarised as follows.

Data Transformation
where D is the (p × p)-dimensional DCT-I matrix as given in Lemma 1 and D j is its j-th column.
All inferential procedures developed in the Gaussian regression context can also be adopted accordingly.

Theoretical Properties
In this section, we study the asymptotic properties of the estimatorsf ,Σ and Ω. The results are established under the asymptotic scenario where p → ∞ and p/n → c ∈ (0, ∞], that is, the dimension p grows, while the sample size n either remains fixed or also grows but not faster than p. This corresponds to the asymptotic scenario when the sample covariance matrix is inconsistent. Letf be the spectral density estimator defined in Section 3, i.e.,f = m exp{ (4), m = np/T and φ is the digamma function. Furthermore, let Σ be the Toeplitz covariance matrix estimator andΩ the corresponding precision matrix defined in equations (5) and (6), respectively. The following theorem shows that bothΣ andΩ attain the minimax optimal rate of convergence over the class ∼ N p (0, Σ) with Σ = Σ(f ) ∈ F β and β = γ + α > 1/2. If h > 0 such that h → 0 and hT → ∞, then with T = p υ for any υ ∈ ((4 − 2 min{β, 1})/3, 1) and q = max{1, γ}, the spectral density estimatorf , the corresponding covariance matrix estimatorΣ and the precision matrix estimatorΩ satisfy The proof of Theorem 1 can be found in the Appendix A.3 and is the main result of our work. The most important part of this proof is the derivation of the convergence rate for the spectral density estimatorf under the L ∞ norm. In the original work, Cai et al. (2010) established an L 2 rate for a wavelet nonparametric mean estimator in a gamma regression where the data are assumed to be independent. In our work, the spectral density estimatorf is based on the gamma distributed data W i,1 , . . . , W i,p , which are only asymptotically independent. Moreover, the mean of these data is not exactly f (πx 1 ), . . . , f (πx p ), but is corrupted by the diagonalization error given in Lemma 1. This error adds to the error that arises via binning and VST and that describes the deviation from a Gaussian distribution, as derived in Cai et al. (2010). Finally, we need to obtain an L ∞ rather than an L 2 rate for our spectral density estimator. Overall, the proof requires different tools than those used in Cai et al. (2010).
To get the L ∞ rate forf , we first derive that for the periodic smoothing spline estimator H(f ) of the log-spectral density. To do so, we use a closed-form expression of its effective kernel obtained in Schwarz and Krivobokova (2016), thereby carefully treating various (dependent) errors that describe deviations from a Gaussian nonparametric regression with independent errors and mean f (πx i ). Note also that although the periodic smoothing spline estimator is obtained on T binned points, the rate is given in terms of the vector dimension p. Then, using the Cauchy-Schwarz inequality and a mean value argument, this rate is translated into the L ∞ rate for the spectral density estimatorf . To obtain the rate for the Toeplitz covariance matrix estimator is enough to note that

Simulation Study
In this section, we compare the performance of the proposed Toeplitz covariance estimator, denoted as VST-DCT, with the tapering estimator of Cai et al. (2013) and with the sample covariance matrix. We consider Gaussian vectors with (A) p = 5000, n = 1, (B) p = 1000, n = 50 and (C) p = 5000, n = 10, and with the autocovariance functions σ : To select the regularisation parameter for our estimator, we implemented the restricted maximum likelihood (ML) method, generalized cross validation (GCV) and the corresponding oracle versions, i.e., as if Σ were known.The tapering parameter can be selected using cross-validation (see Bickel and Levina, 2008), if n > 1, that is under scenarios (B) and (C). For this, the n observations are divided by 30 random splits into a training set of size n 1 = 2/3n and a test set of size n 2 = n/3. Let Σ ν 1 and Σ ν 2 be the sample covariance matrices from the ν-th split, where the sample tapering parameter k is then estimated aŝ where T ap k (Σ ν 2 ) is the tapering estimator of Cai et al. (2013) with parameter k.
If n = 1, that is, under scenario (A), Wu and Pourahmadi (2009) suggest to split the time series Y into l non-overlapping subseries of length p/l and then proceed as before to select the tuning parameter k. To the best of our knowledge, there is no data-driven method for selecting this parameter l. Using the true covariance matrix Σ, we selected l = 30 subseries for the example 1 and l = 15 subseries for the exam-ples 2 and 3. The parameter k can then be chosen by cross-validation as above. We employ this approach under scenario (A) instead of an unavailable fully data-driven criterion and name it semi-oracle. Finally, for all three scenarios (A), (B) and (C), the oracle tapering parameter is computed using grid search for each Monte Carlo sample ask or = arg min k=2,3,...,p/2 T ap k (Σ) − Σ , whereΣ is the sample covariance matrix. To speed up the computation, one can replace the spectral norm by the 1 norm, as suggested by Bickel and Levina (2008).
In Tables 1, 2  The results show that the tapering and VST-DCT estimator perform overall similar in terms of the spectral norm risk. This is not surprising as both estimators are proved to be rate-optimal. Moreover, both the tapering and VST-DCT estimators are clearly superior to the inconsistent sample Toeplitz covariance matrix. A closer look at the numbers shows that the VST-DCT method has better constants, i.e., VST-DCT estimators have somewhat smaller errors in the spectral norm than the tapering estimators across all examples, but especially under scenario (C). The oracle estimators show similar behaviour, but are slightly less variable compared to the data-driven estimators. In general, both the tapering and VST-DCT estimators perform best for example 1, second best for example 3 and worst for example 2, which traces back to functions complexity. In terms of computational time, both methods are similarly fast for scenarios (A) and (B). For scenario (C), the tapering method is much slower due to the multiple high-dimensional matrix multiplications in the cross-validation method. It is expected that for larger p the tapering estimator is much more computationally intensive than the corresponding VST-DCT estimator.
(  Table 1: (A) p = 5000, n = 1: Errors of the Toeplitz covariance matrix and the spectral density estimators with respect to the spectral and L 2 norm, respectively, as well as the average computation time of the covariance estimators in seconds for one Monte Carlo sample (last column).
(  Table 2: (B) p = 1000, n = 50: Errors of the Toeplitz covariance matrix and the spectral density estimators with respect to the spectral and L 2 norm, respectively, as well as the average computation time of the covariance estimators in seconds for one Monte Carlo sample (last column).
(  Table 3: (C) p = 5000, n = 10: Errors of the Toeplitz covariance matrix and the spectral density estimators with respect to the spectral and L 2 norm, respectively, as well as the average computation time of the covariance estimators in seconds for one Monte Carlo sample (last column).
To test how robust our approach is to deviations from the Gaussian assumption, we simulated the data from gamma and uniform distributions and conducted a simulation study for the same scenarios and examples. The results are very similar to those of the Gaussian distribution, see supplementary materials for the details.

Application to Protein Dynamics
We revisit the data analysis of protein dynamics performed in Krivobokova et al. (2012) and Singer et al. (2016). We consider data generated by the molecular dy- nanosecond time frame, split into 20 000 equidistant observations. Additionally, the diameter of the channel y t at time t is given, measured by the distance between two centers of mass of certain residues of the protein. The aim of the analysis is to identify the collective motions of the atoms responsible for the channel opening. In order to model the response variable y t , which is a distance, based on the motions of the protein atoms, we chose to represent the protein structure by distances between atoms and certain fixed base points instead of Euclidean coordinates. That is, we where A t,i ∈ R 3 , i = 1, . . . , 783 denotes the i-th atom of the protein at time t, B j ∈ R 3 , j = 1, 2, 3, 4, is the j-th base point and d(·, ·) is the Euclidean distance. It can therefore be concluded that a linear model Y = Xβ + holds, where Y = (y 1 , . . . , y 20 000 ) t , X = (X t 1 , . . . , X t 20 000 ) t , β ∈ R 4·783 , ∈ R 20 000 . This linear model has two specific features which are intrinsic to the problem: first, the observations are not independent over time and second, X t is high-dimensional at each t and only few columns of X are relevant for Y . Krivobokova et al. (2012) have shown that the partial least squares (PLS) algorithm performs exceptionally well on this type of data, leading to a small-dimensional and robust representation of proteins, which is able to identify the atomic dynamics relevant for Y . Singer et al. (2016) studied the convergence rates of the PLS algorithm for dependent observations and showed that decorrelating the data before running the PLS algorithm improves its performance. Since Y is a linear combination of columns of X, it can be assumed that Y and all columns of X have the same correlation structure.
Our goal now is to estimate Σ and compare the performance of the PLS algorithm on original and decorrelated data. For this purpose, we divided the data set into a training and a test set (each with p = 10 000 observations). First, we tested whether the data are stationary. The augmented Dickey-Fuller test confirmed stationarity for Y with a p-value< 0.01. The Hurst exponent of Y is 0.85, indicating moderate long-range dependence supported by a rather slow decay of the sample autocovariances (see grey line in the left plot of Figure 3). Therefore, we set q = 1 for the VST-DCT estimator to match the low smoothness of the corresponding spectral density. Moreover, the smoothing parameter is selected with the restricted maximum likelihood method and T = 550 bins are used. Figure 3 (black line in the left plot) confirms that the covariance matrix estimated with our VST-DCT method almost completely decorrelates the channel diameter Y on the training data set. Next, we estimated the regression coefficients β with the usual PLS algorithm, ignoring the dependence in the data. Finally, we estimated β with PLS that takes into account dependence using our covariance estimatorΣ. Based on these regression coefficient estimators, the prediction on the test set was calculated. The plot on the right side of Figure 2 shows the Pearson correlation between the true channel diameter on the test set and the prediction on the same test set based on raw (grey) and decorrelated data (black). Obviously, the performance of the PLS algorithm on the decorrelated data is significantly better for the small number of components. In particular, with just one PLS component, the correlation between the true opening diameter on the test set and its prediction that takes into account the dependence in the data is already 0.54, while it is close to zero for PLS that ignores the dependence in the data. Krivobokova et al. (2012) showed that the estimator of β based on one PLS component is exactly the ensemble-weighted maximally correlated mode (ewMCM), which is defined as the collective mode of atoms that has the highest probability to achieve a specific alteration of the response Y . Therefore, an accurate estimator of this quantity is crucial for the interpretation of the results and can only be achieved if the dependence in the data is taken into account. Y (black), whereΣ is estimated with the VST-DCT method; On the right, correlation between the true values on the test data set and prediction based on partial least squares (in grey) and corrected partial least squares (black).
Estimating Σ with a tapered covariance estimator has two practical problems. First, since we only have a single realization of a time series Y (n = 1), there is no datadriven method for selecting the tapering parameter. Second, the tapering estimator turned out not to be positive definite for the data at hand. To solve the second problem, we truncated the corresponding spectral density estimatorf tap to a small positive value, i.e.,f + tap = max{f tap , 1/ log(p)} (see Cai et al., 2013;McMurry and Politis, 2010 Altogether, our proposed estimator is fully data-driven, fast even for large sample sizes, automatically positive definite and can handle certain long-memory processes.
In contrast, the tapering estimator is not data-driven and must be manipulated to become positive definite. Our method is implemented in the R package vstdct.

Discussion
In this paper, we proposed a simple, fast, fully data-driven, automatically positive definite and minimax optimal estimator of Toeplitz covariance matrices from a large class that also includes covariance matrices of certain long-memory processes.
Our estimator is derived under the assumption that the data are Gaussian. However, simulations show that the suggested approach yields robust estimators even when the data are not normally distributed. In the context of spectral density estima- non-linear processes (see Shao and Wu, 2007). Since DFT and DCT matrices are closely related, we expect that equation (3) also holds asymptotically for these non-Gaussian time series, but consider a rigorous analysis to be beyond the scope of this paper.
In fact, our numerical experiments have even shown that if the spectral density is estimated from W j = f (πx j ) + j , that is, as if W j were Gaussian instead of gamma distributed, then the resulting spectral density estimator has almost the same L ∞ risk (and hence the corresponding covariance matrix has almost the same spectral norm). Of course, such an estimator would lead to a wrong inference about f (πx j ), since the growing variance of W j would be ignored.
Since our approach translates Toeplitz covariance matrix estimation into a mean estimation in an approximate Gaussian nonparametric regression, all approaches developed in the context of Gaussian nonparametric regression, such as (locally) adaptive estimation, as well as the corresponding (simultaneous) inference, can be directly applied. Bayesian tools for adaptive estimation and inference in Gaussian nonparametric regression as proposed in Serra et al. (2017) can also be employed.

A Appendix
Throughout the appendix, we denote by c, c 1 , C, C 1 , . . . etc. generic constants, that are independent of n and p. To simplify the notation, the constants are sometimes skipped and we write for less than or equal to up to constants.
Together, we have Some calculations show that for r = 1, . . . p Using the Taylor expansion of cot(x) for 0 < |x| < π one obtains for r = 1, . . . p where the O term does not depend on j and the hidden constant does not depend on r, p.
where the O terms do not depend on j. Since the complex exponential function is Lipschitz continuous with constant L = 1, it holds λ r = λ j + L r,j |r − j|p −1 where −1 ≤ L r,j ≤ 1 is a constant depending on r, j. Then, L r,j |r − j|c(r, j) 2 .
Since p−1 r=1 λ r c(1, r) 2 = − p−1 r=1 λ r c(p, r) 2 , it is sufficient to consider j = 1, ..., p − 1. We begin with first sum. For a shorter notation, we use k := r − 1 and l := j − 1 in the following. Then, summing the squares of the first term in (A.4) for l = 0, ..., p−2 , else see chapter 23.2 of Abramowitz and Stegun (1964) on sums of reciprocal powers. If p is even, then the residual terms are given by where φ and φ (1) denote the digamma function and its derivative. If p is odd, similar remainder terms can be derived. To see that R i (l, p) = O(p −1 ) for i = 1, 2, 3 and uniformly in l we use that asymptotically φ(x)∼ log(x)−1/(2x) and The mixed term 4 3 p−2 k=0 k 2 {1−(−1) k−l } 2 (k−l)(k+l)(2p−2) 2 and the squared O term 16π 2 9 p−2 k=0 k 2 (2p−2) 4 are both of the order p −1 . Furthermore, since the harmonic sum diverges at a rate of log(p). Finally, λ j = f (x j )+O{log(p)p −β } by the uniform approximation properties of the discrete Fourier series for Hölder continuous functions (see Zygmund, 2002). All together, we have shown that (DΣD) j,j = where the O terms are uniform over j = 1, ..., p.
Case i = j and |i − j| is even In this case, (DΣD) i,j = a i a j 2p−2 r=1 λ r {b(i, r)b(j, r) + c(i, r)c(j, r)} since |r − j|, and |r − i| are both either odd or even. Note that b(i, r)b(j, r) = O(p −2 ) for r = i, j , we proceed similarly as before. Setting k=r−1, l=j−1, m=i−1 and using that l =m and |l−m| is even, one obtains where for even p the residual terms are given by If p is odd, analogous residual terms can be derived. Using similar techniques as before, one can show that the two residual terms and the remaining mixed and square terms vanish at a rate of the order O(p −1 ) and uniformly in i, j.
Case i = j and |i − j| is odd |r − i| and |r − j| are either odd and even, or even and odd. Without loss of generality, assume that |r − i| is even. Then, (DΣD) i,j = a i a j 2p−2 r=1 λ r b(i, r)c(j, r). Since b(i, ·) is an even function, c(j, ·) is an odd function and λ r = λ 2p−r , it follows (DΣD) i,j = 0.

A.2 Periodic Smoothing Splines
In this section we recapitulate the results obtained in Schwarz and Krivobokova (2016) for the equivalent kernels of periodic smoothing spline estimators. Given and the residuals i satisfy standard assumptions, the periodic smoothing spline estimatorf is the solution to where h > 0 is the smoothing parameter and S per (2q−1, x N ) the periodic spline space Schwarz and Krivobokova (2016) is known as effective kernel of a periodic smoothing spline estimator and its explicit expression is given by where Φ(·, ·) denotes exponential splines (Schoenberg, 1973) and polynomials Q 2q−2 (x) are formally defined as Q 2q−2 = ∞ l=−∞ sinc{π(x + l)} 2q ; for the explicit expressions see Section 4.1.1 in Schwarz (2012). A scaled version K(x, t) of W (x, t) in terms of the bandwidth parameter h is defined as K(x, t) = hW (hx, ht). We denote by is the asymptotic equivalent kernel of smoothing spline estimators on R. Finally, we

A.3 Proof of Theorem 1
The structure of the proof is as follows. First, we derive the L ∞ rate of the periodic smoothing spline estimator H(f ). Then, using the Cauchy-Schwarz inequality and a mean value argument, the convergence rate of the spectral density estimatorf is established. With the relationship E Σ − Σ ≤ E f − f 2 ∞ the first claim of the theorem follows. Finally, we prove the second statement on the precision matrices.
For the sake of clarity, some technical lemmas used in the proof are listed separately in A.4.

Upper bound on the variance
Using the above representation, one can write First we reduce the supremum to a maximum over a finite number of points.
Using the inequality log(x) ≤ x a /a one can find constants x υ , C υ > 0 depending on υ but not on n, p such that log(2T ) log(p) 2 p −2β T 2β ≤ C υ p −xυ . This implies Next, we derive a bound for the second term The exponential decay property of the kernel K stated in Lemma 2 yields The first term in (A.9) can be bounded again with Lemma 1.6 of Tsybakov (2009).
We use the fact that for not necessarily independent random variables X 1 , ..., R and R > 0 are constants. This is a consequence of Lemma 1 of Azuma (1967) which a i X i has a subGaussain distribution and the subGaussian norm is bounded by 2R( N i=1 a 2 i ) 1/2 . See (Vershynin, 2018) for further details on the subgaussian distribution.

In total, choosing an integer
gives

Upper bound on the bias
Using the representation in Lemma 4 once more gives for each x ∈ [0, 1] The bounds on k in Lemma 4 imply Consider the case that β ≥ 1. In particular, q = γ and f (q) is α-Hölder continuous.
Since f is a periodic function with f (x) ∈ [δ, M 0 ] and H(y) ∝ φ(m/2)+ log (2y/m), it follows that {H(f )} (q) is also α-Hölder continuous. Extending g := H(f ) to the entire real line, we get Expanding g(t) in a Taylor series around x and using that h −1 K h is a kernel of order 2q, see Lemma 2(iii), it follows that for any x ∈ [0, 1] where ξ x,t is a point between x and t. Using the fact that the kernel K h decays exponentially and that g (q) is α-Hölder continuous on [δ, M 0 ] with some constant L, one obtains If 1/2 < β ≤ 1, then q = 1 and g is β-Hölder continuous. Since f (x) ∈ [δ, M 0 ] and the logarithm is Lipschitz continuous on a compact interval, it follows g = H(f ) is β-Hölder continuous. Expanding g to the entire line and using Lemma 2(iii) with In a similar way as before, one obtains Note thatT −β =o(h β ) as β > 1/2, T h → ∞ and h → 0 by assumption. Since the derived bounds are uniform for x ∈ [0, 1] it holds Putting the bounds A.13 and A.14 together gives A.3.2 An upper bound on E f − f 2 ∞ Proposition 2. Let Σ ∈ F β such that β > 1/2. If h > 0 such that h → 0 and hT → ∞, then with T = p υ for any υ ∈ ((4 − 2 min{1, β})/3, 1), the estimatorf described in Section 3 with q = max{1, γ} satisfies Proof : By the mean value theorem, it holds for some function g between H(f ) and Application of the Cauchy-Schwarz inequality with a constant C > 0 yields To show that the second term on the right hand side of (A.15) is negligible we use the moment generating function of H(f ) ∞ . In the next paragraph, we derive the asymptotic order of E[exp{λ H(f ) ∞ }] for n, p → ∞, where λ > 0 may depend on n, p or not.

Moment generating function of H(f ) ∞
By the exponential decay property of the kernel K stated in Lemma 2 holds First, H(f ) ∞ is bounded with the maximum over a finite number of points. Calculating the derivative of s : Since δ δx s(x) > 0 almost surely for x = x k , the extrema occur at x k , k = 1, ...,T . Thus, for λ > 0 the moment generating function of H(f ) ∞ is bounded by , which by Lemma 2 is bounded uniformly in j by some global constant M > 0. By the convexity of the exponential function we Recall that Y * k = log(Q k /m)/ √ 2 and by assumption 0 ≤ δ ≤ f ≤ M 0 . Using Lemma 3, Q k can be written as a sum of m = np/T independent gamma random variables, with a k ∈ (c 1 δ, c 2 M 0 ) for some constants c 1 , c 2 . Thus, we can find random variables Q − k ∼ Γ(m/2, cδ) and Q + k ∼ Γ(m/2, cM 0 ) such that Q − k ≤ Q k ≤ Q + k almost surely and c > 0 is a constant. Then, The moment generating function of | log(X)| when X follows a Γ(a, b)-distribution is given by where Γ(a) is the gamma function and γ(a, b) is the lower incomplete gamma function. In particular, In our setting, X = Q − k /m ∼ Γ(m/2, cδ/m) resp. X = Q + k /m ∼ Γ(m/2, cM 0 /m). To derive the asymptotic order of E[exp{λ H(f ) ∞ }] for n, p → ∞ we first establish the asymptotic order of the ratio Γ(a + t)/Γ(a) for a → ∞. We distinguish the two cases where t is independent of a and where t linearly depends on a. Based on Stirling's approximation one derives for a > 1, t > 0 Thus, for 0 < t < a and t independent of a, equation (A.17) implies for a → ∞ that Γ(a + t)/Γ(a) = O(a t ). Similarly, it can be seen that Γ(a − t)/Γ(a) = O(a −t ).
If 0 < t < a and t linearly depends on a, i.e. t = ca for some constant c ∈ (0, 1), then we get Γ(a ± t)/Γ(a) = O(a ±t exp{a}) for a → ∞. Hence, for a fixed λ not depending on n, p and such that 0 < λ < m/( √ 2M j ) we get for sufficiently large If λ = cm such that 0 < λ < m/( √ 2M j ), then for sufficiently large n, p for some constant L > 1. Set K = min j=1,...,T 1/( √ 2M j ) which is a constant independent of n, p. Altogether, we showed that for 0 < λ < Km and n, p → ∞ where c 1 := H(C − M 0 ). Applying Markov inequality for t = cm with c ∈ (0, K) and C = 2L 4/c + M 0 where c, K, L are the constants in (A.19) gives Together with Proposition 1 follows

A.3.3 An upper bound on E Σ − Σ 2
Using the fact that the spectral norm of a Toeplitz matrix is upper bounded by the sup norm of its spectral density we get According to the mean value theorem, for a function g between H(f ) and H(f ), it holds that Application of the Cauchy-Schwarz inequality with a constant C > 0 yields some constant c 1 > 0 not depending on n, p. Chosing the same constant C as in section A.3.2 it follows Noting that 1/f ∞ ≤ 1/δ and 2/m exp {φ(m/2)} ∈ [0.25, 1] for m ≥ 1, (A.18) implies for some constants c 2 , c 3 > 0 and n, p → ∞ Since the derived bounds hold for each Σ(f ) ∈ F β , we get all together

A.4 Auxiliary Lemmata for Theorem 1
This section states some technical lemmata needed for the proof of Theorem 1. The proofs can be found in the supplementary material. The first lemma lists some properties of the kernel K h and its extension K h on the real line. The proof is based on (Schwarz and Krivobokova, 2016).
Lemma 2. Let h > 0 be the bandwith parameter depending on N .
(i) There are constants 0 < C < ∞ and 0 < γ < 1 such that for all for x, t ∈ [0, 1] Lemma 3 states that the sum of the correlated gamma random variables in each bin can be rewritten as a sum of independent gamma random variables.
Finally, Lemma 4 gives explicit bounds for the stochastic and deterministic errors of the variance stabilizing transform. Thus, it quantifies the difference to an exact Gaussian regression setting. This result is a generalization of Theorem 1 of Cai et al. (2010) adapted to our setting with n ≥ 1 observations and correlated observations.

B.2 Proof of Lemma 3
It is sufficient to show the statement for n = 1 by independence of the Y i . Then, the number of points per bin is m = p/T . For simplicity, the index i is skipped in the following. First, we write Q k as a matrix-vector product and refactor it so that it corresponds to a sum of independent scaled χ 2 random variables. In the second step, we calculate the scaling factors. Let E (km) be a diagonal matrix with ones on the (k − 1)m + 1, ..., km-th entries and otherwise zero diagonal elements. Then, Define X = Σ −1/2 Y ∼ N p (0 p , I) and let U t ΓU be the eigendecomposition of whereX := U X. Since U is an orthogonal matrix, it follows thatX ∼ N p (0 p , I).
In particular, Q k is a sum of independent scaled χ 2 random variablesW j := λ jX 2 j indep.

B.3 Proof of Lemma 4
In the regression setting of Cai et al. (2010) and Brown et al. (2010) the gamma random variables in each bin are independent. Using Lemma 3, we can rewrite our setting into one with independent observations per bin. Thus, the first step of the proof is to apply the results of Cai et al. (2010) and Brown et al. (2010) on the approximation of Y * k by a Gaussian random variable m −1/2 Z k and a non-Gaussian remainder ξ k . Since in our setting the observations of different bins are correlated, the correlation Cov(Z k , Z l ) is derived in the second part of the proof.
By Lemma 3, Q k can be written as sum of m = np/T independent gamma random variables with mean functionf (x) : , and ξ k is a mean zero random variable. Z k is defined via quantile coupling. It is the Gaussian approximation to the standardized version of Q k where the W ij in the k-th bin are assumed to be i.i.d. In particular, Z k approximatesQ k = kp/T j=(k−1)p/T +1 iid.
Let θ be the maximum difference of the observations' means in each bin. Then, Application of Lemma 4 and Theorem 1 of Brown et al. (2010) yield that ξ k satisfies E[|ξ k | ] log 2 (m)[m − + T −1 + T −1 log(p)p 1−β ] for any constant integer > 0 and where the hidden constant depends on . Next, by the Lipschitz continuity of the logarithm and the α-Hölder continuity of f . To compute Cov(Z k , Z l ) for k = l we use Hoeffding's covariance identity. In particular, we show that Cov(Z k , Z l ) Cov(Q k ,Q l ). The claim then follows from noting that where j = (k − 1)p/T + 1, ..., kp/T and h = (l − 1)p/T + 1, ..., lp/T . Let Φ and FQ denote the cumulative distribution functions of Z k , Z l ,Q k andQ l . Since the Z k are defined via quantile coupling, it holds Z k = Φ −1 {FQ(Q k )} (see Komlós et al., 1975;Mason and Zhou, 2012). Furthermore, define the uniform random variables The corresponding joint cumulative distribution functions are denoted by F Z,Z , FQ ,Q , F U,U . By Hoeffding's covariance identity Let ρ = Cov(Z k , Z l ). Then, the identity Patel and Read, 1996, 9.3.1.3). Since Cov(Q k ,Q l ) ≥ 0 and the ratio of two densities is non-negative, it follows Cov(Z k , Z l ) ≥ 0. Furthermore, It remains to show that it exists a c > 0 such that Note that for u ∈ (1/2, 1) Since Φ −1 (u)>0 for u ∈ (1/2, 1), the second term is positive. To see that the first term is negative for u ∈ (1/2, 1) we rewrite fQ and F −1 Q with respect to the nonnormalized gamma random variable X = Hence, where σ and µ are the standard deviation and the mean of X. Note that the mode of A ∼ Γ(a, b) is at b(a − 1). Since xσ + µ = 2f (x k )( m 2 − 1) implies x = − 2/m, it follows that fQ(x) is monotone decreasing for x ≥ − 2/m. Furthermore, FQ(− m/2) ≤ 0.5 for all m ∈ N as fQ(x) is right-skewed. In particular, − m/2 ≤ F −1 Q (1/2) for all m ∈ N. Finally, since fQ(− 2/m) → φ(0) for m → ∞ there is a constant c > 0 not depending on m such that

B.4 Simulation Study with non-Gaussian Data
The simulation study in Section 5 is performed in the same way, but with the uniform and the gamma distribution instead of the Gaussian distribution.
For example 2, the parameter innov of the R function arima.sim is used to pass the innovations X 1 , ..., X n i.i.d.
(1) polynomial σ (2) ARMA(2, 2) (3) Lipschitz cont. f time  Table 4: (A) p = 5000, n = 1: Errors of the Toeplitz covariance matrix and the spectral density estimators with respect to the spectral norm and the L 2 norm, respectively. Average computation time of the covariance estimators in seconds for one Monte Carlo sample (last column).
(1) polynomial σ (2) ARMA(2, 2)  Table 5: (B) p = 1000, n = 50: Errors of the Toeplitz covariance matrix and the spectral density estimators with respect to the spectral norm and the L 2 norm, respectively. Average computation time of the covariance estimators in seconds for one Monte Carlo sample (last column).
(1) polynomial σ (2) ARMA(2, 2) (3) Lipschitz cont. f time  Table 6: (C) p = 5000, n = 10: Errors of the Toeplitz covariance matrix and the spectral density estimators with respect to the spectral norm and the L 2 norm, respectively. Average computation time of the covariance estimators in seconds for one Monte Carlo sample (last column).
(1) polynomial σ (2) ARMA(2, 2)  Table 7: (A) p = 5000, n = 1: Errors of the Toeplitz covariance matrix and the spectral density estimators with respect to the spectral norm and the L 2 norm, respectively. Average computation time of the covariance estimators in seconds for one Monte Carlo sample (last column).
(1) polynomial σ (2) ARMA(2, 2)  Table 8: (B) p = 1000, n = 50: Errors of the Toeplitz covariance matrix and the spectral density estimators with respect to the spectral norm and the L 2 norm, respectively. Average computation time of the covariance estimators in seconds for one Monte Carlo sample (last column).
(1) polynomial σ (2) ARMA(2, 2)  Table 9: (C) p = 5000, n = 10: Errors of the Toeplitz covariance matrix and the spectral density estimators with respect to the spectral norm and the L 2 norm, respectively. Average computation time of the covariance estimators in seconds for one Monte Carlo sample (last column).