Towards Understanding Residual and Dilated Dense Neural Networks via Convolutional Sparse Coding

Convolutional neural network (CNN) and its variants have led to many state-of-art results in various fields. However, a clear theoretical understanding about them is still lacking. Recently, multi-layer convolutional sparse coding (ML-CSC) has been proposed and proved to equal such simply stacked networks (plain networks). Here, we think three factors in each layer of it including the initialization, the dictionary design and the number of iterations greatly affect the performance of ML-CSC. Inspired by these considerations, we propose two novel multi-layer models--residual convolutional sparse coding model (Res-CSC) and mixed-scale dense convolutional sparse coding model (MSD-CSC), which have close relationship with the residual neural network (ResNet) and mixed-scale (dilated) dense neural network (MSDNet), respectively. Mathematically, we derive the shortcut connection in ResNet as a special case of a new forward propagation rule on ML-CSC. We find a theoretical interpretation of the dilated convolution and dense connection in MSDNet by analyzing MSD-CSC, which gives a clear mathematical understanding about them. We implement the iterative soft thresholding algorithm (ISTA) and its fast version to solve Res-CSC and MSD-CSC, which can employ the unfolding operation for further improvements. At last, extensive numerical experiments and comparison with competing methods demonstrate their effectiveness using three typical datasets.


Towards Understanding Residual and Dilated Dense Neural Networks via Convolutional Sparse Coding
Zhiyang Zhang and Shihua Zhang* Abstract-Convolutional neural network (CNN) and its variants have led to many state-of-art results in various fields. However, a clear theoretical understanding about them is still lacking. Recently, multi-layer convolutional sparse coding (ML-CSC) has been proposed and proved to equal such simply stacked networks (plain networks). Here, we think three factors in each layer of it including the initialization, the dictionary design and the number of iterations greatly affect the performance of ML-CSC. Inspired by these considerations, we propose two novel multi-layer models-residual convolutional sparse coding model (Res-CSC) and mixed-scale dense convolutional sparse coding model (MSD-CSC), which have close relationship with the residual neural network (ResNet) and mixed-scale (dilated) dense neural network (MSDNet), respectively. Mathematically, we derive the shortcut connection in ResNet as a special case of a new forward propagation rule on ML-CSC. We find a theoretical interpretation of the dilated convolution and dense connection in MSDNet by analyzing MSD-CSC, which gives a clear mathematical understanding about them. We implement the iterative soft thresholding algorithm (ISTA) and its fast version to solve Res-CSC and MSD-CSC, which can employ the unfolding operation for further improvements. At last, extensive numerical experiments and comparison with competing methods demonstrate their effectiveness using three typical datasets.

I. INTRODUCTION
N OWADAYS, neural networks have become effective techniques in many fields including computer vision, natural language processing, bioinformatics and so on. Their predecessor perceptron was proposed by Rosenblatt in 1958 [1]. However, perceptron is even too simple to solve the XOR problem. To tackle more complex problems, multilayer perceptron (MLP) has been proposed. Neural networks can be seen as generalized MLPs with more special operations. The activation functions (e.g., Sigmoid, Tanh and ReLU (rectified linear unit) [2], [3], [4]) have been used for computing the outputs of hidden layers in neural networks, and have been considered to simulate the sparse activation of neurons in human brain.
Convolution is another important operation used for processing data that has a known, grid-like topology. For example, time-series data can be a one-dimensional grid data and image can be thought as a two-dimensional grid of pixels. Convolution operation simulates human eyes, capturing features locally and scanning globally. The range that features captured from is called the receptive field [5], [6]. Comprehensive experiments have shown that convolution is effective to extract abstract features from data. The first CNN architecture LeNet proposed in 1998 has been successfully applied to handwritten digit recognition [7].
Recent years, rapid improvements of hardwares and public availability of highly optimized softwares [8], [9], [10] make it possible to train a neural network with a large number of parameters. AlexNet [11] is such a classical CNN architecture, which draws our attention to deep convolutional neural networks (DCNNs). Deeper networks have strong ability to fit complex distributions, so it is easier to achieve better performance than ever before. The deeper, the better becomes a belief [12]. But DCNNs are hard to train because of diverse optimization issues. The problem of vanishing or exploding gradients [13], [14] is a notorious one, and has been addressed by two novel tricks: normalized initialization [14], [15] and batch normalization (BN) [16]. When deep neural networks are trained with these two tricks, a degradation phenomenon has been exposed [17] that deep networks achieve lower accuracy than shallow networks. ResNet [17] is a special architecture with skip connections tackling this phenomenon. Difficulties have been settled but the optimization issues behind the degradation phenomenon solved by ResNet is still not clear.
However, we still do not understand the principle of CNNs clearly. All the above successes are mainly based on empirical exploration. A clear and profound theoretical understanding of such neural networks is still lacking. On the one hand, architectures with excellent performance are hard to interpret strictly. On the other hand, the design of architectures mainly depends on intuition or inspiration. The lacking of theory is currently a key problem, which limits the further development of neural networks. This situation makes us unsure when applying neural networks to some challenging fields such as self-driving, medical diagnosis and identity recognition.
LG] 8 Dec 2019 brings a fresh view to CNN. In sparse coding, one assumes that a signal can be represented as a linear combination of a few columns from a matrix called dictionary, and the linear combination can be written into a sparse vector. The task of retrieving the sparsest representation of a signal over a dictionary is called sparse coding or basis-pursuit (BP) [28], [29], [30], [31]. It is also known in the statistical learning community as the least absolute shrinkage and selection operator (Lasso) problem [32]. Neuroscience indicates that sparse coding or representation plays an important role in human brain [33]. Moreover, sparsity has been shown to be a driving force in a myriad of applications in computer vision [34], [35], [36], statistics [32], [37] and machine learning [38], [39], [40]. For a given dictionary, orthogonal matching pursuit (OMP) [41], [42], iterative soft thresholding algorithm (ISTA) and its fast version (FISTA) [43] have been proposed to tackle the pursuit problem. Besides, double sparsity has also been proposed to accelerate the training process [44], and it assumes the dictionary in sparse coding can be factorized into a multiplication of two matrices.
Inspired by these progresses, a multi-layer convolutional sparse coding (ML-CSC) model has been proposed [25], which is proved to equal plain networks when propagates with the layered thresholding algorithm [25]. This reveals that CNNs actually try to find the sparse coding or representation of input signals over a very special dictionary, which corresponds to the convolution operation. It computes the sparse vectors layer by layer, but does not recover all the vectors at once, which is computationally and conceptually challenging. The value of ML-CSC is not only giving us an understanding of CNNs, but also building a strict mathematical form which would provide a possibility of drawing into more mathematical tools to carry on strict theoretical analysis. Previous studies have theoretically proved that why ReLU behaves well in numerical experiments [3], [4], what the feature maps computed in each layer represent, and what the meaning of the bias term (which is always added after convolution) is [25].
To solve the CSC model, layered basis pursuit (Layered BP) and a multi-layer version have been proposed [25], [27]. Layered BP considers the sparsity only in one layer, while the multi-layer version considers the sparsity with all layers together. The stability of Layered BP in noiseless and noisy regimes has been clarified in [25]. A boundary has been proposed to measure the distance between the solution and the true underlying sparse coding [25]. The convergence of the multi-layer version has been proved in [27]. The uniqueness of the sparsest representation and the conditions guaranteeing to find the true underlying one have been discussed in [26]. Skipping a fixed number of spatial locations after convolution is a common step among practitioners of CNNs [11], [12]. This can reduce the dimensions of the kernel maps throughout the layers, leading to computational benefits. More other theoretical benefits about this skill have been analyzed in [25].
At present, existing studies only established a preliminary connection between CSC and plain networks. The relationship between CSC and current popular architectures (e.g., ResNet, MSDNet) is still lacking. The roles of many key tricks (e.g., BN, Dropout) played in CSC are still not clear. We notice that neural networks with skip connections usually have better performance [17], [18], [19]. Do the skip connections have any theoretical interpretation? Moreover, dilated convolution in MSDNet is also a powerful trick for extracting multiscale features [19]. To better understand ResNet and MSDNet, we introduce a residual convolutional sparse coding model (Res-CSC) and a mixed-scale dense convolutional sparse coding model (MSD-CSC), which have close relationship with ResNet and MSDNet, respectively. This paper is structured as follows. We first introduce notation and concepts in CNN and sparse coding in Section II. Second, we introduce the relationship between CNN and CSC in Section III. Third, we propose the layer-initializing question (LIQ) and the Res-CSC model in Section IV, and demonstrate ResNet is a special case of Res-CSC. Fourth, we propose the MSD-CSC model and establish the bridge between it and MSDNet in Section V. Through MSD-CSC, we give a theoretical understanding of superior performance of MSDNet in terms of dilated convolution and dense connection. Fifth, we implement two algorithms ISTA and FISTA to solve MSD-CSC in Section VI. Finally, we apply Res-CSC and MSD-CSC onto three typical datasets (CIFAR10, CIFAR100, SVHN) and compare it with several methods.

A. Convolution and Matrix Multiplication
Convolution is a basic operation in CNN. It is used for extracting features in structured data, and has been widely used in computer vision. We denote an image with m rows, n columns and c channels as X ∈ R m×n×c . A dilated convolution kernel F s with dilation scales s ∈ z + [45] convolves X to produce a new feature map Z. This operation is denoted as Z = F s ⊗ X. Equally, this process can be written as a matrix multiplication (Fig. 1A). Simple examples with one channel and different dilation scales s = 1 and s = 2 are shown in Fig. 1B and Fig. 1C respectively. Here, we use the same symbol F s to represent the corresponding matrix, which has a special structure -a union of bars and circulant. F s is also called a convolutional matrix. Though convolution computes locally and scans globally, it is a linear transform about X. The structure of a convolutional matrix corresponds to the sparse connection and parameter sharing in convolution. These two characteristics decrease the number of parameters, which could overcome the overfitting issue to some extent.

B. ResNet and MSDNet
Compared with the traditional CNN ( Fig. 2A), ResNet [17] adds one operation -shortcut connection (Fig. 2B), and MSDNet [19] adds two operations -dilated convolution and dense connection (Fig. 2C). Shortcut connection directly adds feature maps after a transformation (Fig. 2B). This process can be written as below: where F is a transformation and Z i is the i-th layer output.
Dilated convolutions with different dilation scales could acquire larger receptive field with fewer parameters, enabling feature extraction in a multi-scale manner. Dense connection gathers all feature maps before the current layer and computes new feature maps with them ( Fig. 2C). This process can be written as below: In this paper, a MSDNet with k layers is represented as: where (F si i , b i ) denotes the convolution kernels and the bias b i in the i-th layer. Note that b i is a vector recording biases corresponding to convolution kernels in F si i . Here, we use w to denote the number of convolution kernels in each layer and d to denote the number of layers.

C. Sparse Coding
Let X denote a signal vector. In sparse coding, one assumes that it can be represented as a linear combination of a few columns of a dictionary matrix D: where Γ is the coefficient vector of the linear combination, and it is called a coding under the dictionary D.
In general, Γ is not unique because the number of columns in D is usually greater than the number of rows in D. One expects to find a special coding with a small number of non-zeros in all of the solutions. In addition, the signal X always has noise, we don't need to reconstruct X exactly. Finally, sparse coding can be formulated into the following optimization problem [28], [46], [44]: where β is a regularization parameter to balance the reconstruction error of the signal X and the sparsity of Γ. Γ 0 denotes the number of non-zero entries in Γ.
The problem (P0) is NP-hard because of the second term Γ 0 [47]. Fortunately, it has been proved that the problem (P0) can be relaxed into the following format [48]: Here the problem (P1) is called the Lasso [32] or BP problem [28], [29], [30], [31] in different fields. Moreover, the problem (P1) can be solved using the popular iterative soft thresholding algorithm (ISTA). Its update formula can be formulated as follows: where Γ k denotes the coding in the k-th iteration. The smallest is defined as below: Is the sparsest representation for the problem (P1) unique? Lemma 1 gives the answer. Lemma 1 [28], [46], [49]: The sparsest representation is unique if the number of non-zeros in the underlying sparsest representation for the problem (P1) is not too high and in particular less than 1 2 1 + 1 µ(D) . Here, µ(D) is defined as the maximal inner product of two columns extracted from D. This can be formally written as below: where d i is assumed to be normalized to unit length.

A. Nonnegative Sparse Coding
Let's consider a signal X = DΓ. Naturally, Γ can be split into its positive one Γ P and negative one Γ N . Then X can be equally written as: T . Notice both Γ P and −Γ N are nonegative. Therefore, every sparse coding can always be converted into nonnegative sparse coding.

B. Soft Non-negative Thresholding Operator
For a nonnegative sparse coding problem, one only needs to consider a nonnegative sparse coding. So, one can define the soft non-negative thresholding operator S + b (·) based on the soft thresholding operator: It is obvious that the soft non-negative thresholding operator is equivalent to the ReLU: where b is a bias term. According to Eq. (1), b depends on β and the Lipschitz constant L in the problem (P1). In other words, β is a hyper-parameter in sparse coding, but it becomes a parameter in neural networks via Eq. (2).

C. ML-CSC
The ML-CSC model is formulated as follows [25]: where X is the input signal (e.g., an image), and {D i } is a set of special dictionaries. Each D i is a transpose of a convolutional matrix. ML-CSC encodes signals layer by layer: The i-th layer in ML-CSC can be described as a Lasso problem: One can use Eq. (1) to compute the sparse coding in every layer. When {D i } is known, one sets Γ 0 = 0, and Eq. (1) becomes: One can update Γ only once with Eq. (3) and then obtain the layered thresholding algorithm with k layers (Algorithm 1).

Algorithm 1 The layered thresholding algorithm
According to the relationship between convolution and matrix multiplication, it is obvious that using this layered thresholding algorithm to solve ML-CSC is equivalent to the forward pass of the plain networks ( Fig. 2A) [25]. So, the final sparse coding Γ k in ML-CSC corresponds to the final feature map in CNN. Here one sets Γ 0 = 0 and only updates Γ once in each layer. This strategy not only achieves the equivalence between ML-CSC and the plain networks, but also improve the computational efficiency since two terms, Γ k and D T DΓ k , can be ignored.
A natural question is how close between the true solution Γ i and the estimate Γ i solved by Algorithm 1. Lemma 2 answers this question. Lemma 2 [25]: where Γ i s 0,∞ is defined to be the maximal number of nonzeros in stripe of length (2n − 1)m extracted from Γ i (m denotes the number of convolution kernels in the i-th layer and n denotes the length of bars in the i-th layer) (Fig. 1).
Until now, ML-CSC has been connected to the plain network. Intriguingly, we think three factors in each layer of it including the initialization, the dictionary design and the number of iterations greatly affect its performance. Inspired by these considerations, in the next three sections IV, V and VI, we propose Res-CSC and MSD-CSC, and a forward propagation algorithm with unfolding (iterate more than once), respectively.

IV. LIQ AND RES-CSC
According to the above analysis, the forward pass of plain networks can be explained as solving Eq. (1) with initialization Γ 0 = 0 in each layer. This dramatically improves the computational efficiency. However, this naive setting causes serious training limits in much deeper networks due that its iteration time in each layer is set to be one, which causes the accumulation of errors. Thus, we propose a fundamental question about this initialization: Layer-Initializing Question (LIQ): In ML-CSC, Eq. (1) iterates once in each layer. Under this condition, can we design a proper initialization for Γ 0 i of the i-th layer to approach the optimal sparse coding Γ i ?
To give a solution to LIQ, we modify the ML-CSC model. In principle, ML-CSC computes the sparse coding layer by layer, and such a representation becomes sparse gradually. Thus, an intuitive idea is to set Γ 0 equals the input of a former layer. In the following, we use the input of the closest layer to the current one as the initialization (denoted as Γ 0 = X −1 ). We change the forward propagation rule of each layer partially in ML-CSC to reduce the accumulation error, and keep part of the rule to enhance the computational efficiency. Specifically, every two layers use this new initialized setting once. The first one adopts Eq. (3) to obtain the output, and the second one employs the following update rule based on Eq. (1): Let D = 1 L D, c = −L. Eq. (4) becomes: Now, we obtain a new optimization rule following the same mode of ML-CSC. In contrast, its forward propagation implements Eq. (3) and Eq. (5) alternately (Fig. 3A). Notice that Eq. (5) becomes the forward propagation of ResNet exactly when we ignore the term c · D T DX −1 . For convenience, we name the ML-CSC updated with this new rule as Res-CSC. The classical ResNet can be seen as a special case of Res-CSC (Fig. 3B), which gives an approximate solution to the LIQ. But it has not been formally proposed before that LIQ is a key of the optimization problem behind degradation phenomenon [17]. Similarly, we can obtain another approximate update rule by ignoring the term X −1 (Fig. 3C).
The relationship between Res-CSC and ResNet draws our attention to the error-tolerant, which is an important characteristic of CNN [50], [51]. Being the approximation of Res-CSC, ResNet has presented this characteristic in many applications. For a more general propagation rule, Res-CSC can overcome the training difficulty when networks come to hundreds of layers. In the experiment section, numerical tests on Res-CSC, ResNet and Res-CSC-Simplified using the three typical datasets demonstrate that the terms X −1 and c · D T DX −1 in Eq. (5) indeed play a key role. This reveals that LIQ is the underlying mathematical key behind the degradation phenomenon in plain networks. Coincidentally, ResNet addresses it by introducing such non-zero initializations.

A. MSD-CSC
Inspired by the layered thresholding algorithm designed for ML-CSC, we attempt to describe the dilated convolution and dense connection of MSDNet in a sparse coding view via modifying the structure of dictionaries. In MSDNet, each layer uses all the previous feature maps to compute its layer output. This leads to the following CSC model: , and I is an identity matrix. The Lasso problem in the i-th layer is formulated as: We call this the MSD-CSC model, and denote it as: i is the dictionary and β i is the regularization parameter in the i-th layer. Is MSD-CSC equivalent to MSDNet?

B. Theoretical Analysis
Proposition 1: For a given MSDNet, there exists a MSD-CSC model, which is equivalent to MSDNet when propagates with the layered thresholding algorithm. Proof: For a given MSDNet: , · · · , (F s k k , b k )} Let's define a corresponding MSD-CSC model: the i-th layer. According to the layered thresholding algorithm: where i = 1, · · · , k, b i = βi Li = (0, · · · , 0, b i ). We can observe that the features before the i-th layer are kept in the i-th feature as the input of the next layer. This indicates that the propagating rule of M SDCSC k is equivalent to that of M SD k .
From Proposition 1, we can see that MSDNet is just a special case of a corresponding MSD-CSC model. Next, we will proof that the coding performance of MSD-CSC is better than that of ML-CSC. This attributes to two operations: dilation convolution and dense connection. In context of CSC, the dilation convolution affects the structure of the convolutional matrices. From Fig. 1B and Fig. 1C, we can see that µ(D) is relatively smaller in the convolutional matrix corresponding to the dilated convolutional kernel compared with that without dilation (µ = 0.47 in Fig. 1B and µ = 0 in Fig. 1C). According to Lemma 1, a dictionary (notice that the identity matrices in MSD-CSC don't affect µ) with smaller µ tends to ensure that the sparsest representation is unique. According to Lemma 2, a dictionary with smaller µ will make the estimation Γ i closer to the true solution Γ i when solved with Algorithm 1.
Next we will explore how MSD-CSC benefits from the dense connection. In sparse coding, larger β leads to sparser representation, but sparser representation may lead to the loss of information according to the Lasso form. Sparsity and loss of information are contradictory. Sometimes, an unsuitable β can led to a very unreasonable solution. For example, let's Let ξ = (ξ 1 , ξ 2 , · · · , ξ n ) = D si i Γ i , which is used to reconstruct the signal Γ i−1 . If the j-th dimension satisfy: The reconstruction is considered as unsuccess in the j-th dimension, otherwise success.
Let's conduct a simulated experiment on ML-CSC first. The simulation data is of length 100 and generated by adding the Gaussian noise to 100 different data centers, which represent 100 different classes. We generate 10000 training data (each class has 100 data points) and 2000 testing data. We use a ML-CSC with two hidden layers. After each iteration, we compute the regularization parameter β, reconstruct the input signal X and count the number of dimensions which fails to be reconstructed (i.e., unsuccess) (Fig. 4). We can see that the number of unsuccessfully reconstructed dimensions decreases at the beginning of the training process and becomes stable after some iterations. Finally, there still exists some dimensions which can not be reconstructed successfully. The theorem below could show us that MSD-CSC is better than ML-CSC under this situation. Theorem 1: For the Lasso problem in each layer, the performance of MSD-CSC is better than that of ML-CSC. Proof: In ML-CSC, the coding in the i-th layer is Γ i−1 = D i Γ i , and the corresponding Lasso problem is: In MSD-CSC, the coding in the i-th layer is Γ i−1 = D si i Γ i = I (F si i ) T Γ i , and the corresponding Lasso problem is In ML-CSC, if the reconstructions in all the dimensions are considered as success, let η = (0, · · · , 0|Γ i ) T . Then we can obtain: In ML-CSC, if the reconstruction in the j-th dimension is considered as unsuccess, let where Γ i is the solution of ML-CSC (Fig. 5). Then, we can obtain: This indicates that MSD-CSC has a better solution for the Lasso problems in each layer than ML-CSC. Let's further conduct a simulated experiment on MSD-CSC. Note that the bias term of MSD-CSC has been updated to ensure that it has the same regularization parameter as ML-CSC. Based on the simulated experiment, we can see that the number of unsuccessfully reconstructed dimensions decreases rapidly compared with that of ML-CSC and finally stabilizes at zero (Fig. 4). This means that the unsuccessfully reconstructed dimensions in ML-CSC can been recovered by MSD-CSC due to the identity matrix (corresponding to the dense connection in MSDNet).
In Theorem 1, we only consider the situation that one specific dimension is considered as unsuccess. Actually, the proof can be extended to all the dimensions that are considered as unsuccess. In this case, f M SD−CSC could be much smaller than f M L−CSC . Besides, in this proof, we still assume the balance coefficient βs in ML-CSC and the corresponding MSD-CSC are equivalent. This assumption is reasonable under the view of CSC since MSD-CSC and the corresponding ML-CSC can be seen as two arrays of Lasso problems with the same coefficient β. But from the view of neural networks, there doesn't exist the concept of regularization parameter β. Thus, the corresponding ML-CSC should be considered as having the same bias as MSD-CSC. How should we understand the better performance under this new assumption? Lemma 3: For a matrix A = 0, let's assume the matrix AA T has eigenvalues λ 1 , · · · , λ n . As a result, the matrix has eigenvalues 0, · · · , 0, λ 1 +1, · · · , λ n +1. Here the number of zeros is equivalent to the number of columns of A. Proof: Assume X is an eigenvector of AA T corresponding to the eigenvalue λ. Let's consider the vector X = 1 λ A T X, X T = 0 and the following equation Obviously, B has an eigenvalue λ i +1 and X is a corresponding eigenvector. So, B has eigenvalues λ 1 +1, · · · , λ n +1. Let's further consider the trace of B, tr(B) = tr(I) + tr AA T = λ 1 + λ 2 + · · · + λ n + n. Clearly, the β in MSD-CSC becomes larger than that in the corresponding ML-CSC. This means MSD-CSC tends to obtain sparser solutions based on the formulation of Lasso. In addition, similar to the proof of Theorem 1, we can recover the dimensions of the signal which lose too much information via the identity matrix in the dictionary of MSD-CSC. All the above analysis indicates that MSD-CSC allows larger β and lower loss of information compared to ML-CSC. MSD-CSC alleviates the contradiction between them. We can see that the identity matrix in the dictionary is an important difference between ML-CSC and MSD-CSC. Due to this, the problem (P2) could lead to a better solution for MSD-CSC compared to that of it for ML-CSC. On the other hand, the identity matrix corresponding to the dense connection in MSDNet according to Proposition 1. Thus, Proposition 1 and Theorem 1 together provide a theoretical insight into the better coding performance of dense connection. We summarize the relationship between the generalized CNNs (including ResNet and MSDNet) and the new CSC models in Table I. VI. FORWARD PROPAGATION ALGORITHM For MSD-CSC, we just need to replace the dictionary in Eq.
(1) with the corresponding one. The update formula becomes:  where I is an identity matrix. We can adopt ISTA and FISTA to tackle the problem (P1) [43]. Besides, it is unnecessary to limit the number of iterations to be one. Iterating more than once corresponds to unfolding. The case of unfolding = 0 corresponds to MSDNet. The models with different unfolding have the same number of parameters. According to the relationship between convolution and matrix multiplication, we can obtain the forward propagation algorithm in one layer (Fig.  6A) as below (Algorithm 2 and Algorithm 3). The main difference between the two algorithms is that ISTA is only based on the last iteration, but FISTA is based on a linear combination of the last two iterations. Obviously, the main computational effort in both algorithms remains the same. We illustrate a simple architecture using MSD-CSC for a classification task (Fig. 6B). In this architecture, each block represents a layer in MSD-CSC. We can implement a propagating algorithm and set an unfolding number (e.g., 0, 1, 2) in each block. Note FISTA is the same as ISTA when unfolding <2. The feature maps in each block flow are illustrated in Fig. 6A. Max-pooling layers are added to downsample feature maps for memory constraints. Obviously, the number of blocks corresponds to d and the number of convolution kernels corresponds to w in MSDNet.
Algorithm 2 MSD-CSC ISTA in the i-th layer Input: Signal X, convolution F si i , thresholding b i , parameters c i width w, unfolding Output:

VII. EXPERIMENTS
In this section, we evaluate Res-CSC, MSD-CSC and related methods using three typical datasets including CIFAR10, Algorithm 3 MSD-CSC FISTA in the i-th layer Input: Signal X, convolution F si i , thresholding b i , parameters c i width w, unfolding, t 1 = 1 Output: Encoding signals Γ i 1. f 1 = conv (X, F si i ) //conv denotes convolution 2.
4. for k = 1 : unfolding do: . return Γ 1+ unfolding i CIFAR100 and SVHN [52], [53]. CIFAR10, CIFAR100 and SVHN consist of colored natural images with 32 × 32 pixels. In SVHN, the training and testing sets contain 73,257 images and 26,032 images respectively. In CIFAR10 and CIFAR100, the training and testing sets contain 50,000 and 10,000 images, which are drawn from 10 and 100 classes respectively.

A. Experiments on Res-CSC
We implement standard data augmentation (translation and horizontal flip) on CIFAR10 and CIFAR100. All the four models including a plain network, Res-CSC-Simplified, ResNet and Res-CSC are trained with the stochastic gradient descent (SGD) on a single GPU. The mini-batch size is 128 and the Nesterov momentum is set to 0.9. Each model is trained for 200 epochs. The initial learning rate is set to 0.1 and then is divided by 10 after 100 and 150 epochs.
First, we train all the four type of networks with 20 and 56 layers respectively. The plain network with 56 layers achieves lower accuracy than that with 20 layers (Fig. 7A and Table II), suggesting that the degradation phenomenon happens. Interestingly, the degradation phenomenon is alleviated to some extent with Res-CSC-Simplified though the improvement is limited. Moreover, when the number of layers comes to hundreds, Res-CSC-Simplified can not be trained normally, like the situation met in the plain network. Both Res-CSC and ResNet distinctly overcome the degradation phenomenon ( Fig. 7B and  7C). These observations are consistent with our theoretical derivation. According to the results, the term X −1 plays a more important role than the term c · D T DX −1 does.
Next, we train six Res-CSC and ResNet models with 20, 32, 44, 56 110 and 218 layers respectively. Each Res-CSC has the same number of parameters with the corresponding ResNet. Clearly, Res-CSC achieve very competitive or even slightly better performance than ResNet in terms of accuracy ( Fig. 7 and Table II). The difference is due to the effect of the last term in Eq. (9). Besides, Res-CSC takes a little more time for extra convolution and transposed convolution in each layer. In short, Res-CSC is a more general white-box model for overcoming the degradation phenomenon. Its special case with c = 0 leading to an equivalent form of ResNet. Thus, it can be an alternative to the black-box ResNet. More importantly, it leads to more theoretical understanding towards ResNet in terms of the update rule.

B. Experiments on MSD-CSC
We implement MSDNet architectures with w = 32, d =6, 9, 12, s =1, 2, 3 and use max-pooling layers after one-third and two-thirds of the whole layers respectively (Fig. 6B). Before the softmax layer, average pooling is applied. We compare the results of ISTA and FISTA with different unfolding =0, 1, 2. In addition, we choose the traditional feed-forward network (baseline) and ML-CSC (6 layers, the kernel sizes are 4 × 4, 4 × 4, 4 × 4, 3 × 3, 3 × 3 and 3 × 3, respectively, with stride of 2 in the first three layers, and stride of 1 in the last three layers. The number of kernel in each layer are set to 32, 64, 128, 256, 512 and 512 for comparison. All these models are trained with SGD on a single GPU with momentum of 0.9. The total training epoch is set to 150. The mini-batch size is 128. The initial learning rate is set to 0.05 and is divided by 10 after 75 and 115 epochs. First, we can clearly see that further unfolding improves the accuracy compared to MSDNet without unfolding implicitly ( Fig. 8A and Table III). MSD-CSC with d = 12 and unfold-ing=2 improves 1.28% and 3.08% on CIFAR10 and CIFAR100 respectively compared with the corresponding MSDNet. It should be emphasized that this improvement is achieved  without adding any extra parameters. Second, MSD-CSC uses fewer parameters but achieves more accurate results compared to other models (Table III and Fig. 8B). That MSD-CSC has fewer number of parameters indicates it has better coding ability. This is consistent with our analysis in the Section V. Note that MSD-CSC takes more time to train though it has fewer parameters. The reason is that existing deep learning training frameworks do not support the dilation convolution and dense connection operations well since they assume that all channels of a certain feature maps are computed in the same way, and GPU convolution routines such as the cuDNN library assume that feature data is stored in a contiguous memory. Therefore, concatenate operation can be expensive in the current frameworks [54]. Frequent concatenate and split operation are used in MSD-CSC (Fig. 5). This limits the application of MSD-CSC with more unfolding and layers. We believe that such an implementation issue could be addressed in the near future.
VIII. CONCLUSION Inspired by the relationship between neural networks and ML-CSC, we develop the Res-CSC model to explore the layer-initializing question (LIQ). Intriguingly, ResNet can be seen as a special case of Res-CSC. Hence, we think LIQ is the key issue behind the degradation phenomenon, which has not been formally proposed before. Through evaluation on three common datasets, we find that Res-CSC reach very competitive or even slightly better performance compared to that of ResNet. Next, we introduce the MSD-CSC model to decipher the emerging MSDNet architecture via adapting the dictionaries in ML-CSC. Through the analysis of this model, we give a theoretical understanding of the dilated convolution (mixed-scale) and dense connection in MSDNet. As we know that sparse coding has more complete theory compared with neural networks. Thus, the bridge between sparse coding and neural networks will make it clear to interpret the advanced neural networks. In addition, sparse coding models can be implemented and solved with the elegant mathematical optimization algorithms such as ISTA and FISTA. Numerical experiments show that MSD-CSC performs better than ML-CSCs because of the advantage of MSDNet. Moreover, it also performs better than MSDNet because of the power of ISTA and FISTA with the unfolding trick, which achieve distinctive improvements without extra parameters.
We conclude some further thinking and potential research directions. The first one is that, as shown in this paper, Res-CSC gives an answer to LIQ. Is this answer the best? We think it is a challenge to find a universal rule for finding the best initialization since we would meet different features in different layers. Meta learning may give a potential solution through drawing lessons from MAML [55], CAML [56], Reptile [57] and Meta-SGD [58], which aim to learn the initializations in meta learning.
Next, we can see that dilated convolution corresponds to the dilated convolutional dictionary, and dense connection corre-sponds to an identity matrix in the dictionary. Can we find a better dictionary structure inspired by such a observation? And what operation does this new dictionary structure correspond to? This would help us find a new basic operation used for extracting features from data. Moreover, some architectures or operations in neural networks still lack of theoretical understanding (e.g., batch-normalization, dropout). Can we explain them in a sparse coding framework? On the one hand, we expect to find a mathematical understanding and improve the original models. On the other hand, we hope to find the key roles that these models play in the context of sparse coding.