SRPS–deep-learning-based photometric stereo using superresolution images

This paper introduces a novel deep-learning-based photometric stereo method that uses superresolution (SR) images: SR photometric stereo. Recent deep-learning-based SR algorithms have yielded great results in terms of enlarging images without mosaic effects. Supposing that the SR algorithms successfully enhance the feature and colour information of original images, implementing SR images using the photometric stereo method facilitates the use of considerably more information on the object than existing photometric stereo methods. We built a novel deep-learning-based network for the photometric stereo technique to optimize the input–output of SR image inputs and normal map outputs. We tested our network using the most widely used benchmark dataset and obtained better results than existing photometric stereo methods.


Introduction
The photometric stereo (PS) method uses more than three images of an object captured by a single camera with varied directions of a light source and derives the surface normal data of the object. The PS method can derive detailed data on the surface texture and structure compared to other surface reconstruction methods, such as stereo matching and structured light. However, the main challenge of the PS method is that the reflection of light by the low-Lambertian material can interrupt the surface normal estimation. While measuring the surface normal data of an object with an ideal Lambertian surface, only three images with different light sources and simple matrix equations are sufficient to sustain accuracy. However, when the surface material of an object is non-Lambertian, such as glossy plastic or gold, and the intensity of light reflected from the shiny object is not proportional to its slanted rate, the traditional PS methods can generate large measurement errors. To overcome these obstacles, optimization-based PS methods have been widely studied; this requires a large number of images and high computational costs. The quality of the resulting normal maps derived using these methods is not stable. The method proposed by Alldrin et al. (2008) exhibits great performance for semi-Lambertian objects. However, it is not consistent for various semi-Lambertian objects and is especially bad for metallic surfaces.
Recently, deep-learning methods have been applied in various computer vision divisions, such as stereo matching and PS. Santo et al. (2017) developed a deep PS network (DPSN), wherein, for the first time, the PS adopted a deep neural network, employing dense neural layers and several dropout layers to its network to reduce errors caused by shadows. In the PS-convolutional neural network (CNN) method (Ikehata, 2018), an observation map was derived using every pixel in an image, and the observation map was placed in a CNN-based network to calculate the surface normal vector corresponding to each pixel, which yielded good results for non-Lambertian objects. Zheng et al. (2019) developed the idea by Ikehata (2018) and introduced SPLINE-Net by applying a two-step CNN-based network.
Further, they invented a symmetric/asymmetric loss function to enhance the training results and reduce the number of images required. However, Zheng et al. (2019) reported that the singlepixel processing PS method can provide inaccurate normal maps even for Lambertian objects. PS-FCN (Chen et al., 2018) utilizes a large amount of information from neighbouring pixels and is known as the most stable method for existing deep-learningbased PS methods; hence, it is considered to be the best; however, the surface normal map data derived by the PS-FCN lack high-frequency information.
Meanwhile, deep-learning methods are also utilized in image restoration areas for purposes such as noise reduction and SR. Single-image superresolution (SISR) problem is considered very difficult to solve. However, deep-learning-based methods have demonstrated that it is possible to derive a high-quality SR image. We propose that if the SR method can successfully enlarge the original images for the PS method with slight information corruption, it is possible for the PS method to use more data from the input images. We implemented the residual channel attention network (RCAN) algorithm (Zhang et al., 2018) to enlarge the images by two times and applied our novel CNN-based network to derive an original-sized normal map. Our method yielded better results than other existing PS methods, especially in eliminating low-frequency errors while preserving the highfrequency information of the object.

Previous Works
The first PS was proposed by RJ Woodham (1980) and used three light sources with a calibrated direction vector and a single orthographic camera. With three direction vectors for each light source, three images are taken with each light source on, and the surface orientation vector of a surface point is calculated using the three intensity values of the image pixel corresponding to the surface point. The calibration method introduced by Hayakawa (1994) measures the reflectance data of a known shape. Because a sphere is straightforward while determining the gradient of the shape, they chose a sphere as the reference shape, which is still being adopted in current light calibration. However, these methods cannot derive accurate surface normal vectors for a non-Lambertian surface because the equations presented by Woodham (1980) and Hayakawa (1994) do not fit the reflection characteristic of the non-Lambertian surface. To solve this problem, many papers (Iwahori et al., 1995;Wu et al., 2010;Shi et al., 2013;Ikehata & Aizawa, 2014) have proposed an optimization-based solution. These studies could derive an improved normal map compared to the conventional rule-based PS method; however, they still required more than 30 images for accurate results, and the assumptions they made did not perfectly fit materials in the real world, thereby making the results unstable. Thus, further improvement was required. In 2019, Chen (2019) devised a method that estimates the reflectance, smoothness, and normal of the surface using a non-deep-learning method in a microfacet-based reflectance model, which could replace the existing surface reflectance estimation model effectively, to obtain results close to those of deeplearning methods.
Recently, studies on deep-learning-based PS have been conducted with improved results. The first deep-learning-based PS was established by Santo et al. (2017) with 96 images per object. Because the input of the network is a group of pixels of images having the same coordinates, it is possible to train a network with a relatively small dataset. Ikehata et al. (2018) introduced a PS-CNN that utilizes an observation map as the input of its network. Zheng et al. (2019) improved the observation map, applied symmetric/asymmetric cost during training, and succeeded in obtaining good results with a relatively small number of images. Networks introduced in the PS-CNN and SPLINE-Net use one pixel per image to derive one pixel of the normal map; hence, they require a small computational cost. Li et al. (2019) fixed the range of the angle of the light position of the training dataset and effectively reduced the number of images required and found the optimal group of light sources.
However, Zheng et al. (2019) reported that the networks failed to calculate precise normal maps when an object had a Lambertian surface. While deriving the surface normal vector of a pixel, the DPSN, PS-CNN, and SPLINE-Net do not refer to any information from neighbouring pixels; therefore, the normal vector is not properly derived when the light information of the pixel is insufficient (e.g. a small number of images of a textureless non-Lambertian object). Chen et al. (2018) proposed a network named PS-FCN that max-pools the feature maps from each image and de-convolutes the fused feature map to derive the final normal map. They then improved their method by normalizing the input images, which eliminated the defects caused by the albedo difference . These methods use N-shared encoder networks and one decoder network while using N inputs and, therefore, require a large amount of resources. PS-FCN and PS-FCN +N use a large number of neighbouring pixels per surface normal vector; consequently, they output normal map results with less low-frequency errors than any other PS method but lose highfrequency information of the surface normal. The overall summary of the deep-learning-based PS method has been presented by Zheng et al. (2020). According to the authors, PS-FCN +N is considered to be the state-of-the-art method with respect to the average accuracy and stability while using a small number of images.
Meanwhile, image SR is a sub-area of image restoration methods that restores the information loss caused by image size reduction. Before the growth of deep learning, image SR methods analysed the image using continuous Fourier transform (CFT), discrete Fourier transform (DFT), and multiple continuous images (Liyakathunisa & Ananthashayana, 2008;Zheng et al., 2010). The point of SR methods that use CFT/DFT analysis is that the image registration can help interpolate low-resolution images onto a high-resolution image grid, and CFT/DFT analysis can efficiently extract suitable information for image registration. CFT/DFT can be replaced by a wavelet transform (Iwahori et al., 1995;Shi et al., 2016). Non-deep-learning-based SISR assumes that the image can be effectively restored if the edge information is accurately determined and a proper interpolation method is applied. Various non-deep-learning SISR methods have been introduced previously (Patel et al., 2011;Makwana & Mehta, 2013).
The deep-learning-based SR method was first developed by Dong et al. (2014) and exhibited extraordinary performance compared to non-deep-learning methods. The first deep-learningbased SISR, which used four convolutional layers to upscale the images by factors of 2, 3, and 4, was introduced by these authors. Among the various deep-learning-based SISR methods (Dong et al., 2014;Kim et al., 2016;Lim et al., 2017;Zhang et al., 2018), the RCAN introduced by Zhang et al. (2018) applied a residual-inresidual network to obtain high-quality SR images and was considered the most precise SR method until 2019, according to a review by Wang et al. (2020). After the RCAN, several methods (Jiang et al., 2020;Jin et al., 2020;Mei et al., 2020) were developed by adjusting the hierarchical/multiple branch structures, and these exhibited better performance than the RCAN, especially when images were magnified by more than four times. While adjusting the 2× SR on images, the PSNR value of the results (Jiang et al., 2020;Jin et al., 2020;Mei et al., 2020) did not exhibit a significant difference compared to the results obtained by Wang et al. (2020). Because we applied two-time SR on the input image, the results of the methods did not show a significant difference. As we wanted to adjust the 2× SR, we implemented RCAN on the datasets and designed the architecture of a network suitable for SR images to obtain the desired results.

Architecture of the SRPS Network
Our method is based on the PS-FCN +N , which extracts N feature maps from N input images and merges the feature maps to derive one surface normal map. To enhance the performance of deep-learning-based networks, we improved the following aspects: 1) Using SR images without further pre-processing does not allow the network to calculate an accurate surface normal map. We established pre-processing to put an SR image into our network. 2) We redesigned the feature map network and cost function to use SR images as inputs. We found that using SR images and existing feature extracting networks without change does not enhance the details; therefore, we designed a proper network structure. Further, we designed the cost function to enhance the training by considering the normal vector and gradient of the normal vector.
We will describe the designed network structure and image pre-processing, and demonstrate how these factors enhanced the surface normal map calculation results. The overall flow of the network is presented in Fig. 1.

Image normalization
The image pre-processing involves three steps: image normalization, image SR, and image-light vector concatenation. In the image normalization step, we adjusted equation (1) to each image that participates in the PS.
Here, I k is the intensity value of a pixel in the kth image, N is the number of images that participate in the PS, and k ∈ {1, 2, . . . , N}. I kN is the intensity of normalized image I kn . The normalization process was performed independently for each RGB colour channel.
If the albedo of an object is very small or large, the intensity difference between the participating images reduces regardless of the light direction, which makes it difficult to calculate the surface normal vector accurately. Normalizing images solves this problem by eliminating the effects of albedo. Although the oversaturation of light is not perfectly reduced by normalization, it helps in the overall process of the PS. An example of the normalization results is presented in Fig. 2. Focusing on the black stripe texture of the object, it can be observed that the black texture is removed from the normalized image.

Image SR
After the normalization process, the input images were enlarged using the SR method. The existing SR method multiplies the resolution of the input image by scaling factors of 2, 3, and 4. According to Zhang et al. (2018), the PSNR rate of SR images is best for two-time multiplied images, which means that SR images enlarged by a scaling factor larger than 2 contain more errors. We empirically concluded that it is not necessary to enlarge images more than two times and used 2× SR images for our network.   We designed our network to obtain (2h × 2w)-sized images as inputs and generate an (h × w)-sized surface normal map as the output; thus, only the input images are SR. Here, h and w are the height and width of the output normal map, respectively. The result obtained through the SR step is presented in Fig. 3. Com-paring the hair and ear parts of the images, it can be observed that the SR image contains more detailed information regarding the object.
To ensure that we have correctly adjusted the SR method introduced by Zhang et al. (2018), we tested our implemented     algorithm on the FlyingThings3D dataset (Jin et al., 2020), which is the most widely used dataset for stereo matching. We randomly selected 1000 images from the dataset, resized the original image by a scale factor of 0.5, and adjusted the SR to the image. According to Zhang et al. (2018), the PSNR value is used as the criterion for the performance of the algorithm. We calculated the PSNR value for each pair of images and confirmed that the calculated PSNR values varied in the range of 30.5-33.1 dB, which indicates that the algorithm was well implemented. Figure 4 presents an example of the implementation of the SR algorithm. The PSNR value of the pair in Fig. 4 is 31.65 dB. Also, we calculated the PSNR value for image pair of normalized images in the DiLiGenT dataset (Shi et al., 2016), which will be used as the test dataset of our experiment. We normalized the original images of the dataset (Image A), resized the normalized images by 0.5× rate, and adjusted SR by 2.0× rate (Image B). We then derived the PSNR value of the Image A and Image B. The PSNR value of the pairs varied in the range of 28.6-30.7 dB. Although the SR network we implemented did not used normalized images as its training set, the network showed that it also worked properly for normalized images.
During the experiment, we found that the error caused by SR significantly affects the result, and the sequence of preprocessing is important to compensate for the error. SR effectively preserves the edge and detailed information regarding the original image but slightly loses light consistency. We found that adjusting the SR image on normalized images can preserve more lighting consistency than adjusting that on non-normalized images.
To demonstrate the intensity distortion caused by SR, we randomly selected 50 images from the DiLiGenT dataset (Shi et al., 2016) and derived the average and standard deviation (std. dev.) of each value. Then, we adjusted the SR on each image and compared the average and std. dev. difference between the original and SR images. Further, we normalized the images and repeated the process. Table 1 lists the average values of the 50 trials for each case. It should be noted that adjusting the SR on the non-normalized image significantly lowers the std. dev. value, whereas that of the normalized image is not lowered. These results explain why networks trained using SR images without image normalization fail to calculate surface normal maps accurately, whereas those using image normalization are successful.

Light vector-image concatenation
The final step in image pre-processing is the light vector tile generation and concatenation. The feature extraction step takes N × (2w × 2h × 6) arrays as inputs, where N is the number of images used in the PS. Suppose that the input images are SR images and have a size of (2w × 2h × 3). Because we want our PS to utilize the image as well as light vector information, we edited the input image by referring to Chen et al. (2018). To utilize the light direction vector efficiently, we created a tile matrix with a light vector. We duplicated the light direction vector and created a tile matrix with a size of (2h × 2w × 3), similar to an SR input image. By concatenating the input image and tile matrix, the size of the input becomes (2h × 2w × 6). This pre-processing of the input image simplifies the addition of inputs into our network, and the structure of the feature extraction network (FEN) is simple. The final step in image pre-processing is depicted in Fig. 5.

FEN
The pre-processed images were placed in the FEN. The FEN structure is composed of several CNN layers. The entire structure of the FEN is shown in Fig. 6. We stacked 11 CNN layers to set the output of FEN as (h/2 × w/2 × 256), referring to the    structure of the PS-FCN. We then added two skip connections between layers 5 and 8 and layers 3 and 10. We set the size of the output as similar to the final normal map to prevent the information loss that may potentially occur via deconvolution in the final step of the network.
According to ResNet (He et al., 2016) and U-Net (Ronneberger et al., 2015), building skip connections between layers helps us to detect fine features from images, making it widely applicable in other deep-learning-based computer vision areas. We expected the skip connection to extract detailed feature maps from the input images. If the number of images used is N, the FEN is called N times, and N feature maps are extracted and passed to the next step. Additionally, Fig. 7 presents an example of the resulting feature maps. Some layers of the feature map contain meaningful information, such as the edges of the object (channel 24) or shadow part caused by the light source (channel 131).

Merge and calculation network (MnC)
After extracting N feature maps, they are merged and passed to the MnC to calculate the normal map. We adopted a max pooling layer to fuse the feature maps from the FEN. Because pooling layers are not weight based, the order of the light vector does not affect the results. The role of max pooling is illustrated in Fig. 8.
The normal map calculation network is composed of seven 2D convolutional layers, two skip residual shortcuts, and an L2normalization layer. Skip connections were also applied to layers 1, 3, 4, and 7. In the ablation section, it is shown that adding a skip connection to the network is essential because the stacked convolutional layer cannot properly utilize the information of the SR image. The structure of the MnC is shown in Fig. 9.

Loss function for learning
Because the measurement of the output is achieved by calculating the cosine similarity function, it is efficient to use the function as the loss function of our network as well. The cosine similarity loss function L is as follows: (1 − n i j · n i j ) .
(2)   Here, n ij and n' ij denote the calculated surface normal vector and ground truth vector, respectively. The maximum value of n i j · n i j is 1 when n i j and n i j are equal; thus, the cosine similarity loss function is adequate for use.
Since the loss function above considers the mean value of angular difference only, it does not guarantee the detail information of surface, especially when the algorithm utilizes lots of neighbouring information; thus, it causes blurring effect. The     blurring effect on resulting normal map can be observed in the result comparison section. So we designed a new loss function to enhance the detail information that employs the gradient of normal vector value: Here, n x and n' x denote the x-direction gradient of the derived normal vectors and ground truth normal vectors, respectively; n y and n' y are the y-direction gradients of derived normal vectors and the ground truth normal vectors, respectively.
Using L G only can cause the global error since angular error is not considered in L G . The final loss function, L f , is established  to consider both angular error and gradient error.
We empirically chose α as 0.4, which showed the best result among the experiments done with various values between 0.1 and 0.9.

Experimental Setting, Dataset, and Result Criteria
In this section, we will describe the dataset, hardware, software set-up, and criteria for performance analysis.

Training dataset
We used the Sculpture PS Dataset, which was implemented by Chen et al. (2018). The Sculpture PS Dataset employs eight 3D models with a complicated shape. Apparently, 3700 views per 3D model were applied, and two randomly different surface materials were rendered on the model. Each viewpoint has 64 random light conditions; hence, the total number of samples was 29 646 × 2 = 59 292, and the total number of images was 59 292 × 64 = 3794 688. The original size of the dataset images was 128 × 128, and we adjusted the RCAN SR algorithm for each image to obtain a size of 256 × 256. We applied pre-trained RCAN to input images of the dataset only and did not apply it to the normal map of the dataset, which was used as the ground truth. We cropped patches from each image by 96 × 96 and 48 × 48 from the normal map. Example images of the dataset used are presented in Fig. 10.
To train our network, we implemented the Keras framework (Keras, 2015). We randomly split the dataset into training and validation sets at a ratio of 99:1. The initial learning rate was set to 0.001 and was reduced by half per five epochs. We trained our network with a batch size of 32 for 50 epochs. The total number of weights for our network was 10 275 587; hence, the entire training process took approximately 3 days. We observed that   the final loss value oscillates near 0.14, and the validation loss significantly oscillates; therefore, we did not consider the validation loss as the dominant determinant for obtaining good training results. The loss oscillation may have been caused by the effect of the black part of the dataset images. We employed the Adam optimizer (Kingma & Jimmy, 2014) to train with the default parameters (β1 = 0.9 and β2 = 0.999). We used Nvidia Tesla V100 with 32 GB of VRAM and 128 GB of RAM for training. We observed that the minimum RAM requirement for training our network was 32 GB.

Test dataset
The most widely used benchmark dataset for the PS method is the DiLiGenT dataset (Shi et al., 2016), which contains 10 objects with random and composite materials. Each object has 96 images with different light directions. Examples of images and the ground truth are presented in Fig. 11. The input images are also superresolutionized by a scale factor of 2 before being put into our network. We chose 16 as the number of images to use, N, and compared the resulting surface normal map with those obtained in other methods.
To show the contribution of our research more clearly, we grouped the test datasets by two criteria. The first criterion is the inverse of the L2 (Woodham, 1980) error value of each object (L2 −1 ). The L2 error becomes zero when the object is completely Lambertian and shadowless; therefore, we used L2 −1 as a criterion for the Lambertian-ness of the object. The second criterion is the mean of gradient (MoG), which is the mean of the gradient of the ground-truth normal. If the MoG value is large, the object has a complex shape. Figure 12 shows the grouping of the objects using the two criteria.

Error evaluation criteria
To evaluate the performance of our network, we devised the following error criteria:  A large gradient indicates that the geometry of the area is complex. Therefore, comparing the MAE of the large-gradient area can help evaluate the performance of each method on a detailed surface. We filtered the pixels with a gradient smaller than the median gradient value of the object. Figure 14 shows the filtered normal maps for deriving the MAE-LG. The ball ob-ject is neglected in this step because we deduced that comparing MAE-LG on a completely smooth object is meaningless.

Ablation study
We trained two networks for the ablation study. The difference between our network and each network is as follows: 1. Net 1: Net 1 uses bicubic interpolated images as input to check the effect of the SR image input. 2. Net 2: Net 2 has no skip connection in the network, as in PS-FCN.
The structure of each network is illustrated in Fig. 15. We tested each network on the test dataset and derived the MAE and MAE-LG values for the ablation study.

Synthetic dataset test
We created six synthetic datasets using three complicated objects: alien, house, and octopus. We randomly picked six surface materials. The sample images are shown in Fig. 16. Each object has 11 × 11 regularly distributed light positions, and we randomly selected 16 positions to obtain a surface normal map. The derived surface normal map is shown in Fig. 17, and the MAE errors of each normal map are listed in Table 2. Our methods outperformed the PS-FCN +N for every object in the synthetic dataset. We averaged 50 trial results for each object. You can see the detail of a surface normal map in Fig. 18.

SR image quality test
In this chapter, we validate how much SR degrades the result of PS. Since the original DiLiGenT dataset does not have (2×)sized image, we made two normal maps for error comparison as follows: r Normal map from original normalized image (N o ): We derived a normal map N o using normalized DiLiGenT dataset images. To obtain the result, we trained a network SRPS', which has same structure with SRPS but is trained using normalized images of training dataset without SR, so that it outputs (0.5×)-sized normal maps N o as a result.
r Normal map from resized-SR normalized images (N r ): We resized the normalized DiLiGenT dataset images by 0.5× rate. Then, we adjusted 2.0× SR to the images to obtain resized-SR images, which has the same size with the original images. We utilized these images as input of SRPS and derived (0.5×)sized normal maps (N r ). By comparing the mean angular error of N o and N r , we can measure the degradation effect of SR on PS.   The result normal maps and error comparison are shown below. Table 3 shows the mean angular value of N o and N r . Since there is small difference between two values and superiority cannot be determined, we concluded that SR does not degrade the performance of our method. Fig. 19 shows the samples of N o and N r . From this result, we inferred that our network effectively compensates the error of image by the degradation of SR.  Table 4 presents the results of the comparison. We selected 16 random images for each trial, and the resulting values of our methods were the average of 30 trials per object. Our method exhibits the best performance among the PS methods for every object in the dataset, except for the Ball and Harvest datasets. Examples of the results of the normal and error maps are shown in Fig. 20. Table 5 presents the MAE-LG errors for our network as well as the PS-FCN and PS-FCN +N . Our network exhibited the best performance for every object in the dataset. Table 6 presents the local MAE results of our network, along with those for the PS-FCN and PS-FCN +N . Our network exhibited the best performance for every object in the dataset. The MAE-LG result shown in Tables 5 and 6 shows that our method has superiority on calculating the surface normal vector of a complex object. Additionally, the MAE-LG results reveal that our method achieves better results for all objects, especially for MoG group 1.

Overall comparison of benchmark dataset
Tables 7 and 8 present the ablation study results with respect to object MAE and MAE-LG. Remarkably, using bicubic images as input instead of SR images also yielded good performance on the MAE-LG comparison, but it showed relatively low performance on the object MAE. It can be deduced that bicubic interpolation can preserve the detailed information of an object but exhibits inferior performance in preserving the material characteristics of the surface. Table 9 presents the ablation study results obtained using the local MAE results, and Fig. 21 shows the normal map calculation results for the local MAE comparison between the PS-FCN +N and our method. It can be observed that detailed information is well preserved in the results from our methods, such as the teeth of Buddha as well as the hairband jewel and cloth ornaments of Harvest.
Tables 10 and 11 present the ablation study that compares results (MAE error/MAE-LG error) when our network is trained with different loss functions L and L f . The result shows that using L f derives superior result for all objects in test dataset than using only L.

Conclusion and Future Work
In this paper, we proposed a new PS method that utilizes an SR image; this method is referred to as SRPS. Our SRPS method can derive more precise and accurate surface normal map data using a relatively small number of images. Additionally, our network efficiently reduces the high-frequency error caused by several factors compared to other methods. These results indicate that using an SR image on a computer vision/image analysis area is considerably effective, because the amount of applicable information drastically increases. We will identify other areas such as stereo vision or image segmentation, where SR is also applicable.
Because our network uses many neighbouring pixels per surface normal vector, the detailed information still needs to be improved. Moreover, our network has a larger network size than other deep-learning-based networks, which increases further when the RCAN network is considered simultaneously (the total number of weights is approximately 13 million). This can be a critical disadvantage for practical use; therefore, we will find ways to optimize the number of weights in our future work. Additionally, our network is available for only calibrated light environments; consequently, we are planning to reinforce our network for implementation in uncalibrated light environments.
Further, we tested our network on light-stage data gallery datasets (Einarsson et al., 2006) but obtained poor results for all the objects. This is because the object was so large that the light could not cover its entire area, thus causing a significant error while using a small number of images. In Fig. 22, it can be observed that the resulting normal map for the Knight-Standing and Helmet is defective and blurred, compared to the result for the PS-FCN, which used approximately 140 images. This indicates that every light source used for our method should cover the entire part of the object to be measured to make our network function effectively.