Gaze Estimation Using Neural Network And Logistic Regression

Currently, a large number of mature methods are available for gaze estimation. However, most regular gaze estimation approaches require additional hardware or platforms with professional equipment for data collection or computing that typically involve high costs and are relatively tedious. Besides, the implementation is particularly complex. Traditional gaze estimation approaches usually require systematic prior knowledge or expertise for practical operations. Moreover, they are primarily based on the characteristics of pupil and iris, which uses pupil shapes or infrared light and iris glint to estimate gaze, requiring high-quality images shot in special environments and other light source or professional equipment. We herein propose a two-stage gaze estimation method that relies on deep learning methods and logistic regression, which can be applied to various mobile platforms without additional hardware devices or systematic prior knowledge. A set of automatic and fast data collection mechanism is designed for collecting gaze images through a mobile platform camera. Additionally, we propose a new annotation method that improves the prediction accuracy and outperforms the traditional gridding annotation method. Our method achieves good results and can be adapted to different applications.


INTRODUCTION
Gaze estimation is a technique of detecting and obtaining the direction and position of observed gaze through hardware and software algorithm analyses [1][2][3][4]. For instance, as illustrated in Fig. 1, a gaze estimation-based application installed in the device would return the predicted gaze location on the screen when the user is looking at the screen. Recently, gaze estimation has been widely used in many scientific research fields, especially in human-computer interaction [5]. With gaze estimation technology, a system was developed that around the needs for region of interest decompression and display in the context of large image interpretation and analysis in science and medicine [6]. Other application fields cover advertising recommender systems [7], assisted driving [8], psychology [9], military [10] and other fields [11,12]. As the number of scenarios using gaze estimation technology is increasing, more convenient, fast and low-cost methods for gaze estimation are required.
Gaze estimation also has many applications in medical research. An example is the vision screening test, a popular test used in hospital for screening potential vision problems and eye disorders [13]. Typically, during the test, a doctor monitors the eye movement of children when the children are attracted to the test pictures. However, many factors including the non-cooperation of children, limitation of viewing angle and different medical experience of doctors can affect the results of the vision screening test. For the vision screening test, gaze estimation can be utilized to obtain more accurate results in a fast and convenient way. For instance, using the gaze estimation method proposed herein, without human intervention, mobile devices can automatically recognize the gaze direction of children or assess whether the gaze of children 2 Y . X i a et al. is located on the testing pictures through the gaze captured by the camera of mobile devices, and can thus aid doctors in obtaining more accurate test results.
In this study, we aim to develop a gaze estimation method specialized for mobile devices such as mobile phones. The main reasons for mobile devices are 3-fold. First, mobile phones are widely used not only in communication but also in daily life, including making online shopping payments, entertainment, telecommuting and identity recognition. Next, mobile phones receive frequent hardware updates, and most smart phones are capable to support computationally intensive algorithms. Moreover, the rapid development of digital cameras in phones has tremendously improved image quality, which enables us to collect reliable gaze images conveniently and economically. Mobile phones supply a convenient platform for gaze estimation and have become popular in recent years. Gaze has shown its advantages in human-computer interaction. When one uses mobile device, gaze serves as a tool for handsfree interaction and has many applications experience as an input modality for tasks including desk control [14], target selection [15] and notification display [16]. Using gaze for interaction control is faster than the mouse for pointing on the screen; the mouse tends to lag behind gaze by more than 100 ms on average, according to the result of experiment about eye and mousecorrelated relationship [17]. The gaze estimation technique can also be used to recommend relevant or similar goods to a potential customer if the gaze of a user is detected to stop on some good. When a user is playing games on a mobile phone, the captured gaze can help the user to control the direction of the target movement. The gaze captured by mobile phone cameras can help users to unlock their phones using password entry [18] and open an application without using hands.
Deep learning is an extremely powerful technique that has developed fast and is widely used in computer vision [19][20][21], including gaze estimation [22,23]. iTracker [24] is a convolution neural network (CNN) for gaze estimation, a typical method based on deep learning. Hence, we used the deep learning technique in our proposed gaze estimation method. Furthermore, there have been many open-source datasets for gaze estimation, such as GazeCapture [24], TabletGaze [25], a comprehensive head pose and gaze database [26], ETH-XGaze [27] and RT-GENE [28]. However, these datasets were collected from mixed devices or tablets instead of single phone devices. Also, these datasets lacked the images where the participants' gaze located outside the screen. We wish to develop a gaze estimation method mainly based on mobile phone that could exhibit good precision and is applicable to various applications in daily life. Therefore, we collected our own gaze dataset using mobile phones. The appearance and operating system of device used in this study is similar to the most popular and daily used mobile devices. Besides, the participants were all Chinese. In collecting process, we added temporal information and collected the images when the gaze was located outside the screen, which can be used to improve the model.
Motivated by the existing methods, we proposed a gaze estimation method which combined a neural network and the logistic regression method. Our proposed method relied on a two-stage process. In the first stage, a convolutional neural network with a logistic regression layer processed the input gaze pictures and output estimated probability vectors of bins annotation labels in both horizontal and vertical directions. In the second stage, an additional logistic regression was used for refinement of prediction from the neural network. The proposed two-stage method is different from existing gaze estimation methods, although they share certain similarities. To be specific, for the first stage, a similar method treating the screen horizontally and vertically has been investigated in TabletGaze, where the gaze labels of data also include both horizontal and vertical coordinates on the screen. Different from TabletGaze, we treat gaze direction as horizontal and vertical directions separately and split screen into bins. The labels of horizontal and vertical directions are bins indexes vectors including 1 and 0 instead of one coordinate. The gaze location in the TabletGaze is obtained directly by regression, while the gaze location in the proposed method is obtained though regression using the probability vector output from CNN and setting the threshold. In another paper, Liu [29] studied a logistic regression layer following a CNN for gaze estimation, which also shares certain similarity with our method. In Liu's work, the last classification layer of CNN is replaced with a logistic regression layer using for classification in all raw image pixels. The proposed method by Liu is more similar to the gridding method, which needs sion layer following the CNN in our proposed method is used to produce probability vectors corresponding to the bins instead of directly classification for binary outcome. In the second stage of the proposed method, the probability vectors are then used to fit curves with modified sigmoid function and get target point by setting threshold, which can refine the prediction result.
In this article, we developed a data driven model for gaze estimation based on two-stage process without using handengineered features. In specific, we proposed an annotation method that annotates the gaze location by splitting the screen into bins horizontally and vertically instead of pixels. Compared with traditional annotation methods, the proposed annotation method could convert a complex multi-class problem into a binary-class problem in multiple bins, well control the number of labels and improve the accuracy. The first stage of the proposed method is similar to existing works, and the additional logistic regression in the second stage is our major contribution which can further process the output probability vectors for refinement of the gaze prediction.

Data collection
For gaze estimation, we first designed a set of automatic and fast data collection mechanisms for collecting gaze data that included ordinary images captured by the designed mobile platform camera. In this study, we collected 200 frames for each of the 550 participants. The participants were all Chinese with the age ranging 20-35 years old. It took about 10 minutes for each participant. The data collection was implemented using Samsung Galaxy S8+. The screen resolution was 2220 × 1080 pixels. The screen size and device type are aligned with the popular devices Chinese used. The light environment of data collection mimics the natural office working scenarios. To collect gaze images, we preset fixed points on the mobile device screen, which enabled us to obtain the ground truth of the gaze points on the screen easily and conveniently. We setup a collecting program, in advance, that provided instructions to the participants. The participants followed the instructions to prepare for the next step or look at the screen. We obtained frames from the camera of the mobile device when the participants looked at the fixed points on the screen. Specifically, once the collecting program started, the fixed points began to appear in order on the screen. There were 25 gaze points in total and each point was presented for 15 seconds on the screen. The camera captured pictures when the participants were instructed to look at the points. To guarantee the quality of the collected images, every step was performed after a voice prompt. Before displaying a new gaze point, there was a break about 5 seconds. To avoid fatigue of participants, the screen automatically turned black so as to give the participants a short break about half minutes when the screen displayed five gaze points. In addition, the participants were encouraged by voice prompt to change their head pose and move their head to a different position relative to the camera. The scenes of participants with different head poses or body postures simulate the scenes of different postures people using phones in daily life. The scene of gaze data captured was displayed in the left panel of Fig. 2. Due to different sitting posture of participants, the distance from participants to screen was not fixed and was ranging from 25 to 60 cm. An illustration is shown in the right panel of Fig. 2. We also captured the images where the gaze was spotted outside the screen for improving the accuracy of gaze estimation. Besides, we captured the data frames with time sequence for adding the time series relation information to our model for future research.

Data annotation and pre-processing
Data annotation is necessary for gaze learning tasks. A commonly used method is the gridding annotation method that  divides the screen into grids. Denote the width and height of the screen as W and H, respectively, and set the grid width as g. Here, W and H are measured with pixel values. We can select the appropriate g such that W and H can be divided by it. Therefore, we set the grid value of the gaze location as 1 and the others as 0. Finally, this method produces (W × H)/g grid labels, which may cause a computationally intensive problem in practice or result in a complex model and hence affect the accuracy of the model output.
Considering the drawbacks of regular gridding annotation method, we propose a new annotation method to control the number of labels on the gaze image. We herein propose an annotation method that divides screen into bins instead of grids. Hence, the amount of calculation is reduced significantly, and the predicted gaze location is labeled separately in the horizontal and vertical directions. We divide the screen into horizontal and vertical bins, respectively, and set the bin length equally as b, see Fig. 3. Additionally, we select the appropriate bin length such that W and H can be divided by it. This annotation method produces (W+H)/b bins. Denote their references as ( In this annotation method, all bin references on the left of gaze location in horizontal direction are labeled as 1. Similarly, all bin references above the gaze location in vertical direction are labeled as 1. The rest bins references are labeled as 0 in horizontal and vertical directions. Compared with the proposed annotation method, the gridding annotation method yields larger dimension of an annotation vector, producing larger amount of annotation values. The proposed method yields less dimension of an annotation vector but requires more processing steps to obtain the predicted gaze location. With the proposed annotation method, we essentially convert a multi-class problem into a binary-class problem in multiple bins. We herein focus on the new annotation method. For the training of neural network, we only use the face and eyes patches, so we first pre-process the frames, as illustrated in Fig. 4. For each image, we detected the two eyes and face on the frames employing Haar Cascade Classifiers [30] in Open Source Computer Vision Library (OpenCV) [31] and got the position of top-left corner, width and height of the two eyes boxes and face box. After detecting, we cropped the images of the left eye, right eye and face from the frames according to the box position and width. We resized the box images to the same scale for the convenience of input.
The next two subsections introduce the proposed two-stage procedure of gaze estimation, consisting of using the neural network to output the estimation in two directions and refining the estimation with logistic regression.

Stage-I: process gaze image using neural network
In this subsection, the neural network model was used to extract the features of eyes and faces and output coordinate probability Section C: Computational Intelligence, Machine Learning and Data Analytics The Computer Journal, Vol. 00 No. 00, 2021 vectors. Frames of the face, left eye and right eye were cropped from the original pictures. Subsequently, these three parts were used as input to the corresponding convolutional layer to extract high-level features. The first two convolutional layers of left eye and right eye shared the same weights. After two convolutional layers and pooling layers, the left eye features and right eye features were independently input into three convolutional layers, respectively. The three convolutional layers were followed by a common fully connected layer. The frame of face was input to five convolutional layers followed by two fully connected layers. Finally, the total features above were joined and then placed into two fully connected layers, as shown in Fig. 5. The inputs contained left eye, right eye and face patches with size 227 × 227. The size of last two fully connection layers were 128 and M + N. For the convolutional layers, each layer contained two hidden layers including batch normalization [32], which have much higher learning rates and are insensitive to initializations in the network and rectified linear units [33], and hence could better learn features for face verification and preserve relative intensities information through multiple layers of feature detectors. Finally, with a sigmoid function, we obtained vectors output by the neural network. We further divided them into an M × 1 dimensional vector and an N × 1 dimensional vector, respectively, in the horizontal and vertical directions, where M and N are the number of horizontal and vertical bins, respectively. The vector, which is the output from a sigmoid activation function, represents the probability of each bin in which the value is predicted to be 1. Here, we obtain the probability vectors instead of predicted coordinates. It is noteworthy that the traditional neural network outputs a value in a range starting from 0 to the screen width or height if it predicts the coordinates directly, which causes a wide output range compared with that of the proposed model, while our method converts the prediction task into a binary classification problem, hence avoiding such problems as in traditional neural networks.

Stage-II: fitting network output using modified logistic regression
Based on the neural network output, we used an additional logistic regression for refinement of the estimation results in the second stage. Considering the annotation data of the image is a vector composed of subsequence of 1 and 0 in order, we applied logistic regression with a modified sigmoid function to process the output from stage-I, which corresponds to the activation function of the neural network output layer, i.e. the sigmoid function. The same model was separately applied to the output Section C: Computational Intelligence, Machine Learning and Data Analytics The Computer Journal, Vol. 00 No. 00, 2021 6 Y . X i a et al. vector in the horizontal and vertical directions. The purpose was to obtain the boundary position of the two subsequences such that the logistic regression model could be used for classification to obtain the target position according to the classified data. We used the vector data of the neural network output from one direction, such as the horizontal direction, as input to fit the sigmoid function, see Fig. 6. The fitting data were a value vector ranking from big to small because the labeled data composed of subsequence of 1 and followed by subsequence of 0. According to the characteristics of fitting data, we modify the sigmoid function as x i p i is the mathematical expectation of the output variable from neural network, p i represents possibility variable of neural network output and K is the number of variables from one direction. Through x, we process the variable values by mean normalization. Meanwhile, to match the fitting data with the characteristics of the curve graph, the symbol of variables x in the function was transformed such that the curve flipped left and right. In equation (1), a is a tuning parameter that affects the steepness of the fitting curve but not the output results. Therefore, we can set it freely according to our needs. The method above was applied to the output vector in the horizontal and vertical directions, respectively. The process of finding the target points in the horizontal direction is shown in Algorithm 1. We set the median value between the maximum and minimum values of the value range as the threshold value for classification with logistic regression. We iterated through each component of the probability vector and compared it with the threshold value until the point breaking through the threshold value was obtained. We define this point as the mutation point, and target point. The process is similar when finding the target points in the vertical direction.  In Fig. 7, the relative points are mapped to the points on the screen. We obtain the actual position coordinates by revivifying the relative position point and calculate the predicted gaze position coordinates using the following equations:

Revivification of target point
where T x and T y are the predicted gaze position in horizontal and vertical direction, respectively. Recall that M and N represent the amounts of bins in horizontal and vertical direction, respectively, and W and H represent width and height of the screen.

EXPERIMENT
In this section, we introduced the datasets on which our method performed and detailed the settings in our experiment. Then, we showed our results using two annotation methods. Besides, we evaluated the performance of the proposed method on our own dataset and the GazeCapture dataset. We also analyze the prediction errors of our method.

Setup
For the training data of the gaze model, we divided the gaze dataset into training sets and test sets. The participants were divided into 11 groups: six groups for the training set, two groups for validation and three groups for the test set, as listed in Table 1. The participants were enrolled into groups in sequence so as to balance the gender proportion, wearing glasses proportion of participants for each group. Finally, we obtained 50 participants in each group. Also, for comparison, we performed our method on Gaze-Capture dataset. This dataset was captured with iphone and ipad, which contains 1 490 959 frames from 1471 subjects. The dataset was divided into train, validation and test set. The train, validation and test set consisted of 1271, 50 and 150 subjects, respectively.
The network input contained three parts, including the left eye frames, right eyes frames and face frames. The input frames size was 227 × 227. With our gaze dataset, we set the bin number of horizontal and vertical direction as 111 and 54, respectively. With GazeCapture dataset, we set bin number as 100 both in the horizontal and vertical directions. The model was implemented using Python, Pytorch programming framework and compute unified device architecture (CUDA). In the training procedure, the initial learning rate was 0.001, and stochastic gradient decent optimizer with a momentum of 0.9 and a weight decay of 0.0001 was used. The model training ran on Ubuntu 18.04 with 12 3.2GHz i7-8700 CPUs, 32GB memory, additionally with two GPUs including NVIDIA GeForce RTX 2070 and TESLA K40.

Results and comparison
We evaluate the accuracy of the proposed method via the error from the ground truth location to the gaze predicted location on the screen. We pre-set the fixed points on the mobile device screen and regard the points location as the ground truth when participants watching the points. We compute the error using the following formula: where (x 0 , y 0 ) represents the ground truth location coordinates, and (T x , T y ) is the predicted gaze location. To assess the proposed annotation method, we compared its performance with the gridding annotation method. For the gridding annotation method, the logistic regression in stage-II is not required for processing after the neural network, so we can obtain the target point through a neural network directly. We evaluate the maximum, minimum and mean errors in the models using two annotation methods. The results were evaluated in centimeters.  Note: GazeMP represents our collected dataset. M1 and M2 represent models with the gridding annotation method and the proposed annotation method, respectively. The max. error and min. error were evaluated by groups. The results of M3 are reported in the GazeCapture paper [24].
Specifically, we denoted the model with the gridding and the proposed annotation methods as M1 and M2, respectively. In Table 2, the first row is the results by M1 and M2 performed on our collected dataset. We evaluated the maximum error and minimum error by participant group. The maximum error and minimum error showed the largest and smallest participant group mean error in all groups. The mean error of M2 was smaller than that of M1. M2 also yields better prediction results in terms of the maximum error and minimum error. The neural network in M2 has fewer output quantities than M1; it reduces the quantity of predicted targets. In addition, M2 used the sigmoid function where a high relevance exists between the variables, while M1 did not. Hence, M2 will undoubtedly improve the prediction accuracy. We checked the participant groups with big error in our dataset. We found that there were more participants wearing glasses in the group with big prediction error. It may be necessary to balance the proportion of participants wearing glasses both in training dataset and test dataset and in all participant group.
We also performed the two annotation methods on GazeCapture dataset [34] for comparison. The results are shown in the second row of Table 2. The mean error of M1 was 2.42 cm. However, the mean error of M2 was 2.23 cm, which was better than M1. Also, maximum error and minimum error of M2 were better than that of M1. Neither M1 or M2 has ideal performance on GazeCapture dataset, compared with the results reported in the GazeCapture paper [24].
For GazeCapture data, the results using proposed annotation method also outperform that by the gridding method on both ipad platform and iphone platform. Specifically, the gaze prediction mean error by M1 and M2 on iphone data was 2.20 and 2.09 cm, respectively, much better than that performed on ipad data. One possible explanation is that the gaze location in our proposed method is predicted to be located in a square area with width equal to bin length, and hence, the error is related to the bin length. Since our predicted gaze location is the center point of the square area and the bins, number we set was equal on the ipad and iphone, so the bin width of ipad is larger than that of iphone, which may cause the larger error on the ipad than that on the iphone.

Prediction error analysis
The results indicate the gaze prediction errors at different locations, see Fig. 8. According to the error distribution, we observed that a larger error primarily occurred when the points were close to the margin of the screen. The error increased with the distance to the screen center. The result of the center point was not the best of all prediction results. Good predicted results were shown on points close to the center point. Observing the distribution of errors, we discovered a poor gaze prediction on points close to the margin of the screen, which were related to our collected dataset. Because we only collected a small number of frames of people looking at points around the margin of the screen, which might adversely affect the gaze prediction on the point around the margin.
Our gaze estimation method did not perform perfectly in prediction error and the possible explanations are as following. Our collected dataset was collected on the mobile phones with our designed data collecting mechanisms, while the Gaze-Capture dataset was captured on mixed platforms including iphone and ipad with different screen size. Besides, we have collected high-quality images in our dataset and design the proper annotation method according to our data. When annotating the data, we adjusted the bin width according to the different size of screen on our collected dataset instead of setting the fixed bins numbers on GazeCapture dataset, which may affect the prediction results. In addition, the predicted gaze point is located at the center point in a square area with width equal to the bin length, which makes our method more robust.
In the future, we will include information of facial coordinate position [35], head pose [36,37] and time sequence information, such as eye optical flow, as the inputs of a neural network to further reduce the prediction error and hence improve the prediction accuracy of the proposed model.

CONCLUSIONS
In this article, we herein proposed a two-stage gaze estimation method based on a neural network and logistic regression, Section C: Computational Intelligence, Machine Learning and Data Analytics The Computer Journal, Vol. 00 No. 00, 2021 where the neural network was used to process the input pictures and output predicted probability vectors of gaze labels and the logistic regression was used for refinement of prediction from the neural network. We also designed a dataset collecting mechanisms and built our own dataset. The proposed method could be widely used in various mobile devices without additional hardware or systematic prior knowledge. We demonstrated that the two-stage gaze estimation method combined with a new annotation approach significantly improved the gaze estimation accuracy. Furthermore, by changing the annotation bins, we could adjust the bin length to different accuracy needs for applications.

Data Availability
Currently, the data underlying of this article cannot be shared publicly due to privacy concerns and ongoing study. Once the project is ended, the data will be publicly available, with the ethical authorization, in the near future. The original data should only be used for scientific research purpose. The implementation of the proposed method will also be publicly available on GitHub.

Funding
National Natural Science Foundation of China (11901013, 12075011); Beijing Natural Science Foundation (1204031, 7202093); Fundamental Research Funds for the Central Universities.