Classifying Topological Charge in SU(3) Yang-Mills Theory with Machine Learning

We apply a machine learning technique for identifying the topological charge of quantum gauge configurations in four-dimensional SU(3) Yang-Mills theory. The topological charge density measured on the original and smoothed gauge configurations with and without dimensional reduction is used for inputs of the neural networks (NN) with and without convolutional layers. The gradient flow is used for the smoothing of the gauge field. We find that the topological charge determined at a large flow time can be predicted with high accuracy from the data at small flow times by the trained NN; the accuracy exceeds $99\%$ with the data at $t/a^2\le0.3$. High robustness against the change of simulation parameters is also confirmed. We find that the best performance is obtained when the spatial coordinates of the topological charge density are fully integrated out as a preprocessing, which implies that our convolutional NN does not find characteristic structures in multi-dimensional space relevant for the determination of the topological charge.


Introduction
Quantum chromodynamics (QCD) and other Yang-Mills gauge theories in four spacetime dimensions can have topologically nontrivial gauge configurations classified by the topological charge Q taking integer values. The existence of the non-trivial topology in QCD is responsible for various non-perturbative aspects of this theory, such as the U(1) problem. 1) The susceptibility of Q also provides an essential parameter relevant for the cosmic abundance of the axion dark matter. [2][3][4] The topological property of QCD and Yang-Mills theories has been studied by numerical simulations of lattice gauge theory. [5][6][7][8][9][10][11][12][13][14][15][16][17][18] Because of the discretization of spacetime, gauge configurations on the lattice are, strictly speaking, topologically trivial. However, it is known that well-separated topological sectors emerge when the continuum limit is approached. 19) Various methods for the measurement of Q of the gauge configurations on the lattice have been proposed, which are roughly classified into the fermionic and gluonic ones. In the fermionic definitions the topological charge is defined through the Atiyah-Singer index theorem, 20) while the gluonic definitions make use of the topological charge measured on a smoothed gauge field. 21,22) The values of Q measured by various methods show an approximate agreement, 14) which indicates the existence of separated topological sectors. In the lattice simulations, the measurement of the topological charge is also important for monitoring the problem of the topological freezing. 23,24) In the present study, we apply the machine learning (ML) technique for analyzing Q of gauge configurations on the lattice. The ML has been applied for various problems in computer science quite successfully, such as the image recognition, object detection, and natural language processing. [25][26][27][28][29][30][31][32][33][34][35][36][37][38][39][40][41][42] Recently, this technique has also been applied to problems in physics. [43][44][45][46][47][48][49][50][51][52][53][54][55][56][57] In the present study, we generate data by the numerical simulation of SU(3) Yang-Mills theory in four spacetime dimensions, and feed them into the neural networks * kitazawa@phys.sci.osaka-u.ac.jp (NN). We use the convolutional NN (CNN) as well as the simple fully-connected NN (FNN) depending on the type of the input data. The NN are trained to predict the value of Q by the supervised learning.
The first aim of this study is a development of an efficient algorithm for the analysis of Q with the aid of the ML. The second, and more interesting, purpose is the search for characteristic local structures in the four-dimensional space related to Q by the CNN. It is known that Yang-Mills theories have classical gauge configurations called instantons, which carry a nonzero topological charge and have a localized structure. 1) If the topological charge of the quantum gauge configurations is also carried by instanton-like local objects, the CNN would recognize and make use of them for the prediction of Q. Such an analysis of the four-dimensional quantum fields by the ML will open a new application of this technique.
In this study, we use the topological charge density measured on the original and smoothed gauge configurations as inputs of the NN. The smoothing is performed by the gradient flow. [58][59][60] We also perform the dimensional reduction to various dimensions as a preprocessing of the data before feeding them into the CNN or FNN. For the definition of Q, we use a gluonic one through the gradient flow. 59,60) We find that the NN can estimate the value of Q determined at a large flow time with high accuracy from the data obtained at small flow times. In particular, we show that the high accuracy is obtained by the multi-channel analysis of the data at different flow times. We argue that this method can reduce the numerical cost for the analysis of Q compared with the conventional method.
We also find that the accuracy of the NN does not have a statistically-significant dependence on the dimension of the input data after the the dimensional reduction. This result implies that the CNN fails in finding characteristic features related to the topology in multi-dimensional space, i.e. the quantum gauge configurations do not have such features, or their signals are too weak to be detected by the CNN. β N 4 N conf 6.2 16 4 20,000 6.5 24 4 20,000 Table I. Simulation parameters on the lattice. The inverse bare coupling β, the lattice size N 4 , and the number of configurations N conf .

Organization of this paper
In this study, we perform various analyses of the topological charge Q with the use of the CNN or FNN. First, we analyze the topological charge density q t (x) in the fourdimensional (d = 4) space at a flow time t (the definitions of q t (x) and t will be given in Sec. 4). Second, we perform the dimensional reduction of the input data as a preprocessing and analyze them by the NN. The dimension is reduced to d = 0 − 3 by integrating out the spatial coordinates. For the analysis at d = 0 we adopt the FNN, while the data at d ≥ 1 are analyzed by the CNN.
As discussed already, we find that the resulting accuracy is insensitive to the value of d. Because the numerical cost for the supervised learning is suppressed as d becomes smaller, this means that the best performance is obtained at d = 0. Therefore, in this paper we first report this most successful result among various analyses with the ML in Sec. 6. The analyses of the multi-dimensional data will then be discussed later in Secs. 7 and 8.
Before the analyses of Q with the ML technique, we consider simple models to estimate Q without ML in Sec. 5. These models are used for the benchmarks of the trained NN to verify if they recognize nontrivial features of the data in Secs. 6-8. The whole structure of this paper is summarized as follows. In the next section, we show the setup of the lattice numerical simulations. In Sec. 4, we then give a brief review on the analysis of the topology with the gradient flow. The benchmark models for the classification of Q without using the ML are discussed in Sec. 5. The application of the ML is then discussed in Secs. 6-8. We first consider the analysis of the d = 0 data by the FNN in Sec. 6. We then discuss the analysis of the four-dimensional field q t (x) by the CNN in Sec. 7. In Sec. 8, we extend the analysis to d = 1, 2, 3. The last section is devoted to discussions.

Lattice setup
Throughout this paper, we consider SU(3) Yang-Mills theory in the four-dimensional Euclidean space with the periodic boundary conditions for all directions. The standard Wilson gauge action is used for generating the gauge configurations. We perform the numerical analyses at two inverse bare couplings β = 6/g 2 = 6.2 and 6.5 with the lattice volumes 16 4 and 24 4 , respectively, as in Table I. These lattice parameters are chosen so that the lattice volumes in physical units L 4 are almost the same on these lattices; the lattice spacing determined in Ref. 18 shows that the difference in the lattice size L is less than 2%. The lattice size L is related to the critical temperature of the deconfinement phase transition T c as 1/L 0.63T c . 61) We generate 20, 000 gauge configurations for each β, which are separated by 100 Monte Carlo sweeps with each other, where one sweep consists of one pseudo-heat bath and five over-relaxation updates. For the discretized definition of the topological charge density on the lattice, we use the operator constructed from the clover representation of the field strength. The gradient flow is used for the smoothing of the gauge field.
To estimate the statistical error of an observable on the lattice, we use the jackknife analysis with the binsize 100. We have numerically checked that the auto-correlation length of the topological charge is about 100 and 1900 sweeps for β = 6.2 and 6.5, respectively. The binsize of the jackknife analysis including 100 × 100 sweeps is sufficiently larger than the auto-correlation length.

Topological charge
In the continuous Yang-Mills theory in the fourdimensional Euclidean space, the topological charge is defined by where V is the four-volume and F µν ( is called the topological-charge density with the coordinate x in Euclidean space. In lattice gauge theory, Eq. (1) calculated on a gauge configuration with a discretized definition of Eq. (2) is not given by an integer, but distributes continuously. To obtain discretized values, one may apply a smoothing of the gauge field before the measurement of q(x).
In the present study, we use the gradient flow 59,60) for the smoothing. The gradient flow is a continuous transformation of the gauge field characterized by a parameter t called the flow time having dimension of mass inverse squared. The gauge field at a flow time t is a smoothed field with the meansquare smoothing radius √ 8t. 59) In the following, we denote the topological charge density obtained at t as q t (x), and its four-dimensional integral as Shown in Fig. 1 is the t dependence of Q(t) calculated on 200 gauge configurations at β = 6.2 and 6.5. The horizontal axis shows the dimensionlees flow time t/a 2 with the lattice spacing a. One finds that the values of Q(t) approach discrete integer values as t becomes larger. In Fig. 2, we show the distribution of Q(t) for several values of t/a 2 by the histogram at β = 6.5. At t = 0, the values of Q(t) are distributed continuously around the origin. As t becomes larger, the distribution converges on discretized integer values. For t/a 2 > 1.0, the distribution is almost completely separated around integer values. In this range of t, one can classify the gauge configurations into different topological sectors labeled by the integer topological charge Q defined, for example, by the nearest integer to Q(t). It is known that the value of Q defined in this way approximately agrees with the topological charge obtained through other definitions, and the agreement is better on finer lattices. 14) From Figs. 1 and 2, one finds that the distribution of Q(t) deviates from integer values toward the origin. This deviation   becomes smaller as t becomes larger. From Fig. 1, one also finds that Q(t) on some gauge configurations has a "flipping" between different topological sectors; after Q(t) shows a convergence to an integer value, it sometimes jumps into another integer. 14) As this behavior decreases on the finer lattice, the flipping would be regarded as a lattice artifact arising from the ambiguity of the topological sectors on the discretized spacetime.
In the following, we use t/a 2 = 4.0 as the definition of the topological charge Q as where round(x) means rounding off to the nearest integer. As indicated from Fig. 1, the value of Q hardly changes with the variation of t/a 2 in the range 4 < t/a 2 < 12. In Table II, we show the number of gauge configurations classified into each topological sector through this definition. The variance of this distribution Q 2 is shown in the far right column. In Fig. 2, the distributions of Q(t) in individual topological sectors are shown by the colored histograms.

Benchmark models
In this study, we analyze q t (x) or Q(t) at small values of t by the ML technique. Here, t used for the input has to be chosen small enough so that a simple estimate of Q like Eq. (4) is not possible. In this section, before the main analysis with the ML technique we discuss the accuracy obtained only from Q(t) without the ML. These analyses serve as benchmarks for evaluating the genuine benefit of the ML.
Throughout this study, as the performance metric of a model for an estimate of Q we use the accuracy defined by P = number of correct answer number of total data .
Because the numbers of gauge configurations on different topological sectors differ significantly as in Table II, Eq. (5) would not necessarily be a good performance metric. In particular, the topological sector with Q = 0 has the largest number, and a model which estimates Q = 0 for all configurations obtains the accuracy P 0.37 (0.41) for β = 6.2 (6.5), although such a model is, of course, meaningless. One has to keep in mind this possible problem of Eq. (5). In Sec. 7, we Flow time t dependence of the accuracies P naive and P imp obtained by the models Eqs. (6) and (7), respectively. The dotted lines show the accuracy of the model that answers Q = 0 for all configurations.  To make an estimate of Q from Q(t), we consider two simple models. The first model is just rounding off Q(t) as The accuracy obtained by this model, P naive , as a function of t is shown in Fig. 3 by the dashed lines. The figure shows that the accuracy of Eq. (6) approaches 100% as t/a 2 becomes t/a 2 β = 6.2 β = 6.5 0 0.273 (3) 6) is equivalent to Eq. (4) and the accuracy becomes 100% by definition.
The model Eq. (6) can be improved with a simple modification. In Fig. 2, one sees that the distribution in each topological sector is shifted toward the origin from Q. This behavior suggests that Eq. (6) can be improved by applying a constant before rounding off as where c is the parameter determined so as to maximize the accuracy in the range c > 1 for each t. In Fig. 4, we show the c dependence of the accuracy of Eq. (7) for several values of t/a 2 . The figure shows that the accuracy has a maximum at c > 1 for some t/a 2 . In this case, the model Eq. (7) has a better accuracy than Eq. (6) by tuning the parameter c. We denote the optimal accuracy of Eq. (7) as P imp . In Fig. 3, the t/a 2 dependence of P imp is shown by the solid lines. The numerical values of P imp are depicted in Table III for some t/a 2 . Figure 3 shows that a clear improvement of the accuracy by the single-parameter tuning is observed at t/a 2 0.2 and 0.3 for β = 6.2 and 6.5, respectively. We note that P naive = P imp = 1 at t/a 2 = 4.0 by definition. Table III also shows that P imp is almost unity at t/a 2 = 2.0 and 10.0, which shows that the value of Q defined by Eq. (4) hardly changes with the variation of t/a 2 in the range t/a 2 2.0. As P imp is already close to unity at t/a 2 = 0.5, it is difficult to obtain a nontrivial gain of the accuracy from the analysis of q t (x) by the NN for t/a 2 ≥ 0.5. In the following, therefore, we feed the data at t/a 2 < 0.5 to the NN. In the following sections, we use P imp for a benchmark of the accuracy obtained by the NN models.

Learning Q(t)
From this section we employ the ML technique for the analysis of the lattice data. As discussed in Secs. 1 and 2, among various analyses we found that the most successful result is obtained when a set of the values of Q(t) at several t is analyzed by the FNN. In this section, we discuss this result. The layer output size activation input 3 full connect 5 logistic full connect 1 - analysis of the multi-dimensional data by the CNN will be reported in later sections.

Setting
In this section we employ a simple FNN model without convolutional layers. The FNN accepts three values of Q(t) at different t as inputs, and is trained to predict Q by the supervised learning. The structure of the FNN is shown in Table IV. The FNN has only one hidden layer with five units that are fully connected with the input and output layers. We use the logistic (sigmoid) function for the activation function of the hidden layer. Although we have also tried the ReLU for the activation function, we found that the logistic function gives a better result. We employ the regression model, i.e. the output of the FNN is given by a single real number. The final prediction of Q is then obtained by rounding off the output to the nearest integer.
For the supervised learning, we randomly divide 20,000 gauge configurations into 10,000 and two 5,000 sub-groups. We use 10,000 data for the training, and one of the 5,000 data sets for the validation analysis. The last 5,000 data is used for the evaluation of the accuracy of the trained NN. The supervised learning is repeated 10 times with different divisions of the configurations, and the uncertainty of the accuracy is estimated from the variance.
We use the mean-squared error for the loss function, and minimize it through the updates of the NN parameters by the ADAM 62) with the default setting. The update is repeated 3,000 epochs with the batchsize 16. The optimized parameter set of the FNN is then determined as the one giving the lowest value of the loss function on the validation data.
The FNN is implemented by the Chainer framework. 63) The training of the FNN in this section has been carried out as a single-core job on a XEON processor (Xeon E5-2698-v3). It takes about 40 minutes for a single training on this environment.

Result
Shown in Table V are the accuracies obtained by the trained FNN for various choices of the input data. The left columns show the set of three flow times t/a 2 that evaluate Q(t) used for the input of the FNN. In the upper eight rows we show the results with the input flow times t/a 2 = (t max ,t max −0.05,t max − 0.1). The table shows that the accuracy is improved ast max becomes larger. By comparing this result with Table III one finds that the accuracy obtained by the FNN is significantly higher than P imp at t/a 2 =t max . In particular, the accuracy att max = 0.3, i.e. t/a 2 = (0.3, 0.25, 0.2), shown by the bold letters is as high as 99% for β = 6.5, while the benchmark model Eq. (7) gives P imp 0.71. This result shows that the prediction of Q from the numerical data at t/a 2 ≤ 0.3 is remarkably improved with the aid of the ML technique.  larger, but the improvement from P imp is limited for much largert max because P imp is already close to unity. The same result is obtained for β = 6.2, although the accuracy is slightly lower than β = 6.5.
In Fig. 5 we show the confusion matrix that plots the numbers of configurations with the true and predicted values of Q with the input t/a 2 = (0.3, 0.25, 0.2) for β = 6.2 and 6.5. From the figure one finds that the deviation of the output of the FNN from the true value is at most ±1.
The high accuracy obtained in this analysis is not surprising qualitatively. As shown in Fig. 1, on many configurations the behavior of Q(t) is monotonic at 0.2 ≤ t/a 2 ≤ 0.3. Therefore, the value at large t can be estimated easily, for example, by the human eye, for almost all configurations. It is reasonable to interpret that the FNN learns this behavior. We, however, remark that the accuracy of 99% obtained by the trained FNN is still non-trivial. We have tried to find a simple function to predict Q from three values of Q(t) at t/a 2 = (0.3, 0.25, 0.2), and also performed blind tests by human beings. These trials have been able to obtain 95% accuracy easily, but have failed in attaining 99%; see Appendix for more discussion on this point. These results suggest that the ML finds non-trivial features in the behavior of Q(t).
In the lower three rows of Table V, we show the accuracies of the trained FNN with the input flow times t/a 2 = (t max ,t max − 0.1,t max − 0.2) for several values oft max . The ac-curacies in these cases are slightly lower than the result with t/a 2 = (t max ,t max − 0.05,t max − 0.1) at the samet max . We have also tested the FNN models analyzing four values of Q(t). It, however, was found that the accuracy does not exceed the case with three input data with the same maximum t/a 2 . We have also tested the FNN models having more complex structure, for example, with multiple hidden layers. A statisticallysignificant improvement of the accuracy, however, was not observed, either.
In the conventional analysis of Q with the gradient flow discussed in Sec. 4, one must use the value of Q(t) at a large flow time at which the distribution of Q(t) is well localized. This means that the gradient flow equation has to be solved numerically 59) up to the large flow time to obtain Q. Moreover, concerning the continuum a → 0 limit it is suggested that the flow time has to be fixed in physical units when a is varied. 18) This means that the flow time in lattice units, t/a 2 , becomes large and the numerical cost for solving the flow equation increases as the continuum limit is approached. On the other hand, our analysis can estimate Q quite successfully only with the data at t/a 2 0.3. This means that the numerical cost for the evaluation of Q can be reduced drastically with the aid of the FNN. Table V shows that the better accuracy is obtained on the finer lattice (larger β). From Fig. 1, it is suggested that this tendency comes from the reduction of the "flipping" of Q(t) on the finer lattice, as the non-monotonic flipping makes the prediction of Q from Q(t) at small t/a 2 difficult. This effect is also suggested from Fig. 3, as P naive and P imp at β = 6.2 are lower than those at β = 6.5. We note that this lattice spacing dependence hardly changes even if we scale the value of t to determine Q in Eq. (4) in physical units if t is sufficiently large. Provided that the flipping of Q(t) comes from the lattice artifact related to the ambiguity of the topological sectors on the discretized spacetime, it is conjectured that the imperfect accuracy of the FNN is to a large extent attributed to this lattice artifact. Then, the imperfect accuracy of the FNN at finite a is an inevitable one, and the accuracy should become better as the lattice spacing becomes finer. Therefore, it is conjectured that the systematic uncertainty arising from the imperfect accuracy of the FNN is suppressed in the analysis of the continuum extrapolation.

Susceptibility
Next, we consider the variance of the topological charge Q 2 , which is related to the topological susceptibility as χ Q = Q 2 /V. From the output of the FNN with the input flow times t/a 2 = (0.3, 0.25, 0.2), the variance of Q is calculated to be for each β where the first and second errors represent the statistical error obtained by the jackknife analysis and the uncertainty of the FNN model estimated from 10 different trainings, respectively. These values agree well with those shown in Table II.

Reduction of training data
So far, we have performed the training of the FNN with the number of the training data N train = 10, 000. Now we consider  the training with much smaller N train . Shown in Table VI is the accuracy of the trained FNN with various N train with the input flow times t/a 2 = (0.3, 0.25, 0.2). The structure of the FNN is the same as before. From the table, one finds that the FNN is successfully trained even with N train = 500.
This result shows that the cost for the preparation of the training data can be reduced. The reduction of N train is also responsible for reducing the numerical cost for the training. With N train = 10, 000, the training of the FNN requires about 40 minutes on a single core of a XEON processor, while only 5.5 minutes is needed with N train = 500 on the same environment.

Robustness
Next, we consider the analysis of the data with different β from the one used for the training. In Table. VII, we show the accuracy obtained with various combinations of the β values used for the training and the analysis with the input flow times t/a 2 = (0.3, 0.25, 0.2). The table shows that the accuracy becomes worse when the different data set is analyzed, but the reduction is small and almost within statistics. We have also performed the training of the FNN with the combined data set of β = 6.2 and 6.5. The result of this analysis is shown in the far bottom row in Table. VII. One finds that this FNN can predict Q for each β with the same accuracy within statistics as those trained for individual β.
These results suggest that it is possible to develop a NN model to deal with various β simultaneously. Once such a model is developed, the model plays a quite useful role in the analysis of Q. We, however, notice that the two lattices studied in the present study have almost the same spatial volume in physical units. The analysis of the robustness against the variation of the spatial volume is left for future work.

Learning topological charge density q t (x)
In this section we employ the CNN and train it to analyze the four-dimensional field q t (x). A motivation of this analysis is the search for characteristic features responsible for the topology in the four-dimensional space by the ML. In particular, if the quantum gauge configurations have local structures like instantons, 1) such structures would be recognized by the CNN and used for an efficient prediction of Q.  Table VIII. Design of the CNN for the analysis of the multi-dimensional data. The dimension d of the input data is 4 in Sec. 7. In Sec. 8, we analyze the data with d = 1, 2, 3 obtained by the dimensional reduction.

Input data
Let us first discuss the choice of the input data for the CNN. Because the gauge configurations on the lattice are described by the link variables U µ (x), which are elements of the group SU(3), the most fundamental choice for the input data is the link variables. However, as U µ (x) is described by 72 real variables per lattice site, the reduction of the data size is desirable for an efficient training. Moreover, because physical observables are given only by gauge-invariant combinations of U µ (x), the CNN must learn the concept of the gauge invariance, and accordingly the SU(3) matrix algebra, so that it can make a successful prediction of Q from U µ (x). These concepts, however, would be too complicated for simple CNN models.
In the present study, from these reasons we use the topological charge density q t (x) as the input of the CNN. q t (x) is gauge invariant, and the degrees of freedom per lattice site is one. To reduce the size of the input data further, we reduce the lattice volume to 8 4 from 16 4 and 24 4 by the average pooling as a preprocessing. In addition to the analysis of q t (x) at a given t, we prepare a combined data set of q t (x) with several values of t and analyze it as the multi-channel data by the CNN.

Designing CNN
In this section, we use the CNN with the convolutional layers that deal with four-dimensional data. In Table. VIII, we show the structure of the CNN model, where d denotes the dimension of the spacetime and is set to d = 4 throughout this section. The model has three convolutional layers with the filter size 3 4 and five output channels. In these convolutional layers, we use the periodic padding for all directions to respect the periodic boundary conditions of the gauge configuration. N ch denotes the number of channels of the input data per lattice site; N ch = 1 when q t (x) at a single t is fed into the CNN. We also perform the multi-channel analysis by feeding q t (x) at N ch flow times.
The lattice gauge theory has translational symmetry and the shift of the spatial coordinates of q t (x) toward any directions does not change the value of Q. To ensure that the CNN automatically respects this property, we insert a global average pooling (GAP) layer 28) after the convolutional layers. The GAP layer takes the average with respect to the spatial coordinates for each channel. The output of the GAP layer is then processed by two fully-connected layers before the final output. The logistic activation function is used for the convolutional and fully-connected layers.
The training of the CNN in this section has been mainly carried out on Google Colaboratory. 64) We use 12,000 data for the training, 2,000 data for the validation, and 6,000 data for the test, respectively. The batchsize for the minibatch training is 200. We repeat the parameter tuning 500 epochs. Other settings of the training are the same as in the previous section. Besides the CNN model in Table. VIII, we have tested various variations of the model. For example, we tested ReLU activation function in place of the logistic one. The use of the fully-connected layer in place of the GAP layer and the convolutional layers with the 5 4 filter size are also tried. The number of the output channels of the convolutional layers is varied up to 20. We, however, found that these variations do not improve the accuracy at all, while they typically increase the numerical cost for the training. The CNN in Table. VIII is a simple but efficient choice among all these variations.

Results
In Table IX where N Q is the number of configurations in the topological sector Q and N correct Q is the number of the correct answers among them.
The top row of Table IX shows P and R Q obtained by the analysis of the topological charge density of the original gauge configuration without the gradient flow. Although we obtain a nonzero P, the recall of each Q shows that in this case the CNN is trained to answer Q = 0 for almost all configurations. The training fails in obtaining any features responsible for the determination of Q.
Next, the results with N ch = 1 but nonzero t/a 2 show that P becomes larger with increasing t/a 2 . From R Q one also finds that the output of the CNN scatters on different topological sectors. However, by comparing P with that of the benchmark model P imp in Table III with the same t/a 2 , one finds that P and P imp are almost the same. This result suggests that the CNN is trained to answer Q imp and no further information is obtained from the analysis of the four-dimensional data of q t (x).
Finally, from the multi-channel analysis with the input flow times t/a 2 = (0.3, 0.2, 0.1), one finds that the accuracy P is significantly enhanced from the case with N ch = 1 and exceeds 94% for each β. However, this accuracy is the same within the error as that obtained in Sec. 6 with t/a 2 = (0.3, 0.2, 0.1) shown in Table V. This result implies that the CNN is trained to obtain Q(t) for each t and then predicts the answer from them with a similar procedure as the FNN in Sec. 6.
From these results, we conclude that our analyses of the four-dimensional data by the CNN fail in finding structures in the four-dimensional space responsible for the determination of Q. The numerical cost for the training of the CNN in this   section is a few orders larger than those in Sec. 6, although clear improvement of the accuracy is not observed. Therefore, for practical purposes the analysis in the previous section with the FNN is superior.

Dimensional reduction
In the previous two sections we discussed the analysis of the four-dimensional topological charge density q t (x) and its four-dimensional integral Q(t) by the ML. The spatial dimensions of these input data are d = 4 and 0, respectively. In this section, we analyze the data with the dimensions d = 1-3 obtained by the dimensional reduction by the CNN.
We consider the integral of the topological charge density with respect to some coordinates Here,q (d) t is the d-dimensional field analyzed by the CNN. The structure of the CNN is the same as the previous section (see Table VIII) except for the value of d. The procedure of the supervised learning is also the same. We perform the analysis of the multi-channel data with N ch = 3 and t/a 2 = (0.3, 0.2, 0.1).
In Fig. 6, we show the accuracy obtained by the analysis of the d-dimensional dataq (d) t by the CNN. The data points at d = 0 show the result obtained by the analysis of Q(t) by the FNN in Sec. 6 with t/a 2 = (0.3, 0.2, 0.1) given in Table V. From the figure, one finds that the accuracy does not have a statistically-significant d dependence, although the results at d = 1 and 2 would be slightly better than d = 0.

Discussion
In the present study, we have investigated the application of the machine learning (ML) technique for the classification of the topological sector of gauge configurations in SU(3) Yang-Mills theory. The topological charge density q t (x) at zero and nonzero flow times t are used for the inputs of the neural networks (NN) with and without the dimensional reduction.
We found that the prediction of the topological charge Q can be made most efficiently when Q(t) at small flow times are used as the input of the NN. In particular, we found that the value of Q defined from Q(t) at a large flow time can be predicted with high accuracy only with Q(t) at t/a 2 ≤ 0.3; at β = 6.5, the accuracy exceeds 99%. Using this procedure, the numerical cost for solving the gradient flow toward the large flow time can be omitted in the analysis of the topological charge.
Because the prediction of the NN does not have 100% accuracy, the analysis of Q by the NN gives rise to uncontrollable systematic uncertainties. However, our analyses indicate that the accuracy becomes better as the continuum limit is approached. Moreover, as discussed in Sec. 6, the imperfect accuracy would to a large extent come from intrinsic uncertainty of the topological sectors on the lattice with finite a. It thus is expected that the analysis of Q becomes more accurate as the lattice spacing becomes finer. As the 99% accuracy is already attained at β = 6.5 (a 0.044 fm), the systematic uncertainty should be well suppressed on the lattices finer than this lattice spacing, and our analysis should be able to carry out the analysis of the continuum limit safely.
In this study, we found that the analysis of the multidimensional field q t (x) by the CNN does not improve the accuracy compared with that of Q(t). A plausible interpretation of this result is that our CNN fails in capturing useful structures in the four-dimensional space relevant for the determination of Q. It is an interesting future work to pursue the search for the structures in the four-dimensional space by the ML. One possible extension along this direction is the analysis with the CNN having a more complex structure. Another interesting direction is the analysis of the gauge configurations at high temperatures where the dilute instanton-gas picture is well applicable. As the topological charge would be carried by well-separated local objects at such temperatures, the search for the multi-dimensional space by the CNN would be easier than the vacuum configurations analyzed in the present study. It is also interesting to analyze q t (x) at a large flow time after subtracting the average, because the NN can no longer make use of the information on Q(t) by such a preprocessing. We left these analyses for future research.
The authors thank A. Tomiya for many useful discussions. They also thank H. Fukaya and K. Hashimoto. The lattice simulations of this study are in part carried out on OCTOPUS at the Cybermedia Center, Osaka University and Reedbush-U at Information Technology Center, The University of Tokyo. The NN are constructed on Chainer framework. The supervised learning of the NN in Sec. 7 is in part carried out on Google Colaboratory. This work was supported by JSPS KAKENHI Grant Numbers 17K05442 and 19H05598.

Appendix: Behavior of Q(t)
In this appendix, we take a closer look at the behavior of Q(t) at small t. In Fig. A·1, we show the t dependence of Q(t) on 100 gauge configurations at β = 6.5 by two different ways. In Sec. 6 it is shown that the trained NN can estimate the value of Q from the behavior of Q(t) at 0.2 ≤t ≤ 0.3 with 99% accuracy for β = 6.5. This range oft is highlighted by the gray band in Fig. A·1. From the upper panel, one sees thatQ(t) approaches zero monotonically on almost all configurations. However, the panel shows that some lines deviate from this trend. As a result, it seems difficult to predict the value of Q with 99% accuracy (Q has to be predicted correctly on 99 lines among 100 in the panel) by a simple function or the human eye from the behavior at 0.2 ≤t ≤ 0.3, although 95% accuracy is not difficult to attain. A similar observation is also obtained from the lower panel. It thus is indicated that the 99% accuracy obtained by the NN in Sec. 4 is not a trivial result.