Phase transition encoded in neural network

We discuss an aspect of neural networks for the purpose of phase transition detection. To this end, we first train the neural network by feeding Ising/Potts configurations with labels of temperature so that it can predict the temperature of input. We do not explicitly supervise if the configurations are in ordered/disordered phase. Nevertheless, we can identify the critical temperature from the parameters (weights and biases) of trained neural network. We attempt to understand how temperature-supervised neural networks capture the information of phase transition by paying attention to what quantities they learn. Our detailed analyses reveal that it learns different physical quantities depending on how well it is trained. Main observation in this study is how the weights in the trained neural-network can have information of the phase transition in addition to temperature.


I. INTRODUCTION
Exploring phases of matters is one of the most important tasks to reveal infrared structure of the underlying microscopic physical systems. Their phases are classified based on the symmetries that the theory possesses [1,2]. In a theory with several distinct phases, phase transitions occur at their boundaries, and, among them, the nature of second order phase transition is solely determined by number of dimensions, global symmetries of underlying theory independent of its microscopic details, i.e., classified by the universality class. In reality, however, analytic determination of various phases or detection of phase transitions based on the data of microscopic theories is generally a hard problem because we mostly find it difficult to exactly solve them or identify the corresponding infrared theories. Therefore, tremendous amount of works have been devoted to unravel them with numerical approaches.
An obvious and major obstacle is that the larger the number of degrees of freedom grows the harder the numerical analyses become.
Machine learning has grown up rapidly in the field of computer science and made prominent successes in pattern recognition, image processing, etc. Recently, we have witnessed that the machine learning has also been applied in various branches of physics. Detection of phase transitions is one of the intriguing examples that machine learning may make new progresses, and several approaches have already proposed and examined in simple models such as spin systems. The training is performed with supervision [3][4][5][6][7][8][9][10][11][12][13][14][15][16] or without one [17][18][19][20][21][22].
One approach to detect a phase boundary is a supervised binary classification, where a neural network is trained so that it can distinguish ordered and disordered phases. Indeed, it reasonably detects the phase transition in several models from their raw data [3]. It also succeeded in detecting non-standard phase transition such as topological phase transitions and BKT phase transition [10,12,14,16].
On the other hand, a novel approach was proposed in [4] to detect the phase transition by speculating that the information of order parameters are encoded in the weights of neural network as a consequence of training. They attempted to identify the critical temperature of the two dimensional Ising model based on supervised machine learning. A fully-connected as well as convolutional neural network are trained in such a way that it can correctly infer the target temperature of input spin configuration. It is surprising that they succeeded in extracting the phase transition temperature from its weight because they did not feed any direct information about phases transition during the training. It implies that the network spontaneously captures the phase transition and encodes it in the machine parameters along the process of supervised learning of temperature although the underlying mechanism of the outcome was remained unresolved.
The purpose of this article is to understand the mechanism of this phase transition detection. Understanding what this approach captures will be useful when we try to apply it to other unknown systems because it may not detect the phase transition correctly if it is triggered by different physical mechanisms such as quasi long-range order (BKT transition) or topology. Indeed, it turns out that the current method captures different physical quantities depending on how we train the neural network. In order to illustrate the idea, we study simplified architectures embodying the essence of temperature prediction.

II. CRITICAL TEMPERATURE PREDICTION
We consider two dimensional Ising model, described by a Hamiltonian H( where the coupling constant J is taken to be positive. σ i ∈ {−1, 1} represents a spin degree of freedom living on a site of a square lattice of size L × L. We impose periodic boundary condition on spins. The sum is taken over nearest neighboring sites. We redefine the Hamiltonian as by absorbing the coupling constant into the inverse temperature K := J/k B T on the Boltz- We attempt to detect its critical temperature associated with the second-order phase transition. Nevertheless, our first step is to construct a "thermometer" by employing a supervised feed-forward neural network. Then, we will examine the weights and biases of the trained neural networks, and attempt to identify the critical temperatures of two We use two types of neural network architecture: fully-connected and convolutional neural networks.
The former consists of fully-connected hidden and output layers as follows: Let us describe it in detail here. We denote input degrees of freedom by {σ i } (i = 1, . . . , L × L), which would be spins in case of Ising model. A hidden unit x a (a = 1, . . . , N h ) is given by, where repeated indices are summed. w (1) ai and b (1) a are weights and biases of the first layer, respectively. In terms of weights w (2) αa and biases b (2) α of the second layer, a variable y K (K = 1, . . . , N o ) of output layer takes the same form as the hidden variables, Based on the output {y K }, the temperature of input configuration is determined via namely, the temperature α with the highest "probability" y K is the output temperature. The training is implemented by tuning the weights and biases with the Adam optimizer [23] and the error function, which is the cross entropy between the target and output temperatures, where i denotes the label of input configulations.
The convolutional neural network consists of three convolutional layers followed by the final fully connected layer (15). We shall discuss it in more detail in Sec. III B.
Summary of our procedure is as follows: Step 1: Gather configurations via the ordinary Markov-chain Monte-Carlo method: We do not need the machine learning in this step. Data set {σ i } for each spin configuration at fixed K is stored.
Step 2: Train the neural-network as a thermometer: The input is the spin configuration and the output is the predicted temperature. 1 Step 3: Analyze trained weights and biases appearing in the neural-network: We discuss how the machine parameters contain information of the phase transition.

A. 2D Ising model
We briefly discuss how the phase transition detection works. We first take a look at 2D Ising model, which was already studied in [4] with 100 target temperatures and its weights behave like an order parameter, i.e. spontaneous magnetization. Since our primary purpose is to understand its mechanism rather than to quantitatively estimate the critical tempera- weights of second layer after the training is shown in Fig. 1. In [4], the critical temperature was predicted by fitting the sum of the weights by a tanh[c(K −K c )]−b with free parameters a, b, c. Indeed, the average of the final weights appears to behave like an order parameter (right panel of Fig 1). We will discuss the detailed structure of the weights in the next section.

B. 2D 3-state Potts model
Before getting into the detail of learning mechanism of critical temperature of 2D Ising model, we take a look at another example, two dimensional 3-state Potts model. The

Hamiltonian is given by
where Φ i takes three values, a generalization of Ising spin σ i . Hence, configurations {Φ i } labeled by temperatures K are the inputs of neural network. The 3-state Potts model is known to exhibit the second order phase transition at K c 1.0050 because of the fluctuation unlike the simple Landau theory [25][26][27].
After the same training as we used for 2D Ising model, we obtain the weights and its average shown in Fig. 2. We again find a drastic change in the weight structure around the critical temperature.  Ka of trained neural networks with three hidden units (N h = 3). The structure change is still observed around critical temperature. One of the weights has an almost opposite temperature dependence, which also appeared in Fig. 9. See the main text for detailed discussion.

III. DISCUSSION
So far, we have observed that the change in value of weights in the second layer signal the phase transition, which implies that the information of critical temperature is somehow encoded in the trained neural network. In what follows, we carefully examine the trained fully-connected/convolutional neural networks and attempt to understand what (physical) quantity they capture and how it is related to temperature prediction as well as the critical temperature detection [15].

A. Magnetization encoded in neural network
Since the order parameter of phase transition in the 2D Ising model is the spontaneous magnetization, it sounds natural that it is encoded in the neural network after the training.
To give a quantitative argument we construct a simplified model by examining the weights and biases of training neural network in case of 2D Ising model. First, we reduce the number of hidden unit in (2) from 80 to 3 for the sake of simplicity. We notice that it still captures the critical temperature as shown in Fig. 3.
Before modeling the second layer, we examine the characteristics of the first layer.
and magnetization density of the input Ising spin configuration. From this observation, we model these output by three lines linear in magnetization m({σ}) as shown in Fig. 4b: where > 0 is a constant. Furthermore, as an activation function, we use the max function, that assigns 1 to the maximum entry and 0 to the rest, instead of softmax for our purpose; this replacement does not change our final result. x a = max(x a ) yields the following vectors depending on magnetization of inputs m, The parameter may be interpreted as a threshold magnetization separating the ferromagnetic and paramagnetic phases [3].
Having understood the magnetization dependence of the three hidden units, we next analyze the second layer. The weights in the layer are given in Fig. 3. We divide the temperature in three pieces: low, critical, and high temperatures, respectively represented by the following vectors, in N o -dimensional output space. This procedure effectively reduce the output dimension N o to 3. According to Fig. 3, we parametrize the weights in the following way, with ∆ > δ > 0. Elements of (K × a)-matrix w (2) are the weights w Ka . We neglect the biases as they are much smaller than the weights. Precise parametrization is not necessary for the following discussion. Then, − ≤ m < : ≤ m : which are predicted as low, high, and low temperature, respectively. Since the configurations with m < − and m > are in the ordered phase, they are correctly predicted as low temperature phase. The configurations with − ≤ m < is also correctly predicted as high  are almost temperature independent above and below K c , respectively, as seen in Fig. 3.
Therefore, the trained network is capable of distinguishing only high or low temperature.
One might think that this is due to the fact that we have single threshold parameter corresponding to the three hidden unit. Actually, we can increase the number of threshold parameters by introducing more hidden units. But, it does not lead to higher resolution of temperatures. In fact, even if we increase the number of hidden units, the weights in the second layer show only two patterns of temperature dependence: most of them behave like blue and green curves and the rest behave like the orange curve in Fig. 3. This is exactly what is observed in Fig. 1. Increasing the hidden units simply results in duplicating the second layer's weights that are already observed in Fig. 3, and consequently, the predicted temperature is either high or low no matter how many hidden units are introduced.
What we have learned from the above analyses is as follows: The network obtained the information of magnetization which manifests itself in the output of the first layer.
However, it is hard for this network to discriminate each temperature except for the difference between ordered and disordered phases based on the magnetization. From the viewpoint of machine learning parameters, this is due to the fact that the weights of the second layer are temperature independent except around the critical temperature.

B. Energy and temperature prediction
We have found in the last subsection that the magnetization of 2D Ising model was build in the temperature-supervised fully-connected neural network, that in turn allows us to read off the critical temperature from its weight structure. While the critical temperature seems to be well detected, we have not discussed the temperature prediction itself. Interestingly, the accuracy of temperature learning can be theoretically computed to be 40.1% in case of 2D Ising model under the above setup, giving the upper bound for the accuracy of temperature prediction by machine learning [28] (see Appendix A for details). However, we note that the test accuracy of temperature prediction by fully-connected neural network is 16.8%, which is not even close to theoretically predicted accuracy 40.1%. We could train better from the viewpoint of temperature prediction although we do not need for the phase transition detection. What happens if we design the neural network architecture in such a way that the temperature prediction accuracy improves?
To answer the question, we use a convolutional neural network shown below, enabling us to achieve higher accuracy in two dimensional image recognition.

Fully connected layer
Softmax activation The network has three convolutional layers with square filters with size (s i , s i ), strides (s i , s i ) and the number of channels C i , each of which is followed by ReLU activation function. Then, the output is passed to fully-connected layer, whose input and output are of out interest and analyzed in detail. The output of the second last layer, denoted by x a , is given by takes a value in [0, ∞).x a is an output of the previous layer before passed to the ReLU activation. Then, the output x a plays a role of input of the fully-connected layer to give a prediction of temperature via (4) and (5).
leading to the number of hidden units N h = 3. With this setup we achieved 35.9% of accuracy on a test data set. It is twice higher than fully-connected layers, although still   Fig. 5a) and internal energy (Fig. 5b), respectively.
We clearly see the transition in what the neural network has learned. The first-layer outputs are now proportional to the internal energy rather than the magnetization of input configurations. Also, we notice that the weights of the second layer obtain mild dependence on temperature, implying the information of the critical temperature is blurred (Fig. 7).
As we shall see momentarily, the oscillating orange line is irrelevant in the temperature prediction.
We now proceed to a simplified parametrization of the last two layers. Here, (w a ) and (w (2) Ka , b (2) K ) stand for weights and biases in the second last and last layer, respectively. As we have already mentioned and checked in Fig. 5b, the output of the first layer is proportional to energy E({σ i }) of input configuration {σ i }. After passing to the ReLU activation, where and φ are positive constants. Domain of E is restricted so that x a does not take a negative value.
Having observed that the input of the fully-connected layer is proportional to E, let us consider what would be an optimal inferrence of the temperature of an input configura-tion {σ i } [28]. The probability P ({σ i }; K) that the configuration {σ i } appears at temperature K is given by Therefore, the likelihood that {σ i } is generated at temperature K is represented by the following "probability", where F = − ln Z(K) is the free energy and K is summed over the target temperatures.
Then, we obtain the estimated temperature, We remark here that the free energy is a function of temperature K and holds the information of phase transition although it does not exhibit genuine singularity in finite systems.
Based on the above consideration, we guess fully-connected weights of our neural network so that the resultant output behaves like (20). Then, we compare with the actual parameters of the trained network. To this end, we parametrize the weights as follows: with constant parameters p, q, r, s and a common nonlinear function G(K). w (2) K1 (the orange curve in Fig. 7) is not relevant to our consideration because x 1 = ReLU(x 1 ) = 0. We use an observation from the simulation to neglect the bias, where it is much smaller than the value of weights.
K ) with Eqs. (18) and (22) yields the following output; where we dropped K-independent terms because they do not affect the outcome of the max function. The expression indeed takes the form of y theory K (20) if −2 (φG(K)+pK) = (p+r)F is satisfied. We plot the theoretically predicted weights in Fig. 6:    Fig. 7. Based on these considerations, we conclude that the information of the phase transition is again encoded in the fully-connected weight because G(K) in Eq. (22) is directly related to the free energy. Particularly, the critical temperature is obtained by detecting the enhancement of the second derivative of weights with respect to temperature.
It is noted that the model or weight parametrization demonstrated above is one of many possibilities yielding the same temperature prediction. For example, the quadratic K-dependence could arise from the biases on the last layer instead of the weights [28]. In that case, the information of phase transition or free energy should be encoded in the bias.
How the physical information is stored in the machine parameters depends on the architecture of neural network including the number of hidden units or activation functions.
Finally, we mention that the discussion given for 2D Ising model also holds for the 3-state Potts model.

IV. CONCLUSION
We revisited the phase transition detection of the 2D Ising model based on temperaturesupervised machine learning to clarify the underlying mechanism.
We first demonstrated that the fully-connected neural network shows the drastic change in weight structure of the second layer as a result of training on 2D Ising model as well as 3-state Potts model. Closer look at the neural network with 3 hidden units revealed that it actually captures the magnetization in the first layer. The phase transition detection is, however, a consequence of low prediction accuracy of temperature on input configurations except around the critical temperature.
On the other hand, employing the convolutional neural network, we succeeded in improving the temperature-prediction accuracy. It turned out that the trained convolutional network captures the internal energy of input configurations instead of magnetization. Also, the weights in the last fully-connected layer do not show any drastic change in their value in contrast to the former case and this aspect is understood from the viewpoint of optimal temperature prediction. In this case, the weights are proportional to free energy, and hence, the physical information including the critical temperature is again encoded in them.
Interestingly, the trained neural network extract different physical information depending on how they are trained. A "bad" neural network in terms of temperature prediction tries to capture the magnetization of input spin configurations, which happens to be convenient for detecting critical temperature as it is the order parameter. Improvement of the network architecture allow us to construct a "good (better)" temperature predictor. But the information of phase transition is encoded in the network more implicitly. ACKNOWLEDGMENT We thank Akinori Tanaka for providing us with the code used in [4]. K.K. is supported by the Grants-in-Aid for Scientific Research from JSPS (No. 18K03618). The work of A.T.
was supported by the RIKEN Special Postdoctoral Researcher program.

Appendix A: Temperature prediction and its accuracy
We think about how temperature prediction of Ising spin configuration works [28]. We start with preparing Ising spin configurations generated by Markov-chain Monte-Carlo algorithm at target temperatures. The configurations at a fixed temperature K are distributed over energy with mean E 3 and variance E 2 − E 2 , each of which is given by Given a spin configuration, we consider the optimal prediction of temperature. Energy can be calculated for each configuration. It is noted that, since the standard deviation of energy density E/L 2 is proportional to L −1 , the energy probability distribution does not admit width in infinite system. In such case, provided a certain configuration, we should be able to correctly predict the temperature at which it is generated by calculating its energy density. However, the distribution at each temperature has finite width if we consider finite systems. In that case, temperature prediction does not necessarily give the correct answer because there are overlaps between energy distributions (Fig. 8). The best we can do is to guess the temperature of the configuration by maximally likelihood estimate, namely, the temperature that is most likely to yield the configuration's energy is the optimally predicted temperature. Consequently, there is an upper bound in the accuracy of prediction from the overlap of the energy probability distributions in finite system.
Let us take a look at two dimensional Ising model in more detail. Since it is exactly Figure 8. Energy probability distributions at two nearby temperatures K and K + ∆K, which have overlap shown by red and blue shaded areas. The red distribution is generated at a temperature K.

� � �
The configurations are, however, misclassified as those at temperature K + ∆K by maximum-like method if they are in the red shaded area because P (E) K+∆K > P (E) K below E 0 . The same argument holds for configurations generated in the blue shaded area.
with the number of sites N . We do not incorporating the finite size effect. Then, we obtain the energy probability distributions approximated by the Gaussian distributions with mean E and variance (E − E ) 2 at the temperatures of our interest. We show energy distributions at 20 different temperatures K = 0.05, 0.1, 0.15, . . . , 1.0 in Fig. 9. We see that the overlaps between distributions are significant while they are reduced around the critical temperature K exact c ∼ 0.4407. The accuracy obtained by maximum-likelihood estimate is as low as 40.1%, implying that a thermometer constructed by machine learning can achieve an accuracy 40.1% at most. Nevertheless, it turns out that the trained neural network can work as a phase transition detector.