Model-free optimization of power/efficiency tradeoffs in quantum thermal machines using reinforcement learning

Abstract A quantum thermal machine is an open quantum system that enables the conversion between heat and work at the micro or nano-scale. Optimally controlling such out-of-equilibrium systems is a crucial yet challenging task with applications to quantum technologies and devices. We introduce a general model-free framework based on reinforcement learning to identify out-of-equilibrium thermodynamic cycles that are Pareto optimal tradeoffs between power and efficiency for quantum heat engines and refrigerators. The method does not require any knowledge of the quantum thermal machine, nor of the system model, nor of the quantum state. Instead, it only observes the heat fluxes, so it is both applicable to simulations and experimental devices. We test our method on a model of an experimentally realistic refrigerator based on a superconducting qubit, and on a heat engine based on a quantum harmonic oscillator. In both cases, we identify the Pareto-front representing optimal power-efficiency tradeoffs, and the corresponding cycles. Such solutions outperform previous proposals made in the literature, such as optimized Otto cycles, reducing quantum friction.

However, the optimal control of such devices, necessary to reveal their maximum performance, is an extremely challenging task that could find application in the control of quantum technologies and devices beyond QTMs.The difficulties include: (i) having to operate in finite time, the state can be driven far from equilibrium, where the thermal properties of the system are model-specific; (ii) the optimization is a search over the space of all possible time-dependent controls, which increases exponentially with the number of time points describing the cycle; (iii) in experimental devices, often subject to undesired effects such as noise and decoherence [18], we could have a limited knowledge of the actual model describing the dynamics of the QTM.
A further difficulty (iv) arises in QTMs, since the maximization of their performance requires a multi-objective optimization.Indeed, the two main quantities that describe the performance of a heat engine (refrigerator) are the extracted power (cooling power) and the efficiency (coefficient of performance).The optimal strategy to maximize the efficiency consists of performing reversible transformations [19] which are, however, infinitely slow, and thus deliver vanishing power.Conversely, maximum power is typically reached at the expense of reduced efficiency.Therefore, one must seek optimal trade-offs between the two.
In general, aside from variational approaches, there is no guarantee that these regimes and cycles are optimal.Recently, reinforcement-learning (RL) has been used to find cycles that maximize the power of QTMs without making assumptions on the cycle structure [69].However, this approach requires a model of the system and the knowledge of the quantum state of the system, which restricts its practical applicability.This calls for the development of robust and general strategies that overcome all above-mentioned difficulties (i-iv).
We propose a RL-based method with the following properties: (i) it finds cycles yielding near Pareto-optimal trade-offs between power and efficiency, i.e. the collection of cycles such that it is not possible to further improve either power or efficiency, without decreasing the other one.(ii) It only requires the heat currents as input, and not the quantum state of the system.(iii) It is completely model-free.(iv) It does not make any assumption on the cycle structure, nor on the driving speed.The RL method is based on the Soft Actor-Critic algorithm [70,71], introduced in the context of robotics and videogames [72,73], generalized to combined discrete and con- FIG. 1. Schematic representation of a quantum thermal machine controlled by a computer agent.A quantum system (gray circle) can be coupled to a hot (cold) bath at inverse temperature βH (βC), represented by the red (blue) square, enabling a heat flux JH(t) (JC(t)).The quantum system is controlled by the computer agent through a set of experimental control parameters ⃗ u(t), such as an energy gap or an oscillator frequency, that control the power exchange P (t), and through a discrete control d(t) = {Hot, Cold, None} that determines which bath is coupled to the quantum system.
We prove the validity of our approach finding the full Pareto-front, i.e. the collection of all Pareto-optimal cycles describing optimal power-efficency tradeoffs, in two paradigmatic systems that have been well studied in literature: a refrigerator based on an experimentally realistic superconducting qubit [6,49], and a heat engine based on a quantum harmonic oscillator [43].In both cases we find elaborate cycles that outperform previous proposal mitigating quantum friction [43,49,55,66,[93][94][95], i.e. the detrimental effect of the generation of coherence in the instantaneous eigenbasis during the cycle.Remarkably, we can also match the performance of cycles found with the RL method of Ref. [69] that, as opposed to our model-free approach, requires monitoring the full quantum state and only optimizes the power.

Setting: Black-box Quantum Thermal Machine
We describe a QTM by a quantum system, acting as a "working medium", that can exchange heat with a hot (H) or cold (C) thermal bath characterized by inverse temperatures β H < β C (Fig. 1).Our method can be readily generalized to multiple baths, but we focus the description on two baths here.
We can control the evolution of the quantum system and exchange work with it through a set of timedependent continuous control parameters ⃗ u(t) that enter in the Hamiltonian H[⃗ u(t)] of the quantum system [96], and through a discrete control d(t) = {Hot, Cold, None} that determines which bath is coupled to the system.J H (t) and J C (t) denote the heat flux flowing out respectively from the hot and cold bath at time t.
Our method only relies on the following two assumptions: 1. the RL agent can measure the heat fluxes J C (t) and J H (t) (or their averages over a time period ∆t); In contrast to previous work [69], the RL optimization algorithm does not require any knowledge of the microscopic model of the inner workings of the quantum system, nor of its quantum state; it is only provided with the values of the heat fluxes J C (t) and J H (t).These can be either computed from a theoretical simulation of the QTM [69], or measured directly from an experimental device whenever the energy change in the heat bath can be monitored without influencing the energetics of the quantum system (see e.g.experimental demonstrations [6][7][8][9]).In this sense, our quantum system is treated as a "black-box", and our RL method is "model-free".Any theoretical model or experimental device satisfying these requirements can be optimized by our method, including also classical stochastic thermal machines.The timescale T is finite because of energy dissipation and naturally emerges by making the minimal assumption that the coupling of the quantum system to the thermal baths drives the system towards a thermal state within some timescale T .Such a timescale can be rigorously identified e.g.within the weak system-bath coupling regime, and in the reaction coordinate framework that can describe non-Markovian and strong-coupling effects [97].In a Markovian setting, T is related to the inverse of the characteristic thermalization rate.
The thermal machines we consider are the heat engine and the refrigerator.Up to an internal energy contribution that vanishes after each repetition of the cycle, the instantaneous power of a heat engine equals the extracted heat: and the cooling power of a refrigerator is: The entropy production is given by where we neglect the contribution of the quantum system's entropy since it vanishes after each cycle.

Machine Learning Problem
Our goal is to identify cycles, i.e. periodic functions ⃗ u(t) and d(t), that maximize a trade-off between power and efficiency on the long run.Since power and efficiency cannot be simultaneously optimized, we use the concept of Pareto-optimality [98,99].Pareto-optimal cycles are those where power or efficiency cannot be further increased without sacrificing the other one.The Paretofront, defined as the collection of power-efficiency values delivered by all Pareto-optimal cycles, represents all possible optimal trade-offs.To find the Pareto-front, we define the reward function r c (t) as: where P (t) is the power of a heat engine (Eq. 1) or cooling power of a refrigerator (Eq.2), P 0 , Σ 0 are reference values to normalize the power and entropy production, and c ∈ [0, 1] is a weight that determines the trade-off between power and efficiency.As in Ref. [69], we are interested in cycles that maximize the long-term performance of QTMs; we thus maximize the return ⟨r c ⟩ (t), where ⟨•⟩(t) indicates the exponential moving average of future values: Here κ is the inverse of the averaging timescale, that will in practice be chosen much longer than the cycle period, such that ⟨r c ⟩ (t) is approximately independent of t.
For c = 1, we are maximizing the average power ⟨r 1 ⟩ = ⟨P ⟩ /P 0 .For c = 0, we are minimizing the average entropy production ⟨r 0 ⟩ = − ⟨Σ⟩ /Σ 0 , which corresponds to maximizing the efficiency.For intermediate values of c, the maximization of ⟨r c ⟩ describes trade-offs between power and efficiency (see "Optimizing the entropy production" in Materials and Methods for details).Interestingly, if convex, it has been shown that the full Pareto-front can be identified repeating the optimization of ⟨r c ⟩ for many values of c [98,100].

Deep reinforcement learning for black-box quantum thermal machines
In RL, a computer agent must learn to master some task by repeated interactions with some environment.
Here we develop an RL approach where the agent maximizes the return (5) and the environment is the QTM with its controls (Fig. 2A).To solve the RL problem computationally, we discretize time as t i = i∆t.By timediscretizing the return (5), we obtain a discounted return whose discount factor γ = exp(−κ∆t) determines the averaging timescale and expresses how much we are interested in future or immediate rewards (see "Reinforcement Learning Implementation" in Materials and Methods for details).At each time step t i , the agent employs a policy function π(a|s) to choose an action a i = {⃗ u(t i ), d(t i )} based on the state s i of the environment.Here, the policy function π(a|s) represents the probability of choosing action a, given that the environment is in state s, ⃗ u(t) are the continuous controls over the quantum system, and d(t i ) ∈ {Hot, Cold, None} is a discrete control that selects the bath the system is coupled to.All controls are considered to be constant during time step of duration ∆t.The aim of RL is to learn an optimal policy function π(a|s) that maximizes the return.
In order to represent a black-box quantum system whose inner mechanics are unknown, we define the con-trol history during a time interval of length T as the observable state: where N = T /∆t.Therefore, the state of the quantum system is implicitly defined by the sequence of the agent's N recent actions.
To find an optimal policy we employ the soft actorcritic algorithm, that relies on learning also a value function Q(s, a), generalized to a combination of discrete and continuous actions [70][71][72][73].The policy function π(a|s) plays the role of an "actor" that chooses the actions to perform, while a value function Q(s, a) plays the role of a "critic" that judges the choices made by the actor, thus providing feedback to improve the actor's behavior.We further optimize the method for a multi-objective setting by introducing a separate critic for each objective, i.e. one value function for the power, and one for the entropy production.This allow us to vary the weight c during training, thus enhancing convergence (see "Reinforcement Learning Implementation" in Materials and Methods for details).
We learn the functions π(a|s) and Q(s, a) using a deep NN architecture inspired by WaveNet, an architecture that was developed for processing audio signals [101] (See Figs.2B-C).We introduce a "convolution block" to efficiently process the time-series of actions defining the state s i .It consists of a 1D convolution with kernel size and stride of 2, such that it halves the length of the input.It is further equipped with a residual connection to improve trainability [102] (see "Reinforcement Learning Implementation" in Materials and Methods for details).The policy π(a i |s i ) is described by a NN that takes the state s i as input, and outputs parameters µ and σ describing the probability distribution from which action a i is sampled (Fig. 2B).The value function Q(s i , a i ) is computed by feeding (s i , a i ) into a NN, and outputting Q(s i , a i ) (Fig. 2C).Both π(a i |s i ) and Q(s i , a i ) process the state by feeding it through multiple convolution blocks (upper orange boxes in Figs.2B and 2C), each one halving the length of the time-series, such that the number of blocks and of parameters in the NN is logarithmic in N .Then a series of fully-connected layers produce the final output.
The policy and value functions are determined by minimizing the loss functions in Eqs.(39) and (49) using the ADAM optimization algorithm [103].The gradient of the loss functions is computed off-policy, over a batch of past experience recorded in a replay buffer, using backpropagation (see "Reinforcement Learning Implementation" in Materials and Methods for details).

Pareto-optimal cycles for a superconducting qubit refrigerator
We first consider a refrigerator based on an experimentally realistic system: a superconducting qubit coupled to two resonant circuits that behave as heat baths [49] (Fig. 3A).Such a system was experimentally studied in the steady-state in Ref. [6].The system Hamiltonian is given by [49,55,63]: where E 0 is a fixed energy scale, ∆ characterizes the minimum gap of the system, and u(t) is our control parameter.In this setup the coupling to the baths, described by the commonly employed Markovian master equation [104][105][106][107], is fixed, and cannot be controlled.However, the qubit is resonantly coupled to the baths at different energies.The u-dependent coupling strength to the cold (hot) bath is described by the function γ u ), respectively (Fig. 3F).As in Ref. [63], the coupling strength is, respectively, maximal at u = 0 (u = 1/2), with a resonance width determined by the "quality factor" Q C (Q H ) (see "Physical model" in Materials and Methods for details).This allows us to choose which bath is coupled to the qubit by tuning u(t).
In Fig. 3 we show an example of our training procedure to optimize the return ⟨r c ⟩ at c = 0.6 using N = 128 steps determining the RL state, and varying c during training from 1 to 0.6 (Fig. 3C).In the early stages of the training, the return ⟨r c ⟩ i , computed as in Eq. ( 28) but over past rewards, and the running averages of the cooling power ⟨P cool ⟩ i and of the negative entropy production − ⟨Σ⟩ i all start off negative (Fig. 3B), and the corresponding actions are random (left panel of Fig. 3D).Indeed, initially the RL agent has no experience controlling the QTM, so random actions are performed, resulting in heating the cold bath, rather than cooling it, and in a large entropy production.However, with increasing steps, the chosen actions exhibit some structure (Fig. 3D), and the return ⟨r c ⟩ i increases (Fig. 3B).While both the power and the negative entropy production initially increase together, around step 100k we see that − ⟨Σ⟩ i begins to decrease.This is a manifestation of the fact that power and entropy production cannot be simultaneously optimized.Indeed, the agent learns that in order to further increase the return, it must "sacrifice" some entropy production to produce a positive and larger cooling power.In fact, the only way to achieve positive values of ⟨r c ⟩ i is to have a positive cooling power, which inevitably requires producing entropy.Eventually all quantities in Fig. 3B reach a maximum value, and the corresponding final deterministic cycle (i.e. the cycle generated by policy switching off stochasticity, see "Reinforcement Learning Implementation" in Materials and Methods for details) is shown in Fig. 3E as thick black dots.
For the same system, Ref. [63] proposed a smoothed trapezoidal cycle u(t) oscillating between the resonant peaks at u = 0 and u = 1/2 and optimized the cycle time (Fig. 3E, dashed line).While this choice outperformed a sine and a trapezoidal cycle [49], the cycle found by our RL agent produces a larger return (Fig. 3B).The optimal trapezoidal cycle found for c = 0.6 is shown in Fig. 3E Superconducting Qubit Refrigerator   Fig. 4 compares optimal cycles for different trade-offs between cooling power and coefficient of performance η cool , the latter defined as the ratio between the average cooling power, and the average input power.This is achieved by repeating the optimization for various values of c.To demonstrate the robustness of our method, the optimization of ⟨r c ⟩ was repeated 5 times for each choice of c (variability shown with error bars in Fig. 4A, and as separate points in Fig. 4B).The RL method substantially outperforms the trapezoidal cycle by producing larger final values of the return ⟨r c ⟩ at all values of c (Fig. 4A), and by producing a better Pareto front (Fig. 4B).The RL cycles simultaneously yield higher power by more than a factor of 10, and a larger η cool , for any choice of the power-efficiency trade-off.The model-free RL cycles can also deliver the same power at a substantially higher COP (roughly 10 times larger) as compared to the cycle found with the RL method of Ref. [69], which only optimizes the power.This is remarkable since, as opposed to the current model-free method, the method in Ref. [69] has access to the full quantum state of the system, and not only to the heat currents (see "Comparing with other methods" in Materials and Methods for details).This also shows that a large efficiency improvement can be achieved by sacrificing very little power.As expected, the period of the RL cycles increases as c decreases and the priority shifts from high power to high η cool (Figs.4C-F, black dots).However, the period is much shorter than the corresponding optimized trapezoidal cycle (dashed line), and the optimal control sequence is quite unintuitive, even going beyond the resonant point at u = 1/2.As argued in [49,55,63], the generation of coherence in the instantaneous eigenbasis of the quantum system, occurring because [ Ĥ(u 1 ), Ĥ(u 2 )] ̸ = 0 for u 1 ̸ = u 2 , causes power losses that increase with the speed of the cycle.We find that we can interpret the power enhancement achieved by our cycle as a mitigation of such detrimental effect: indeed, we find that trapezoidal cycles operated at the same frequency as the RL cycle generate twice as much coherence as the RL cycles (see "Generation of coherence" in Materials and Methods for details).In either case, cycles with higher power tend to generate more coherence.
Given the stochastic nature of RL, we also compared the cycles obtained across the 5 independent training runs, finding that cycles are typically quite robust, displaying only minor changes (see Fig. 8 of Methods for four cycles found in independent training runs corresponding to Figs. 4C-F).
Pareto-optimal cycles for a quantum harmonic oscillator engine We now consider a heat engine based on a collection of non-interacting particles confined in a harmonic potential [43] (Fig. 5A).The Hamiltonian is given by where m is the mass of the system, w 0 is a reference frequency and p and q are the momentum and position operators.The control parameter u(t) allows us to change the frequency of the oscillator.Here, at every time step we let the agent choose which bath (if any) to couple to the oscillator.The coupling to the baths, characterized by the thermalization rates Γ α , is modeled using the Lindblad master equation as in Ref. [43] (see "Physical model" in Materials and Methods for details).In contrast to the superconducting qubit case, c is held constant during training.Fig. 5 reports the results on the optimal trade-offs between extracted power and efficiency η heat , the latter defined as the ratio between the extracted power and the input heat, in the same style of Fig. 4. In this setup, we compare our RL-based results to the well-known Otto cycle.The authors of Ref. [43] study this system by optimizing the switching times of an Otto cycle, i.e. the duration of each of the 4 segments, shown as a dashed lines in Figs.5D-E

details).
The RL method produces cycles with a larger return and with a better power-efficiency Pareto-front with re-spect to the Otto cycle (Fig. 5B,C).The cycles found by the RL method significantly outperforms the Otto engine in terms of delivered power.For c = 1, a high-power cycle is found (Fig. 5D and corresponding blue dots in Figs.5B-C) but at the cost of a lower efficiency than the Otto cycles.However, at c = 0.5, the RL method finds a cycle that matches the maximum efficiency of the Otto cycles, while delivering a ∼ 30% higher power (Fig. 5E and corresponding blue dots in Figs.5B-C) Remarkably, our model-free RL method also finds cycles with nearly the same power as the RL method of Ref. [69], but at almost twice the efficiency (see "Comparing with other methods" in Materials and Methods for details).As in Fig. 4, we see that a very small decrease in power can lead to a large efficiency increase.
Interestingly, as shown in Figs.5D-E, the cycles found by the RL agent share many similarities with the Otto cycle: both alternate between the hot and cold bath (orange and blue portions) with a similar period.However, there are some differences: at c = 1, the RL cycle ramps the value of u while in contact with the bath, eliminating the unitary stroke (Fig. 5D).Instead, at c = 0.5, the RL agent employs a unitary stroke that is quite different respect to a linear ramping of u (Fig. 5E, green dots).As in the superconducting qubit case, the enhanced performance of the RL cycle may be interpreted as a mitigation of quantum friction [43,93].
Also in this setup, we verified that the discovered cycles are quite robust across the 5 independent training runs, displaying only minor changes (see Fig. 9 of Methods for two cycles found in independent training runs corresponding to Figs. 5D-E).

DISCUSSION
We introduced a model-free framework, based on Reinforcement Learning, to discover Pareto-optimal thermodynamic cycles that describe the best possible tradeoff between power and efficiency of out-of-equilibrium quantum thermal machines (heat engines and refrigerators).Our algorithm only requires monitoring the heat fluxes of the QTM, making it a model-free approach.It can therefore be used both for the theoretical optimization of known systems, and potentially for the direct optimization of experimental devices for which no model is known, and in the absence of any measurement performed on the quantum system.Using state-of-the-art machine learning techniques, we demonstrate the validity of our method applying it to two different prototypical setups.Our black-box method discovered elaborate cycles that outperform previously proposed cycles and are on par with a previous RL method that observes the full quantum state [69].Up to minor details, the cycles found by our method are reproducible across independent training runs.Physically we find that Otto cycles, commonly studied in literature, are not generally optimal, and that optimal cycles balance a fast operation of the cycle, with the mitigation of quantum friction.
Our method paves the way for a systematic use of RL in the field of quantum thermodynamics.Future directions include investing larger systems to uncover the impact of quantum many-body effects on the performance of QTMs, optimizing systems in the presence of noise, and optimizing trade-offs that include power fluctuations [99,[108][109][110].

METHODS
In this section we provide details on the optimization of the entropy production, on the reinforcement learning implementation, on the physical model used to describe the quantum thermal machines, on the training details, on the convergence of the method, on the comparison with other methods, and on the computation of the generation of coherence during the cycles.We also provide access to the full code that was used to generate the results presented in the manuscript, and the corresponding data.

Optimizing the entropy production
Here we discuss the relation between optimizing the power and the entropy production, or the power and the efficiency.We start by noticing that we can express the efficiency of a heat engine η heat and the coefficient of performance of a refrigerator η cool in terms of the averaged power and entropy production, i.e.
where ν = heat, cool, η is the Carnot coefficient of performance, and where we defined β heat ≡ β C and β cool ≡ β C − β H .We now show that, thanks to this dependence of η ν on ⟨P ν ⟩ and ⟨Σ⟩, if a cycle is a Paretooptimal trade-off between high power and high efficiency, then it is also a Pareto-optimal trade-off between high power and low entropy-production up to a change of c.This means that if we find all optimal trade-offs between high power and low entropy-production (as we do with our method if the Pareto-front is convex), we will have necessarily also found all Pareto-optimal trade-offs between high power and high efficiency.
Mathematically, we want to prove that the cycles that maximize for some value of c ∈ [0, 1], also maximize the return in Eq. ( 5) for some (possibly different) value of c ∈ [0, 1].
To simplify the proof and the notation, we consider the following two functions where P (θ) and η(θ) represent the power and efficiency of a cycle parameterized by a set of parameters θ, a > 0 and b > 0 are two scalar quantities, and is obtained by inverting Eq. ( 9).We wish to prove the following.Given some weights a 1 > 0 and b 1 > 0, let θ 1 be the value of θ that locally maximizes G(a 1 , b 1 , θ).Then, it is always possible to identify positive weights a 2 > 0, b 2 > 0 such that the same parameters θ 1 (i.e. the same cycle) is a local maximum for F (a 2 , b 2 , θ).In the following, we will use that and that the Hessian H (Σ) of Σ(P, η) is given by Proof: by assumption, θ 1 is a local maximum for G(a 1 , b 1 , θ).Denoting with ∂ i the partial derivative in (θ) i , we thus have Now, let us compute the derivative in θ of F (a 2 , b 2 , θ 1 ), where a 2 > 0 and b 2 > 0 are two arbitrary positive coefficients.We have ) Therefore, if we choose a 2 and b 2 such that thanks to Eq. ( 15) we have that meaning that the same parameters θ 1 that nullifies the gradient of G, nullifies also the gradient of F at a different choice of the weights, given by Eq. ( 17).The invertibility of Eq. ( 17) (i.e. a non-null determinant of the matrix) is guaranteed by Eq. ( 13).We also have to make sure that if a 1 > 0 and b 1 > 0, then also a 2 > 0 and b 2 > 0. To do this, we invert Eq. ( 17), finding It is now easy to see that also the weights a 2 and b 2 are positive using Eq. ( 13).
To conclude the proof, we show that θ 1 is a local maximum for F (a 2 , b 2 , θ) by showing that its Hessian is negative semi-definite.Since, by hypothesis, θ 1 is a local maximum for G(a 1 , b 1 , θ), we have that the Hessian matrix is negative semi-definite.We now compute the Hessian where and H (Σ) is the Hessian of Σ(P, η) computed in P (θ 1 ) and η(θ 1 ).Since we are interested in studying the Hessian of F (a 2 , b 2 , θ 1 ) in the special point (a 2 , b 2 ) previously identified, we substitute Eq. ( 19) into Eq.( 21), yielding We now prove that H (F ) ij is negative semi-definite since it is the sum of negative semi-definite matrices.By hypothesis H (G) ij is negative semi-definite.Recalling Eq. ( 13) and that b 1 > 0, we now need to show that Q ij is positive semi-definite.Plugging Eq. ( 14) into Eq.( 22) yields where We now show that if R ij is positive semi-definite, then also Q ij is positive semi-definite.By definition, Q ij is positive semidefinite if, for any set of coefficient a i , we have that ij a i Q ij a j ≥ 0. Assuming R ij to be positive semi-definite, and using that [ν] , η > 0, we have where we define x i ≡ ∂ i η a i .We thus have to prove the positivity of R ij .We prove this showing that it is the sum of 3 positive semi-definite matrices.Indeed, the first term in Eq. ( 25), 2P/η, is proportional to a matrix with 1 in all entries.Trivially, this matrix has 1 positive eigenvalue, and all other ones are null, so it is positive semi-definite.At last, S ij and its transpose have the same positivity, so we focus only on S ij .S ij is a matrix with all equal columns.This means that it has all null eigenvalues, except for a single one that we denote with λ.Since the trace of a matrix is equal to the sum of the eigenvalues, we have λ = Tr[S] = i S ii .Using the optimality condition in Eq. ( 15), we see that each entry of S is positive, i.e. S ij > 0. Therefore λ > 0, thus S is positive semi-definite, concluding the proof that H (F ) ij is negative semi-definite.
To conclude, we notice that we can always renormalize a 2 and b 2 , preserving the same exact optimization problem.This way, a value of c ∈ [0, 1] can be identified.

Reinforcement Learning Implementation
As discussed in the main text, our goal is to maximize the return ⟨r c ⟩ (t) defined in Eq. ( 5).To solve the problem within the RL framework, we discretize time as t i = i∆t.At every time-step t i , the aim of the agent is to learn an optimal policy that maximizes, in expectation, the time-discretized return ⟨r c ⟩ i .The time-discrete reward and return functions are given by: Eq. ( 28) is the time-discrete version of Eq. ( 5), where the discount factor γ = exp(−κ∆t) determines the averaging timescale and expresses how much we are interested in future or immediate rewards.To be precise, plugging Eq. ( 27) into Eq.( 28) gives ⟨r c ⟩(t) (up to an irrelevant constant prefactor) only in the limit of ∆t → 0. However, also for finite ∆t, both quantities are time-averages of the reward, so they are equally valid definitions to describe a long-term trade-off maximization.
As in Ref. [69], we use a generalization of the soft-actor critic (SAC) method, first developed for continuous actions [70,71], to handle a combination of discrete and continuous actions [72,73].We further tune the method to stabilize the convergence in a multi-objective scenario.We here present an overview of our implementation of SAC putting special emphasis on the differences with respect to the standard implementation.However, we refer to [70][71][72][73] for additional details.Our method, implemented with PyTorch, is based on modifications and generalizations of the SAC implementation provided by Spinning Up from OpenAI [111].All code and data to reproduce the experiments is available online (see Data Availability and Code Availability sections).
The SAC algorithm is based on policy iteration, i.e. it consists of iterating multiple times over two steps: a policy evaluation step, and a policy improvement step.In the policy evaluation step, the value function of the current policy is (partially) learned, whereas in the policy improvement step a better policy is learned by making use of the value function.We now describe these steps more in detail.
In typical RL problems, the optimal policy π * (s|a) is defined as the policy that maximizes the expected return defined in Eq. ( 28), i.e.: where E π denotes the expectation value choosing actions according to the policy π.The initial state s 0 = s is sampled from µ π , i.e. the steady-state distribution of states that are visited by π.In the SAC method, balance between exploration and exploitation [112] is achieved by introducing an Entropy-Regularized maximization objective.In this setting, the optimal policy π * is given by where α ≥ 0 is known as the "temperature" parameter that balances the trade-off between exploration and exploitation, and is the entropy of the probability distribution P .Notice that we replaced the unknown state distribution µ π with B, which is a replay buffer populated during training by storing the observed one-step transitions (s k , a k , r k+1 , s k+1 ).Developing on Ref. [69], we generalize such approach to a combination of discrete and continuous actions in the following way.Let us write an arbitrary action a as a = (u, d), where u is the continuous action and d is the discrete action (for simplicity, we describe the case of a single continuous action, though the generalization to multiple variables is straightforward).From now on, all functions of a are also to be considered as functions of u, d.We decompose the joint probability distribution of the policy as where π D (d|s) is the marginal probability of taking discrete action d, and π C (u|d, s) is the conditional probability density of choosing action u, given action d (D stands for "discrete", and C for "continuous").Notice that this decomposition is an exact identity, thus allowing us to describe correlations between the discrete and the continuous action.With this decomposition, we can write the entropy of a policy as where correspond respectively to the entropy contribution of the discrete (D) and continuous (C) part.These two entropies take on values in different ranges: while the entropy of a discrete distribution with |D| discrete actions is non-negative and upper bounded by log(|D|), the (differential) entropy of a continuous distribution can take on any value, including negative values (especially for peaked distributions).Therefore, we introduce a separate temperature for the discrete and continuous contributions replacing the definition of the optimal policy in Eq. ( 30) with where α C ≥ 0 and α D ≥ 0 are two distinct "temperature" parameters.This is one of the differences with respect to Refs.[69][70][71].Equation ( 35) defines our optimization objective.Accordingly, we define the value function Q π (s, a) of a given policy π as Its recursive Bellman equation therefore reads As in Ref. [70,71], we parameterize π C (u|d, s) as a squashed Gaussian policy, i.e. as the distribution of the variable ξ ∼ N (0, 1), (38) where µ(d, s) and σ(d, s) represent respectively the mean and standard deviation of the Gaussian distribution, N (0, 1) is the normal distribution with zero mean and unit variance, and where we assume that U = [u a , u b ].This is the so-called reparameterization trick.
We now describe the policy evaluation step.In the SAC algorithm, we learn two value functions Q ϕi (s, a) described by the learnable parameters ϕ i , for i = 1, 2. Q ϕ (s, a) is a function approximator, e.g. a neural network.Since Q ϕi (s, a) should satisfy the Bellman Eq. ( 37), we define the loss function for Q ϕi (s, a) as the mean square difference between the left and right hand side of Eq. ( 37), i.e. where Notice that in Eq. ( 40) we replaced Q π with min j=1,2 Q ϕtarg,j , where ϕ targ,j , for j = 1, 2, are target parameters which are not updated when minimizing the loss function; instead, they are held fixed during backpropagation, and then they are updated according to Polyak averaging, i.e.
where ρ polyak is a hyperparameter.This change was shown to improve learning [70,71].In order to evaluate the expectation value in Eq. ( 40), we use the decomposition in Eq. (32) to write where we denote a ′ = (u ′ , d ′ ).Plugging Eq. ( 42) into Eq.( 40) and writing the entropies explicitly as expectation values yields We then replace the expectation value over u ′ in Eq. ( 43) with a single sampling u ′ ∼ π C (•|d ′ , s ′ ) (therefore one sampling for each discrete action) performed using Eq.(38).This corresponds to performing a full average over the discrete action, and a single sampling of the continuous action.
We now turn to the policy improvement step.Since we introduced two separate temperatures, we cannot use the loss function introduced in Refs.[70,71].Therefore, we proceed in two steps.Let us define the following function where Q π old (s, a) is the value function of some given "old policy" π old , and π is an arbitrary policy.First, we prove that if a policy π new satisfies for all values of s, then π new is a better policy than π old as defined in Eq. (35).Next, we will use this property to define a loss function that implements the policy improvement step.Equation (45) implies that We now use this inequality to show that π new is a better policy.Starting from the Bellmann equation ( 37) for Q π old , we have Eq.( 47).Using a strategy similar to that described in Refs.[70,112], in Eq. ( 47) we make a repeated use of inequality (46) and of the Bellmann equation for Q π old (s, a) to prove that the value function of π new is better or equal to the value function of π old .
Let π θ (a|s) be a parameterization of the policy function that depends on a set of learnable parameters θ.We define the following loss function Thanks to Eqs. ( 44) and (45), this choice guarantees us to find a better policy by minimizing L π (θ) with respect to θ.In order to evaluate the expectation value in Eq. ( 48), as before we explicitly average over the discrete action and perform a single sample of the continuous action, and we replace Q π old with min j Q ϕj .Recalling the parameterization in Eq. (38), this yields We have defined and shown how to evaluate the loss functions L Q (ϕ) and L π (θ) that allow us to determine the value function and the policy [see Eqs. ( 39), ( 43) and (49)].Now, we discuss how to automatically tune the temperature hyperparameters α D and α C .Ref. [71] shows that constraining the average entropy of the policy to a certain value leads to the same exact SAC algorithm with the addition of an update rule to determine the temperatures.Let HD and HC be respectively the fixed average values of the entropy of the discrete and continuous part of the policy.We can then determine the corresponding temperatures α D and α C minimizing the following two loss functions As usual, we evaluate the entropies by explicitly taking the average over the discrete actions, and taking a single sample of the continuous action.To be more specific, we evaluate L D by computing and L C by computing and replacing the expectation value over u with a single sample.
To summarize, the SAC algorithm consists of repeating over and over a policy evaluation step, a policy improvement step, and a step where the temperatures are updated.The policy evaluation step consists of a single optimization step to minimize the loss functions L Q (ϕ i ) (for i = 1, 2), given in Eq. (39), where y(r, s ′ ) is computed using Eq.(43).The policy improvement step consists of a single optimization step to minimize the loss function L π (θ) given in Eq. ( 49).The temperatures are then updated performing a single optimization step to minimize L D (α D ) and L C (α C ) given respectively in Eqs. ( 51) and (52).In all loss functions, the expectation value over the states is approximated with a batch of experience sampled randomly from the replay buffer B.
We now detail how we parameterize π(a|s) and Q(s, a).The idea is to develop an efficient way to process the state that can potentially be a long time-series of actions.To this aim, we introduce a "convolution block" as a building element for our NN architecture.The convolution block, detailed in Fig. 6, takes an input of size (C in , L in ), where C in is the number of channels (i.e. the number of parameters determining an action at every time-step) and L in is the length of the time-series, and produces an output of size (C out , L out = L in /2), thus halving the length of the time-series.Notice that we include a skip connection (right branch in Fig. 6) to improve trainability [102].
Using the decomposition in Eq. ( 32) and the parameterization in Eq. (38), the quantities that need to FIG. 6.Schematic representation of the convolution block that takes as input a 1D time-series of size (Cin, Lin), where Lin is the length of the series and Cin is the number of channels, and produces an output of size (Cout, Lin/2).In this image Lin = 4.The output is produced by stacking a 1D convolution of kernel size and stride of 2, and a non-linearity (left branch).A residual connection (right branch), consisting only of linear operations, is added to improve trainability.The architecture of the neural network that we use for the policy function is shown in Fig. 7A.The state, composed of the time-series s i = (a i−N , . . ., a i−1 ) which has shape (C in , L in = N ), is fed through a series of ln 2 (N ) convolutional blocks, which produce an output of length (C out , L = 1).The number of input channels C in is determined by stacking the components of ⃗ u (which, for simplicity, is a single real number u in this appendix) and by using a one-hot encoding of the discrete actions.We then feed this output, together with the last action which has a privileged position, to a series of fully connected NNs with ReLU activations.Finally, a linear network outputs W (d|s), µ(d, s) and log(σ(d, s)), for all d = 1, . . ., |D|.
The probabilities π D (d|s) are then produced applying the softmax operation to W (d|s).We parameterize the value function Q ϕ (s, u, d) as in Fig. 7B.As for the policy function, the state s is fed through ln 2 (N ) stacked convolution blocks which reduce the length of the input to (C out , L = 1).This output, together with the action u, is fed into a series of fullyconnected layers with ReLU activations.We then add a linear layer that produces |D| outputs, corresponding to the value of Q(s, u, d) for each d = 1, . . ., |D|.
At last, we discuss a further change to the current method that we implemented in the superconducting qubit refrigerator case to improve the converge.This idea is the following.The return ⟨r c ⟩ is a convex combination of the power and of the negative entropy production.The first term is positive when the system is delivering the desired power, while the second term is strictly negative.Therefore, for c close to 1, the optimal value of the return is some positive quantity.Instead, as c decreases, the optimal value of the return decreases, getting closer to zero (this can be seen explicitly in Figs.4A and 5B).However, a null return can also be achieved by a trivial cycle that consists of doing nothing, i.e. of keeping the control constant in time.Indeed, this yields both zero power, and zero entropy production.Therefore, as c decreases, it becomes harder and harder for the RL agent to distinguish good cycles from these trivial solutions.We thus modify our method to allow us to smoothly change the value of c during training from 1 to the desired final value, which allows to tackle an optimization problem by "starting from an easier problem" (c = 1), and gradually increasing its difficulty.This required the following modifications to the previously described method.
We introduce two separate value functions, one for each objective (P for the power, and Σ for the entropy production) where represent respectively the normalized average power and average entropy production during each time-step.Since the value functions in Eq. ( 53) are identical to Eq. ( 36) up to a change of the reward, they separately satisfy the same Bellmann equation as in Eq. ( 37), with r 1 replaced respectively with r (P) 1 and r (Σ) 1 .Therefore, we learn each value functions minimizing the same loss function L Q given in Eq. ( 39), with r i replaced with r (P) 1 or r (Σ) 1 .Both value functions are parameterized using the same architecture, but separate and independent parameters.We now turn to the determination of the policy.Comparing the definition of r i given in the main text with Eq. ( 54), we see that r i+1 = cr i+1 .Using this property, and comparing Eq. (36) with Eq. ( 53), we see that Therefore, we learn the policy minimizing the same loss function as in Eq. ( 49), using Eq. ( 55) to compute the value function.To summarize, this method allows us to vary c dynamically during training.This requires learning two value functions, one for each objective, and storing in the replay buffer the two separate rewards r (P) i and r (Σ) i .At last, when we refer to "final deterministic cycle", we are sampling from the policy function "switching off the stochasticity", i.e. choosing continuous actions u setting ξ = 0 in Eq. ( 38), and choosing deterministically the discrete action with the highest probability.

Physical model
As discussed in the main text, we describe the dynamics of the two analyzed QTMs employing the Lindblad master equation that can be derived also for nonadiabatic drivings [107], in the weak system-bath coupling regime performing the usual Born-Markov and secular approximation [104][105][106] and neglecting the Lambshift contribution.This approach describes the timeevolution of the reduced density matrix of the quantum system, ρ(t), under the assumption of weak system-bath interaction.Setting ℏ = 1, the master equation reads where Ĥ[⃗ u(t)] is the Hamiltonian of the quantum system that depends explicitly on time via the control parameters ⃗ u(t), [•, •] denotes the commutator, and , known as the dissipator, describes the effect of the coupling between the quantum system and bath α = H, C. We notice that since the RL agent produces piece-wise constant protocols, we are not impacted by possible inaccuracies of the master equation subject to fast parameter driving [113], provided that ∆t is not smaller than the bath timescale.Without loss of generality, the dissipators can be expressed as [105,106] where λ α [d(t)] ∈ {0, 1} are functions that determine which bath is coupled the quantum system, Â(α) k,⃗ u(t) are the Lindblad operators, and γ (α) k,⃗ u(t) are the corresponding rates.In particular, λ H (Hot) = 1, λ C (Hot) = 0, while λ H (Cold) = 0, λ C (Cold) = 1, and λ H (None) = λ C (None) = 0. Notice that both the Lindblad operators and the rates can depend on time through the value of the control ⃗ u(t).Their explicit form depends on the details of the system, i.e. on the Hamiltonian describing the dynamics of the overall system including the bath and the system-bath interaction.Below, we provide the explicit form of Â(α) k,⃗ u(t) and γ k,⃗ u(t) used to model the two setups considered in the manuscript.We adopt the standard approach to compute the instantaneous power and heat currents [114] that guarantees the validity of the first law of thermodynamics ∂U (t)/(∂t) = −P (t) + α J α (t), the internal energy being defined as In the superconducting qubit refrigerator, we employ the model first put forward in Ref. [49], and further studied in Refs.[55,63].In particular, we consider the following Lindblad operators and corresponding rates (identifying k = ±): where |g u(t) ⟩ and |e u(t) ⟩ are, respectively, the instantaneous ground state and excited state of Eq. ( 7).The corresponding rates are given by γ where ∆ϵ u(t) is the instantaneous energy gap of the system, and ) is the noise power spectrum of bath α.Here ω α , Q α and g α are the base resonance frequency, quality factor and coupling strength of the resonant circuit acting as bath α = H, C (see Refs. [49,63] for details).As in Ref. [63], we choose ω C = 2E 0 ∆ and ω H = 2E 0 ∆ 2 + 1/4, such that the C (H) bath is in resonance with the qubit when u = 0 (u = 1/2).The width of the resonance is governed by Q α .The total coupling strength to bath α, plotted in Fig. 3F, is quantified by In the quantum harmonic oscillator based heat engine, following Ref.[43], we describe the coupling to the baths through the Lindblad operators Â(α) −,u(t) = âu(t) and corresponding rates γ where we identify k = ±.âu(t) = (1/ √ 2) mω 0 u(t) q + i/ mω 0 u(t) p and â † u(t) are respectively the (control dependent) lowering and raising operators, Γ α is a constant rate setting the thermalization timescale of the system coupled to bath α, and n(x) = [exp(x) − 1] −1 is the Bose-Einstein distribution.

Training details
We now provide additional practical details and the hyper parameters used to produce the results of this manuscript.
In order to enforce sufficient exploration in the early stage of training, we do the following.As in Ref. [111], for a fixed number of initial steps, we choose random actions sampling them uniformly withing their range.Furthermore, for another fixed number of initial steps, we do not update the parameters to allow the replay buffer to have enough transitions.B is a first-in-first-out buffer, of fixed dimension, from which batches of transitions (s k , a k , r k+1 , s k+1 , a k+1 ) are randomly sampled to update the NN parameters.After this initial phase, we repeat a policy evaluation, a policy improvement step and a temperature update step n updates times every n updates steps.This way, the overall number of updates coincides with the number of actions performed on the QTM.The optimization steps for the value function and the policy are performed using the ADAM optimizer with the standard values of β 1 and β 2 .The temperature parameters α D and α C instead are determined using stochastic gradient descent with learning rate 0.001.To favor an exploratory behavior early in the training, and at the same time to end up with a policy that is approximately deterministic, we schedule the target entropies HC and HD .In particular, we vary them exponentially during each step according to Ha (n steps ) = Ha,end + ( Ha,start − Ha,end ) exp(−n steps / Ha,decay ), (62) where a = C, D, n steps is the current step number, and Ha,start , Ha,end and Ha,decay are hyperparameters.In the superconducting qubit refrigerator case, we schedule the parameter c according to a Fermi distribution, i.e.

c(n
In the harmonic oscillator engine case, to improve stability while training for lower values of c, we do not vary c during training, as we do in the superconducting qubit refrigerator case.Instead, we discourage the agent from never utilizing one of the two thermal baths by adding a negative reward if, withing the last N = 128 actions describing the state, less than 25 describe a coupling to either bath.In particular, if the number of actions N α where d = α, with α = Hot, Cold is less than 25 in the state time-series, we sum to the reward the following penalty This penalty has no impact on the final cycles where N α is much larger than 25.
All hyperparameters used to produce the results of the superconducting qubit refrigerator and of the harmonic oscillator heat engine are provided respectively in Tables I and II, where c refers to the weight at which we are optimizing the return.

Convergence of the RL approach
The training process presents some degree of stochasticity, such as the initial random steps, the stochastic sampling of actions from the policy function, and the random sampling of a batch of experience from the replay buffer to compute an approximate gradient of the loss functions.We thus need to evaluate the reliability of our approach.
As shown in the main text, specifically in Figs. 4 and  5, we ran the full optimization 5 times.Out of 65 trainings in the superconducting qubit refrigerator case, only 4 failed, and out of the 55 in the harmonic oscillator engine, only 2 failed, where by failed we mean that the final return was negative.In such cases, we ran the training an additional time.
Figs. 4A and 5B display an error bar corresponding to the standard deviation, at each value of c, computed over the 5 repetitions.Instead, in Figs.4B and 5C we display one black dot for each individual training.As we can see, the overall performance is quite stable and reliable.
At last, we discuss the variability of the discovered cycles.The cycles shown in Figs.4C-F and 5D-E were chosen by selecting the largest return among the 5 repetitions.In Figs. 8 and 9 we display cycles discovered in the last of the 5 repetition, i.e. chosen without any post-selection.They correspond to the same setups and parameters displayed in Figs.4C-F and 5D-E.As we can see, 5 out of the 6 displayed cycles are very similar to the ones displayed in Figs.4C-F and 5D-E, with a very slight variability.The only exception is Fig. 8B, where the cycle has a visibly shorter period and amplitude than the one shown in Fig. 4D.Despite this visible difference in the cycle shape, the return of the cycle shown in Fig. 8B is 0.382 compared to 0.385 of the cycle shown in Fig. 4B.
We therefore conclude that, up to minor changes, the cycles are generally quite stable across multiple trainings.

Comparing with other methods
In Figs. 4 and 5 we compare the performance of our method respectively against optimized trapezoidal cycles, and optimized Otto cycles.In both cases, we also maximize the power using the RL method of Ref. [69].We now detail how we perform such comparison.
In the refrigerator based on a superconducting qubit, we consider the trapezoidal cycle proposed in Ref. [49,63], i.e. we fix with a = 2, and we optimize ⟨r c ⟩ with respect to frequency Ω.In the heat engine case based on a quantum harmonic oscillator, we fix an Otto cycle as described in Ref. [43], i.e. a trapezoidal cycle consisting of the 4 strokes shown in Figs.5D-E as a dashed line, and we optimize over the duration of each of the 4 strokes.In particular, we first performed a grid search in the space of these four durations for c = 1.After identifying the largest power, we ran the Newton algorithm to further maximize the return.We then ran the Newton algorithm for all other values of c.The comparison with Ref. [69] was done using the source code provided in Ref. [69], and using the same exact hyperparameters that were used in Ref. [69].
In particular, in the case of the refrigerator based on a superconducting qubit, we re-ran the code using the hyperparameters reported in Table 1, column "Figs.3,  4", of the Methods section of Ref. [69], and we trained for the same number of steps (500k).We then evaluated its power and coefficient of performance evaluating the deterministic policy (which typically has a better performance).In the heat engine case based on a quantum harmonic oscillator, we evaluated the performance of the cycle reported in Fig. 5a,c of Ref. [69], whose training hyperparameters are reported in Table 1, column "Fig.5a", of the Methods section of Ref. [69].

Generation of coherence
In order to quantify the coherence generated in the instantaneous eigenbasis of the Hamiltonian in the refrigerator based on a superconducting qubit, we evaluated the time average of relative entropy of coherence [115], defined as C(ρ(t)) = S(ρ diag.(t)) − S(ρ(t)), (66) where is the density matrix, in the instantaneous eigenbasis |g u(t) ⟩ and |e u(t) ⟩, with the off-diagonal terms canceled out.
We compute the time-average of the relative entropy of coherence generated by the final deterministic cycle found by the RL agent, and compare it to the coherence generated by a trapezoidal cycle operated at the same speed, i.e. with the same period.As we can see in Table III, the trapezoidal cycles generate twice as much coherence as the RL cycles shown in Figs.4C-F, i.e. corresponding to c = 1, 0.8, 0.6, 0.4.

CODE AND DATA AVAILABILITY
The code used to generate all results is available on GitHub (https://github.com/PaoloAE/paper_rl_blackbox_thermal_machines).All raw data that was generated with the accompanying code and that was used to produce the results in the manuscript is available on Figshare (https://doi.org/10.6084/m9.figshare.19180907).

1 FIG. 3 .
FIG. 3. Training of the superconducting qubit refrigerator model to optimize ⟨rc⟩ at c = 0.6.(A): Schematic representation of the energy levels of the qubit (horizontal black lines) that are controlled by u(t).The gray arrow represents the input power, while the colored arrows represent the heat fluxes.(B): Return ⟨rc⟩ i computed over past rewards (black curve), running average of the cooling power ⟨P cool ⟩ i /P0 (green curve), and of the negative entropy production − ⟨Σ⟩ i /Σ0 (orange curve), as a function of the training step.The dashed line represents the value of the return found optimizing the period of a smoothed trapezoidal cycle.(C): Value of the weight c as a function of the step.It is varied during training from 1 to the final value 0.6 to improve convergence.(D): Actions chosen by the agent, represented by the value of u, as a function of step, zoomed around the three black circles in panel (B).(E): Final deterministic cycle found by the agent (thick black dots) and smoothed trapezoidal cycle (thin dashed line) whose return is given by the dashed line in panel (B), as a function of time.(F): coupling strength γ (C) u (blue curve) and γ (H) u (red curve) as a function of u (on the y-axis).The parameters used for training are N = 128, gH = gC = 1, βH = 10/3, βC = 2βH, QH = QC = 4, E0 = 1, ∆ = 0.12, ωH = 1.028, ωC = 0.24, U = [0, 0.75], ∆t = 0.98, γ = 0.997, P0 = 6.62 • 10 −4 and Σ0 = 0.037.
as a dashed line (see "Comparing with other methods" in Materials and Methods for details).

FIG. 4 .
FIG.4.Results for the optimization of the superconducting qubit refrigerator model.(A): final value of the return ⟨rc⟩, as a function of c, found using the RL method (black and blue points), and optimizing the period of a trapezoidal cycle (red dots).The error bars represent the standard deviation of the return computed over 5 independent training runs.(B): corresponding values of the final average cooling power ⟨P cool ⟩ and of the coefficient of performance η cool found using the RL method (black and blue dots), optimizing the trapezoidal cycle (red dots), and using the RL method of Ref.[69] (purple cross).Results for each of the 5 repetitions are shown as separate points to visualize the variability across multiple trainings.(C-F): final deterministic cycles identified by the RL method (thick black dots), as a function of time, corresponding to the blue points in panels (A) and (B) (respectively for c = 1, 0.8, 0.6, 0.4 choosing the training run with the largest return).The dashed line represents the trapezoidal cycle that maximizes the return for the same value of c [not shown in panel (F) since no cycle yields a positive return].The parameters used for training are chosen as in Fig. 3.

1 FIG. 5 .
FIG.5.Results for the optimization of the harmonic oscillator heat engine model.(A): Schematic representation of the energy levels of the particles (black horizontal lines) trapped in a harmonic potential (parabolic curve) whose amplitude is controlled by u(t).The gray arrow represents the extracted power, while the colored arrows represent the heat fluxes.(B): final value of ⟨rc⟩, as a function of c, found using the RL method (black and blue dots), and optimizing the Otto cycle (red dots).The error bars represent the standard deviation of the return computed over 5 independent training runs.(C): corresponding values of the average power ⟨P heat ⟩ /P0 and of the efficiency η heat found using the RL method (black and blue dots), optimizing the Otto cycle (red dots), and using the RL method of Ref.[69] (purple cross).Results for each of the 5 repetitions are shown as separate points to visualize the variability across multiple trainings.(D-E): final deterministic cycle identified by the RL method (thick dots), as a function of time, corresponding to the blue points in panels (B) and (C) (respectively c = 1, 0.5 choosing the training run with the largest return).The color corresponds to the discrete choice d = {Hot, Cold, None} (see legend).The dashed line represents the Otto cycle that maximizes the return for the same value of c.The parameters used for training are N = 128, Γ (H) = Γ (C) = 0.6, βH = 0.2, βC = 2, w0 = 2, U = [0.5, 1] (to enable a fair comparison with Ref.[43]), ∆t = 0.2, γ = 0.999, P0 = 0.175 and Σ0 = 0.525.

FIG. 7 .
FIG. 7. Neural network architecture used to parameterize the policy π(⃗ u, d|s) (panel A) and to parameterize the value function Q(s, ⃗ u, d) (panel B).
Coherence generated by the final deterministic cycles identified by the RL method (RL column) and generated by a trapezoidal cycle operated at the same speed (Trapez.column) at the values of c shown in the first column.These values correspond to the cycles shown in Figs.4C-F.

TABLE I .
Hyperparameters used in numerical calculations relative to the superconducting qubit refrigerator that are not reported in the caption of Fig.3.