The neural state-space model
Learning dynamical systems from data has been a core discipline within control design for decades, including aircraft flight control20 and simulation21, and has historically been known as system identification22,50. However, due to computational limitations of the time, classical approaches have typically been restricted to linear models, often in the form of linear SSMs:
$$\dot{{{\bf{x}}}}=A{{\bf{x}}}+B{{\bf{a}}}$$
(1)
$${{\bf{o}}}=C{{\bf{x}}}+D{{\bf{a}}}$$
(2)
Where A, B, C, and D are the matrices to be learned from datasets of observables, o, actions, a, and, possibly, states, x. We note that the controls literature typically uses the notation y in lieu of o and u in lieu of a, reflecting a difference in notation between the controls and RL communities, but here we use RL notation for consistency. In the modern deep-learning learning era, this idea of learning dynamical systems from data was rediscovered from a different perspective, with the advent of the neural differential equation (NDE)18:
$$\dot{{{\bf{x}}}}=N{N}_{\theta }({{\bf{x}}})$$
(3)
where it was discovered that, given datasets of x, a neural network, NNθ, can be used as a system of differential equations that is integrated forward in time with a differential equations solver, and then adjoint back-propagation methods can be used in conjunction with automatic differentiation to determine the gradient of loss with respect to the network parameters θ16,17,18. The introduction of flexible machine learning frameworks has enabled the development of the field of SciML, based around the core idea of extending NDEs to include physics, and other domain-specific, structure16,18. One extension that completes the circle with the classical linear SSM is the NSSM, which reintroduces the concepts of actions and observations:
$$\dot{{{\bf{x}}}}(t)={f}_{\theta }({{\bf{x}}},{{\bf{a}}})$$
(4)
$${{\bf{o}}}(t)={O}_{\theta }({{\bf{x}}},{{\bf{a}}})$$
(5)
Thanks to the power of new, highly flexible machine learning frameworks such as JAX and the Julia SciML ecosystem, fθ and Oθ can be programmed to include arbitrary combinations of neural networks, physics formulas, and even classical data-driven models such as power laws, a capability which we exploit in this work. The training process of an NSSM is shown in Fig. 9 and involves the simulation of the NSSM forward in time using an initial state x0 and a time series of actions a0:T from an experimental database. The error of the simulation results against the experimental data is computed, and adjoint methods and automatic differentiation are used to determine the gradient to reduce the loss. In this work, the differential equation solver package diffrax17 is used, which includes the integration of multiple adjoint methods with the JAX automatic differentiation system, which allows backpropagation through all differential equation solvers in the package.

The NSSM, defined by the dynamics function fθ and observation function Oθ with parameters θ, is simulated forward in time, given an initial state x0 and an action trajectory a0:T, to generate a sequence of simulated observations, \({\hat{{{\bf{o}}}}}_{0:T}\). The simulated observations are compared with experimental observations via the loss functional \({{\mathcal{L}}}\), which is defined as the time-integrated value of an instantaneous loss function l. Adjoint methods in diffrax17 and JAX automatic differentiation then yield the gradient of model parameters with respect to loss, \({\nabla }_{\theta }{{\mathcal{L}}}\), which allows the optimizer to update the parameters θ.
The dynamics function f
θ(x, a)
We begin by defining the following confinement laws:
$${\tau }_{n,pred}({{\bf{x}}},{{\bf{a}}})= {c}_{n}{I}_{p}^{{c}_{I,n}}{\bar{n}}_{e,20}^{{c}_{n,n}}{P}_{input}^{{c}_{P,n}}{\kappa }^{{c}_{\kappa,n}}{\epsilon }^{{c}_{\epsilon,n}}| {\dot{I}}_{p}{| }^{{c}_{{\dot{I}}_{p},n}}N{N}_{conf,0}({{\bf{x}}},{{\bf{a}}})\\ {({c}_{n,h}{\bar{n}}_{e,20}^{{c}_{n,n,h}}{P}_{input}^{{c}_{P,n,h}})}^{{\mbox{hmode}}({{\bf{x}}},{{\bf{a}}})}$$
(6)
$$\tau_{E,pred}({{{\mathbf{x}}}},{{{\mathbf{a}}}})= \underbrace{c_{E}I_p^{c_{I,E}}{{\bar{n}}}_{e,20}^{c_{n,E}}P_{input}^{c_{P,E}}\kappa^{c_{\kappa,E}}\epsilon^{c_{\epsilon,E}}|\dot{I}_p|^{c_{\dot{I}_p,E}}}_{{{{\mbox{L-mode}}}}}\underbrace{NN_{conf,1}({{{\mathbf{x}}}},{{{\mathbf{a}}}})}_{{{{\mbox{NN correction}}}}} \\ \underbrace{(c_{E,h}{{\bar{n}}}_{e,20}^{c_{n,E,h}}P_{input}^{c_{P,E,h}})^{{{{\mbox{hmode}}}}({{{\mathbf{x}}}},{{{\mathbf{a}}}})}}_{{{{\mbox{H-mode correction}}}}}$$
(7)
$$\,{{\mbox{hmode}}}\,({{\bf{x}}},{{\bf{a}}})=\,{{\mbox{tanhHeaviside}}}\,({P}_{input}-{c}_{h}{\bar{n}}_{e,20}^{{c}_{h,n}}{a}_{minor}^{{c}_{h,a}})$$
(8)
$$\,{\mbox{tanhClip}}\,(x)\equiv \tanh \left(\frac{2k}{\,{\mbox{max}}-{\mbox{min}}}(x-{\mbox{center}})\right)\frac{{\mbox{max}}-{\mbox{min}}}{2}+{\mbox{center}}\,$$
(9)
$$\,{\mbox{tanhHeaviside}}\,(x)\equiv \frac{1}{2}(\tanh (kx+1))$$
(10)
where the parameters to be learned include all coefficients c* and neural network parameters. The laws are structured to multiply a portion corresponding to L-mode, a neural network correction factor, and an H-mode correction factor. The L-mode term reflects standard confinement scalings, but with the introduction of a \({\dot{I}}_{p}\), which was found to help better capture the short-term effects of ramping plasma current. The neural network output includes a tanhClip final activation that constrains its output to the range [0.75, 1.25], thus controlling the maximum adjustment the network is allowed to make. The hmode function includes a tanhHeaviside function, which provides a smooth transition between one to zero once Pinput falls below the learned back-transition threshold, which is structured to reflect the Martin scaling51. Note that the use of the hmode function output as a power deactivates the H-mode correction term once hmode transitions from one to zero. While, in principle, the neural network should be able to learn the effects of H-mode implicitly, we found that adding an explicit H-mode correction factor helped improve model predictions in our low-data regime. The k factor controls the smoothness of both the tanhClip and tanhHeaviside functions.
These confinement laws are integrated as a part of the following 0D energy and particle balance equations, which is a model that blends simple physics principles, power laws, and neural networks:
$$\frac{dW_{tot}}{dt}= -\underbrace{\frac{W_{tot}}{\tau_{E,pred}}}_{{{{\mbox{Transport}}}}}+\underbrace{I_p^2NN_{ohm,rad,0}({{{\mathbf{x}}}},{{{\mathbf{a}}}})}_{{{{\mbox{Ohmic Heating}}}}} – \underbrace{{{\bar{n}}}_{e,20}VNN_{ohm,rad,1}({{{\mathbf{x}}}},{{{\mathbf{a}}}})}_{{{{\mbox{Radiated Power}}}}} \\ +\underbrace{P_{NBI}+P_{ECRH}}_{{{{\mbox{Aux. Heating}}}}}$$
(11)
$$\frac{d({{\bar{n}}}_{e,20}V)}{dt}= -\underbrace{\frac{{{\bar{n}}}_{e,20}}{\tau_{n,pred}}}_{{{{\mbox{Transport}}}}}+\underbrace{c_{NBI}P_{NBI}}_{{{{\mbox{NBI Fueling}}}}}+\underbrace{c_{gas,0}\sigma(c_{gas,1}V_{gas}+c_{gas,2})}_{{{{\mbox{Gas Valve Fueling}}}}} \\ +\underbrace{NN_{wall}({{{\mathbf{x}}}},{{{\mathbf{a}}}})\exp^{-c_{wall}g_{HFS}}}_{{{{\mbox{Wall Effects}}}}}$$
(12)
where σ is the sigmoid function, NNohm,rad is a network that predicts two quantities; the first is multiplied by \({I}_{p}^{2}\) to serve as an Ohmic heating term, and the second is multiplied by density and volume to serve as the radiated power term. NNwall is included to account for possible wall fueling effects when in a limited configuration, and is multiplied by an exponential in the HFS gap to deactivate it when diverted. Additional simple constants are included to account for fueling from both NBI and gas puffing. We note that, in both cases, the included terms do not capture important state dependencies and time delays, but they proved sufficient for this use case. The dynamics of density times volume are predicted; in cases where density itself is used (e.g., to compute the Greenwald Fraction), the following volume approximation is used to recover density:
$$V\approx 2\pi {R}^{2}{\epsilon }^{2}\kappa \left(\pi -\left(\pi -\frac{8}{3}\right)\epsilon \right)$$
(13)
Since time derivatives of quantities, \({\dot{I}}_{p},\dot{\kappa },{\dot{a}}_{minor},\dot{\delta }\) are used as actions, their integrated values are also added as state variables with trivial dynamics.
The observation function O
θ(x, a)
The observation function consists of several components: a NN predictor for γvgr, a profile predictor, and simple physics formulae to compute derived quantities:
$${O}_{\theta }({{\bf{x}}},{{\bf{a}}})\left\{\begin{array}{ll}{\beta }_{p}=\frac{8}{3}\frac{{W}_{tot}}{{\mu }_{0}{R}_{0}{I}_{p}^{2}}\quad \hfill &(14{{\rm{c}}})\\ {f}_{GW}=\frac{{\bar{n}}_{e,20}\pi {a}_{minor}^{2}}{{I}_{p,MA}} \hfill\quad &(14{{\rm{d}}})\\ {q}_{95}=\frac{4.1{a}_{minor}^{2}{B}_{0}}{{R}_{0}{I}_{p,MA}}\left(1+1.2(\kappa -1)+0.56{(\kappa -1)}^{2}\right) \left(1+0.09\delta+0.16{\delta }^{2}\right)\frac{1+0.45\delta \epsilon }{1-0.74\epsilon }\quad &(14{{\rm{e}}})\\ {\gamma }_{vgr}=N{N}_{vgr}({{\bf{x}}},{{\bf{a}}})\quad \hfill &(14{{\rm{f}}})\\ {T}_{e}(\rho ),{N}_{e}(\rho )=N{N}_{prof}({{\bf{x}}},{{\bf{a}}})\quad \hfill &(14{{\rm{g}}})\end{array}\right.$$
where βp is computed in accordance with the LIUQE definition52, fGW is the usual Greenwald Fraction9, q95 is the approximation given in53 with the squareness factor set to 1, NNvgr is a multi-layer perceptron (MLP) with GELU activation and a scaled sigmoid output, and NNprof is a neural-operator-based profile predictor, discussed further in the next subsection.
Neural-operator-based profile predictor
Prior work at NSTX-U trained a neural network to successfully predict kinetic profile shapes using their averages plus zero-dimensional control parameters such as plasma current, shaping, and auxiliary heating. Building upon this prior work, we show that, on TCV data, kinetic profiles can be predicted to reasonable accuracy with a neural network using the stored energy Wtot, line-averaged electron density \({\bar{n}}_{e,20}\), and control parameters. The key implication is that accurate prediction of the time-dependent dynamics of just two scalars, Wtot and \({\bar{n}}_{e,20}\), implies reasonable prediction of the dynamics of kinetic profiles.
We leverage methods developed by the neural operator54,55 literature, which has found success for solving machine learning problems in scientific domains involving PDEs. Letting fin denote an input function and fout denote an output function, a neural operator \({{\mathcal{F}}}\) parameterized by θ maps an input function to an output function:
$${f}_{out}={{{\mathcal{F}}}}_{\theta }({f}_{in})$$
(15)
In practice, the functions involved are approximated using a set of basis functions; thus, the practical implementation results in a neural network operating on basis function coefficients. In this work, we make use of cubic B-spline basis functions to represent the kinetic profiles:
$${T}_{e}(\rho )={\sum }_{i=1}^{{n}_{basis}}{c}_{T,i}{B}_{i,3}(\rho )\quad {n}_{e,20}(\rho )={\sum }_{i=1}^{{n}_{basis}}{c}_{n,i}{B}_{i,3}(\rho )$$
(16)
And we predict these profiles using a set of 0D scalars, where every scalar is a control parameter except stored energy Wtot and \({\bar{n}}_{e,20}\). The full set of input and output parameters is specified in Table 1. During training, the ρ grid corresponding to the dataset is chosen to evaluate the basis functions, but arbitrary alternative grids can be used during inference time.
Training methods
Training of the NSSM involved two stages. First, NNohm,rad, NNvgr, and NNprof are trained independently of the rest of the model on time-independent samples to predict their respective quantities. These “pretrained” models are then integrated into the NSSM, where they are further trained jointly with the rest of the model through the time-dependent process specified in Fig. 9. The AdamW optimizer56 with an exponential decay learning rate schedule is used for every training run. All NNs used in the dynamics function f and NNvgr are simple MLPs with GELU57 activations on the hidden state and tanhClip functions as final activations to constrain their outputs to reasonable ranges. The profile predictor is further detailed earlier in the methods. Hyperparameters for the optimizer and model sizes are optimized via Bayesian Optimization using the method implemented in the Weights and Biases platform58, which was used in this work for experiment tracking. The final set of hyperparameters is detailed in Table S2 in the Supplementary Information.
Training data distribution
The dataset used for training models in this work consists of the 442 most recent shots with rampdowns that are at least partially complete and have sufficient diagnostic availability, gathered with Disruption and Event analysis framework for FUSion Experiments59. The initial training phase involved training on 311 shots of data, with the rest of the dataset used for validation. After the initial training phase, the model is further trained on a fine-tuning dataset of 44 shots. During this phase, all of the model weights except those in the τE and τN hybrid confinement laws described in are frozen. As shown in Fig. S8 in the Supplementary Information, the dataset consists of only five shots of data anywhere near the relevant HP region.
Reward function
The reward function is designed to balance the priority of achieving a low plasma current and energy against the risk of disrupting the plasma, and is given by:
$$r({{{\mathbf{x}}}}(t),{{{\mathbf{a}}}}(t))= \underbrace{-c_{time}}_{{{{\mbox{Penalty for time}}}}} – \underbrace{c_{W}W_{tot}(t) – c_{I_p}I_{p}(t)}_{{{{\mbox{Penalty for current and energy}}}}} – \underbrace{\sum\limits_{i=1}^{n_{soft}} c_{soft}s_i({{{\mathbf{x}}}}(t))}_{{{{\mbox{Soft chance-constraints}}}}} \\ – \underbrace{\sum\limits_{i=1}^{n_{hard}} c_{hard}h_i({{{\mathbf{x}}}}(t))}_{{{{\mbox{Hard chance-constraints}}}}}$$
(17)
The reward function is active for every time step before hitting the goal state or the maximum allowed training episode time. The goal state is chosen to be a stored energy of 500J and a plasma current of 40 kA, as, for the 170 kA extrapolation test scenario, 40 kA approximately corresponds to the relative plasma current for an ITER 15 MA benign termination, which is defined as 3 MA60. A constant penalty term is active for every time step before achieving the goal to encourage time minimization. In addition, penalty terms that scale with plasma current and energy are included to further prioritize moving towards a safer state. To avoid disruptive limits such as high Greenwald fraction during the rampdown, penalty terms are added for states that violate user-specified constraints on key quantities correlated with disruptions.
One challenge with specifying constraint limits is the difference in severity of violating different constraints, and the, at times, weak correlations between physical quantities and disruptions. To address this issue, we partition constraints into “soft” constraints, which incur a small penalty to discourage, but not forbid, the algorithm from finding solutions that violate these limits, and “hard” constraints, which incur a large penalty to strictly enforce constraint violation. We note that while methods in the constrained optimization literature often mathematically express constraints separately from the objective function being optimized, most practical implementations of constrained optimization algorithms enforce constraints by rewriting constraints as penalty terms in the objective function61,62, an approach we also adopt. Stochastic optimization across a distribution of outcomes introduces a challenge: trying to avoid limits for every scenario will likely result in excessively slow and conservative solutions63, which itself poses its own risk. To address this challenge, we utilize chance-constraints, a technique often utilized in the autonomous driving literature64,65, and only activate the constraint if violation probability exceeds a certain threshold. In this set of experiments, this threshold is chosen as 5%. Reward function parameters used for the final four shots are shown in Table S2 in the Supplementary Information.
Uncertainty model
In experimental reality, the time evolution of plasma dynamics is highly nonlinear and subject to considerable amounts of uncertainty, as evidenced by the two same-scenario shots shown in Fig. 8, which begin at drastically different initial conditions. To design trajectories that have robustness to large variability and ONEs, we defined an uncertainty model for the RL training environments, and sampled from this uncertainty model for each training environment used during training. The uncertainty model includes random variables for both the initial state of the plasma during rampdown and disturbances/model uncertainties that affect the time-varying plasma dynamics. To account for the fact that accidental H-L back-transition implies the initial state of the plasma may start in either H or L-mode, the initial state distribution is modeled as a bi-modal mixture model, with a 50% chance of any given RL training environment starting in either H or L-mode. In some cases, uncertainty distributions could easily be quantified from past experimental data (such as tracking error in the plasma current), or from model prediction accuracy (such as γvgr), but in other cases, the distribution was chosen in an ad-hoc fashion, upon identifying additional sources of uncertainty in the experiment. Table S1 in the Supplementary Information summarizes the random variables, parameterized distributions, and quantification methods used in this work. As discussed in the section, this uncertainty model proved to be non-exhaustive in the experiment. In addition, the uncertainty model employed does not account for time-varying fluctuations in uncertain variables; future work should employ time-varying stochastic processes. Both of these limitations further highlight the need to advance experimental uncertainty quantification and robust control in the context of fusion plasma control.
RL methods
Standard RL problems involve optimizing a policy π to map observations to actions:
$${{\bf{a}}}=\pi ({{\bf{o}}})$$
(18)
from this perspective, trajectory optimization can be viewed as policy optimization where the only observable is time:
$${{\bf{a}}}=\pi (t)$$
(19)
Given that time is the only observable, but there exist different physical conditions in the parallel training environments that are unobservable to the policy, the reward maximization process yields a trajectory that is designed to succeed across the different conditions specified in the subsection. After an initial trial with Proximal Policy Optimization66, we found OpenAI-ES, an evolutionary strategy (ES) designed for policy optimization, to work better in practice67. This is possibly explained by the theoretical analysis given in the paper introducing OpenAI-ES, which suggests that RL problems with long time horizons and actions that have long-lasting effects may be better solved with ES approaches than the dominant paradigm of policy gradient methods67. The policy π was parameterized by an MLP with two hidden layers of width 64 and used ReLU activations with a hyperbolic tangent final activation to constrain the action space. A hyperparameter sweep of the architecture was not employed, and it would be worthwhile to investigate for future work.
Deployment to TCV
Shape trajectories determined via RL were mapped to last-closed-flux-surface control points via re-scaling of the flattop shape for the diverted phase, and using an analytic formula in the TCV MGAMS68 algorithm for the limited phase. Feed-forward coil currents and voltages to achieve the desired plasma current and shaping trajectories were then determined with the free-boundary equilibrium code FBT and shot preparation algorithm MGAMS68,69, and the PNBI trajectory was programmed into the TCV supervisory control system SAMONE70,71.

