Learning plasma dynamics and robust rampdown trajectories with predict-first experiments at TCV

The neural state-space model

Learning dynamical systems from data has been a core discipline within control design for decades, including aircraft flight control²⁰ and simulation²¹, and has historically been known as system identification^22,50. However, due to computational limitations of the time, classical approaches have typically been restricted to linear models, often in the form of linear SSMs:

$$\dot{{{\bf{x}}}}=A{{\bf{x}}}+B{{\bf{a}}}$$

(1)

$${{\bf{o}}}=C{{\bf{x}}}+D{{\bf{a}}}$$

(2)

Where A, B, C, and D are the matrices to be learned from datasets of observables, o, actions, a, and, possibly, states, x. We note that the controls literature typically uses the notation y in lieu of o and u in lieu of a, reflecting a difference in notation between the controls and RL communities, but here we use RL notation for consistency. In the modern deep-learning learning era, this idea of learning dynamical systems from data was rediscovered from a different perspective, with the advent of the neural differential equation (NDE)¹⁸:

$$\dot{{{\bf{x}}}}=N{N}_{\theta }({{\bf{x}}})$$

(3)

where it was discovered that, given datasets of x, a neural network, NN_θ, can be used as a system of differential equations that is integrated forward in time with a differential equations solver, and then adjoint back-propagation methods can be used in conjunction with automatic differentiation to determine the gradient of loss with respect to the network parameters θ^16,17,18. The introduction of flexible machine learning frameworks has enabled the development of the field of SciML, based around the core idea of extending NDEs to include physics, and other domain-specific, structure^16,18. One extension that completes the circle with the classical linear SSM is the NSSM, which reintroduces the concepts of actions and observations:

$$\dot{{{\bf{x}}}}(t)={f}_{\theta }({{\bf{x}}},{{\bf{a}}})$$

(4)

$${{\bf{o}}}(t)={O}_{\theta }({{\bf{x}}},{{\bf{a}}})$$

(5)

Thanks to the power of new, highly flexible machine learning frameworks such as JAX and the Julia SciML ecosystem, f_θ and O_θ can be programmed to include arbitrary combinations of neural networks, physics formulas, and even classical data-driven models such as power laws, a capability which we exploit in this work. The training process of an NSSM is shown in Fig. 9 and involves the simulation of the NSSM forward in time using an initial state x₀ and a time series of actions a_0:T from an experimental database. The error of the simulation results against the experimental data is computed, and adjoint methods and automatic differentiation are used to determine the gradient to reduce the loss. In this work, the differential equation solver package diffrax¹⁷ is used, which includes the integration of multiple adjoint methods with the JAX automatic differentiation system, which allows backpropagation through all differential equation solvers in the package.

**Fig. 9: Depiction of the neural state-space model (NSSM) training methodology.**

The dynamics function f
_θ(x, a)

We begin by defining the following confinement laws:

$${\tau }_{n,pred}({{\bf{x}}},{{\bf{a}}})= {c}_{n}{I}_{p}^{{c}_{I,n}}{\bar{n}}_{e,20}^{{c}_{n,n}}{P}_{input}^{{c}_{P,n}}{\kappa }^{{c}_{\kappa,n}}{\epsilon }^{{c}_{\epsilon,n}}| {\dot{I}}_{p}{| }^{{c}_{{\dot{I}}_{p},n}}N{N}_{conf,0}({{\bf{x}}},{{\bf{a}}})\\ {({c}_{n,h}{\bar{n}}_{e,20}^{{c}_{n,n,h}}{P}_{input}^{{c}_{P,n,h}})}^{{\mbox{hmode}}({{\bf{x}}},{{\bf{a}}})}$$

(6)

$$\tau_{E,pred}({{{\mathbf{x}}}},{{{\mathbf{a}}}})= \underbrace{c_{E}I_p^{c_{I,E}}{{\bar{n}}}_{e,20}^{c_{n,E}}P_{input}^{c_{P,E}}\kappa^{c_{\kappa,E}}\epsilon^{c_{\epsilon,E}}|\dot{I}_p|^{c_{\dot{I}_p,E}}}_{{{{\mbox{L-mode}}}}}\underbrace{NN_{conf,1}({{{\mathbf{x}}}},{{{\mathbf{a}}}})}_{{{{\mbox{NN correction}}}}} \\ \underbrace{(c_{E,h}{{\bar{n}}}_{e,20}^{c_{n,E,h}}P_{input}^{c_{P,E,h}})^{{{{\mbox{hmode}}}}({{{\mathbf{x}}}},{{{\mathbf{a}}}})}}_{{{{\mbox{H-mode correction}}}}}$$

(7)

$$\,{{\mbox{hmode}}}\,({{\bf{x}}},{{\bf{a}}})=\,{{\mbox{tanhHeaviside}}}\,({P}_{input}-{c}_{h}{\bar{n}}_{e,20}^{{c}_{h,n}}{a}_{minor}^{{c}_{h,a}})$$

(8)

$$\,{\mbox{tanhClip}}\,(x)\equiv \tanh \left(\frac{2k}{\,{\mbox{max}}-{\mbox{min}}}(x-{\mbox{center}})\right)\frac{{\mbox{max}}-{\mbox{min}}}{2}+{\mbox{center}}\,$$

(9)

$$\,{\mbox{tanhHeaviside}}\,(x)\equiv \frac{1}{2}(\tanh (kx+1))$$

(10)

where the parameters to be learned include all coefficients c_* and neural network parameters. The laws are structured to multiply a portion corresponding to L-mode, a neural network correction factor, and an H-mode correction factor. The L-mode term reflects standard confinement scalings, but with the introduction of a ${\dot{I}}_{p}$, which was found to help better capture the short-term effects of ramping plasma current. The neural network output includes a tanhClip final activation that constrains its output to the range [0.75, 1.25], thus controlling the maximum adjustment the network is allowed to make. The hmode function includes a tanhHeaviside function, which provides a smooth transition between one to zero once P_input falls below the learned back-transition threshold, which is structured to reflect the Martin scaling⁵¹. Note that the use of the hmode function output as a power deactivates the H-mode correction term once hmode transitions from one to zero. While, in principle, the neural network should be able to learn the effects of H-mode implicitly, we found that adding an explicit H-mode correction factor helped improve model predictions in our low-data regime. The k factor controls the smoothness of both the tanhClip and tanhHeaviside functions.

These confinement laws are integrated as a part of the following 0D energy and particle balance equations, which is a model that blends simple physics principles, power laws, and neural networks:

$$\frac{dW_{tot}}{dt}= -\underbrace{\frac{W_{tot}}{\tau_{E,pred}}}_{{{{\mbox{Transport}}}}}+\underbrace{I_p^2NN_{ohm,rad,0}({{{\mathbf{x}}}},{{{\mathbf{a}}}})}_{{{{\mbox{Ohmic Heating}}}}} – \underbrace{{{\bar{n}}}_{e,20}VNN_{ohm,rad,1}({{{\mathbf{x}}}},{{{\mathbf{a}}}})}_{{{{\mbox{Radiated Power}}}}} \\ +\underbrace{P_{NBI}+P_{ECRH}}_{{{{\mbox{Aux. Heating}}}}}$$

(11)

$$\frac{d({{\bar{n}}}_{e,20}V)}{dt}= -\underbrace{\frac{{{\bar{n}}}_{e,20}}{\tau_{n,pred}}}_{{{{\mbox{Transport}}}}}+\underbrace{c_{NBI}P_{NBI}}_{{{{\mbox{NBI Fueling}}}}}+\underbrace{c_{gas,0}\sigma(c_{gas,1}V_{gas}+c_{gas,2})}_{{{{\mbox{Gas Valve Fueling}}}}} \\ +\underbrace{NN_{wall}({{{\mathbf{x}}}},{{{\mathbf{a}}}})\exp^{-c_{wall}g_{HFS}}}_{{{{\mbox{Wall Effects}}}}}$$

(12)

where σ is the sigmoid function, NN_ohm,rad is a network that predicts two quantities; the first is multiplied by ${I}_{p}^{2}$ to serve as an Ohmic heating term, and the second is multiplied by density and volume to serve as the radiated power term. NN_wall is included to account for possible wall fueling effects when in a limited configuration, and is multiplied by an exponential in the HFS gap to deactivate it when diverted. Additional simple constants are included to account for fueling from both NBI and gas puffing. We note that, in both cases, the included terms do not capture important state dependencies and time delays, but they proved sufficient for this use case. The dynamics of density times volume are predicted; in cases where density itself is used (e.g., to compute the Greenwald Fraction), the following volume approximation is used to recover density:

$$V\approx 2\pi {R}^{2}{\epsilon }^{2}\kappa \left(\pi -\left(\pi -\frac{8}{3}\right)\epsilon \right)$$

(13)

Since time derivatives of quantities, ${\dot{I}}_{p},\dot{\kappa },{\dot{a}}_{minor},\dot{\delta }$ are used as actions, their integrated values are also added as state variables with trivial dynamics.

The observation function O
_θ(x, a)

The observation function consists of several components: a NN predictor for γ_vgr, a profile predictor, and simple physics formulae to compute derived quantities:

$${O}_{\theta }({{\bf{x}}},{{\bf{a}}})\left\{\begin{array}{ll}{\beta }_{p}=\frac{8}{3}\frac{{W}_{tot}}{{\mu }_{0}{R}_{0}{I}_{p}^{2}}\quad \hfill &(14{{\rm{c}}})\\ {f}_{GW}=\frac{{\bar{n}}_{e,20}\pi {a}_{minor}^{2}}{{I}_{p,MA}} \hfill\quad &(14{{\rm{d}}})\\ {q}_{95}=\frac{4.1{a}_{minor}^{2}{B}_{0}}{{R}_{0}{I}_{p,MA}}\left(1+1.2(\kappa -1)+0.56{(\kappa -1)}^{2}\right) \left(1+0.09\delta+0.16{\delta }^{2}\right)\frac{1+0.45\delta \epsilon }{1-0.74\epsilon }\quad &(14{{\rm{e}}})\\ {\gamma }_{vgr}=N{N}_{vgr}({{\bf{x}}},{{\bf{a}}})\quad \hfill &(14{{\rm{f}}})\\ {T}_{e}(\rho ),{N}_{e}(\rho )=N{N}_{prof}({{\bf{x}}},{{\bf{a}}})\quad \hfill &(14{{\rm{g}}})\end{array}\right.$$

where β_p is computed in accordance with the LIUQE definition⁵², f_GW is the usual Greenwald Fraction⁹, q₉₅ is the approximation given in⁵³ with the squareness factor set to 1, NN_vgr is a multi-layer perceptron (MLP) with GELU activation and a scaled sigmoid output, and NN_prof is a neural-operator-based profile predictor, discussed further in the next subsection.

Neural-operator-based profile predictor

Prior work at NSTX-U trained a neural network to successfully predict kinetic profile shapes using their averages plus zero-dimensional control parameters such as plasma current, shaping, and auxiliary heating. Building upon this prior work, we show that, on TCV data, kinetic profiles can be predicted to reasonable accuracy with a neural network using the stored energy W_tot, line-averaged electron density ${\bar{n}}_{e,20}$, and control parameters. The key implication is that accurate prediction of the time-dependent dynamics of just two scalars, W_tot and ${\bar{n}}_{e,20}$, implies reasonable prediction of the dynamics of kinetic profiles.

We leverage methods developed by the neural operator^54,55 literature, which has found success for solving machine learning problems in scientific domains involving PDEs. Letting f_in denote an input function and f_out denote an output function, a neural operator ${{\mathcal{F}}}$ parameterized by θ maps an input function to an output function:

$${f}_{out}={{{\mathcal{F}}}}_{\theta }({f}_{in})$$

(15)

In practice, the functions involved are approximated using a set of basis functions; thus, the practical implementation results in a neural network operating on basis function coefficients. In this work, we make use of cubic B-spline basis functions to represent the kinetic profiles:

$${T}_{e}(\rho )={\sum }_{i=1}^{{n}_{basis}}{c}_{T,i}{B}_{i,3}(\rho )\quad {n}_{e,20}(\rho )={\sum }_{i=1}^{{n}_{basis}}{c}_{n,i}{B}_{i,3}(\rho )$$

(16)

And we predict these profiles using a set of 0D scalars, where every scalar is a control parameter except stored energy W_tot and ${\bar{n}}_{e,20}$. The full set of input and output parameters is specified in Table 1. During training, the ρ grid corresponding to the dataset is chosen to evaluate the basis functions, but arbitrary alternative grids can be used during inference time.

Training methods

Training of the NSSM involved two stages. First, NN_ohm,rad, NN_vgr, and NN_prof are trained independently of the rest of the model on time-independent samples to predict their respective quantities. These “pretrained” models are then integrated into the NSSM, where they are further trained jointly with the rest of the model through the time-dependent process specified in Fig. 9. The AdamW optimizer⁵⁶ with an exponential decay learning rate schedule is used for every training run. All NNs used in the dynamics function f and NN_vgr are simple MLPs with GELU⁵⁷ activations on the hidden state and tanhClip functions as final activations to constrain their outputs to reasonable ranges. The profile predictor is further detailed earlier in the methods. Hyperparameters for the optimizer and model sizes are optimized via Bayesian Optimization using the method implemented in the Weights and Biases platform⁵⁸, which was used in this work for experiment tracking. The final set of hyperparameters is detailed in Table S2 in the Supplementary Information.

Training data distribution

The dataset used for training models in this work consists of the 442 most recent shots with rampdowns that are at least partially complete and have sufficient diagnostic availability, gathered with Disruption and Event analysis framework for FUSion Experiments⁵⁹. The initial training phase involved training on 311 shots of data, with the rest of the dataset used for validation. After the initial training phase, the model is further trained on a fine-tuning dataset of 44 shots. During this phase, all of the model weights except those in the τ_E and τ_N hybrid confinement laws described in are frozen. As shown in Fig. S8 in the Supplementary Information, the dataset consists of only five shots of data anywhere near the relevant HP region.

Reward function

The reward function is designed to balance the priority of achieving a low plasma current and energy against the risk of disrupting the plasma, and is given by:

$$r({{{\mathbf{x}}}}(t),{{{\mathbf{a}}}}(t))= \underbrace{-c_{time}}_{{{{\mbox{Penalty for time}}}}} – \underbrace{c_{W}W_{tot}(t) – c_{I_p}I_{p}(t)}_{{{{\mbox{Penalty for current and energy}}}}} – \underbrace{\sum\limits_{i=1}^{n_{soft}} c_{soft}s_i({{{\mathbf{x}}}}(t))}_{{{{\mbox{Soft chance-constraints}}}}} \\ – \underbrace{\sum\limits_{i=1}^{n_{hard}} c_{hard}h_i({{{\mathbf{x}}}}(t))}_{{{{\mbox{Hard chance-constraints}}}}}$$

(17)

The reward function is active for every time step before hitting the goal state or the maximum allowed training episode time. The goal state is chosen to be a stored energy of 500J and a plasma current of 40 kA, as, for the 170 kA extrapolation test scenario, 40 kA approximately corresponds to the relative plasma current for an ITER 15 MA benign termination, which is defined as 3 MA⁶⁰. A constant penalty term is active for every time step before achieving the goal to encourage time minimization. In addition, penalty terms that scale with plasma current and energy are included to further prioritize moving towards a safer state. To avoid disruptive limits such as high Greenwald fraction during the rampdown, penalty terms are added for states that violate user-specified constraints on key quantities correlated with disruptions.

One challenge with specifying constraint limits is the difference in severity of violating different constraints, and the, at times, weak correlations between physical quantities and disruptions. To address this issue, we partition constraints into “soft” constraints, which incur a small penalty to discourage, but not forbid, the algorithm from finding solutions that violate these limits, and “hard” constraints, which incur a large penalty to strictly enforce constraint violation. We note that while methods in the constrained optimization literature often mathematically express constraints separately from the objective function being optimized, most practical implementations of constrained optimization algorithms enforce constraints by rewriting constraints as penalty terms in the objective function^61,62, an approach we also adopt. Stochastic optimization across a distribution of outcomes introduces a challenge: trying to avoid limits for every scenario will likely result in excessively slow and conservative solutions⁶³, which itself poses its own risk. To address this challenge, we utilize chance-constraints, a technique often utilized in the autonomous driving literature^64,65, and only activate the constraint if violation probability exceeds a certain threshold. In this set of experiments, this threshold is chosen as 5%. Reward function parameters used for the final four shots are shown in Table S2 in the Supplementary Information.

Uncertainty model

In experimental reality, the time evolution of plasma dynamics is highly nonlinear and subject to considerable amounts of uncertainty, as evidenced by the two same-scenario shots shown in Fig. 8, which begin at drastically different initial conditions. To design trajectories that have robustness to large variability and ONEs, we defined an uncertainty model for the RL training environments, and sampled from this uncertainty model for each training environment used during training. The uncertainty model includes random variables for both the initial state of the plasma during rampdown and disturbances/model uncertainties that affect the time-varying plasma dynamics. To account for the fact that accidental H-L back-transition implies the initial state of the plasma may start in either H or L-mode, the initial state distribution is modeled as a bi-modal mixture model, with a 50% chance of any given RL training environment starting in either H or L-mode. In some cases, uncertainty distributions could easily be quantified from past experimental data (such as tracking error in the plasma current), or from model prediction accuracy (such as γ_vgr), but in other cases, the distribution was chosen in an ad-hoc fashion, upon identifying additional sources of uncertainty in the experiment. Table S1 in the Supplementary Information summarizes the random variables, parameterized distributions, and quantification methods used in this work. As discussed in the section, this uncertainty model proved to be non-exhaustive in the experiment. In addition, the uncertainty model employed does not account for time-varying fluctuations in uncertain variables; future work should employ time-varying stochastic processes. Both of these limitations further highlight the need to advance experimental uncertainty quantification and robust control in the context of fusion plasma control.

RL methods

Standard RL problems involve optimizing a policy π to map observations to actions:

$${{\bf{a}}}=\pi ({{\bf{o}}})$$

(18)

from this perspective, trajectory optimization can be viewed as policy optimization where the only observable is time:

$${{\bf{a}}}=\pi (t)$$

(19)

Given that time is the only observable, but there exist different physical conditions in the parallel training environments that are unobservable to the policy, the reward maximization process yields a trajectory that is designed to succeed across the different conditions specified in the subsection. After an initial trial with Proximal Policy Optimization⁶⁶, we found OpenAI-ES, an evolutionary strategy (ES) designed for policy optimization, to work better in practice⁶⁷. This is possibly explained by the theoretical analysis given in the paper introducing OpenAI-ES, which suggests that RL problems with long time horizons and actions that have long-lasting effects may be better solved with ES approaches than the dominant paradigm of policy gradient methods⁶⁷. The policy π was parameterized by an MLP with two hidden layers of width 64 and used ReLU activations with a hyperbolic tangent final activation to constrain the action space. A hyperparameter sweep of the architecture was not employed, and it would be worthwhile to investigate for future work.

Deployment to TCV

Shape trajectories determined via RL were mapped to last-closed-flux-surface control points via re-scaling of the flattop shape for the diverted phase, and using an analytic formula in the TCV MGAMS⁶⁸ algorithm for the limited phase. Feed-forward coil currents and voltages to achieve the desired plasma current and shaping trajectories were then determined with the free-boundary equilibrium code FBT and shot preparation algorithm MGAMS^68,69, and the P_NBI trajectory was programmed into the TCV supervisory control system SAMONE^70,71.

Source link

What's Hot

NEM 3.0 Decision: 1st Appellate District Affirms Original Opinion in Center for Biological Diversity v. Public Utilities Commission

US & Israel bomb Iran’s schools & hospitals, with ‘no stupid rules of engagement’ « nuclear-news

Duke Energy’s political donations pay off through rate hikes

Learning plasma dynamics and robust rampdown trajectories with predict-first experiments at TCV

A meaningful map of the underexplored electrolyte universe

US firm reveals fusion reactor design aiming for fission-level power reliability

'Fallout' is bleak but funny science fission : Pop Culture Happy Hour

Kansas will get nuclear reactors buried underground. CEO says it’s ‘safe, clean, and good’ | KCUR

Miniature Fission Ionization Chamber Global Market Research

Deep Fission breaks ground on nuclear project outside Parsons, Kan. | Local News

About us

QUICKLINKS

USEFUL lINKS

What's Hot

Learning plasma dynamics and robust rampdown trajectories with predict-first experiments at TCV

The neural state-space model

The dynamics function f θ(x, a)

The observation function O θ(x, a)

Neural-operator-based profile predictor

Training methods

Training data distribution

Reward function

Uncertainty model

RL methods

Deployment to TCV

Keep Reading

About us

QUICKLINKS

USEFUL lINKS

Subscribe to Updates

The dynamics function f
_θ(x, a)

The observation function O
_θ(x, a)