## Abstract

Physics-constrained machine learning is emerging as an important topic in the field of machine learning for physics. One of the most significant advantages of incorporating physics constraints into machine learning methods is that the resulting model requires significantly less data to train. By incorporating physical rules into the machine learning formulation itself, the predictions are expected to be physically plausible. Gaussian process (GP) is perhaps one of the most common methods in machine learning for small datasets. In this paper, we investigate the possibility of constraining a GP formulation with monotonicity on three different material datasets, where one experimental and two computational datasets are used. The monotonic GP is compared against the regular GP, where a significant reduction in the posterior variance is observed. The monotonic GP is strictly monotonic in the interpolation regime, but in the extrapolation regime, the monotonic effect starts fading away as one goes beyond the training dataset. Imposing monotonicity on the GP comes at a small accuracy cost, compared to the regular GP. The monotonic GP is perhaps most useful in applications where data are scarce and noisy, and monotonicity is supported by strong physical evidence.

## 1 Introduction

Physics-constrained machine learning is an important toolbox in the machine learning era, with many possible applications, including those in biology, materials science, astrophysics, and finance. Breakthroughs in deep learning have been recognized as some of the most revolutionary contributions to science, but its Achilles heel is the size of the dataset required to train huge deep learning architectures.

However, this requirement cannot always be satisfied, especially in fields involving experimentation, where data can be resource-intensive, in terms of both time and labor. In this context, physics-constrained machine learning arises as an alternative solution, with equivalent or better accuracy, while requiring significantly less data to train. By incorporating some fundamental physics rules into the machine learning framework, the behavior of the machine learning predictions could be more physically plausible. One common constraint is the monotonicity, where the predictions are expected to behave monotonically with respect to one or several input variables.

Materials science is a field where theory has overwhelmed in the last several decades and centuries. The process–structure–property relationship is perhaps one of the most well-studied topics, in terms of theoretical analysis. Existing well before the computer era, material scientists tended to agree that mathematics and physics are the right tools to bridge the gap between inputs and quantities of interest. With the emergence of computers, supercomputers, and subsequently deep learning, the main tool has changed significantly, starting with the Materials Genome Initiative [1]. However, brute-force applications of deep learning has suffered from the lack of data in materials science, where physics-constrained machine learning seems to be a more promising candidate. Furthermore, materials science theory suggests that there are many simple physical rules that one can take advantage of when constructing a machine learning framework, for example, the Hall–Petch relationship [2]. It should be noted that many of these rules or simple equations are involve monotonicity. If these rules are successfully exploited, then a physics-constrained machine learning could be achieved with much less data, while retaining the accuracy performance.

The Gaussian process (GP) stands out among other machine learning approaches and has been one of the most popular methods for small datasets across multiple disciplines. In particular, it is well known that GPs are often used as underlying surrogate models for Bayesian optimization, which is an efficient active machine learning method and very popular among materials scientists. For example, Tallman et al. [3,4] used a GP as a surrogate model to bridge microstructure crystallography and homogenized materials properties, where crystal plasticity finite element models are used to construct the database, and Yabansu et al. [5] applied GPs to capture the evolution of microstructure statistics. Tran and Wildey [6,7] also used a GP as a surrogate model for structure–property relationship and solved a stochastic inverse problem using the acceptance–rejection sampling method. More examples can be found in Refs. [8,9].

One of the main advantages of using a GP is its flexibility and, therefore, its ability to adopt extensions. Fernández-Godino et al. [10] proposed a novel approach to constrain symmetry on GPs. Linear operator constraints can be enforced through a novel covariance function, as demonstrated by Jidling et al. [11]. Agrell [12] presented an approach to constrain GPs such that a set of linear transformations of the process are bounded. Furthermore, Lange-Hegermann [13] constrained multi-output GP priors to the solution set of a system of linear differential equations subjected to boundary conditions. A comprehensive survey study of constrained GPs, including, but not limited to, bound constraints, non-negativity, monotonicity, linear partial differential operator constraints, and boundary conditions of a partial differential equation can be found in Swiler et al. [14].

The present work focuses on bound-type constraints. Riihimäki and Vehtari [15] proposed enforcing monotonicity to constrain GPs using binary classification, where the derivative is classified as either monotonically increasing or decreasing with respect to a set of specific input variables. Golchi et al. [16] extended this approach in a deterministic computer experiments setting and sampled from the exact joint posterior distribution rather than an approximation. More recently, Ustyuzhaninov et al. [17] presented a monotonicity constraint in a Bayesian nonparametric setting based on numerical solutions of stochastic differential equations, and Pensoneault et al. [18] presented a novel approach to construct a non-negative GP. Tan [19] proposed a B-spline model that linearly constrains the model coefficients to achieve monotonicity, which converges to the projection of the GP in *L*_{2}. Chen et al. [20] proposed an optimal recovery method that is equivalent to maximum a posterior estimation for a GP constrained by a partial differential equation.

In this paper, we adopt the monotonic GP formulation from Riihimäki and Vehtari [15] and apply it to three different datasets, both in interpolation and extrapolation regimes, to survey the advantages and disadvantages between the regular GP and the monotonic GP. We focus on experimental and computational materials science applications where monotonicity is physically expected. Materials science, specifically experimental materials science, is well-known for data scarcity due to its resource-intensive experiments. An overview of Gaussian processes and the incorporation of monotonic constraints therein are discussed in Sec. 2. Three examples of how classical and monotonic GPs compare in performance are given in Secs. 4–6. An overall discussion and conclusion are given in Sec. 7.

## 2 Overview of Gaussian Processes

### 2.1 Classical Gaussian Processes.

A thorough mathematical explanation and derivation of Gaussian processes can be found in the canonical text by Rasmussen and Williams [21], but we provide a brief overview here.

**y**, with a zero-mean GP,

**K**

_{f,f}are defined by a covariance function. Although there are many covariance functions to choose from, in this paper, we focus on the squared exponential covariance function

**y**are then given by

*x**, the posterior prediction distribution, also called the testing distribution, is a Gaussian distribution, where the posterior mean and posterior variance are, respectively,

*g*=

*h*and 0 otherwise.

*x**, the derivatives with respect to the dimension of the posterior mean and posterior variance are, respectively,

### 2.2 Monotonic Gaussian Processes.

In this work, we adopt the monotonic GP formulation from Riihimäki and Vehtari [15], which has been implemented in the GPstuff toolbox [23,24]. To constrain the classical GP to be monotonic, Riihimäki and Vehtari proposed a block covariance matrix structure similar to that of many multifidelity approaches, e.g., Refs. [25–27], and approximated the posterior using expectation propagation [28], which is arguably more efficient than Laplace’s method. For the sake of completeness, we assume a zero-mean GP and briefly summarize the formulation. Readers are referred to the original work of Riihimäki and Vehtari [15] for further details. Roughly speaking, the crux of the approach is to augment the covariance matrix **K** using a block structure, as is done in multifidelity and gradient-enhanced GP. Now, the difference is that the augmented block encodes the binary classification of whether the GP is supposed to be monotonically increasing or decreasing at some locations. Naturally, the formulation of GP lends itself into binary classification problem with the linear logistic regression or the probit regression, taken from the cumulative density function of a standard normal distribution $\Phi (z)=\u222b\u2212\u221etN(t|0,1)dt$.

*M*inducing locations $Xm\u2208RM\xd7D$. At the location

**x**

^{(i)}∈

**X**

_{m}, the derivative of

**f**is non-negative with respect to the input dimension

*d*

_{i}. The key idea is to assume a probit likelihood at the location

**x**

^{(i)}as

**X**

_{m}, the function

**f**is known to be monotonic. The joint prior for

**f**and its derivatives

**f**′ is given by

*μ*

_{−i}and $\sigma \u2212i2$ are the parameters of the cavity distribution in EP. By introducing

*M*monotonic inducing point, the cost complexity for optimizing the log-marginal likelihood increases from $O(N3)$ to $O((N+M)3)$. The posterior mean and posterior variance of the testing distribution are, respectively, given by

## 3 Numerical Examples

### 3.1 Example 1: Noisy Logistic Regression.

*ɛ*is substantial enough to corrupt the monotonicity of the underlying function. Despite the fact that the noise is homoscedastic, i.e., the variance of the noise is constant throughout the bounded domain, the regular GP has not been able to capture the function correctly. On the other hand, the monotonic GP enforces the monotonicity while ignoring the noise, which results in a better approximation of the true function, which is also plotted as magenta squares as testing dataset.

### 3.2 Example 2: Heteroscedastic Hall–Petch Relationship.

*d*,

*d*is considered on the [15

*μ*m, 350

*μ*m]. For a fixed-size representative volume element, when

*d*is large, the number of grain reduces, and therefore, the noise of $\sigma Y$ increases. The same argument applies when

*d*is small. Now, the variance of the Monte Carlo estimator roughly decays at the rate of

*N*

^{−1}where

*N*is the number of grains or samples; in this hypothetical example, to examine the behavior of the monotonic GP, we simply model the volume of the grain as proportional to

*d*

^{3}and therefore, the noise term with zero-mean and $O(d3)$ variance is considered. Obviously, the variance of the noise term increases as

*d*increases, which is consistent to what we have observed in the literature [6].

Figure 2 shows the comparison between the regular and monotonic GP with 20 training data points, plotted as black dots. The regular GP overfits the training dataset, which results in an oscillatory behavior toward the large *d* region, which is unphysical. The monotonic GP, on the other hand, correctly identifies the trend and monotonically decreases as *d* increases, which results in a better approximation compared to the regular GP.

## 4 Case Study: Fatigue Life Prediction Under Multiaxial Loading

### 4.1 Dataset.

*A*and

*B*are material-dependent coefficients, $\sigma a$ is the stress amplitude, and

*N*

_{f}is the experimental fatigue life. Experimental materials science is resource-intensive [34]; thus, practically speaking, fatigue experimental data are scarce and may not be strictly monotonic, as shown in Karolczuk and Słoński [32]. In the present work, the input of the GPs is $\sigma a$, and the output of the GPs is log

*N*

_{exp}.

Training | Testing | ||
---|---|---|---|

$\sigma a$ (MPa) | N_{exp} | $\sigma a$ (MPa) | N_{exp} |

674 | 2908 | 427 | 77,730 |

558 | 8115 | 400 | 113,900 |

556 | 10,035 | 411 | 117,275 |

507 | 17,012 | 403 | 144,264 |

483 | 19,955 | 390 | 192,920 |

505 | 20,595 | 391 | 198,992 |

498 | 23,780 | 379 | 243,816 |

490 | 25,913 | 366 | 376,815 |

484 | 28,045 | 369 | 396,987 |

474 | 51,430 | 345 | 406,800 |

469 | 52,000 | 342 | 1,252,208 |

475 | 66,200 | 335 | 1,444,998 |

335 | 1,528,487 |

Training | Testing | ||
---|---|---|---|

$\sigma a$ (MPa) | N_{exp} | $\sigma a$ (MPa) | N_{exp} |

674 | 2908 | 427 | 77,730 |

558 | 8115 | 400 | 113,900 |

556 | 10,035 | 411 | 117,275 |

507 | 17,012 | 403 | 144,264 |

483 | 19,955 | 390 | 192,920 |

505 | 20,595 | 391 | 198,992 |

498 | 23,780 | 379 | 243,816 |

490 | 25,913 | 366 | 376,815 |

484 | 28,045 | 369 | 396,987 |

474 | 51,430 | 345 | 406,800 |

469 | 52,000 | 342 | 1,252,208 |

475 | 66,200 | 335 | 1,444,998 |

335 | 1,528,487 |

### 4.2 Results.

We implement and compare two GP variants: the regular GP and a monotonically constrained GP. Both are trained using a squared exponential kernel and the hyper-parameters are optimized using the L-BFGS method [35] with a maximum of 6000 iterations. Figure 3 compares the accuracy between the two GPs against the testing dataset. The root-mean-square errors for the regular GP and the monotonic GP are 183.3026 and 2.0441, respectively. While the regular GP tends to underestimate the experimental data, the monotonic GP provides more accurate predictions. As shown in Fig. 4, outside the training domain, the regular GP is unable to capture the general trend and behaves wildly, even when interpolating the data points. On the contrary, the monotonic GP is able to capture the monotonic trend and extrapolate to a significant distance outside the training interval. It is important to note that the monotonic GP is only strictly monotonic within the training domain and may fail to maintain monotonicity far beyond the training regime. Indeed, in Fig. 4, we see that the monotonic GP fails to maintain monotonicity when $\sigma a\u2272350$ MPa. Figure 4 also shows a nonmonotonic behavior for the regular GP.

It is also observed that the posterior variance of the monotonic GP is significantly less than the posterior variance of the classic GP, using the same training dataset. This holds true for both interpolation and extrapolation cases. The main reason is that the monotonic GP uses an extra “pseudo” training dataset by imposing *M* inducing location in the input domain, as described in Sec. 2.2, and a training dataset with more data points reduces the posterior variance.

## 5 Case Study: Potts Kinetic Monte Carlo for Grain Growth

### 5.1 Model Description.

*P*of successful change in grain site orientation is calculated as

*E*is the total grain boundary energy calculated by summing all the neighbor interaction energies, Δ

*E*can be regarded as the activation energy,

*k*

_{B}is the Boltzmann constant, and

*T*

_{s}is the simulation temperature. In the basic Potts model, the interaction energy between two pixels belonging to the same grain is zero, and

*E*is incremented by one for each dissimilar neighbor. From Eq. (30), changes that decrease system energy are preferred, and the total system energy is monotonically decreased through grain coarsening. It is worthy to note that the

*T*

_{s}simulation temperature is not the real system temperature:

*k*

_{B}

*T*

_{s}is an energy that defines the thermal fluctuation, i.e., noise, presented in the kMC simulation [36]. Increasing the value of

*k*

_{B}

*T*

_{s}results in microstructures with increasing grain boundary roughness.

### 5.2 Results.

Using SPPARKS as the forward model, we build a dataset with *k*_{B}*T*_{s} ∈ {0.15, 0.25, 0.35, 0.45, 0.55, 0.65, 0.75, 0.85} at multiple snapshots in time *t* ∈ {0.000, 56.879, 106.899, 152.482, 206.331, 251.886, 306.530, 353.403, 406.482, 450.629, 502.591, 554.926, 603.345, 653.470, 706.620, 756.473, 800.112, 855.231, 901.232, 950.060, 1005.970, 1054.070, 1102.890, 1155.320, 1200.360, 1251.300, 1302.020, 1353.410, 1400.000} Monte Carlo steps. In this case study, the average grain size, measured in pixel^{2}, is used as the quantity of interest. Similar to Tran et al. [39], we apply a filter of 50 pixel^{2} to eliminate small grains before computing the average grain size. The training and testing datasets are divided as follows: if *k*_{B}*T*_{s} ≤ 0.75 or if *t* < 1000, the data point belongs to the training dataset; otherwise, it belongs to the testing dataset. Under this condition, a training dataset of 140 data points and a testing dataset of 92 data points are obtained. The inputs of the GP are (*k*_{B}*T*_{s}, *t*), and the output of the GP is the average grain size.

It is noted that the data are naturally monotonic, as grains grow in time, which is shown in Fig. 5. At the boundary for the training and testing datasets, at *t* = 950.060 Monte Carlo step, for *k*_{B}*T*_{s} = 0.65, the average grain size is 1203.6605 pixel^{2}, whereas at *t* = 1005.970 Monte Carlo step, for *k*_{B}*T*_{s} = 0.65, the average grain size is 1279.0099 pixel^{2}. This is clearly shown in Fig. 7, where the testing predictions are mostly more than 1250 pixel^{2}.

Figure 6 compares the regular GP and the monotonic GP in terms of extrapolating beyond the training regime for *k*_{B}*T*_{s} = 0.65, showing a consistent behavior between the monotonic GP and the regular GP. The posterior means between the regular GP and the classical GP are almost exactly the same. However, the posterior variance is significantly reduced in the monotonic GP, which could be explained by *M* extra inducing points for monotonicity classification, as described in Sec. 2.2.

Figure 7 compares the regular GP and the monotonic GP in terms of accuracy with the error bar of $\mu \xb11.645\sigma $, showing that the regular GP (*R*^{2} = 0.9730) is slightly better than the monotonic GP (*R*^{2} = 0.9661). It can be explained that imposing monotonicity to the GP formulation comes at a small cost for accuracy.

## 6 Case Study: Strain-Rate-Dependent Stress–Strain With Crystal Plasticity Finite Element

### 6.1 Model Description.

We adopt and extend the dataset from one of our previous studies [6], generated from Fe–22Mn–0.6C TWIP steel with the dislocation-density-based constitutive model. The material parameters are as described in Steinmetz et al. [40] and summarized in Section 6.2.3 and Tables 8 and 9 in [41]. The constitutive model was validated experimentally by Wong et al. [42], and for the sake of completeness, we briefly summarize the constitutive model here. The TWIP/TRIP steel constitutive model is parameterized in terms of dislocation density, ϱ, the dipole dislocation density, ϱ_{di}, the twin volume fraction, *f*_{tw}, and the *ɛ*-martensite volume fraction, *f*_{tr}. A model for the plastic velocity gradient with contribution of mechanical twinning and phase transformation was developed in Kalidindi [43] and is given by

*ɛ*-martensite with volume fraction

*f*

_{tr}on

*N*

_{tr}transformation systems, $ss\alpha $ and $ns\alpha $ are unit vectors along the shear direction and shear plane normal of

*N*

_{s}slip systems α, $stw\alpha $ and $ntw\alpha $ are those of

*N*

_{tw}twinning systems $\beta $, and $stw\alpha $ and $ntr\alpha $ are those of

*N*

_{tr}transformation systems $\chi $. The Orowan equation models the shear rate on the slip system α as

*b*

_{s}is the length of the slip Burgers vector, $\nu 0$ is a reference velocity,

*Q*

_{s}is the activation energy for slip,

*k*

_{B}is the Boltzmann constant,

*T*is the temperature, $\tau eff$ is the effective resolved shear stress, $\tau sol$ is the solid solution strength, and $0<\rho s\u22641$ and 1 ≤

*q*

_{s}≤ 2 are fitting parameters controlling the glide resistance profile. Blum and Eisenlohr [44] model the evolution of dislocation densities, particularly the generation of unipolar dislocation density and formation of dislocation dipoles, respectively, as

*D*

_{0}is the prefactor of self-diffusion coefficient,

*V*

_{cl}is the activation volume for climb, and

*Q*

_{cl}is the activation energy for climb. Strain hardening is described in terms of a dislocation mean free path, where the mean free path is denoted by Γ. The mean free path for slip is modeled as

*D*is the average grain size and

*i*

_{s}is a fitting parameter,

*t*

_{tw}is the average twin thickness, and

*t*

_{tr}is the average

*ɛ*-martensite thickness. The mean free path for twinning and for transformation is computed, respectively, as

*i*

_{w}and

*i*

_{tr}being fitting parameters. The nucleation rates for twins and

*ɛ*-martensite are $N\u02d9=N\u02d90PncsP$, where the probability

*P*that a nucleus bows out to form a twin or

*ɛ*-martensite is

*p*

_{tw}and

*p*

_{tr}are fitting parameters. For more details, readers are referred to Roters et al. [41] (cf. Section 6.2.3) and Wong et al. [42].

### 6.2 Results.

In this study, we focus on the effect of strain-rate on the stress–strain curve with various average grain sizes. The effect of strain-rate has been experimentally validated in Benzing et al. [49], and some numerical studies using crystal plasticity finite element model have been done as well, such as those found in Singh et al. [50]. The inputs of the GP are the average grain size, the strain-rate, the strain, i.e., $(D,\epsilon \u02d9,\epsilon )$, and the output of the GP is the stress $\sigma $. We vary independently the average grain size *D* ∈ {1.0 · 10^{−6}, 2.5 · 10^{−6}, 5.0 · 10^{−6}, 1.0 · 10^{−5}, 2.5 · 10^{−5}} m, as well as the strain-rate $\epsilon \u02d9\u2208{10\u22125,10\u22124,10\u22123,10\u22122,10\u22121,100,$$101,102,103,104}$ s^{−1}. Numerically unstable results are removed in the post-process. Some examples of homogenized stress–strain responses are shown in Fig. 9 for visualization purpose.

The training and testing datasets are divided as follows: if the average grain size *D* is less than 5 · 10^{−6} m or if the strain-rate $\epsilon \u02d9=101$ s^{−1}, the data point belongs to the training dataset; otherwise, it belongs to the testing dataset. Under this splitting condition, a training dataset of 340 data points and a testing dataset of 120 data points are obtained.

Figure 10 shows the accuracy comparison between the regular GP and the classical GP. The *R*^{2} coefficient for the regular GP is 0.9895, whereas the *R*^{2} coefficient for the monotonic GP is slightly lower, 0.9860. Both GPs do very well, but the monotonic GP is slightly worse. The regular GP predictions are also more consistent in the large stress regime $\sigma $.

## 7 Discussion and Conclusion

In this paper, we compare the monotonic GP against the regular GP for various materials datasets, namely fatigue, grain growth, and homogenized stress–strain behaviors. The fatigue dataset is the only experimental dataset with noise, whereas the other two datasets, i.e., grain growth and homogenized stress–strain, are computationally generated by SPPARKS and DAMASK, respectively. We test out their predictive capabilities in both interpolation and extrapolation, but are mainly concerned with extrapolation.

For the fatigue example featured in Sec. 4, the experimental data do not adhere strictly to theoretical derivations. This is where the monotonic GP outperforms the regular GP significantly. In the other two examples, which featured datasets generated by computational models, both GPs perform very closely with each other, even though the regular GP performs slightly better than the monotonic GP. This can be explained by imposing the monotonicity on the regular GP comes a small cost for accuracy and numerical performance. It is also noted that the monotonic behavior in the monotonic GP is only guaranteed within the training domain. Beyond the training regime, the monotonicity effect imposed by introducing *M* inducing points fades away: the further the monotonic GP goes beyond the training regime, the more likely it is the GP will violate the monotonicity constraints. Last but not least, the posterior variances of the monotonic GP are significantly smaller than the posterior variances of the regular GP, while the posterior means are roughly the same.

Constructing an accurate and physically faithful surrogate model is important for uncertainty quantification [51]. Computationally efficient surrogate models are commonly used, for example, in sampling [3,4] or to solve a stochastic inverse problem in structure–property relationship [6,7]. Monotonicity is common in materials science, such as the famous Hall–Petch relationship [2], i.e., $\sigma y=\sigma 0+k(1/D)$. Exploiting such physical constraints and insights may improve the efficiency of machine learning in the future.

The objective of this work is to examine the performance of the monotonic GP compared to the regular GP in different settings, with an emphasis on materials science. Our conclusion is that the monotonic GP is best suited for sparse and noisy settings, which is fairly common in experimental materials science, where the data acquisition process is expensive (and intensive in terms of both time and labor perspectives) and the data are often corrupted by the noise induced by the finite-size effect of microstructures.

**K**

_{joint}. We conclude that for datasets that are already monotonic, applying a monotonic GP formulation may be unnecessary and may lead to a slight degradation in performance. For datasets that are sparse and noisy, imposing monotonicity when it is appropriate may result in a substantial improvement in performance.

## Acknowledgment

This work was supported by the Laboratory Directed Research and Development program at Sandia National Laboratories, a multimission laboratory managed and operated by National Technology and Engineering Solutions of Sandia LLC, a wholly owned subsidiary of Honeywell International Inc. for the U.S. Department of Energy’s National Nuclear Security Administration under Contract DE-NA0003525.

## Conflict of Interest

There are no conflicts of interest.

## Data Availability Statement

The datasets generated and supporting the findings of this article are obtainable from the corresponding author upon reasonable request.

## Nomenclature

*x** =testing input

*f** =testing output

*i*,*j*=dummy data point index

*d*,*g*,*h*=dummy dimensionality index

**f**=noiseless output

**f**(**x**)**f**′ =first derivative of

**f****x**=$[x1,\u2026,xD]$: training input of

*D*dimensionality**y**=noisy observation

**y**(**x**) =**f**(**x**) +*ɛ*, $\epsilon \u223cN(0,\sigma 2)$*D*=input dimensionality

*M*=number of inducing data points for derivatives

*N*=number of data points

**X**=${x(i)}i=1N$: training dataset of size

*N*, $X\u2208RN\xd7D$**X**_{m}=derivative inducing points, $Xm\u2208Rm\xd7D$