The invention generally relates to neural networks, and in particular to a deep operator network.
Over time, mathematical models based on calculus have been employed to both predict and interpret physical phenomena around us, establishing fundamental laws that generalize well. Classical calculus and partial differential equations (PDEs) or their stochastic variants have been effective in the era of the “pursuit of simplicity,” but they may be inefficient or even inadequate to represent complex inter-connected systems or systems of systems with non-local interactions. Some of the hardest problems in science rely on being able to solve and evaluate complex ordinary or partial differential equations and integrals. While this is not an issue for simple systems where the equations are well studied and analytical solutions exist, the majority of open problems do not share these properties. In the past, the equations are often approximated from noisy experimental data, and their evaluation around new data points is obtained from running complex numerical simulations on large supercomputers; both expensive options, computationally and financially.
What is needed is a system that can both estimate the equations from data, and evaluate them around new data points in a fast and reliable way.
The following presents a simplified summary of the innovation in order to provide a basic understanding of some aspects of the invention. This summary is not an extensive overview of the invention. It is intended to neither identify key or critical elements of the invention nor delineate the scope of the invention. Its sole purpose is to present some concepts of the invention in a simplified form as a prelude to the more detailed description that is presented later.
In an aspect, the invention features a method of operating a neural network device including constructing two sub-networks to encode input functions and location variables separately, and merging the two constructed sub-networks together to compute an output.
In another aspect, the invention features a network including a first neural network configured to encode a discrete function space, and a second neural network configured to encode a domain of output functions.
These and other features and advantages will be apparent from a reading of the following detailed description and a review of the associated drawings. It is to be understood that both the foregoing general description and the following detailed description are explanatory only and are not restrictive of aspects as claimed.
These and other features, aspects, and advantages of the present invention will become better understood with reference to the following description, appended claims, and accompanying drawings where:
The subject innovation is now described with reference to the drawings, wherein like reference numerals are used to refer to like elements throughout. In the following description, for purposes of explanation, numerous specific details are set forth in order to provide a thorough understanding of the present invention. It may be evident, however, that the present invention may be practiced without these specific details. In other instances, well-known structures and devices are shown in block diagram form in order to facilitate describing the present invention.
One of the fundamental issues in machine learning research is system identification from data, and how to design deep neural networks (DNNs) that could represent and discover nonlinear operators, mapping input functions into output functions. These operators can be of explicit type, e.g., the Laplace transform, or of implicit type, e.g., solution operators of partial differential equations (PDEs). For solution operators, the inputs to DNN could be functions representing boundary conditions selected from a properly designed input space V. Implicit type operators may also describe systems for which we do not have any mathematical knowledge of their form, e.g., in social dynamics, although we do not consider such cases in the present work.
In literature, two types of implicit operators have been considered, i.e., dynamical systems, in the form of ordinary differential equations (ODEs) or in the form of PDEs, as follows:
For dynamical systems, different network architectures have been employed, including feed-forward NNs (FNNs), recurrent NNs (RNNs), reservoir computing, residual networks, autoencoders, neural ordinary differential equations, neural jump stochastic differential equations, and many others. However, they are only able to predict the evolution of a specific system (e.g., a Hamiltonian system) rather than identifying the system behavior for new unseen input signals.
In learning operators from PDEs with structured data, some works treat the input and output function as an image and then use convolutional NNs (CNNs) to learn the image-to-image mapping G, but this approach can be only applied to particular types of problems, where the sensors of the input function u are distributed on an equispaced grid. Also, the training data must include all the G(u)(y) values with y also on an equispaced grid.
For unstructured data, a modified CNN based on generalized moving least squares or graph kernel networks can be used to approximate specific operators. However, they are not able to learn general nonlinear operators. Also, some PDEs can be parameterized by unknown coefficients or an unknown forcing term, and then the unknown parts are identified from data. However, not all PDEs can be well parameterized.
Symbolic mathematics have also been applied to represent PDEs while accuracy and robustness still need to be addressed.
In this work, we propose a new deep learning framework, DeepONet, to learn diverse continuous nonlinear operators without the aforementioned constraints. DeepONet is inspired directly by the theory that guarantees small approximation error (i.e., the error between the target operator and the class of neural networks of a given finite-size architecture). Moreover, the specific architectures we introduce exhibit small generalization errors (i.e., the error of a neural network for previously unseen data) for diverse applications that we systematically study herein.
Our proposal of approximating functionals and nonlinear operators with NNs goes beyond the universal function approximation and supervised data, or using the idea of physics-informed neural networks (PINNs).
Neural networks (NNs) are universal approximators of continuous functions. A NN with a single hidden layer can approximate accurately any nonlinear continuous operator. This universal approximation of operators is suggestive of the potential of NNs in learning from scattered data any continuous operator or complex system. Fully described below is a new NN with small generalization error, herein referred to as a deep operator network (“DeepONet”). As is described below, DeepONet includes a NN for encoding the discrete input function space (i.e., branch net) and another NN for encoding the domain of the output functions (i.e., trunk net). DeepONet can learn various explicit operators, e.g., integrals and fractional Laplacians, as well as implicit operators that represent deterministic and stochastic differential equations.
While calculus has served science well, new fundamental developments are slow and have not established any governing principles in some fields, e.g., in the emerging field of social dynamics. In the era of artificial intelligence and data analytics, the question is then if there are any more expressive representations that can be employed to accelerate scientific discovery. Classical calculus and partial differential equations (PDEs) or their stochastic variants, taught diligently in colleges around the globe, have been effective in the era of the “pursuit of simplicity,” but they may be inefficient or even inadequate to represent complex inter-connected systems or systems of systems with non-local interactions.
Here, we formulate an alternative calculus-agnostic simple and fast way of describing an operator and its dynamics implicitly based on a given data set. More broadly, with the anticipated frequent human-robot interactions in the not-too-distant future, we address the question on how to best educate the new generations on quantitative inference and how to endow new robots with the required intelligence. Following the current paradigm, one approach would be to teach robots calculus, i.e., one integral or differential operator at a time, and train them to synthesize and solve such PDEs at blazing speeds to make quantitative decisions, sense their environment, and communicate with humans; however, this seems computationally prohibitive. An alternative direction is to train them using new means in learning nonlinear operators and functionals that compactly express very complex dynamics without the need for resorting to prior knowledge of 17th-century calculus. Though the training of the new DNNs could be laborious, the computational expense for training is not a limiting factor, as for each problem we only need to train the network once, and the training is performed offline.
Here we develop new types of NNs that go beyond the universal function approximation and supervised data, and approximate functionals and nonlinear operators instead. As a first step, we resort to the universal operator approximation theorem, which states that a NN with a single hidden layer can approximate accurately any nonlinear continuous functional (a mapping from a space of functions into the real numbers) or (nonlinear) operator (a mapping from a space of functions into another space of functions). To wit, let G be an operator taking an input function u with G(u) the corresponding output function. For any point y in the domain of G(u), the output G(u)(y) is a real number. Hence, the network takes inputs composed of two parts: u and y, and outputs G(u)(y) (
Theorem 1 (Universal Approximation Theorem for Operator). Suppose that σ is a continuous non-polynomial function, X is a Banach Space, K1⊂X, K2⊂Rd are two compact sets in X and Rd, respectively, V is a compact set in C(K1), G is a nonlinear continuous operator, which maps V into C(K2). Then for any ε>0, there are positive integers n, p, m, constants
This approximation theorem is indicative of the potential application of neural networks to learn nonlinear operators from data, i.e., similar to standard NN where we learn functions from data. However, this theorem does not inform us how to learn operators efficiently. The overall accuracy of NNs can be characterized by dividing the total error into three main types: approximation, optimization, and generalization errors. The universal approximation theorem only guarantees a small approximation error for a sufficiently large network, but it does not consider the important optimization and generalization errors at all, which are often dominant contributions to the total error in practice. Useful networks should be easy to train, i.e., to exhibit small optimization error, and generalize well to unseen data, i.e., to exhibit small generalization error.
To demonstrate the capability and effectiveness of learning nonlinear operators by neural networks, we setup the problem as general as possible by using the weakest possible constraints on the sensors and training dataset. Specifically, the only condition required is that the sensor locations {x1, x2, . . . , xm} are the same but not necessarily on a lattice for all input functions u, while we do not enforce any constraints on the output locations y (FIG. 1B). However, even this constraint can be lifted and it is only used here for computational expediency. We provide a specific new network architecture, the deep operator network (“DeepONet”), to achieve small total errors. We demonstrate that unlike fully-connected neural networks (FNNs), DeepONet significantly improves generalization based on a design of two sub-networks, the branch net for the input function and the trunk-net for the locations to evaluate the output function. The key point is that we discover a new operator G as a NN, which is able to make inferences for quantities of interest given new and unseen data. If we wish to further interpret the type of operator G using the familiar classical calculus, we can project the results of G(u)(y) onto a dictionary containing first- or higher-order derivatives, gradients, Laplacians, etc., as it is done currently with existing regression techniques. Returning to the aforementioned example, using the initial conditions as the function u(x) we can train the operator G offline and then we can predict the dynamics for new unseen boundary conditions with error 10−4 or less at a fraction of a second.
We consider two types of implicit operators, i.e., dynamical systems (e.g., in the form of ordinary differential equations, ODEs) and partial differential equations (PDEs). Some works used standard NNs to identify dynamical systems, but they only considered systems described by difference equations. Some other works predict the evolution of a specific dynamical system rather than identifying the system behavior for new unseen input signals. The network architectures they employed includes FNNs, recurrent neural networks (RNNs), reservoir computing, residual networks, autoencoder, neural ordinary differential equations, and neural jump stochastic differential equations. For identifying PDEs, some works treat the input and output function as an image, and then use convolutional neural networks (CNNs) to learn the image-to-image mapping, but this approach can be only applied to the particular type of problems, where the sensors of the input function u are distributed on an equispaced grid, and the training data must include all the G(u)(y) values with y also on an equispaced grid. In another approach without this restriction, PDEs are parameterized by unknown coefficients, and then only the coefficient values are identified from data. Alternatively, a generalized CNN based on generalized moving least squares can be used for unstructured data, but it can only approximate local operators and is not able to learn other operators like an integral operator. In the examples we present below, we explore other representations of the discrete function input space.
DeepONet architecture. We focus on learning operators in a more general setting, where the only requirement for the training dataset is the consistency of the sensors {x1, x2, . . . , xm} for input functions. In this general setting, the network inputs include two separate components: [u(x1), u(x2), . . . , u(xm)]T and y (
In high dimensional problems, y is a vector with d components, so the dimension of y does not match the dimension of u(x1) for i=1, 2, . . . , m any more. This also prevents us from treating u(x1) and y equally, and thus at least two sub-networks are needed to handle [u(x1), u(x2), . . . , u(xm)]T and y separately. Although the universal approximation theorem (Theorem 1) does not have any guarantee on the total error, it still provides us with a network structure in Eq. (1). Theorem 1 only considers a shallow network with one hidden layer, so we extend it to deep networks to gain expressivity. Our architecture is shown in
In practice, p is at least of the order of 10, and using lots of branch networks is inefficient. Hence, we merge all the branch networks into one single branch network (FIG. 1D), i.e., a single branch network outputs a vector [b1, b2, . . . , bp]T∈Rp. In the first DeepONet (
DeepONet is a high-level network architecture without defining the architectures of its inner trunk and branch networks. To demonstrate the capability and good performance of DeepONet alone, we choose the simplest FNN as the architectures of the sub-networks in this detailed description. It is possible that using convolutional layers we could further improve accuracy. However, convolutional layers usually work for square domains with {x1, x2, . . . , xm} on a equispaced grid, so as alternative and for a more general setting we may use the “attention” mechanism.
Data generation. The input function u(x) plays an important role in operator identification; here, we mainly consider the following function spaces: Gaussian random fields (GRF), spectral representations, and formulating the input functions as images.
In order to estimate how many sensors are required to achieve accuracy F, we consider the following ODE system:
Let G be the operator mapping the input u to the output s, i.e., G(u) satisfies
Now, we choose uniformly m+1 points xj=a+j(b−a)/m, j=0, 1, . . . , m from [a, b], and define the function um (x) as follows:
Denote the operator mapping u to um by Lm, and let Um={Lm(u)|u∈V}, which is a compact subset of C[a, b], since V is compact and continuous operator Lm keeps the compactness. Obviously, Wm:=V∪Um as the union of two compact sets is also compact. Then, set W:=Ui=1∞Wi, and the Lemma points out that W is still a compact set. Since G is a continuous operator, G(W) is compact in C([a, b]; RK). For convenience of analysis, we assume that g(s, u, x) satisfies the Lipschitz condition with respect to s and u on G(W)×W, i.e., there is a constant c>0 such that
Note that this condition is easy to achieve, for instance, as long as g is differentiable with respect to s and u on G(W)×W.
For u∈V, um∈Um, there exists a constant κ(m, V) depending on m and compact space V, such that
When V is GRF with the Gaussian kernel, we have κ(m, V)˜1/m2 l2. Based on these concepts, we have the following theorem.
Theorem 2. Suppose that m is a positive integer making c(b−a)κ(m, V)ec(b-a) less than ε, then for any d∈[a, b], there exist W1∈Rn×(m+1), b1∈Rm+1, W2∈RK×n, b2∈RK, such that
Learning explicit operators. First, we consider a pedagogical example described by
We train FNNs and residual neural networks (ResNets) to learn the antiderivative operator. The p-values of two-sided T-tests indicate whether the test error of the unstacked DeepONet with bias is significantly smaller than that of other networks.
indicates data missing or illegible when filed
To obtain the best performance of FNNs, we grid search the three hyperparameters: depth from 2 to 4, width from 10 to 2560, and learning rate from 0.0001 to 0.01. The mean squared errors (MSE) of the test dataset with learning rate 0.01, 0.001, and 0.0001 are shown in
Compared to FNNs and ResNets, DeepONets have much smaller generalization error and thus smaller test error. Here we do not aim to find the best hyperparameters, and only test the performance of the two DeepONets listed Table S3. The training trajectory of an unstacked DeepONet with bias is shown in
In addition to the aforementioned integral operator, we consider integro-differential operators, namely, fractional differential operators, in order to demonstrate the flexibility of DeepONets to learn more complicated operators. The first fractional differential operator we learn is the 1D Caputo fractional derivative:
Next, we will test DeepONets for the situation where the training dataset and test dataset are not sampled from the same function space. Specifically, the input functions in the training dataset are sampled from the space of GRF with the covariance kernel kl(x1, x2)=exp(−∥x1−x2∥2=2/2l2), where l is the length-scale parameter. After the network is trained, instead of testing functions also in this space, we will always use the functions sampled from the space of GRF with 1=0.1 for testing. To quantify the difference between two GRF spaces of different correlation lengths, we use the 2-Wasserstein (W2) metric to measure their distance. When 1 is large, the W2 metric is large, and the test error is also large. When 1 is smaller, the W2 metric and test error become smaller. If 1=0.1, the training and test spaces are the same, and thus the W2 metric is 0. It is interesting that the test MSE grows exponentially with respect to the W2 metric.
The second fractional differential operator we learn is the 2D Riesz fractional Laplacian:
where .v.″ means principle value. The input and out functions are both assumed to be identically zero outside of a unit disk centered at the origin. The 2D fractional Laplacian reduces to standard Laplacian Δ as the fractional order a goes to two. For learning this operator, we specify the V space to be the orthogonal space spanned by the Zernike polynomials, which are commonly used to generate or approximate functions defined on a unit disk.
An important question for the effectiveness of DeepONet is how fast it learns new operators. We investigate this question by learning a system of ODEs first and subsequently a nonlinear PDE. First, we consider the motion of a gravity pendulum with an external force described by
The test and generalization errors of networks with different width are shown in
Next, we learn an implicit operator in the form of a nonlinear diffusion-reaction PDE with a source term u(x) described by
We investigate the error tendency with respect to the number of u samples and the value of P. When we use 100 random u samples, the test error decreases first as P increases (
Next, we demonstrate that we can learn high-dimensional operators, so here we consider a stochastic ODE and a stochastic PDE and present our main findings.
Before the error saturates, the rates of convergence with respect to both P and the number of u samples obey a polynomial law in most of the range (
Consider the population growth model
k(t,ω)˜(k0(t),Cov(t1,t2)),
We use DeepONet to learn the operator mapping from k(t; ω) of different correlation lengths to the solution y(t; ω). We note that we do not assume we know the covariance function for training DeepONet. The main differences between this example and the previous examples are that 1) here the input of the branch net is a random process instead of a function, 2) the input of the trunk net contains both physical spaces and random spaces. Specifically, to handle the random process as the input, we employ the Karhunen-Loeve (KL) expansion,
[√{square root over (λ1e1(t))},√{square root over (λ2e2(t))}, . . . ,√{square root over (λNeN(t))}]∈N×m
We choose N=5, which is sufficient to conserve 99.9% stochastic energy. We train a DeepONet with a dataset of 10000 different k(t; ω) with l randomly sampled in [1, 2], and for each k(t; ω) we use only one realization. The test MSE is 8.0×10−5±3.4×10−5, and as an example the prediction for 10 different random samples from k(t; ω) with 1=1.5 is in
Next, we consider the following elliptic problem with multiplicative noise
We use DeepONet to learn the operator mapping from b(x; ω) of different correlation lengths to the solution u(x; ω). We train a DeepONet with a dataset of 10000 different b(x; ω) with l randomly sampled, and for each b(x, ω) we use only one realization. The test MSE is 2.0×10−3±1.7×10−3, and as an example the prediction for 10 different random samples from b(x; ω) with 1=1.5 is in
In summary, in DeepONets, we first construct two sub-networks to encode input functions and location variables separately, and then merge them together to compute the output.
DeepONet may be used for a number of applications, such as, for example, the following.
Multiple DeepONets can be configured together to represent one of a plurality of multiphysics problems. An example of multiple DeepONets is shown in
One of the plurality of multiphysics problems includes forecasting, the forecasting comprising predicting a time and a space of a state of a system. One of the plurality of multiphysics problems includes interrogating a system with different input scenarios to optimize design parameters of the system. One of the plurality of multiphysics problems includes actuating a system to achieve efficiency/autonomy.
One of the plurality of multiphysics problems includes identifying system parameters and discovering unobserved dynamics.
The following supplemental information is helpful in understanding the present invention.
We list in the Table in
Let C(K) denote the Banach space of all continuous functions defined on a compact set K⊂X with sup-norm ∥f∥C(K)=maxx∈K |f (x)|, where X is a Banach space. We first review the definition and sufficient condition of Tauber-Wiener (TW) functions, and the definition of continuous operator.
Definition S1 (TW). If a function σ:R→R (continuous or discontinuous) satisfies that all the linear combinations ΣNi=1 ciσ(λix+θi), λi∈R, θi∈R, ci∈R, i=1, 2, . . . , N, are dense in every C([a, b]), then σ is called a Tauber-Wiener (TW) function.
Theorem S2 (Sufficient condition for TW). Suppose that σ is a continuous function, and σ∈S′(R) (tempered distributions), then σ∈(TW), if and only if σ is not a polynomial.
It is easy to verify that all the activation functions we used nowadays, such as sigmoid, tanh and ReLU, are TW functions.
Definition S3 (Continuity). Let G be an operator between topological spaces X and Y. We call G continuous if for every ϵ>0, there exists a constant δ>0 such that
We recall the following two main theorems of approximating nonlinear continuous functionals and operators.
Theorem S4 (Universal Approximation Theorem for Functional). Suppose that σ∈(TW), X is a Banach Space, K⊂X is a compact set, V is a compact set in C(K), f is a continuous functional defined on V, then for any ϵ>0, there are a positive integer n, m points x1, . . . , xm∈K, and real constants ci, θi, ξij, i=1, . . . , n, j=1, . . . , m, such that
Theorem S5 (Universal Approximation Theorem for Operator). Suppose that σ∈(TW), X is a Banach Space, K1⊂X, K2⊂Rd are two compact sets in X and Rd, respectively, V is a compact set in C(K1), G is a nonlinear continuous operator, which maps V into C(K2), then for any ϵ>0, there are positive integers n, p, m, constants cki; ξijk, θik, ζk∈R, wk∈Rd, xj∈K1, i=1, . . . , n, k=1, . . . , p, j=1, . . . , m, such that
The necessary and sufficient conditions of the compactness are given by the following Arzelà-Ascoli Theorem.
Definition S6 (Uniform Boundedness). Let V be a family of real-valued functions on the set K. We call V uniformly bounded if there exists a constant M such that
|f(x)|≤M
Definition S7 (Equicontinuity). Let V be a family of real-valued functions on the set K. We call V equicontinuous if for every ϵ>0, there exists a δ>0 such that
Theorem S8 (Arzelà-Ascoli Theorem). Let X be a Banach space, and K⊂X be a compact set. A subset V of C(K) is pre-compact (has compact closure) if and only if V is uniformly bounded and equicontinuous.
Here, we list the details of function spaces V and reference solutions.
Gaussian random fields (GRFs). In most examples, we use the mean-zero GRFs for the input function u(x):
u(x)˜GP(0,k1(x1,x2)),
Orthogonal polynomials. Let M>0 and Pi be i-th orthogonal polynomial basis. We define the N-dimensional orthogonal polynomial space as
We generate the samples of input function u(⋅) from the space VOrg by randomly sampling the expansion coefficient ai from [−M, M].
Here, we consider four different bases Pi(⋅):
Reference solutions. The reference solutions of all ODE systems are obtained by Runge-Kutta, and the reference solutions of all deterministic PDEs are obtained by a second-order finite difference method.
The exact solution of the stochastic ODE is
y=y
0
e
K(t),
The exact solution of the stochastic PDE is
Numerical solutions are obtained by approximating k(t; ω) or b(x; ω) with their truncated Karhunen-Loeve expansions.
S3 Gaussian Random Field with the Gaussian Kernel
Suppose that X(t)˜GP(0, exp(−|x−y|2)). Then, by the Wiener-Khinchin Theorem, we have
Applying a linear interpolation Π1 on the interval [ti, ti+1], then
This part is the proof of Theorem 2. All the concepts and notations involved here are defined in the paragraph of Theorem 2.
Lemma. W:=∪∞i=1 Wi is compact.
Proof. At first, we prove that W is pre-compact. For any ε>0, by (3), there exists an m0 such that
Since Wm
Now for all u∈W and all x, y∈[a, b], |x−y|<δ, if u∈Wm
Which shows the equicontinuity of W. In addition, it is obvious that W is uniformly bounded, so that we know W is pre-compact by applying the Arzelà-Ascoli Theorem.
Next we show that W is closed. Let {wi}∞i=1⊂W be a sequence which converges to a w0∈C[a, b]. If there exists an m such that {wi}⊂Wm, then w0∈Wm⊂W. Otherwise, there is a subsequence {Li
Next we show the proof of Theorem 2.
Proof. For u∈V and um∈Um, according to the bound (3) and the Lipschitz conditions of the right-hand side of the ODE, we can derive that the operator is Lipschitz continuous.
Here κ(m, V) is defined in (3). By Grönwall's inequality, we have
Define Sm={(u(x0), u(x1), . . . , u(xm))∈Rm+1|u∈V}. Then Sm is a compact set in Rm+1 as it is the image of a continuous map on the compact set V. As there is a bijective map between Sm and Um, we may define a vector-valued function ϕ on Sm by
Because G is a continuous operator over V, ϕ is a continuous function over Sm. For any ε>0, make m large enough so that c(b−a)κ(m, V)ec(b-a)<. By the universal approximation theorem of neural network for high-dimensional functions, there exist W1∈Rn×(m+1), b1∈Rm+1, W2∈RK×n,b2∈RK, such that
In summary, by choosing the value of m so that it makes c(b−a)κ(m, V)ec(b-a) less than ε is sufficient to achieve accuracy ε.
Different metrics have been proposed to measure the distance between two Gaussian processes, and we will use the 2-Wasserstein metric in this study. Suppose we have two GPs f1 GP(m1; k1) and f2 GP(m2; k2) defined on X, where m1 and m2 are the mean functions, and k1 and k2 are the covariance functions. Each covariance function ki has an associated covariance operator Ki:L2(X)−+L2(X) defined by
Then the 2-Wasserstein metric between the two Gaussian processes f1 and f2 is given by
In practice, we may not be able to compute the 2-Wasserstein metric analytically. To estimate it numerically, we can estimate the trace of the operator from the trace of the corresponding covariance matrix. For example, let {circumflex over ( )}K1∈RN N be a covariance matrix of k1, i.e., the ith-row and jth-column entry of {circumflex over ( )}K1 is {circumflex over ( )}K1[i; j]=k1 (xi; xj), where {x1; x2; : : : ; xN} are N equispaced points in the domain of f1. Then we have TrK1 Tr {circumflex over ( )}K1/N, and we can estimate other traces in the same way.
In all problems we use ReLU activation function, except that for fractional differential operators we use the hyperbolic tangent. We use the Adam optimizer with learning rate 0.001, except that for the Legendre transform it is 0.0001. The other parameters and network sizes are listed in Tables found in
Computational Cost (Hours) of Training of deepONet and NN Baselines.
DeepONet size is listed in the below Table, and the network sizes of other models are chosen as the sizes with which the best accuracy is obtained. All networks are trained using 4 cores on an Intel Xeon Gold 6242 processor.
NN
transform
3
.41
usion-reaction
9
6
di usion
3
indicates data missing or illegible when filed
DeepONet size is listed in the below Table, and the network sizes of other models are chosen as the sizes with which the best accuracy is obtained. The numerical solvers used for each case are described. The network inferences and numerical solvers are performed using 1 core on an Intel Xeon Gold 6242 processor.
.9
.2
.
.6
indicates data missing or illegible when filed
We consider the Legendre transform of a function f (x):
We consider a nonlinear 1D ODE system
S8 Gravity Pendulum with an External Force
The number of sensors required to distinguish two input functions depends on the value of k, prediction time T, and the input function space. For the case with the default parameters k=1, T=1 and 1=0.2, when the number of sensors m is small, the error decays exponentially as we increase the number of sensors (
MSE∝4.6−#sensors.
When m is already large, the effect of increasing m is negligible. The transition occurs at ˜10 sensors, as indicated by the arrow.
To predict s for a longer time, more sensors are required (
In Theorem 2, m should be large enough to make T ecT κ(m, V) small. In Section S3 we show theoretically that m∝l−1 for the GRF function space with the RBF kernel, which is consistent with our computational result here. T ecT in the bound is loose compared to the computational results.
Here, we investigate the error tendency under different conditions, including network size, and training dataset size. We take T=3 for the following experiments in this subsection. By varying the network width, we can observe that there is a best width to achieve the smallest error (
Next, we investigate the effect of different function spaces, including GRF with different length scale l and the space of Chebyshev polynomials with different number of bases. For a fixed sensor number, we observe that there exists a critical length scale, around where the training and test errors change rapidly, see the arrows in
l∝m
−1 and #Bases∝√m.
The inverse relation between 1 and m is consistent with the theoretical results in Section S3 and the computational results in Section 14.1. Therefore, when the function space complexity increases, one may increase m to capture the functions well.
S9 Diffusion-Reaction System with a Source Term
Consider the advection equation
To generate the training dataset, we solve the system using a finite difference method on a 100 by 100 grid, except for the Case I with the analytical solution s(x, t)=f (sin2(π(x−t)). Then for each solution s(x, t) we randomly select P=100 points out of these 10000(=100×100) grid points. The dataset size is equal to the product of P by the number of input function samples. We sample f (x) from a GRF with the Gaussian kernel of different length scale l, and we use 500 random f samples for training.
Consider the advection-diffusion equation
Below we detail the data generation and NN architectures for learning the 1D Caputo fractional derivative and the 2D fractional Laplacian.
The error (mean and standard deviation) tested on the space of GRF with the correlation length l=0.1 for DeepONets trained with GRF spaces of different correlation length l. The 2-Wasserstein metric between the GRF of 1=0.1 and the GRF of different correlation length l. The test error grows exponentially with respect to the W2 metric.
The V spaces we consider are Legendre polynomial space, poly-fractonomial space, and GRF, all of which have been mentioned in Section S2. For the Legendre and poly-fractonomial spaces, we take M=2 and N=11. The expansion coefficients are taken from Sobol sequence, a quasi-random sequence. For the GRF, we set 1=0.1 and 1=0.3 in the Gaussian kernel for training and test sets, respectively.
Given the discrete version of the input function u(⋅), i.e., [u(x1), . . . , u(xm)]T, the evaluation point y, and the fractional order a, the reference solution G(u)(y, α) is computed by using the L1 scheme, a finite difference scheme for approximating Caputo fractional derivative. We take 2000 finite difference points in order to ensure that the computed value of output function is sufficiently accurate.
The training set is determined by the following parameters. The number of different samples of u(⋅) is Nu=10000. The number of the equi-spaced “sensors” for u(⋅) is m=15. The numbers of different equi-spaced evaluation points y and a for the output function G(u)(y, α) are both Ny=Nα=10. The number of training points is calculated as Nu×Ny×Nα=106. The same parameters are taken for the test set.
The V space, where u(r, θ) is sampled from, is an orthogonal polynomial space spanned by the Zernike polynomial Zi(r, θ), where r and θ are the radius and angle in the polar coordinate, respectively. We take 15 basis functions {Z0, Z1, . . . , Z14}, namely, N=15, and set M=2 for sampling the expansion coefficients ai in Section S2.
Given the discrete version of the input function u(⋅), i.e., [u(x1), . . . , u(xm)]T, the evaluation point y, and the fractional order α, the reference solution G(u)(y, α) is computed by using the vector Grünwald-Letnikov formula for approximating Riesz fractional Laplacian. We take 16 Gauss quadrature points for evaluating the integral with respect to differentiation direction and take 2000 finite difference points to approximate fractional directional derivative along a specific differentiation direction.
For DeepONet and CNN models, the raw data are the same. We choose Nu=5000, Nx=15×15=225, Ny=15×15=225, and Nα=10. But the raw data are organized in different ways for the two models. The DeepONet model organizes the data in a loose way, and thus the size of training set is large, namely, Nu×Ny×Nα=1.125×107. In contrast, the CNN model leverages the spatial structure of the raw data and rearranges the data into a compact form (image). The size of training set is only Nu=5000. Also, the parameters for the test set are the same as those for the training one.
For the CNN, the filter size is fixed to be 3×3, and the maxpooling window size is 2×2. The input images are shaped to be a 4D array having the shape [Nu, Hx, Wx, 1], where the height (Hx) and width (Wx) of the image for contour plot of one sample of u(⋅) are taken to be Hx=Wx=√Nx. The output images are arranged to another 4D array having the shape [Nu, Hy, Wy, Nα], where the height (Hy) and width (Wy) of the images for contour plots of {G(u)(y, α1), G(u)(y, α2), . . . , G(u)(y, αN
S13 Hölder Continuity of the Operators in this Patent
For all the explicit and implicit operators in our examples, the operators G are Hölder continuous.
Here C>0 depends on f and g and the operator G. Here we X and Y are Banach spaces and they refers to the space of continuous functions on a compact set unless otherwise stated.
Let GN be the approximated G using DeepONet. Let fh be an approximation of f in Y, possibly by collocation or neural networks. Then
The explicit operators and their Lipschitz continuity (α=1) are presented below.
1. (Simple ODE, Problem 1.A) G(u)(x)=s0+∫0x u(s)ds.
2. (Caputo derivative, Problem 2) The operator is Lipschitz continuous with respect to its argument in weighted Sobolev norms, c.f. [47, Theorem 2.4].
3. (Integral fractional Laplacian, Problem 3) The operator is Lipschitz continuous with respect to its argument in weighted Sobolev norms, c.f. [14, Theorem 3.3].
4. (Legendre transform, Problem 7 in Equation (S6)). The Lipschitz continuity of the operator G(u)(n)=∫1−1 Pn(x)u(x)dx can be seen as follows. For any non-negative integer,
5. The linear operator from Problem 9 in (S7) in Lipschitz continuous with respect to the initial condition in the norm in the space of continuous functions, from the classical theory for linear parabolic equations.
The implicit operators from dynamic systems and differential equations are more complicated. The Hölder continuity of these operators are elaborated in the following text.
The implicit operator from Problem 1. The Lipschitz continuity of the operator has been shown in the proof of Theorem 2. For Problem 1.C, the solution is still bounded as long as the input u is not too large. With a bounded solution s, the nonlinear term s2 is Lipschitz continuous, and thus, the Lipschitz continuity of the operator is still valid, as in the proof of Theorem 2.
The implicit operator from Problem 4 is Lipschitz with respect to the right-hand side u of not large values, from the classical theory for quasi-linear parabolic equations.
The implicit operator from Problem 5. dy(t, ω)=k(t, ω)y(t, ω) dt, y(0)=y0. Here k(t) is a Gaussian process with mean zero and a smooth covariance function. The exact solution is y=y0 exp(∫t0 k(s)ds). Then the solution operator is Lipschitz continuous in the pathwise sense. Let k1(t) and k2(t) be the coefficients in Problem 5 and y1(t) and y2(t) are the corresponding solutions. By the mean value theorem, |ex−ey|≤|x−y|(ex+ey) holds for all x, y∈R and
Then for any t∈[0, T]
∥G(k1)(⋅,ω)−G(k2)(⋅,ω)∥C[0,T]≤C(ω,t)|k1(⋅,ω)−k2(⋅,ω)|L∞([0,t]),
The implicit operator from Problem 6. The operator can be written as u=G(b) and ui=G(bi), i=1, 2 satisfy the following equations:
−div(eb
Then ∥ui(ω)∥H
−div(eb
Then by the stability of the elliptic equation, we have
Then by the mean value theorem
According to [4, Proposition 2.3], all the random variables
i=1, 2 have any moments of finite order. Then, we obtain the pathwise Lipschitz continuity of the operator G
The implicit operator in Problem 8 is also Lipschitz continuous. The solution can be represented as s0(X−1(t, x)), where X(t, x) satisfies the following equation
X(t,x)=a(t,X(t,x)), X(0,x)=x.
As there exists a0>0 such that a(t, x)>a0>0 in the setting of Problem 8, X(t, x)>0 is away from zero. Denote the solution with coefficient a and initial condition u0 by u0(X−1) and the solution with coefficient b and initial condition v0 by v0(X−1). We then can derive the Lipschitz continuity of the operator from the following facts that
Referring now to
In this study, all our training datasets are obtained from numerical solvers, i.e., the trained DeepONet is a surrogate of the numerical solver. Hence, Deep-ONets cannot be better than the numerical solvers in terms of accuracy. However, the computational cost of running inference of DeepONet is significantly lower than the numerical solver.
More broadly, a DeepONet could represent a multiscale operator trained with data spanning several orders of magnitude in spatio-temporal scales, for example, trained by data from the molecular, mesoscopic and continuum regimes in fluid mechanics or other multiscale problems. We could also envision other types of composite DNNs for developing multiphysics operators, e.g., in electro-convection involving the evolution of the flow field and concentration fields of anions and cations due to continuous variation of imposed electric potentials. Indeed, in ongoing work, we have developed an extension of DeepONet for simulating this electro-convection multiphysics problem [8], where we show that DeepONet is significantly faster than a spectral element solver. We have obtained a similar speed-up and high accuracy in hypersonics for learning the aerodynamics coupled with the finite rate chemistry of multiple species. Learning such multiscale and multiphysics nonlinear operators will simplify computational modeling and will facilitate very fast predictions of complex dynamics for unseen new parameter values and new input functions (boundary/initial conditions/excitation) with good accuracy, if the generalization error is bounded.
This architecture is linked to several benefits. For example, a generalization capability. Generalization here means to extrapolate outside the region of seen (training) data. This ability is due to the feature expansion in the Branch net and the extra overparameterization in Trunk net.
Incorporating the physics: This is possible because of the trunk net 200 (derivatives with respect to trunk net 200 input, y). Other state-of-the-art architectures such as LSTM (long short-term memory) networks do not have this capability because they don't have the trunk net 200.
Less data demanding: This is because we can add the physics and replace the physical model with some previously required data.
Adaptivity to non-uniform, incomplete, and scattered dataset: The trunk net 200 enables DeepONet to be trained and predict the output on scattered datasets unlike the existing state-of-the-art architectures.
The DeepONet architecture may be applied to many different applications. For example, it may be applied to forecasting applications. More specifically, airfoils, solar thermal systems, VIV, material damage, path planning, material processing applications, additive manufacturing, structural health monitoring and infiltration.
The DeepONet architecture may be applied to design applications. More specifically, airfoils, material damage and structural health monitoring.
The DeepONet architecture may be applied to control/autonomy applications. More specifically, airfoils, electro-convection and path planning.
The DeepONet architecture may be applied to identification/discovery. More specifically, VIV, material damage and electro-convention.
The DeepONet architecture may be applied to resin transfer molding (RTM).
It would be appreciated by those skilled in the art that various changes and modifications can be made to the illustrated embodiments without departing from the spirit of the present invention. All such modifications and changes are intended to be within the scope of the present invention except as limited by the scope of the appended claims.
This application claims benefit from U.S. Provisional Patent Application Ser. No. 63/145,783, filed Feb. 4, 2021, which is incorporated by reference in its entirety.
This invention was made with government support under grant number DE-SC00119453 awarded by the U.S. Department of Energy and grant number HR0011-20-9-0062 awarded by the Defense Advanced Research Projects Agency. The government has certain rights in the invention.
| Filing Document | Filing Date | Country | Kind |
|---|---|---|---|
| PCT/US2022/015340 | 2/4/2022 | WO |
| Number | Date | Country | |
|---|---|---|---|
| 63145783 | Feb 2021 | US |