The present disclosure concerns the learning of phenomena representing real systems with parsimonious neural networks having very few connections.
The present disclosure applies in particular to the simulation of a static real system, for example to evaluate the response of the real system in new situations, but also to the simulation of a dynamic real system over long periods of time, for example to model the evolution of a real system. The dynamic model is based on a recurrent form of a feedforward neural network that is called a “recurrent pattern” in the following.
The present disclosure has an advantageous application in the simulation of complex physical systems in at least real time.
The present disclosure proposes a method for the learning of real phenomena by parsimonious neural networks having very few connections. This can concern physical, biological, chemical, or even computer phenomena.
State-of-the-art methods have been largely inspired by the biological brain, which is highly redundant. Redundancy helps protect the brain from the loss of neural cells. This loss may or may not be accidental. It turns out that the choice of redundancy in artificial neural networks plays a major role in the learning process.
The first cause of redundancy is linked to the organization of the neural network's topology into layers of neural cells. It is up to the user to define the number of layers and the number of cells per layer. This construction is done manually in a trial-and-error process. The neural network must be large enough to carry out the learning, but its size is not minimal and is necessarily redundant.
This redundant nature plays a major role in the learning process. Indeed, according to the publication by LeCun, Yann; Bengio, Yoshua; Hinton, Geoffrey (2015), “Deep learning,” Nature 521 (7553): 436-444, the learning process is not trapped by the local minima when the neural network is sufficiently large.
This fundamental property makes the gradient descent method a possible candidate for achieving learning. But this method, reputed to have a very low convergence rate (https://en.wikipedia.org/wiki/Gradient_descent), ensures a very good error descent at the start of the learning process. Hence the idea of stochastic gradient descent: Bottou, L (2010), Large-scale machine learning with stochastic gradient descent, in Proceedings of COMPSTAT2010 (pp. 177-186) Physica-Verlag HD, which reinforces this property by changing the error function with each iteration of the gradient. This involves applying an iteration of the gradient to each training sample in turn. Sometimes the stochastic gradient descent method is applied by small groups of samples. Stochastic gradient descent, like gradient descent, does not have good local convergence. The answer to this problem is redundancy. Indeed, due to this redundant nature, the learning process must stop prematurely to avoid the phenomenon of overfitting. The gradient and stochastic gradient descent methods are therefore used only within their area of effectiveness.
Finally, in a redundancy context, the large number of connection weights to be determined requires the use of massive amounts of data. The state of the art goes hand in hand with what is known as “big data.”
The state of the art represents a coherent construction based on redundancy. But the absence of local convergence shows that the state of the art is oriented towards qualitative learning. If the answer is greater than 0.5, it is rounded to one, and if it is less than 0.5, it is rounded to 0. Quantitative responses have precision requirements which are not taken into account by these methods.
The present disclosure responds to the requirements of the emerging field of modeling complex physical systems by creating a digital copy, also called a Digital Twin, of the physical system, adapted to accurately predict the state of a physical system more quickly than the real system, and preferably thousands of times faster, so as to be able to simulate a large number of possible scenarios impacting the physical system before making the best decision for the real system.
The Digital Twin concept has been introduced in the following publications:
Most learning methods, when applied to quantitative phenomena, are generally limited to relatively simple cases which require only shallow models. In addition to neural methods, we can cite methods such as kriging and support vector machine regression:
These two extremely popular methods can be likened to shallow neural networks, having only three layers of neurons.
These methods, as well as neural networks with a low number of layers, cover most requirements in the field of modeling quantitative phenomena.
The need for deep and quantitative learning appears in special cases such as:
Although the state of the art is dominated by the manual determination of a neural network's topology, the question arises concerning determination of a topology adapted to the problem. The automatic search for an optimal topology is an old topic of research in the neural field. We can cite for example Attik, M., Bougrain, L., & Alexandre, F. (2005, September), Neural network topology optimization, in International Conference on Artificial Neural Networks (pp. 53-58) Springer, Berlin, Heidelberg, which is representative of pruning techniques for simplifying a network.
We can cite other topological optimization methods:
These are based on genetic algorithms. These methods are known to be very slow. Thanks to the computing resources available, these methods are increasingly being used on a redundant database of neural networks.
However, applications also exist for which the amount of available data is very limited (the term “small data” is then used), and in this case the redundant structures of neural networks cannot be used because they require more data than what is available.
Other approaches consist of creating a reduced model by relying on computation-intensive simulation software, which requires hours of calculation and which is not compatible with real time. These approaches consist of creating a space of reduced dimension onto which the parameters of the system are projected. For example, for the case of a dynamic system, by denoting as Xi the solution of a problem not reduced at time i, a solver must, to determine Xi+1 from Xi, solve a system of N equations of the type F(Xi, Xi+1)=0.
The number N is also the dimension of the vectors Xi and Xi+1. The implementation of a reduced model consists in determining a reduced orthonormal basis which is denoted U=(U1, U2, . . . , Un) where n<<N. We can therefore compress Xi by: xi=UTXi, where the xi are the coefficients of size n of Xi, in the reduced basis U, and we can decompress xi to obtain Xi as follows: Xi≈Uxi.
The reduced model consists of solving, at each time interval, a system F(Uxi, Uxi+1)=0 whose unknown xi+1 is of small size n. This system is solved in the least squares sense.
As schematically represented in
This reduced model approach has been proposed for example in the following publications:
This approach is not without disadvantages, however.
First, the reduced problem is highly unstable, which means that a small disturbance in the data leads to a large deviation from the solution. Therefore, approximating the state of a complex physical system with such a model is difficult.
In addition, the minimization of ∥F∥2 implies computing a residual, of large dimension N, a certain number of times, which can prove to be costly in computing time. However, because of the instability problem, the residual must be minimized with the greatest precision at each step. As a result, the current methods are insufficiently precise to describe non-linear complex physical systems, and too costly in computing time to be able to be used in real time in embedded systems.
The basic idea of these methods is to extract modeling information from the simulation software through the residual calculation. Our approach is so parsimonious that it manages to capture the physical and biological phenomena conveyed by the data.
Thus there is no currently existing solution that allows accurately and quickly modeling a complex physical system, over long periods of time, in order to reproduce it in the form of a digital twin.
The present disclosure aims to remedy the shortcomings of the prior art described above, based on the use of redundant neural networks for learning real phenomena representing real systems.
In one or more embodiments, the present disclosure provides a method of dynamic simulation of a complex physical system, provided with excellent prediction capabilities over long periods of time and faster than the real time of the physical system.
In some embodiments, the present disclosure is applicable to both static and dynamic modeling of complex physical systems, and also applicable to the nonlinear compression of complex systems. Indeed, the compression ratio increases drastically with the depth of the network. This compression is the basis of the dynamic prediction over long periods of time.
Lastly, the present disclosure aims to provide a neural network structure adapted to the application which is later made thereof, this structure being parsimonious, in other words as reduced as possible in order to require a small amount of data for its learning.
More particularly, the present disclosure relates to a method of construction of a feedforward neural network, comprising a set of processing nodes and of connections between the nodes forming a topology organized in layers, such that each layer is defined by a set of simultaneously calculable nodes, and the input of a processing node of a layer can be connected to the output of a node of any of the previously calculated layers,
Advantageously, but optionally, the selected topology modification is the one, among the candidate modifications, which optimizes the variation of the error in comparison to the previous topology.
In one embodiment, the network error for a given topology is defined by J(Γ, W*) where
In one embodiment, the variation of the network error between a candidate topology and the previous topology is estimated by calculating the quantity: J(Γn, {tilde over (W)}n,)−J(Γn−1, Wn−1*) where, abusing the notation, we denote
{tilde over (W)}n can then be initialized with the same connection weights as matrix Wn−1* for the connections common to the two topologies, and, in the case of an additive phase, a connection weight of zero for each link created during the additive phase.
In one embodiment, the estimation of the variation of the network error between a modified topology and the previous topology comprises the estimation of the network error according to the modified topology based on the Lagrange operator applied to the connection weights of the neural network (Γ, W, X, Λ), where:
Advantageously, during an additive phase, the variation of the network error between a candidate topology and the previous topology is estimated by calculating the quantity: (Γn, Wn, X, Λ)−J(Γn−1, Wn−1*), where:
Advantageously, during a subtractive phase, the variation of the network error between a calculated topology and the previous topology is estimated by calculating the quantity: (Γn, Wn, X, Λ)−J(Γn−1, Wn−1*) where Wn=W|Γ
In one embodiment, the neural network is adapted to simulate a real system governed by an equation of the type Y=f(X) where X is an input datum and Y is a response of the physical system, and the error J of the neural network is defined as a function of the topology Γ and of the matrix W of connection weights of the network, by: J(Γ, W)=Σi=1M∥ƒΓ,W(Xi)−Yi∥2, where ƒΓ,W(Xi) is the output of the neural network, and Xi and Yi are respectively input and output data generated by measurements on the real system.
In one embodiment, the method comprises, once the topology modification has been selected, the determination of a matrix of connection weights of the network by a method of descending the error with respect to said matrix. This step teaches the network the topology obtained after the topological modification.
Unlike the state of the art, this learning process is based on a Gauss-Newton type of descent method having rapid convergence.
Advantageously, the topological optimization step is implemented as a function of mean errors of the neural network on training data on the one hand, and on validation data on the other hand, wherein:
In one embodiment, the neural network comprises at least one compression block suitable for generating compressed data, and a decompression block, the method comprising at least one topological optimization phase implemented on the compression block and decompression block, and further comprising, after topological optimization of the blocks, a learning phase on the entire neural network at fixed topology.
In this case, the initialization step of the neural network comprises:
The method may further comprise the iterative implementation of:
In one embodiment, the method further comprises the selection of the compression and decompression block and the addition of a modeling block, respectively as output from the compression block or as input to the decompression block, wherein at least one topological optimization phase is implemented on the modeling block, and a learning phase at fixed topology is implemented on the set comprising the modeling block and the compression or decompression block.
In one embodiment, the method further comprises the insertion, between the compression block and the decompression block, of a modeling block suitable for modeling the evolution of a dynamic system governed by an equation of the form Xi+1=F(Xi, Pi)+Gi, i≥0 where Xi is a measurable characteristic of the physical system at a given time, Pi describes the internal state of the physical system, and Gi describes an excitation, and the modeling block is suitable for calculating an output xi+1 of the form: xi+1=h{circumflex over (Γ)},Ŵ(xi, pi)+gi, i≥0 (17) where:
The present disclosure also relates to a neural network, characterized in that it is obtained by implementing the method according to the above description.
The present disclosure also relates to a computer program product, comprising code instructions for implementing the method according to the above description, when it is executed by a processor.
The present disclosure also relates to a method of simulation of a real system governed by an equation of type Y=f(X) where X is an input datum and Y is a response of the real system, comprising:
The present disclosure also relates to a method of simulation of a dynamic physical system governed by an equation of the form Xi+1=F(Xi, Pi)+Gi, i≥0 where Xi is a measurable quantity of the physical system at a given time, Pi describes the internal state of the physical system, and Gi describes an excitation, the method comprising the steps of:
In one embodiment, the method of simulation is implemented by means of a neural network constructed according to the method described above and comprising a compression block and a decompression block, and the steps of compression of Xi, application of a neural network, and decompression of xi+1 are respectively implemented by means of the compression block, the modeling block, and the decompression block of the constructed neural network.
Lastly, the present disclosure relates to a method of data compression, comprising:
The method of construction of a neural network according to the present disclosure makes it possible to obtain a neural network whose structure depends on the intended use or application, since the construction comprises a topological optimization phase which is governed by the error of the network on the training and validation data.
In other words, the method of construction simultaneously comprises the construction and training of the neural network, for a specific task. This allows a user of this method to have no need of specific mathematical knowledge in order to choose a neural network structure suitable for the targeted technical application.
More particularly, the method of construction according to the present disclosure makes it possible to construct a parsimonious neural network, meaning where any redundancy is eliminated, optimized for the intended task. This property is obtained by an incremental construction from a possibly minimal initial topology, in other words comprising a single hidden layer comprising a single neuron, then by implementing an iterative process comprising a learning step in the current state of the network, using a method of rapid local convergence, such as the Gauss-Newton method, and a step of topological modification of the network in order to improve the learning. In addition, the implementation of a topological optimization technique in the construction plays a double role:
The topological optimization method gives the neural network an innovative structure to the extent that a neuron of a layer, including the output layer, can be connected to a neuron of any previous layer, including the input layer. Indeed, when a physical phenomenon depends on a large number of parameters, most of these parameters contribute linearly to the response of the system. Hence the advantage of connecting the corresponding inputs directly to the output layer of the neural network. The effect of weakly non-linear parameters can be taken into account by a single intermediate layer between the input and the output, and so on.
The reduction in complexity of the neural network in fact improves its capacity for generalization (ability to give the right answer on unlearned data). This also makes it possible to alleviate learning difficulties (exploding gradients and vanishing gradients) by reducing the number of layers. Indeed, in a network structured in layers, certain cells may simply be used to duplicate previous cells in order to make them available for the next layer. This unnecessarily increases the complexity of the network.
This neural network, used for modeling a complex physical system, provides a very good quality of simulation for reduced computing times, in particular less than the real time of the physical system. The simulation model can be constructed from measurements made during normal operation of the physical system or during test phases.
In addition, the topological optimization of the network is advantageously carried out by the use of the Lagrange operator, or Lagrangian, applied to the connection weights of the neural network. This method makes it possible to calculate, in a particularly rapid manner, the effect of a topological modification of the network (addition/elimination of a neural cell, addition/elimination of a link), which makes it possible to quickly assess and select the best topological improvement of the neural network at each step.
The feedforward neural network is advantageously used, as a recurrent pattern, in the context of dynamic simulation of physical systems in order to predict a future state of the system on the basis of an initial state and possible source terms or excitations.
The neural network is advantageously combined with an approach in which the data representative of the state of the physical system are compressed. The dynamic model simulates the future state of the system on the compressed data, then decompresses the simulated data to return to real space. Unlike the state of the art concerning the reduced basis described above, the recursive loop is not done in real space but in the space of the compressed data, which eliminates noise in the data while ensuring better stability of the dynamic model. This also makes it possible to reduce computing times in the learning and simulation phases.
Topological optimization plays a major role in the management of dynamic models. Indeed, if we perform m iterations of a recurrent pattern having n layers, the learning difficulty is equivalent to that of a neural network having n×m layers. The present disclosure therefore makes it possible to reduce n, and consequently the number of calculations and their duration, in two different ways:
Other features, details and advantages of the present disclosure will become apparent from reading the following detailed description, and from analyzing the accompanying drawings, in which:
We will now describe a method of construction of a parsimonious neural network that can be used for modeling a physical system or phenomenon. This method, as well as the methods of data compression, and of simulating a static or dynamic system described below, are implemented by a computer 1 schematically represented in
The method has two phases: a phase of learning and constructing the model, and a simulation phase for applying the model. The two phases can be carried out on different equipment. Only the simulation phase is intended to occur in real time.
In what follows, a real system is any system whose state can be measured at least in part by sensors of physical quantities. Among the real systems, in particular one can list physical, biological, chemical, and computer systems.
It is assumed that the real system which is to be modeled is governed by a model of the type:
Y=ƒ(X) (1)
where X and Y are respectively input and output variables characterizing the state of the system.
For the construction of this model, we have a database of the type (Xi, Yi)i=1M, generated by measurements on the real system, the data being able to be stored in the memory 11, where:
This database is divided into two disjoint subsets, the first constituting a training database formed by the indices, for example, i=1, . . . , M1, M1<M, and the rest of the indices forming a validation database. The purpose of this distribution is to implement a method of cross-validation on the learning of the constructed neural network.
The objective of the method of modeling the physical system is to build an approximate model of (1) in the form:
Y≈ƒΓ,W(X) (2)
where ƒΓ,W is a simulation function calculated by a neural network defined by a topology Γ and a matrix or a list of matrices of connection weights W, so as to be able to simulate the output Y on the basis of an input variable X.
The topology Γ and the matrix W of connection weights are determined by minimization of an error function J of the neural network:
where J quantifies the error between an output of the neural network calculated on the basis of input data Xi and the corresponding target result Yi, calculated on the basis of training data:
J(Γ,W):=Σi=1M1∥ƒΓ,W(Xi)−Yi∥2 (4)
Neural Network
Referring to
This neural network comprises a set of processing nodes, also called neurons, and of connections between processing nodes, each connection being weighted by a weighting coefficient, the processing nodes and connections forming a topology organized into layers.
Unlike a conventional neural network in which each layer takes its inputs from the outputs of the preceding one and is therefore only connected to the preceding layer, the neural network of the present disclosure is a computational graph in which each layer is defined by the set of nodes which can be calculated simultaneously, and the input of a processing node of one layer can be connected to the output of a processing node of any of the previously calculated layers.
Also as a consequence, the set of processing nodes calculating the outputs of the neural network, hereinafter referred to as the “set of output nodes,” does not form a layer because the output nodes may be calculated in different steps and be distributed among several layers.
In addition, the neural network is of the feedforward type, meaning it does not comprise any computation loop that returns the output of a processing node as input to the same node or to a node of a previous layer.
Finally, the training of the neural network is carried out during its construction, so as to adapt the structure of the neural network to the function that it is to calculate.
We denote as Xi, i=1, . . . , nc the layer formed by the cells that can be calculated simultaneously in step I, and as Xi=(X0, . . . , Xi) the layers already calculated in step i. We posit X0=(Xi)i=1M1 which is of size n0×M1 and represents the state of the input layer (in other words we apply the neural network to the data of the database that we have). We posit Y=(Yi)i=1M1, the target values corresponding to the input X0.
By denoting the number of layers of the neural network as nc, and by associating with layer i a number ni of processing nodes, we associate a matrix of connection weights Wi of size ni+1×Σj≤inj with each layer. The matrix Wi is very sparse. Most of its columns are zero, and those that are not zero contain many zeros. The set of connection weights of the entire neural network is then W=(W0, . . . , Wnc-1). Abusing the terminology, we will call this object a matrix.
The neural network then carries out the following calculations (hereinafter described as “the calculation algorithm”) on the input data X0:
X0=X0
For i=1 to nc,
Xi=ƒSI(Wi−1*Xi−1)
Xi=(Xi−1,Xi);
End
where the function fSI is the Identity function for the output processing nodes and the sigmoid function:
for the other processing nodes. Let us assume, for example, that the last row of x0 is formed of 1 s. This means that the last cell in the input layer is a bias cell. In conventional architectures, each layer other than the output layer has a bias cell. In the architecture according to this disclosure, only the input layer has a bias cell. Cells in other layers can connect directly to this cell.
The neural network's error function J is then written:
J=∥OXnc−Y∥2
where O is the observation matrix making it possible to extract the output elements of Xnc. Indeed, the number of cells of the last layer, denoted nnc, is less than or equal to the size of the output data of the neural network nO. It is for this reason that the observation operator is applied to Xnc, in other words to all cells in the network.
The topology Γ of the neural network is defined by the incidence matrices of the computation graph Γ=(M0, . . . , Mnc−1), where Mi is an incidence matrix which has the same size as Wi which is equal to 1 for the non-zero coefficients of Wi and zero elsewhere.
Returning to
The initialization step also comprises a determination of the optimal connection weights W1*, i.e., connection weights minimizing the error function J for the fixed initial topology Γ1, denoted JJ(Γ1, W1*). This determination is made by training the neural network on the training data.
Gradient backpropagation can be used for this purpose, but the quantitative and deep phenomena require the use of the zero-memory Gauss-Newton method described in
The zero-memory Gauss-Newton method combines gradient backpropagation with a method of gradient forward propagation. It makes it possible to improve local convergence considerably.
The method then comprises at least one phase of topological optimization 200 of the neural network, determined so as to reduce the error J of the network.
The topological optimization phase may comprise:
In addition, each topology modification 210, additive or subtractive, comprises a selection 212 among a plurality of candidate topological modifications, based on an estimation 211 of the variation in the network error between each topology modified according to a candidate modification and the previous topology, the selected topological modification being the one which optimizes the variation of the error relative to the preceding topology, the goal being to maximize the reduction of the error at each iteration. As we will see, however, subtractive topology modifications can, in a given iteration, induce an increase in the error J on the training data while still making it possible to improve the accuracy of the network by reducing its error on the validation data.
There remains to define the choice of candidate topological modifications. In the case of a subtractive phase, all nodes and links are candidates in turn for a topological modification.
In an additive phase, one can connect, by a link, two nodes which do not belong to the same layer and which are not already connected. Nodes can be added to any layer other than the input and output layers of the network. A new layer can also be created by inserting a node between two successive layers. A created node must be connected to the network by at least two links: at least one input link and at least one output link. The choice of which links to add can be made randomly. In an additive phase, if the network is large, one can choose a thousand candidate topological modifications taken at random. The estimate of the variation for these candidate disruptions is calculated. The best disruptions:
The variation in network error between a modified topology (candidate for iteration n) and the previous topology (iteration n−1) is measured with the optimal connection weights for each topology considered, meaning that it is written:
J(Γn,Wn*)−J(Γn−1,Wn−1*)
where Γn is the topology modified according to the candidate modification at iteration n, and Wn* is the matrix of optimal connection weights for this topology.
However, the computation of a matrix of optimal connection weights for a given topology is very long, and it is not easy to calculate this error variation for all candidate topological modifications considered.
We will therefore describe how to estimate this error variation rather than how to calculate it.
According to a first embodiment, for an additive phase, the connection weights Wn of the modified topology are initialized by:
This initialization does not aggravate the error; we have J(Γn, Wn)=J(Γn−1, Wn−1*).
Then a few training iterations are carried out in order to improve Wn, and the variation of the error is estimated by: J(Γn, Wn)−J(Γn−1, Wn−1*), which is necessarily negative or zero. The purpose of the additive phase is to carry out learning.
In the case of a subtractive phase, the connection weights Wn of the modified topology are initialized by Wn=W|γn−1*, then several training iterations can be performed in order to improve Wn.
The estimate of the error is then also: J(Γn, Wn)−J(Γn−1, Wn−1*)).
This variation is necessarily positive or zero. Otherwise Wn−1* is not optimal. Indeed, matrix Wn would offer a better solution by setting the removed links to zero. This phase, which only increases the error, has the purpose of ensuring generalization: the prediction ability of the neural network with data that are not part of the training set. As the error function J increases, the average error on the validation data tends to decrease.
According to a more advantageous variant embodiment, the estimation of the error between a modified topology and the previous topology is carried out on the basis of the Lagrange operator, or Lagrangian, applied to the internal variables of the neural network that are the layers of the network X=(X0, . . . , Xnc), which is written:
(Γ,W,X,Λ)=J(Γ,W)+Σitr(ΛiT(Xi−ƒSI(Wi−1*Xi−1))) (5)
where Λ=(Λi), Λi being the Lagrange multiplier associated with the equation defining Xi. The multiplier Λi has the same size as Xi. The function tr is the trace, meaning the sum of the diagonal terms of a matrix. According to the calculation algorithm described above for the neural network, if W and X0 are known, it is possible to construct all the Xi and then all the Λi. The Λi are well-defined and are obtained by solving the equations:
∂X
We refer to the Appendix at the end of the description for the solving of these equations.
However, we can see that for any given W, if X is obtained by the calculation algorithm described above, then the terms under the summation symbol of equation (5) cancel each other out and we obtain the following equality:
J(Γ,W)=(Γ,W,XW,Λ) (6)
Thus, for any W we have an equality between the error of the neural network and of the Lagrangian applied to it. From this we can deduce:
dWJ(Γ,W)δW=dW(Γ,W,XW,Λ)δW (8)
where dW is the total derivative with respect to W and δW is the variation of W. Since J only depends on W via X, the total derivative is written:
dWJ(Γ,W)δW=∂WJ(Γ,W)δW+∂XJ(Γ,W)∂WXδW=2(OXnc−Y)∂WXδW. (9)
Here the total derivative dW takes into account ∂W, the partial derivative with respect to W, and takes the variation into account via the variable X. This expression is unusable because of the cost of computing ∂WX. According to equality (6), this derivative of J can also be calculated explicitly without having to calculate
However, as in the construction of Λ, we have ∂X=0 and therefore we obtain the following formula:
dWJ(Γ,W)δW=∂W(Γ,W,XW,Λ)δW (11)
The Λi are chosen so that the variation of the Lagrangian compared to the Xi is zero. The Lagrangian behaves as if we had locally eliminated the variable Xi. It follows that, for any fixed W0, we calculate XW
J(Γ,W)˜(Γ,W,XW
This result is advantageously transposed to the selection of a candidate topological modification which minimizes the error function. Indeed, we can then estimate, for a subtractive topological modification in iteration n, the variation of the network error between a topology Γn calculated according to a candidate modification and the previous topology Γn−1 is estimated by calculating the quantity:
(Γn,Wn,X,Λ)−J(Γn−1,Wn−1*), (13)
where Wn=W|Γ
In the case of an additive topological modification, the variation of the network error between a calculated topology and the previous topology is estimated by calculating the quantity:
(Γn,Wn,X,Λ)−J(Γn−1,Wn−1*) (14)
where Wn is a matrix of connection weights of the network after the candidate topological modification in iteration n, said matrix being initialized with the same connection weights as matrix Wn−1* for the same connections and a zero-connection weight for each link created during the additive phase. At this level of initialization, the variation given by (14) is equal to zero. To estimate the potential variation, after a learning phase, it is sufficient to minimize the Lagrangian in relation to the only links created. This is a form of application of the Pontryagin principle:
The error variation estimates (13) and (14) can be improved by updating Wn:
Returning to
The additive phases are implemented to lower the error value J on the training data. The subtractive phases are implemented if the error on the training data becomes less than the error on the validation data, beyond a certain limit. This in effect means that the neural network has performed overfitting which leads it to give the wrong answer for data not learned (validation data).
Finally, the topological optimization iterations stop when no topology modification leads to an improvement in the precision of the network, in other words when it no longer reduces the errors on the validation data or the training data after optimization of the connection weights.
Finally, for each topological optimization phase 200, once a topological modification has been selected, the method comprises the updating 213 of the matrix of connection weights of the network by a descent method of the gradient backpropagation type:
Wn←Wn−ρ∇J(Wn) (15)
where ρ is the learning rate. One can also use the zero memory Gauss-Newton method.
If we compare this approach with that of the state of the prior art, we see that learning occurs after each topological modification; we thus need a fast convergence algorithm. The state of the prior art relies on redundancy to avoid local minima. In a parsimonious context, local minima are present, but the addition of new degrees of freedom allows us to locally modify the error function J.
Represented in
One will observe that the neural network provided by the prior art software is organized by layers, each layer communicating only with the adjacent layers, and this neural network comprises 22,420 links. The network obtained by applying the above method comprises 291 links and the layers which are visible are only the graphic visualization of the processing nodes which can be calculated simultaneously. One will note that the processing nodes of a layer can communicate with the nodes of all previous layers.
Simulation Method
Once the neural network has been obtained and trained on the database (Xi, Yi)i=1M it can then be applied to new data which are denoted as theoretical data (Xi)i∈S or data captured by one or more sensors on the physical system to be simulated in order to generate results (Yi)i∈S. S represents the set of data for the simulation, and it is therefore disjoint from the set of training and validation data indexed from 1 to M.
Typically, the (Xi)i∈S data are representative of certain quantities characterizing the state of the real system, these data able to be measured, and the (Yi)i∈S data can be representative of other quantities characterizing the state of the physical system, these data possibly being more difficult to measure, hence the need to simulate them. The (Xi)i∈S data can include control data or actuator state data; the goal of the simulation can be to determine the choice of (Xi)i∈S which allows having the best response from the (Yi)i∈S system.
We can consider many possible applications, such as:
For these three examples, a simulation of each system was made by means of a neural network according to the above description, compared to a simulation by means of the prior art software already compared in the previous section.
In this comparison, the neural network according to the present disclosure is executed only once on each test case. Conversely, the prior art software requires specifying the number of layers, the number of cells per layer, and the weight of the links between the cells, so that 50 error tests were carried out with this prior art software. [Table 1] below shows the mean of the error, the standard deviation of the error, and the best error obtained; note that the error obtained by the neural network described above is always less than the best error obtained by the prior art software.
Another comparison can be made between the performance of the present disclosure applied to modeling a complex phenomenon involving fluid-structure interactions in the automotive field, and the performance obtained by a major player in the digital field by exploiting a solution available for purchase. The neural network obtained by the present disclosure for this application is shown in
Compression
The method described above for constructing a neural network can also be used for data compression.
In this regard, and with reference to
The construction of the compression neural network comprises a step 100 of initialization of a neural network which comprises:
The method then comprises a learning step 101 to train this initial neural network on the training database, then a subtractive phase 102 in accordance with a subtractive phase of the topological optimization step described above, to reduce the size of the hidden layer without adversely affecting the learning. We denote as Xi′ the compression of Xi at the hidden layer.
The method then comprises a step of subdivision 103 of the hidden layer into three layers of the same size, and a reiteration of the learning step 101 on the sub-network formed, and of the subtractive step 102 on the new central layer.
A compression block C which is formed by all the layers between the input layer and the central layer, and a decompression block D which is formed by all the layers between the central layer and the output layer, are then defined, and the topological optimization step 200 is implemented separately for each block.
The method then comprises a learning step 300 on the entire network thus formed. Steps 103 to 300 can then be iterated until it becomes impossible to reduce the size of the compressed vector without significantly aggravating the decompression error.
The compression ratio obtained makes it possible to describe very complex structures with only a few variables. To illustrate the power of these nonlinear compression methods, we can give an example where Xi=ei, the ith element of the canonical basis. No compression is possible by classical linear methods. But we can see that the vectors Xi are parameterized by a single variable, the index i.
Advantageously, the compression block and/or the decompression block thus created can be used to model a real system whose inputs and/or outputs are of large dimensionality.
In the case of an input of large dimensionality, we can for example insert a modeling block just after the compression block, to obtain a neural network comprising:
Here the decompression block only serves to ensure that the xi indeed represent the Xi by ensuring that Xi≈D(xi). In this case, the method of construction advantageously comprises at least one additional learning phase at fixed topology on the entire network f·C. This allows the decompression to be corrected according to the application, i.e., modeling. Indeed, the compression process ignores the goal of reaching Yi.
We can take the example of a system that models the risk of developing a disease, based on the genetic characteristics of an individual. The input data to the network can have hundreds of thousands of inputs, while the output is reduced to a single scalar. The best results obtained in this field are based on the process given above.
Outputs of large dimensionality result in a high compression ratio. This phenomenon can be explained by the cause-and-effect link that ties the Xi to the Yi. For example, we can insert a modeling block just before the decompression block, to obtain a neural network comprising:
It is advantageously possible to carry out a final training at fixed topology of the global network D·ƒ.
In the experimental approach, in particular for simulated experiments, we can have Xi of very large dimension, which by their very construction are non-compressible. The Yi which are generally compressible. Indeed, the solving of partial differential equations has a regulating effect. The act of constructing the model yi=f(Xi) shows that ultimately, in a certain sense, the Xi are compressible: their effect on Yi is compressible.
Dynamic System
The method of construction of a neural network can also be used for modeling a dynamic physical system, in which one seeks to determine a future state of a physical system based on information about its current state.
In this regard, a neural network is constructed comprising a compression block, a modeling block, and a decompression block, in which at least the compression block and the decompression block are neural networks constructed according to the method described above, using training and validation databases comprising pairs of the form (Xi, Xi)i=1M.
Here, each Xi represents the state of the system at successive times. If ((zi)i=−pM represents the instantaneous state of the studied system, then
The bias is added to the data, for reasons explained above. In methods such as the ARMA method or NARX-type recurrent networks, the next step depends on the previous p+1 steps. The use of this technique improves the stability of the model. But it also increases the size of the model and reduces its capacity for generalization.
Compression of Xi makes it possible to reduce the size of the recurrent pattern, while increasing p to ensure better stability.
This compression has the advantage of filtering out the noise from Xi, which is essential in the context of measured data.
For modeling a dynamic physical system, with reference to
Xi+1=F(Xi,Pi)+Gi, i≥0 (16)
where Gi corresponds to one or more excitations representing the environment of the simulated system and Pi describes the internal state of the system.
The system is only known through a few measurements made over time:
χ=(X0,X1, . . . ,XM)
G=(G0,G1, . . . ,GM)et P=(P0,P1, . . . ,PM).
The modeling block is advantageously a neural network suitable for reproducing a model of the form:
xi+1=h{circumflex over (Γ)},Ŵ(xi,pi)+gi, i≥0
x0=CX(X0)) (17)
where:
In one embodiment, schematically shown in
The determination of h{circumflex over (Γ)},Ŵ is then done by solving the following optimization problem:
The minimization with respect to {circumflex over (Γ)} is advantageously carried out by the topological optimization step 200 described above, and for fixed {circumflex over (Γ)}, a zero memory Gauss-Newton technique is used to estimate W.
Otherwise, in the case where the number of parameters for P and G is higher, these parameters are also compressed to obtain
pi=CP(Pi)
gi=CG(Gi)
where:
This is compression induced by that of the Xi. Although Pi and Gi do not easily lend themselves to compression, their effect on the dynamic system is compressible.
This embodiment is schematically shown
The minimization with respect to {circumflex over (Γ)} is performed by the topological optimization step 200 described above, and for fixed {circumflex over (Γ)}, a zero memory Gauss-Newton technique is used to estimate W, CP and CG.
In this method, the recursive loop does not occur in the real space of the Xi but in the space of the compressed data. This compression reduces noise in the data and ensures better stability of the dynamic model, while reducing computation times in the training and simulation phases. Regardless of the method used to initialize W and possibly to update it, the number of topological changes to be tested can increase very quickly with the size of the neural network. To limit the number of calculations, we can randomly choose the configurations to be tested and only retain the one that gives the best estimate for reducing the error.
By way of illustration, an example of a possible application that is in no way limiting is that of modeling the melting of a solid sodium block.
Referring to
Three experiments are carried out. During each experiment, the resistor is respectively powered by one of the three power profiles shown in
The response of this system is represented by nine temperature sensors 2, which solely provide the value of 0 if the temperature does not exceed the sodium melting point, and 1 if this value is exceeded.
If we denote as z, the vector formed by the nine measurements at a time i, then Xi represents the state of the system at the successive times i and i−1:
A “digital twin” of this dynamic system is established based on data measured during the first experiment with the first power profile, and according to the method of simulation of a dynamic system described above, by first performing a compression of the Xi.
The compression results in a neural network comprising 18 inputs (two for each of the nine sensors) and 18 outputs. With reference to
A dynamic modeling block in the form of a recurrent neural network, of which the pattern is shown in
With reference to
We will notice from these figures that the position of the sodium melting front depends significantly on the excitation, and that the constructed model succeeds in predicting this position in the validation cases, which are those of
The derivative of the sum being equal to the sum of the derivatives, we establish the result for a single training datum: M1=1.
This gives, for i=nc: 2(O Xnc−Y,O ϕ)+tr(ΛncTϕ), ∀ϕ,
Here (.,.) indicates the scalar product in n
It follows that Λnc=2(O Xnc−Y)TO.
And we obtain for i=nc−1, nc−2, . . . , 0 ΛiTϕ−Σj>iΛjT(ƒSI′(Wj−1I*Xi).*(Wj−1I*ϕ))=0, ∀ϕ, where Wji represents the submatrix of Wj which acts on the components of Xi. The notation.* designates the product, component by component, of two matrices of the same size.
By having Φ pass through the elements of the canonical basis of n
Which can also be written in the form Λi=Σj>i(diag(ƒSI′(Wj−1I*Xi))*(Wj−1I)T)Λj, for i=nc−1, . . . , 0, where diag(x) designates the diagonal matrix in which the diagonal terms are formed by the elements of vector x.
The various embodiments described above can be combined to provide further embodiments. These and other changes can be made to the embodiments in light of the above-detailed description. In general, in the following claims, the terms used should not be construed to limit the claims to the specific embodiments disclosed in the specification and the claims, but should be construed to include all possible embodiments along with the full scope of equivalents to which such claims are entitled. Accordingly, the claims are not limited by the disclosure.
Number | Date | Country | Kind |
---|---|---|---|
1860383 | Nov 2018 | FR | national |
1900572 | Jan 2019 | FR | national |
Filing Document | Filing Date | Country | Kind |
---|---|---|---|
PCT/FR2019/052649 | 11/7/2019 | WO |
Publishing Document | Publishing Date | Country | Kind |
---|---|---|---|
WO2020/094995 | 5/14/2020 | WO | A |
Number | Name | Date | Kind |
---|---|---|---|
5636326 | Stork | Jun 1997 | A |
6038389 | Rahon | Mar 2000 | A |
7814038 | Repici | Oct 2010 | B1 |
9336483 | Abeysooriya | May 2016 | B1 |
10614354 | Aydonat | Apr 2020 | B2 |
11158096 | Schubert | Oct 2021 | B1 |
11544539 | Zhang | Jan 2023 | B2 |
11604957 | Schubert | Mar 2023 | B1 |
11616471 | Narayanaswamy | Mar 2023 | B2 |
20140067741 | Katayama | Mar 2014 | A1 |
20150106311 | Birdwell | Apr 2015 | A1 |
20180307983 | Srinivasa | Oct 2018 | A1 |
20180322388 | O'Shea | Nov 2018 | A1 |
20190080240 | Andoni | Mar 2019 | A1 |
20190130246 | Katayama | May 2019 | A1 |
20190130278 | Karras | May 2019 | A1 |
20190171936 | Karras | Jun 2019 | A1 |
20200167659 | Moon | May 2020 | A1 |
20200364545 | Shattil | Nov 2020 | A1 |
20210279553 | Zhu | Sep 2021 | A1 |
20210357555 | Liu | Nov 2021 | A1 |
20210390396 | Fan | Dec 2021 | A1 |
20210397770 | Bompard | Dec 2021 | A1 |
20220092240 | Chi | Mar 2022 | A1 |
Entry |
---|
Arifovic et al. (Using genetic algorithms to select architecture of a feedforward artificial neural network, Physica A 289 (2001) 574-594) (Year: 2001). |
Gao et al. (âOrthogonal Least Squares Algorithm for Training Cascade Neural Networksâ, IEEE, 2012, pp. 2629-2637) (Year: 2012). |
Ma et al. (A new strategy for adaptively constructing multilayer feedforward neural networks, Neurocomputing 51 (2003) 361 â385) (Year: 2003). |
Finding Better Topologies for Deep Convolutional Neural Networks by Evolution, 2018, arXiv, pp. 1-10) (Year: 2018). |
Huang et al., “Orthogonal Least Squares Algorithm for Training Cascade Neural Networks,” IEEE Transactions on Circuits and Systems—I: Regular Papers 59(11):2629-2637, Nov. 2012. |
Zhang et al., “Finding Better Topologies for Deep Convolutional Neural Networks by Evolution,” arXiv:1809.03242v1, Sep. 10, 2018, 10 pages. |
“ADAGOS,” Sep. 20, 2018, extracted from the internet on Jan. 13, 2020 from https://web.archive.org/web/20180920073216/https://www.adagos.com/, 3 pages. |
Attik et al., “Neural network topology optimization,” ICANN '05: Proceedings of the 15th International Conference on Artificial Neural Networks: Formal Models and their Applications—vol. Part II, Sep. 2005, pp. 53-58. (Abstract only). |
Balabin et al., “Support vector machine regression (SVR/LS-SVM)—an alternative to neural networks (ANN) for analytical chemistry? Comparison of nonlinear methods on near infrared (NIR) spectroscopy data,” Analysis 136:1703-1712, 2011. |
Bottou, “Large-Scale Machine Learning with Stochastic Gradient Descent,” Proceedings of Compstat 2010, 10 pages. |
Carlberg et al., “The GNAT method for nonlinear model reduction: effective implementation and application to computational fluid dynamics and turbulent flows,” Preprint submitted to Journal of Computational Physics, Oct. 29, 2018, 33 pages. |
Chinesta et al., “A Short Review on Model Order Reduction Based on Proper Generalized Decomposition,” Archives of Computational Methods in Engineering 18(4), 10 pages. |
Glaessgen et al., “The Digital Twin Paradigm for Future NASA and U.S. Air Force Vehicles,” Paper for the 53rd Structures, Structural Dynamics, and Materials Conference: Special Session on the Digital Twin, Apr. 2012, 14 pages. |
Lecun et al., “Deep learning,” Nature 521, May 28, 2015, pp. 436-444. |
Lophaven et al., “DACE—A Matlab Kriging Toolbox,” IMM—Informatics and Mathematical Modelling, Technical University of Denmark, Technical Report IMM-TR-2002-12, Version 2.0, Aug. 1, 2002, 28 pages. |
Mineu et al., “Topology Optimization for Artificial Neural Networks using Differential Evolution,” IJCCN 2020 International Joint Conference in Neural Networks, 8 pages. |
Nazghelichi et al., “Optimization of an artificial neural network topology using coupled response surface methodology and genetic algorithm for fluidized bed drying,” Computers and Electronics in Agriculture 75:84-91, 2011. |
Tuegel et al., “Reengineering Aircraft Structural Life Prediction Using a Digital Twin,” International Journal of Aerospace Engineering, Hindawi Publishing Corporation, vol. 2011, Article ID 154798, 14 pages. |
Number | Date | Country | |
---|---|---|---|
20210397770 A1 | Dec 2021 | US |