The invention relates to a system and computer-implemented method for training a neural network. The invention further relates to a trained neural network. The invention further relates to a system and computer-implemented method for using a trained neural network for inference, for example to control or monitor a physical system based on a state of the physical system which is inferred from sensor data. The invention further relates to a computer-readable medium comprising transitory or non-transitory data representing instructions for a processor system to perform either computer-implemented method.
Machine learned (‘trained’) models are widely used in many real-life applications, such as autonomous driving, robotics, manufacturing, building control, etc. For example, machine learnable models may be trained to infer a state of a physical system, such as an autonomous vehicle or a robot, etc., or the system's environment, e.g., the road on which the vehicle is travelling, the robot's workspace, etc., based on sensor data which is acquired by one or more sensors. Having inferred the state, the physical system may be controlled, e.g., using one or more actuators, or its operation may be monitored.
In many cases, neural networks with many layers (‘deep neural networks’) are the most successful models for a given task. However, the implementation of such deep neural networks typically requires a large amount of memory for parameters of the model, such as the weights per layer. In addition, the training itself of a deep neural network requires a large amount of memory, since in addition to the weights per layer, also a large amount of temporary data has to be stored for the forward passes (‘forward propagation’) and backward passes (‘backward propagation’) during the training. For example, the layer output of each individual layer (‘hidden state’) during forward propagation may need to be stored as temporary data as it may be used in the backward propagation. This way, the training of a deep neural network may require many gigabytes of memory, with the memory requirements being expected to further increase as the complexity of models increases. This may represent a serious bottleneck for training machine learnable models in the future, and may result in the training of machine learnable models on lower-spec (e.g., end-user) devices becoming infeasible due to the memory requirements. Such training on lower-spec devices may nevertheless be desired, for example for continual learning after deployment.
While it is known to share weights across some or all layers of a neural network, see, e.g., [1], thereby reducing the amount of data to be stored for the neural network's weights, the temporary data for the forward and backward passes typically still needs to be stored separately for each layer even if several layers have shared weights.
Another disadvantage, besides the large amount of data to be stored in memory, is that the propagating through all the layers of a deep neural network during training, but in some cases also during subsequent use, may be computationally complex and thereby time consuming, resulting in lengthy training sessions and/or a high latency of the model during use. The latter may be particularly undesirable in real-time use.
It would be desirable to obtain a neural network, and a training of the neural network, which addresses at least one of the disadvantages mentioned above.
In accordance with a first aspect of the invention, a computer-implemented method and corresponding system are provided for training a neural network, as defined by claims 1 and 15, respectively. In accordance with a further aspect of the invention, a computer-implemented method is provided for using the trained neural network for inference, as defined by claim 12. In accordance with a further aspect of the invention, a computer-readable medium is provided comprising transitory or non-transitory data representing model data defining a trained neural network, as defined by claim 14. In accordance with a further aspect of the invention, as defined by claim 12, a computer-readable medium is provided comprising instructions for causing a processor system to perform the computer-implemented method of any one of claims 1 to 12.
The above measures may involve providing a neural network which comprises an iterative function (z[i+1]=ƒ(z[i], θ, c(x)). Such an iterative function is known in the field of machine learning to be representable by a stack of layers which have mutually shared weights. Namely, the iterative execution (also referred to as ‘iterative application’) of the individual layers of the stack of layers may establish the iterative function. In such a stack of layers, each layer except for the first layer may receive, as input, i) an output of the previous layer and ii) (a part of) an input to the stack of layers, being either the original input (x) to the neural network or a transformation of that input (c(x)), for example by one or more previous layers preceding the stack of layers in the neural network. The latter may also be referred to as a ‘passthrough’ from the input of the stack of layers to each individual layer, or as a ‘skip connection’ or ‘direct injection’ of this input. The first layer of the stack of layers may receive an initial activation as input, which may for example be an output of yet another layer of the neural network. By having mutually shared weights, such layers provide the same transformation in each layer (also known as ‘weight-tying’). Accordingly, the stack of layers may be executed by iteratively executing a same layer. In other words, a stack of weight-tied layers having depth L may be replaced, in the neural network and/or its training, by an iterative L-times execution of a same layer, and vice versa. Since both concepts (‘iterative execution of same layer’, ‘execution of stack of layers’) are functionally equivalent, a reference to one concept also includes the other concept, unless otherwise noted.
While weight-tying imposes the limitation that the weights for each individual layer of the stack of layers are the same, it is nevertheless known to achieve results competitive with the state-of-the-art. Neural networks may entirely consist of such weight-tied layers but may in other embodiments also comprise a stack of such layers amongst other types of layers. Reference [1] uses such weight-tying in its neural network.
As described in the background section, during training, the iterative execution of a stack of layers still requires a sizable memory footprint, since the layer output of each individual layer (even if weight-tied) during forward propagation may need to be stored as temporary data as it may be used in the subsequent backward propagation.
The measures described in this specification replace, during training but in some embodiments also during subsequent use, the iterative execution of a same layer by the use of a numerical root-finding algorithm. Namely, the inventors have considered that the iterative execution of the same layer may result in a convergence to a fixed point after a certain number of executions, which may here and in the following also be referred to as an equilibrium point (z*), representing an equilibrium of the stack of layers (or iterative function) in which each following layer (or a further execution of the iterative function) would not substantially further change the output of the stack of layers (or of the iterative function). In other words, there may exist an equilibrium point which, when used as input to the iterative function (z*=ƒ(z*, θ, c(x)), is again provided as output of the iterative function. Such an equilibrium point may also be considered as a convergence point of the iterative function.
Instead of simply providing a stack of layers having a certain depth (i.e., executing a same layer a certain amount of times), the measures described in this specification numerically determine the equilibrium point and provide the equilibrium point as a substitute output of the stack of layers, thereby effectively replacing the iterative execution of a same layer by the use of the root-finding algorithm. Indeed, the equilibrium point may be determined numerically since it is known that at such an equilibrium point, the iterative function ƒ(z*, θ, c(x)) minus its input (z*) is zero. Accordingly, a numerical root-finding algorithm may be used to find the root solution of the iterative function minus its input. For example, a root-finding algorithm may be used which is based on a computer-implementation of Newton's methods or a quasi-Newton method such as Broyden's method.
The numerical calculation of the equilibrium point may thereby be provided as a substitute to a stack of weight-tied layers, in that it may functionally correspond to such a stack of weight-tied layers but may be structurally different therefrom. As demonstrated in this specification, the replacement of the iterative application of a same layer by a numerical root-finding approach is feasible and has been found to greatly reduce the memory footprint during training while achieving similar accuracy as state-of-the-art prior art models. Thereby, the training of the neural network using the numerical root-finding approach is less memory intensive, or allows the training of deeper neural networks using a same memory footprint. Namely, the memory requirement of the root-finding algorithm is independent of the iteration depth (depth of the stack of layers). Advantageously, using the above measures, the training of machine learnable models on lower-spec (e.g., end-user) devices having limited memory may be facilitated, for example for enabling continual learning of a neural network even after deployment of the neural network. The numerical root-finding algorithm may be used as a substitute in the training of any neural network architecture, e.g., forward networks or recurrent networks, to replace a stack of weigh-tied layers. In some cases, applying a root-finding algorithm may be computationally faster than iterative layer application.
It is noted that in the above and elsewhere, the term ‘equilibrium point’ and ‘fixed point’ include the point being an array or a vector of values. In addition, the root-finding algorithm may obtain an approximation of the equilibrium point, e.g., to a select degree. The ‘select degree’ may represent a convergence criterion, which may be predefined. The term ‘determining the equilibrium point’ thus includes determining an approximation thereof.
Optionally, the numerical root-finding algorithm is a computer-implementation of Newton's method or a computer-implementation of a quasi-Newton method or specifically a computer-implementation of Broyden's method. Any numerical root-finding algorithm may in principle be used, including Newton's method-based algorithms. However, the inventors have found that Broyden's method may be particularly efficient since it may avoid the computation of the exact inverse Jacobian at every intermediate Newton iteration.
Optionally, performing the backward propagation part comprises:
with respect to the weights (θ) and the part of the input (c(x));
The derivatives which are indicated above may be implemented via their analytic equations or computed, e.g., via automatic differentiation tools. Accordingly, the back propagation may be performed without having to store intermediate layer outputs in memory, which would otherwise be necessary for back propagation in deep neural networks.
Optionally, computing the gradient (δe, δc) comprises solving a linear system ((J9|z*)Tx=−δzT) and computing the gradient as a function of a solution of the linear system
The inventors have found that the backpropagation of the backward gradient through the stack of layers may be replaced by solving the above linear system which may involve using one step of matrix multiplications that involves the Jacobian at equilibrium. Herein, the vector-Jacobian product may be efficiently computed via automatic differentiation tools for any x, without having to explicitly write out the Jacobian matrix. This may be a particularly efficient way of performing the backward propagation during training.
Optionally, solving the linear system comprises using a fast matrix vector multiplication technique. Optionally, solving the linear system comprises using an instance of the numerical root-finding algorithm or another type of numerical root-finding algorithm. For example, Broyden's method may be used to solve the linear system ((Jg|z*)Tx+δzT=0).
Optionally, outputting the trained neural network comprises representing the stack of layers in the trained neural network by at least i) a data representation of a layer (z[i+1]=ƒ(z[i], θ, c(x)) of the stack of layers, and ii) a hyperparameter defining a number of layers of the stack of layers (z[i], i=0, 1, 2, . . . , L) at which the output of the stack of layers reaches or to a selected degree approximates the equilibrium point during forward propagation. Accordingly, the trained neural network may be output in a prior art manner, namely by defining the stack of layers and its weights and the depth of the stack of layers. The depth may be chosen as a hyperparameter so that during use of the trained neural network, the equilibrium point is reached or at least approximated to a sufficient degree.
Optionally, outputting the trained neural network comprises representing the stack of layers in the trained neural network by at least i) the mutually shared weights (θ), ii) an identifier or a data-representation of the numerical root-finding algorithm, and iii) one or more parameters for using the numerical root-finding algorithm to determine the equilibrium point. Instead of representing the iterative function by a stack of layers, the iterative function may be represented by the mutually shared weights found during training and by data which allows the equilibrium point to be determined during inference. Such data may take various forms. For example, the numerical root-finding algorithm itself may be included, e.g., as computer-readable instructions, or an identifier of the algorithm which allows the entity using the trained neural network for inference to identify the numerical root-finding algorithm to be used. In addition, parameters may be included so as to allow the entity using the trained neural network for inference to determine the equilibrium point during the forward pass. This represents an alternative to the iterative application of the same layer during inference, and may provide a higher accuracy (for example, if it is computationally infeasible to execute the same layer to a sufficiently high degree) and may in some cases be faster to compute. The latter may reduce the latency of the model during inference, and may be advantageous in applications in which a low latency is desirable, such as for example autonomous driving.
Optionally, outputting the trained neural network comprises representing the stack of layers in the trained neural network by at least i) a data representation of a layer (z[i+1]=ƒ(z[i], θ, c(x)) of the stack of layers, and ii) computer-readable instructions defining a convergence check for determining when an output obtained by an iterative execution of the layer reaches or to a selected degree approximates the equilibrium point. This represents yet another alternative to representing the iterative function by a stack of layers. Namely, the trained neural network may define one layer of the stack of layers but may additionally comprise computer-readable instructions which define a convergence check and which allow an entity using the trained neural network for inference to determine when an output obtained by an iterative execution of the layer reaches or to a selected degree approximates the equilibrium point. Accordingly, it may be ensured that the equilibrium point is approximated to a sufficient degree while avoiding unnecessary layer executions at runtime.
Optionally, the neural network is a feedforward neural network or a recurrent neural network. In general, the root-finding algorithm may be used in any neural network architecture to replace an iterative function represented by a stack of weight-tied layers.
Optionally, the training data is time-sequential data, and wherein the neural network is one of a group of: a Trellis network, a transformer network and a temporal convolution network. While the application to time-sequential data has been found to be advantageous, the applicability of the measures described in this specification is not limited to such type of data, but may also be used with other types of data, such as spatial data.
It will be appreciated by those skilled in the art that two or more of the above-mentioned embodiments, implementations, and/or optional aspects of the invention may be combined in any way deemed useful.
Modifications and variations of any system, any computer-implemented method or any computer-readable medium, which correspond to the described modifications and variations of another one of said entities, can be carried out by a person skilled in the art on the basis of the present description.
These and other aspects of the invention will be apparent from and elucidated further with reference to the embodiments described by way of example in the following description and with reference to the accompanying drawings, in which
It should be noted that the figures are purely diagrammatic and not drawn to scale. In the figures, elements which correspond to elements already described may have the same reference numerals.
The following list of reference numbers is provided for facilitating the interpretation of the drawings and shall not be construed as limiting the claims.
The following describes with reference to
In some embodiments, the data storage 190 may further comprise a data representation 194 of an untrained version of the neural network which may be accessed by the system 100 from the data storage 190. It will be appreciated, however, that the training data 192 and the data representation 194 of the untrained neural network may also each be accessed from a different data storage, e.g., via a different subsystem of the data storage interface 180. Each subsystem may be of a type as is described above for the data storage interface 180. In other embodiments, the data representation 194 of the untrained neural network may be internally generated by the system 100 on the basis of design parameters for the neural network, and therefore may not explicitly be stored on the data storage 190.
The system 100 may further comprise a processor subsystem 160 which may be configured to, during operation of the system 100, provide an iterative function as a substitute for a stack of layers of the neural network to be trained. Here, respective layers of the stack of layers being substituted may have mutually shared weights and may receive as input an output of a previous layer, or for a first layer of the stack of layers, an initial activation, and a part of the input of the stack of layers. The processor subsystem 160 may be further configured to iteratively train the neural network using the training data 192. Here, an iteration of the training by the processor subsystem 160 may comprise a forward propagation part and a backward propagation part. The processor subsystem 160 may be configured to perform the forward propagation part by, amongst other operations defining the forward propagation part which may be performed, determining an equilibrium point of the iterative function at which the iterative function converges to a fixed point, wherein determining the equilibrium point comprises using a numerical root-finding algorithm to find a root solution for the iterative function minus its input, and by providing the equilibrium point as a substitute for an output of the stack of layers in the neural network. It is noted that the neural network and its training will be further described with reference to
The system 100 may further comprise an output interface for outputting a data representation 196 of the trained neural network, this data also being referred to as trained model data 196. For example, as also illustrated in
The method 200 is shown to comprise, in a step titled “PROVIDING DATA REPRESENTATION OF NEURAL NETWORK”, providing 210 a neural network, wherein the providing of the neural network comprises providing an iterative function as a substitute for a stack of layers of the neural network, wherein respective layers of the stack of layers being substituted have mutually shared weights and receive as input an output of a previous layer, or for a first layer of the stack of layers, an initial activation, and a part of the input of the stack of layers. The method 200 is further shown to comprise, in a step titled “ACCESSING TRAINING DATA”, accessing 220 training data for the neural network. The method 200 is further shown to comprise, in a step titled “ITERATIVELY TRAINING NEURAL NETWORK USING TRAINING DATA”, iteratively 230 training the neural network using the training data, which training 230 may comprise a forward propagation part and a backward propagation part. Performing the forward propagation part by the method 200 may comprise, in a step titled “DETERMINING EQUILIBRIUM POINT USING ROOT-FINDING ALGORITHM”, determining 240 an equilibrium point of the iterative function at which the iterative function converges to a fixed point, wherein determining the equilibrium point comprises using a numerical root-finding algorithm to find a root solution for the iterative function minus its input, and in a step titled “PROVIDING EQUILIBRIUM POINT AS SUBSTITUTE FOR OUTPUT OF STACK OF LAYERS”, providing 250 the equilibrium point as a substitute for an output of the stack of layers in the neural network. The method 200 may further comprise, after the training and in a step titled “OUTPUTTING TRAINED NEURAL NETWORK”, outputting 260 a trained neural network.
The following examples describe the neural network, including the training thereof in which a stack of layers is substituted by an iterative function and in which a root-finding algorithm is used to determine an equilibrium point at which the iterative function converges to a fixed point, in more detail. However, the actual implementation of the neural network and its training may be carried out in various other ways, e.g., on the basis of analogous mathematical concepts. For example, while the following describes both the forward passes and the backward passes being based on a numerical root-finding algorithm, in some embodiments, only the forward pass may be as described below while the backward pass may be performed in another manner, e.g., by backpropagation of algebra terms which are used in the root-finding algorithm so as to obtain a backpropagated algebraic expression. In other examples, Boyden's method may be replaced by a use of the so-called ‘Anderson acceleration’ technique to accelerate the convergence of a fixed point iteration and thereby to determine the equilibrium point of the iterative function. Various other embodiments are within reach of the skilled person based on this specification.
The following considers a deep neural network with hidden layers z and activations ƒ such that z[i+1]=ƒ(z[i], θi, c(x)) for i=0, 1, 2, . . . , L with weights θi and previous layer inputs c(x) may both be tied across layers, i.e., θi=θ∀i. Some of the activations ƒ may exhibit an attractor property, in that there may exist a fixed point z* such that z*=ƒ(z*, θ, c(x)) and
In other words, the repeated application of ƒ for an initial activation z[0] may converge to a fixed point z*. The following describes replacing the iterated function application or the iterated function execution by the use of a numerical method, namely a numerical root-finding algorithm, to find the fixed point directly.
The forward pass, which may also be referred to as a forward propagation part of the training or simply as ‘inference’, may be briefly characterized as follows:
Input: weights θ∈Rn and fixed input c(x)∈Rk
Hyperparameters: base layer function ƒ: Rm×Rn×Rk→Rm
Algorithm:
1. Initialize memory z[0].
2. Define function g: zƒ(z, θ, c)−z.
3. Call subroutine z*=RootFind(λzg(z),z[0]).
Output: z*∈Rm
RootFind may be computed via any Newton's method variant, e.g, classic Newton-Raphson method, Broyden's method, Steffensen's method, etc.
The backward pass, which may also be referred to as a backward propagation part or simply as ‘training’, may be briefly characterized as follows:
Input: Backpropagated error δz∈Rm as well as z*∈Rm, c(x)∈Rk, weights θ∈Rn and base layer function ƒ: Rm×Rn×Rk→Rm from the forward pass.
If (Jg|z*)−1 or an approximation thereof has been already computed in the forward pass in RootFind, (Jg|z*)−1 or its approximation may be stored during the forward pass and used in the backward pass. For solving the linear system, any suitable method may be used, for example an indirect method that exploits fast matrix-vector products.
Broyden's method may be used in solving the linear system, as well as any other Newton's method variant. In general, all derivatives which are indicated above may be implemented via their analytic equations or computed, e.g., via automatic differentiation tools.
The following further describes the above measures within the context of the modeling of sequential data, i.e., x1:T. It will be appreciated, however, that the applicability is not limited to sequential data, but may be applied to spatial or any other type of data x as well. It is further noted that the following replaces c(x), referring to the initial input of the stack of layers which is used as a constant input to each individual layer, by the equivalent x.
As an introductory comment, it is noted that most modern feedforward deep neural networks (in the following also simply referred to as ‘networks’ or ‘nets’ or ‘models’) are built on the core concept of layers. In the forward pass, each network may consist of a stack of some L transformations, where L is the depth of the network. To update these networks, the backward passes may rely on backpropagating through the same L layers via the chain rule, which typically necessitates that the intermediate values of these layers are stored as temporary data. The value for L is usually a hyperparameter and is selected by model designers (e.g., ResNet-101). Among the many applications of deep networks, sequence modelling has witnessed continuous advances in deep architectures. Specifically, while recurrent networks have long been the dominant model for sequences, deep feedforward architectures based on temporal convolutions and self-attention have (re-) emerged to claim superior performance on a variety of sequence prediction tasks.
In very general terms, a deep feedforward sequence model may be written as the following iteration:
z
1:T
[i+1]=ƒθ[i](z1:T[i],x1:T) for i=0,1,2, . . . ,L (1)
where i is the layer index; is the hidden sequence of length T at layer i; x1:T is the input sequence and thereby the model explicitly models skip connections for reasons as explained later; and ƒθ[i] is some nonlinear transformation which may typically enforce causality (e.g., future time points cannot influence past ones). The following is based on the use of the same transformation in each layer (known as weight tying, with ƒθ[i]=ƒθ,∀i) which is known to still achieve results competitive with the state-of-the-art.
The following further introduces a method that directly computes the fixed point z1:T* of a nonlinear transformation, e.g., the solution to the nonlinear system
z
1:T*=ƒθ(z1:T*x1:T). (2)
This solution corresponds to the eventual hidden layer values of an infinite depth network. But instead of finding this value, which may be an array or a vector of values and which here and elsewhere be referred to as ‘equilibrium point’, by iterating the model, the equilibrium point may be found directly via any black-box root-finding method. This approach may be referred to as a deep equilibrium model (DEQ) approach or simply ‘DEQ’.
The following shows that DEQ may directly differentiate through the fixed point equations via implicit differentiation, which may not require storing any intermediate activation values. In other words, one may backpropagate through the infinite-depth network while using only constant memory, equivalent to a single layer's activations. After describing the generic DEQ approach, the instantiation of DEQ is described in two feedforward sequence models: trellis networks (weight-tied temporal convolutions) and memory-augmented universal transformers (weight-tied multi-head self-attention), both of which have obtained state-of-the-art performance on various sequence tasks. It is further shown how both the forward and the backward passes may be implemented via quasi-Newton methods.
One may broadly consider the class of weight-tied deep sequence models (with passthrough connections from the input to each layer), which consist of the update
z
1:T
[i+1]=ƒθ(z1:T[i],x1:T), i=1 . . . . ,L−1, z1:T[0]=0 (3)
It is noted that this model encapsulates classes such as the trellis network and the universal transformer (which is typically not written with passthrough connections, but this is a trivial modification). Such weight-tying is generally considered to come with three major benefits: 1) it acts as a form of regularization that stabilizes training and supports generalization; 2) it significantly reduces the model size; and 3) the network can be unrolled to any depth, typically with improved feature abstractions as depth increases. However, in practice almost all such models (and deep networks in general) may be stacked, trained and evaluated by unrolling a pre-determined, fixed number of layers. One critical issue contributing to this is the limited memory on training hardware: as the models may need to store the intermediate hidden units for backpropagation, one may hardly train them beyond a certain depth, which depth may in turn depend on the computing resources available.
In principle, the network may have an infinite depth. This is attained in the limit of unrolling a weight-tied network for a higher and higher number of layers. However, such weight-tied models tend to converge to a fixed point as depth increases towards infinity, which has been determined via empirical evidence. In other words, as each layer refines the previous layer by combining temporal features across the sequence, increasing depth towards infinity brings “diminishing returns”: each additional layer may have a smaller and smaller contribution until the network reaches an equilibrium state:
The DEQ approach may comprise, instead of iteratively stacking ƒθ, directly solving for and differentiating through the equilibrium state.
The following discusses the forward pass of the training, and which also may be used for inference using the trained neural network. Unlike a conventional network where the output is just the Lth layer activations, the output of a DEQ is the equilibrium point itself. Therefore, the forward evaluation could be any procedure that solves for this equilibrium point. Conventional deep sequence networks, if they converge to an equilibrium, may be considered as one such method that uses the simplest fixed point iterations:
z
1:T
[i+1]=ƒ
θ(z1:T[i];x1:T) for i=0,1,2, . . . (5)
One may alternatively also use other methods that provide faster convergence guarantees. For notational convenience, one may define gθ and re-write Eq. ((4)) as: gθ(z1:T*;x1:T)=ƒθ(z1:T*;x1:T)−z1:T*→0. The equilibrium state z1:T* may thus be the root of gθ, which may be solved more easily with Newton's method or quasi-Newton methods (e.g., Broyden's method):
z
1:T
[i+1]
=z
1:T
[i]
−αBg
θ(z1:T[i];x1:T) for i=0,1,2, . . . (6)
where B is the Jacobian inverse (or its low-rank approximation) at z1:T[i] and α is the step size. However, in general, any ‘black-box’ type of numerical root-finding algorithm may be used to solve for the equilibrium point in the forward pass, given an initial estimate) z1:T[0] (which may be set to 0): z1:T*=RootFind(gθ;x1:T)
The following discusses the backward pass of the training. The use of a black-box RootFind may mean that one may no longer be able to rely on explicit backpropagation through the exact operations in the forward pass. However, one may adapt the numerical root-finding algorithm (say Newton's method) to obtain the equilibrium, and then store and allow backpropagating through all the Newton iterations, the following describes an alternative procedure which may be simpler to implement and which may require constant memory and assume no knowledge of the black-box RootFind.
Let z1:T*∈T×d be an equilibrium hidden sequence with length T and dimensionality d, and y1:T∈T×q the ground-truth (target) sequence. Let h: d→q be any differentiable function and let : q×q→ be a loss function (where h, are applied in vectorized manner) that computes
=(h(z1:T*),y1:T)=(h(RootFind(gθ;x1:T)),y1:T). (7)
Then the loss gradient w.r.t. (⋅) (for instance, θ or x1:T) is:
where Jg
It has been found that the backward gradient through the “infinite” stacking may be represented as one step of matrix multiplication that involves the Jacobian at equilibrium. For instance, a stochastic gradient descent (SGD) update step on model parameters θ may be expressed as:
Note that this result may be independent of the root-finding algorithm or the internal structure of the transformation ƒθ, and thus may not require any storage of the intermediate hidden states, which would otherwise be needed for deep backpropagation.
A challenge of enforcing the forward and backward passes described above may be the cost of computing the exact inverse Jacobian Jg
Initially, one may set Bg
A similar idea may be used for the backward pass as well. Specifically, to compute
one may alternatively solve for the linear system:
where the first term (i.e., a vector-Jacobian product) may be efficiently computed via autograd packages (e.g., in PyTorch) for any x, without explicitly writing out the Jacobian matrix. Such linear systems may generally be solved by any indirect methods that leverage fast matrix-vector products. One may also rely on Broyden's method (or in general, other indirect methods would also suffice) to solve for Eq. (11) and directly backpropagate through the equilibrium by Eq. (8) in the backward pass.
A benefit of DEQ may be its extreme memory efficiency. Since any numerical root-finding algorithm may be used for both the forward and backward passes (e.g., Broyden's method), a DEQ may only need to store z1:T* (the equilibrium sequence), x1:T (input-related, layer-independent variables), and ƒθ for the backward pass. Note that as one may only need the vector-Jacobian product (with dimension N×Td, where N is the minibatch size) in Eq. (11), one may never need to explicitly construct the Jacobian which may otherwise be large on long and high-dimensional sequences (with dimension N×(Td)2). Compared to other deep networks, DEQs may therefore offer a constant-memory alternative that enables models that previously required multiple GPUs and other techniques (e.g., half-precision or gradient checkpointing) to now fit easily into a single GPU.
The above analysis may be independent of the choice of ƒθ, and the memory benefit may be present regardless of the type of ƒθ. However, to find the equilibrium in a reliable and efficient manner, generally ƒθ may need to be stable and constrained. The two following instantiations are examples of stable transformations (the gated activation in TrellisNet and layer normalization in the transformer constrain the output ranges). As both models are drastically different, this illustrate the compatibility of the DEQ approach with all three major families of existing deep sequence networks: transformers, RNNs and temporal convolutional networks (TCNs), but also with any other weight-tied neural networks.
The following describes an embodiment of the DEQ approach for a trellis network. Generally, TrellisNet is a TCN with two modifications. First, a linear transformation of the original input sequence x1:T is injected to the convolutional outputs at all layers. Second, the convolutional kernel weights are tied across the depth of the network (i.e., TrellisNet is a weight-tied TCN). This means one may write TrellisNet with convolutional kernel size k, dilation s, and non-linearity ψ in DEQ-form as
{tilde over (x)}
1:T=Input injection (i. e., linearly transformed inputs by Conv1D(x1:T;Wx))
ƒθ(z1:T;x1:T)=ψ(Conv1D([u−(k−1)s:,z1:T];Wz)+{tilde over (x)}1:T)
where u−(k−1)s: is typically: 1) the last (k−1)s elements of the previous sequence's output (if using history padding); or 2) simply zero-paddings. [⋅,⋅] means concatenation along the temporal dimension. For ψ, LSTM gated activation may be used.
The following describes an embodiment of the DEQ approach for a weight-tied transformer. Instead of using convolutions or recurrence, a transformer network maps the input to a layer into Q (query), K (key) and V (value) and computes the attention score between time ti, tj by [QKT]i,j. This attention score is then normalized via softmax and multiplied with the V sequence to produce the output. Meanwhile, as the transformer is order-invariant, prior works have proposed to inject positional embeddings (PE) to the self-attention operation. Following this design, the universal transformer may “recurrently stack” the transformer's self-attention and transition function block tp through a number of layers.
Accordingly, one may write a weight-tied transformer in DEQ-form as
{tilde over (x)}
1:T=Input injection (i. e., linearly transformed inputs by x1:TWx)
ƒθ(z1:T;x1:T)=LN(ϕ(SelfAttention(z1:TWQKV+{tilde over (x)}1:T;PE1:T))))
where WQKV∈d×3d may produce the Q, K, V for the multi-head self-attention, and LN stands for layer normalization. Note that the above adds input injection {tilde over (x)}1:T to Q, K, V in addition to the positional embedding and initializes with z1:T[0]=0. A 2-layer position-wise feedforward residual block may be used for ϕ. In addition, a memory-augmented transformer may be used, where [z−T′:*,z1:T] (i.e., with history padding of length T′) and relative positional embedding PE−T′:T may be fed to the self-attention operation.
Further indicated by grey highlighting is the memory storage 470 which is needed at training time so as to be able to perform the subsequent backward propagation pass. In other words, the grey highlighting indicates variables which need to be kept in memory during and after the forward propagation pass so as to be able to perform the subsequent backward propagation pass. It can be seen that such memory storage may be needed for the input sequence x1:T, the temporary parameters of the iterative function z1:T[i+1]=ƒθ[i] (z1:T[i],x1:T) for i=0, 1, 2, . . . , L and the output z1:T[L] of the iterative function.
The DEQ approach is evaluated on both synthetic stress tests and realistic large-scale language modelling tasks (where complex long-term temporal dependencies are involved) using the two aforementioned instantiations of ƒθ (trellis network, weight-tied transformer) using the DEQ approach. On both WikiText-103 (which contains >100M words and a vocabulary size of >260K) and the smaller Penn Treebank corpus (where stronger regularizations are needed for conventional deep nets) for word-level language modeling, it is shown that DEQ achieves competitive performance even when compared to state-of-the-art methods (of the same model size, both weight-tied and unweight-tied ones) while using significantly less memory.
Both instantiations of DEQ use Broyden's method to avoid direct computation of the inverse Jacobian, as described earlier. It is noted that the use of DEQ implicitly introduces a new “hyperparameter”: the stopping criterion for Broyden iterations. During training, this tolerance E of forward and backward passes is set to ε=√{square root over (T)}·10−6 and √{square root over (T)}·10−8, respectively. At inference, the tolerance is relaxed to ε=√{square root over (T)}·10−2. For the DEQ-TrellisNet instantiation, the settings described in al are roughly followed. For DEQ-Transformers, the relative positional embedding described in ‘Transformer-XL: Language Modeling with Longer-Term Dependency’ by Zihang Dai is used, with sequences of length 150 at both training and inference on the WikiText-103 dataset. All experiments could run on a single GTX 2080-Ti GPU due to the low memory footprint of DEQ. However, 4 GPUs were used for the WikiText-103 experiments for faster computation.
Evaluations show that the DEQ-approach achieves strong performance on the long-range copy-memory task, as summarized in the following table.
Here, TCN refers to https://arxiv.org/abs/1803.01271, LSTM refers to ‘Long short-term memory’ (Hochreiter et al.), GRU refers to https://arxiv.org/abs/1409.1259. The goal of the copy memory task may be considered simple: to explicitly test a sequence model's ability to exactly memorize elements across a long period of time. As shown in the above table, a DEQ-based transformer demonstrates good memory retention over relatively long and low-dimensional sequences (T=400), with even better results than LSTM/GRU.
An issue encountered in prior work that takes a continuous view of deep networks is the challenge of scaling these approaches to real, high-dimensional, large-scale datasets. In the following subsection, the DEQ approach is evaluated on real large-scale language datasets and its effectiveness as a practical sequence model investigated.
Performance on Penn Treebank: following the set of hyperparameters used by [1] for TrellisNet, the DEQ-TrellisNet instantiation is evaluated on word-level language modelling with the PTB corpus. Note that without an explicit notion of “layer”, we do not add auxiliary losses, as was done in [1]. As shown in in the following table, when trained from ‘scratch’, the DEQ-TrellisNet achieves a test perplexity on par with the original deeply supervised TrellisNet. With reference to the table, it is noted that †the memory footprints are benchmarked on input sequence length 150 and batch size 15, which does not reflect the actual hyperparameters used; the values also do not include memory for word embeddings.
In the above and elsewhere, ‘the Variational LSTM’ model refers to https://arxiv.org/abs/1512.05287, NAS Cell refers to https://arxiv.org/abs/1611.01578, the following NAS model refers to https://arxiv.org/abs/1707.05589, AWD-LSTM refers to https://arxiv.org/abs1708.02182, DARTS refers to https://arxiv.org/abs/1806.09055, and the 60-layer TrellisNet model refers to https://arxiv.org/abs/1607.06450.
Performance on WikiText-103: On the much larger scale WT103 corpus (about 100× larger than PTB), the DEQ-TrellisNet achieves better test perplexity than the original deep TrellisNet. For the Transformer instantiation, the design of the Transformer-XL model (https://arxiv.org/abs/1901.02860) is followed, which may be considered state-of-the-art in language modelling. Specifically, comparisons are made to a “medium” Transformer-XL model (the largest released model that can fit on a GPU) and a “small” Transformer-XL model, while noting that the largest Transformer-XL model has massive memory requirements (due in part to very large embedding sizes, batch sizes, and sequence lengths, which would not be decreased by a DEQ) and can only be trained on a TPU. In following table, it is shown that the DEQs yield competitive performance on par with state-of-the-art approaches on similar model sizes, while outperforming many prior results, while consuming much less memory during training (discussed below). (†See earlier for more details).
In the above and elsewhere, the Gated Linear ConvNet model refers to http://arxiv.org/abs/1612.08083, AWD-QRNN refers to https://arxiv.org/abs/1803.08240, Relational Memory Core refers to ‘Relational recurrent neural networks’ by Santoro et al., and the 70-layer TrellisNet model refers to https://arxiv.org/abs/1607.06450.
Memory Footprint of DEQ: For conventional deep networks with L layers, the training memory complexity may be O(L) since all intermediate activations are stored for backpropagation. In comparison, DEQs have an O(1) (i.e., constant) memory footprint. The reduced memory consumption is verified in the last column of the above tables, with controlled sequence lengths and batch sizes for fairness. On both instantiations, the DEQ approach leads to an over 80% (up to 88%) reduction in memory consumption by the model (excluding word embeddings, which are orthogonal to the comparison here). Note that the DEQ's memory footprint remains competitive even when compared with baselines that are not weight-tied (over 67% reduction), with similar or better accuracy.
Convergence to Equilibrium: the deep equilibrium model may be considered not to have “layers”. One factor that may affect the computation in DEQs is the number of Broyden iterations in forward/backward passes, where each forward Broyden step evaluates ƒθ once, and a backward step computes a vector-Jacobian product.
of a deep-equilibrium model (DEQ)-based transformer against the training epoch 510 for both the forward propagation 530 and the backward propagation 540. It is found that in general the number of Broyden iterations gradually increases with training epochs. Meanwhile, the backward propagation 540 may require much fewer iterations than the forward propagation 530, due to the linear system in Eq. 11.
Regarding the convergence to equilibrium, it is found that DEQs may converge to the sequence-level fixed point more, or in many cases much more efficiently than original weight-tied transformers. This is illustrated in
It is further noted that it has been found that stacking multiple DEQs may not create extra representational power over a single DEQ, or in other words, a single DEQ may provide a same representational power as a stacking of multiple DEQs.
The system 700 may further comprise a processor subsystem 760 which may be configured to, during operation of the system 700, apply the trained neural network to the input data 722 to obtain output data representing an inference by the trained neural network, wherein said applying may comprise determining the equilibrium point using the substitute for the stack of layers and providing the equilibrium point as a substitute for an output of the stack of layers in the neural network. Such output data may take various forms, and may in some examples be a direct output of the system 700. In other examples, which are also described in the following, the system 700 may output data which is derived from the inference of the trained neural network, instead of directly representing the inference.
It will be appreciated that the same considerations and implementation options apply for the processor subsystem 760 as for the processor subsystem 160 of
In some embodiments, the system 700 may comprise an actuator interface 740 for providing control data 742 to an actuator 40 in the environment 60. Such control data 742 may be generated by the processor subsystem 760 to control the actuator 40 based on one or more inferences, as may be generated by the trained neural network when applied to the input data 722. For example, the actuator 40 may be an electric, hydraulic, pneumatic, thermal, magnetic and/or mechanical actuator. Specific yet non-limiting examples include electrical motors, electroactive polymers, hydraulic cylinders, piezoelectric actuators, pneumatic actuators, servomechanisms, solenoids, stepper motors, etc. Such type of control is described with reference to
In other embodiments (not shown in
In general, each system described in this specification, including but not limited to the system 100 of
The computer-implemented method 800 is shown to comprise, in a step titled “ACCESSING TRAINED NEURAL NETWORK”, accessing 810 a trained neural network as described elsewhere in this specification. The method 800 is further shown to comprise, in a step titled “ACCESSING INPUT DATA”, accessing 820 input data for the trained neural network. The method 800 is further shown to comprise, in a step titled “APPLYING TRAINED NEURAL NETWORK TO INPUT DATA”, applying 830 the trained neural network to the input data to obtain output data representing an inference by the trained neural network. Said applying 830 by the method 800 is shown to comprise, in a step titled “DETERMINING EQUILIBRIUM POINT”, determining 840 the equilibrium point using the substitute for the stack of layers, and in a step titled “PROVIDING EQUILIBRIUM POINT AS SUBSTITUTE FOR OUTPUT OF STACK OF LAYERS”, providing 850 the equilibrium point as a substitute for an output of the stack of layers in the neural network.
It will be appreciated that, in general, the operations or steps of the computer-implemented methods 200 and 800 of respectively
Each method, algorithm or pseudo-code described in this specification may be implemented on a computer as a computer implemented method, as dedicated hardware, or as a combination of both. As also illustrated in
Examples, embodiments or optional features, whether indicated as non-limiting or not, are not to be understood as limiting the invention as claimed.
In accordance with an abstract of the specification, it is noted that a neural network may comprise an iterative function (z[i+1]=ƒ(zi, θ, c(x)). Such an iterative function is known in the field of machine learning to be representable by a stack of layers which have mutually shared weights. As described in this specification, this stack of layers may during training be replaced by the use of a numerical root-finding algorithm to find an equilibrium of the iterative function in which a further execution of the iterative function would not substantially further change the output of the iterative function. Effectively, the stack of layers may be replaced by a numerical equilibrium solver. The use of the numerical root-finding algorithm is demonstrated to greatly reduce the memory footprint during training while achieving similar accuracy as state-of-the-art prior art models.
It should be noted that the above-mentioned embodiments illustrate rather than limit the invention, and that those skilled in the art will be able to design many alternative embodiments without departing from the scope of the appended claims. In the claims, any reference signs placed between parentheses shall not be construed as limiting the claim. Use of the verb “comprise” and its conjugations does not exclude the presence of elements or stages other than those stated in a claim. The article “a” or “an” preceding an element does not exclude the presence of a plurality of such elements. Expressions such as “at least one of” when preceding a list or group of elements represent a selection of all or of any subset of elements from the list or group. For example, the expression, “at least one of A, B, and C” should be understood as including only A, only B, only C, both A and B, both A and C, both B and C, or all of A, B, and C. The invention may be implemented by means of hardware comprising several distinct elements, and by means of a suitably programmed computer. In the device claim enumerating several means, several of these means may be embodied by one and the same item of hardware. The mere fact that certain measures are recited in mutually different dependent claims does not indicate that a combination of these measures cannot be used to advantage.
Number | Date | Country | Kind |
---|---|---|---|
19190237.8 | Aug 2019 | EP | regional |