The present invention relates to a method for training a neural network, to a training system, to uses of the neural network thus trained, to a computer program, and to a machine-readable memory medium.
A method for training neural networks is described in “Improving neural networks by preventing co-adaptation of feature detectors,” arXiv preprint arXiv:1207.0580v1, Geoffrey E. Hinton, Nitish Srivastava, Alex Krizhevsky, Ilya Sutskever, Ruslan R. Salakhutdinov (2012), in which feature detectors are randomly ignored during the training. These methods are also known under the name “dropout.”
A method for training neural networks is described in “Batch Normalization: Accelerating Deep Network Training by Reducing Internal Covariate Shift,” arXiv preprint arXiv:1502.03167v3, Sergey Ioffe, Christian Szegedy (2015), in which input variables are normalized in a layer for a small batch (“mini batch”) of training examples.
A method in accordance with an example embodiment of the present invention may have the advantage over the related art that overfitting of parameters of a neural network may be prevented particularly well.
Advantageous refinements and example embodiments of the present invention are disclosed herein.
With a sufficiently large number of training data, so-called “deep learning” methods, i.e., (deep) artificial neural networks, may be used to efficiently ascertain a map between an input space V0 and an output space Vk. This may, for example, be a classification of sensor data, in particular, image data, i.e., a mapping of sensor data or image data to classes. This is based on the approach of providing a k−1 number of hidden spaces V1, . . . , Vk−1. Furthermore, a k number of maps ƒi:Vi−1→Vi(i=1 . . . k) are provided between these spaces. Each of these maps ƒi is typically referred to as a layer. Such a layer ƒi is typically parameterized by weights wi ∈ Wi having a suitable selected space Wi. Weights w1, . . . , wk of the k number of layers ƒi are collectively also referred to as weights w ∈ W:=W1× . . . ×Wk, and the mapping from input space V0 to output space Vk is referred to as ƒw:V0→Vk, which from the individual maps ƒi (with weights wi explicitly indicated as subscript) results as ƒw(x):=ƒw
At a given probability distribution D, which is defined as V0×Vk, the task of training the neural network is to determine weights w ∈ W in such a way that an expected value Φ of a cost function L
Φ[w]=E(x
is minimized. In the process, cost function L denotes a measure for the distance between the map, ascertained with the aid of function ƒw, of an input variable xD to a variable ƒw(xD) in output space Vk and an actual output variable yD in output space Vk.
A “deep neural network” may be understood to mean a neural network including at least two hidden layers.
To minimize this expected value Φ, gradient-based methods may be utilized, which ascertain a gradient ∇Φ with respect to weights w. This gradient ∇Φ is usually approximated with the aid of training data (xj, yj), i.e., by ∇w L(ƒw(xj, yj)), indices j being selected from a so-called epoch. An epoch is a permutation of labels {1, . . . , N} of the available training data points.
To expand the training data set, so-called data augmentation (also referred to as augmentation) may be utilized. In the process, it is possible to select an augmented pair (xα, yj) for each index j from the epoch instead of pair (xj, yj), input signal xj being replaced by an augmented input value xα ∈ α(xj) here. In the process, α(xj) may be a set of typical variations of input signal xj (including input signal xj itself) which leave a classification of input signal xj, i.e., the output signal of the neural network, unchanged.
This epoch-based sampling, however, is not entirely consistent with the definition from equation (1) since each data point is selected exactly one time during the course of an epoch. The definition from equation (1), in contrast, is based on independently drawn data points. This means that while equation (1) requires the data points to be drawn “with replacement,” the epoch-based sampling carries out a drawing of the data points “without replacement.” This may result in the requirements of mathematical convergence proofs not being met (because, when selecting N examples from a set of a N number of data points, the probability of selecting each of these data points exactly once is less than
(for N>2), while this probability is always equal to 1 in the case of epoch-based sampling.
If data augmentation is utilized, this statistical effect may be further amplified since an element of set α(xj) is present in each epoch and, depending on augmentation function α, it cannot be excluded that α(xj) ≈ α(xi) for i ≠ j. Statistically correct mapping of the augmentations with the aid of set α(xj) is difficult since the effect does not have to be equally pronounced for each input datum xj. In this way, for example, a rotation may have no impact on circular objects, but may greatly impact general objects. As a result, the size of set α(xj) may be dependent on input datum xj, which may be problematic for adversarial training methods.
After all, number N of the training data points is a variable which, in general, is complex to set. If N is selected to be too large, the run time of the training method may be unduly extended, if N is selected to be too small, a convergence cannot be guaranteed since mathematical proofs of the convergence, in general, are based on assumptions which are then not met. In addition, it is not clear at what point in time the training is to be reliably terminated. When taking a portion of the data points as an evaluation data set and determining the quality of the convergence with the aid of this evaluation data set, the result may be that overfitting of the weights w occurs with respect to the data points of the evaluation data set, which not only reduces the data efficiency, but may also impair the performance capability of the network when it is applied to data other than training data. This may result in a reduction of the so-called “generalizability.”
To reduce overfitting, a piece of information which is stored in the hidden layers may be randomly thinned with the aid of the “dropout” method mentioned at the outset.
To improve the randomization of the training process, it is possible, through the use of so-called batch normalization layers, to introduce statistical parameters μ and σ over so-called mini batches, which are probabilistically updated during the training process. During the inference, the values of these parameters μ and σ are selected as fixedly predefinable values, for example as estimated values from the training through extrapolation of the exponential decay behavior.
If the layer having index i is a batch normalization layer, the associated weights wi=(μi, σi) are not updated in the case of a gradient descent, i.e., these weights wi are thus treated differently than weights wk of the remaining layers k. This increases the complexity of an implementation.
In addition, the size of the mini batches is a parameter which in general influences the training result and thus, as a further hyperparameter, must be set as well as possible, for example within the scope of a (possibly complex) architecture search.
In a first aspect, the present invention thus relates to a method for training a neural network which, in particular, is configured to classify physical measuring variables. In accordance with an example embodiment of the present invention, an adaptation of parameters of the neural network occurs as a function of an output signal of the neural network, when the input signal is supplied, and as a function of an associated desired output signal, the adaptation of the parameters occurring as a function of an ascertained gradient, characterized in that components of the ascertained gradient are scaled as a function of to which layer of the neural network the parameters corresponding to these components belong.
In this connection, “scaling” shall be understood to mean that the components of the ascertained gradient are multiplied with a factor which is dependent on the layer.
In particular, the scaling may take place as a function of a position, i.e., the depth, of this layer within the neural network.
The depth may, for example, be characterized, in particular, given, by the number of layers through which a signal which is supplied to an input layer of the neural layer has to propagate before it is present for the first time as an input signal at this layer.
In one refinement of the present invention, it may be provided that the scaling also occurs as a function of to which feature of a feature map the corresponding component of the ascertained gradient belongs.
In particular, it may be provided that the scaling occurs as a function of a size of a receptive field of this feature.
It was found that, in particular, in a convolutional network, weights of a feature map are cumulatively multiplied with pieces of information of the features of the receptive field, which is why overfitting may form for these weights. This is effectively suppressed by the described method.
In one particularly simple and efficient alternative, it may be provided that the scaling occurs as a function of the resolution of this layer. In particular, that it occurs as a function of a quotient of the resolution of this layer and the resolution of the input layer.
It was found that, in this way, the size of the receptive field may be approximated very easily and efficiently.
In one further aspect, the present invention relates to a method in which the neural network is trained with the aid of a training data set, pairs including an input signal and an associated desired output signal being (randomly) drawn from the training data set for training, in order to adapt the parameters of the neural network as a function of the output signal of the neural network, when the output signal is supplied, and as a function of the desired output signal, this drawing of pairs always occurring from the entire training data set.
In one preferred refinement of this aspect, it is provided that the drawing of pairs occurs regardless of which pair was previously drawn during the course of the training.
In other words, the sampling of pairs, i.e., data points, from the training data set corresponds to a “drawing with replacement.” This breaks with the existing paradigm that the training examples of the training data set are drawn by “drawing without replacement.” This “drawing with replacement” may initially appear to be disadvantageous since it cannot be guaranteed that every data point from the training data set is actually used within a given number of training examples.
However, a guaranteed reliability of the trained system results, which is essential, in particular, for a safety-critical use.
Surprisingly, this advantage arises without having to tolerate a worsening in the performance capability achievable at the training end (e.g., during the classification of images). In addition, an interface to other sub-blocks of a training system with which the neural network is trainable is drastically simplified.
The drawn pairs may optionally also be further augmented. This means that a set of augmentation functions may be provided for some or all of the input signals included in the training data set (as a component of the pairs), to which the input signal may be subjected. The selection of the corresponding augmentation function may also take place randomly, preferably regardless of which pairs and/or which augmentation functions were previously drawn during the course of the training.
In one refinement of the present invention, it may be provided that the input signal of the drawn pair is augmented using augmentation function αi, i.e., that the input signal is replaced by its image under the augmentation function.
It is preferably provided in the process that augmentation function αi is selected, in particular randomly, from the set of possible augmentation functions α, this set being dependent on the input signal.
In the process, it may be provided that, during the random drawing of pairs from the training data set, a probability that a predefinable pair is drawn is dependent on a number of possible augmentation functions α of the input signal of this predefinable pair.
For example, the probability may be a predefinable variable. In particular, the probability is advantageously selected to be proportional to the number of possible augmentation functions. This makes it possible to adequately take into consideration that some augmentation functions leave the input signal unchanged, so that the cardinal number of the set (i.e., the number of the elements of the set) of the augmentation functions between the input signals may be very different. As a result of the adequate consideration, possible problems with adversarial training methods may be avoided.
In another aspect of the refinements of the present invention, it may be provided that the adaptation of the parameters takes place as a function of an ascertained gradient and, for the ascertainment of the gradient, an estimated value m1 of the gradient is refined, by taking a successively increasing number of pairs which are drawn from the training data set into consideration, until a predefinable termination condition which is dependent on estimated value m1 of the gradient is met.
This means, in particular, that the adaptation of the parameters only takes place after the predefinable termination condition has been met.
This is in contrast to conventional methods from the related art, such as stochastic gradient descent, in which an averaging of the gradient always takes place over a predefinable mini batch. This mini batch has a predefinable size which may be set as a hyperparameter. By successively adding pairs from the training data set, it is possible in the described method to carry out the ascertainment until the gradient reliably points in the ascending direction.
In addition, the size of the mini batch is a hyperparameter to be optimized. As a result of being able to dispense with this optimization, the method is more efficient and more reliable since overfitting may be suppressed more effectively, and the batch size is dispensed with as a hyperparameter.
In particular, the predefinable termination condition may also be dependent on a covariance matrix C of estimated value m1 of the gradient.
In this way, it is possible to ensure particularly easily that the gradient reliably points in the ascending direction.
For example, the predefinable termination condition may encompass the condition whether estimated value m1 and covariance matrix C for a predefinable confidence value λ meet condition m1, C−1m1≥λ2.
A probabilistic termination criterion is thus introduced with this condition. In this way, it is possible to ensure with predefinable confidence that the gradient, with confidence value λ, points in the ascending direction.
In another aspect of the refinements in accordance with the present invention, it may be provided that the neural network includes a scaling layer, the scaling layer mapping an input signal present at the input of the scaling layer in such a way to an output signal present at the output of the scaling layer that the output signal present at the output represents a rescaled signal of the input signal, parameters which characterize the rescaling being fixedly predefinable.
Preferably, it may be provided here that the scaling layer maps an input signal present at the input of the scaling layer in such a way to an output signal present at the output of the scaling layer that this mapping corresponds to a projection to a ball, center c and/or radius ρ of this ball being fixedly predefinable. As an alternative, it is also possible that these parameters, as well as other parameters of the neural network, may be adapted during the course of the training.
In the process, the mapping may be given by equation y=argminN
In one refinement of the present invention, which may be computed particularly efficiently, it may be provided that first norm N1 and second norm N2 are selected to be identical.
As an alternative or in addition, first norm N1 may be an L∞ norm. This norm may also be computed particularly efficiently, in particular, also when first norm N1 and second norm N2 are selected to be dissimilar.
As an alternative, it may be provided that first norm N1 is an L1 norm. This selection of the first norm favors the sparsity of the output signal of the scaling layer. This is advantageous, for example, for the compression of neural networks since weights having the value 0 do not contribute to the output value of their layer.
A neural network including such a layer may thus be used in a particularly memory-efficient manner, in particular in conjunction with a compression method.
In the described variants for first norm N1, it may advantageously be provided that second norm N2 is an L2 norm. In this way, the methods may be implemented particularly easily.
It is particularly advantageous in the process when equation y=argminN
Surprisingly, it was found that this method is particularly efficient when an input signal including many important, i.e., heavily weighted, features is present at the input of the scaling layer.
Specific embodiments of the present invention are described hereafter in greater detail with reference to the figures.
Sensor 30 is an arbitrary sensor, which detects a state of surroundings 20 and transmits it as sensor signal S. It may be an imaging sensor, for example, in particular, an optical sensor such as an image sensor or a video sensor, or a radar sensor, or an ultrasonic sensor, or a LIDAR sensor. It may also be an acoustic sensor, which receives structure-borne noise or voice signals, for example. The sensor may also be a position sensor (such as for example GPS), or a kinematic sensor (for example a single-axis or multi-axis acceleration sensor). A sensor which characterizes an orientation of actuator 10 in surroundings 20 (for example a compass) is also possible. A sensor which detects a chemical composition of surroundings 20, for example a lambda sensor, is also possible. As an alternative or in addition, sensor 30 may also include an information system which ascertains a piece of information about a state of the actuator system, such as for example a weather information system which ascertains an instantaneous or future state of the weather in surroundings 20.
Control system 40 receives the sequence of sensor signals S of sensor 30 in an optional receiving unit 50, which converts the sequence of sensor signals S into a sequence of input signals x (alternatively, it is also possible to directly adopt the respective sensor signal S as input signal x). Input signal x may, for example, be a portion or a further processing of sensor signal S. Input signal x may, for example, encompass image data or images, or individual frames of a video recording. In other words, input signal x is ascertained as a function of sensor signal S. Input signal x is supplied to a neural network 60.
Neural network 60 is preferably parameterized by parameters θ, for example encompassing weights w which are stored in a parameter memory P and provided thereby.
Neural network 60 ascertains output signals y from input signals x. Output signals y typically encode a piece of classification information of input signal x. Output signals y are supplied to an optional conversion unit 80, which ascertains activation signals A therefrom, which are supplied to actuator 10 to accordingly activate actuator 10.
Neural network 60 may, for example, be configured to detect persons and/or road signs and/or traffic lights and/or vehicles in the input signals (i.e., to classify whether or not they are present) and/or to classify them according to their type (which may take place area-by-area, in particular, pixel-by-pixel, in the form of a semantic segmentation).
Actuator 10 receives activation signals A, is accordingly activated, and carries out a corresponding action. Actuator 10 may include a (not necessarily structurally integrated) activation logic, which ascertains a second activation signal, with which actuator 10 is then activated, from activation signal A.
In further specific embodiments, control system 40 includes sensor 30. In still further specific embodiments, control system 40 alternatively or additionally also includes actuator 10.
In further preferred specific embodiments, control system 40 includes one or multiple processor(s) 45 and at least one machine-readable memory medium 46 on which instructions are stored which, when they are executed on processors 45, prompt control system 40 to execute the method for operating control system 40.
In alternative specific embodiments, a display unit 10a is provided as an alternative or in addition to actuator 10.
Sensor 30 may be one of the sensors mentioned in connection with
Neural network 60 may, for example, detect objects in the surroundings of the at least one semi-autonomous robot from input data x. Output signal y may be a piece of information which characterizes where in the surroundings of the at least semi-autonomous robot objects are present. Output signal A may then be ascertained as a function of this piece of information and/or corresponding to this piece of information.
Actuator 10 preferably situated in motor vehicle 100 may, for example, be a brake, a drive or a steering system of motor vehicle 100. Activation signal A may then be ascertained in such a way that actuator or actuators 10 is/are activated in such a way that motor vehicle 100, for example, prevents a collision with the objects identified by neural network 60, in particular, when objects of certain classes, e.g., pedestrians, are involved. In other words, activation signal A may be ascertained as a function of the ascertained class and/or corresponding to the ascertained class.
As an alternative, the at least semi-autonomous robot may also be another mobile robot (not shown), for example one which moves by flying, swimming, diving or walking. The mobile robot may, for example, also be an at least semi-autonomous lawn mower or an at least semi-autonomous cleaning robot. Activation signal A may also be ascertained in these cases in such a way that the drive and/or steering system of the mobile robot is/are activated in such a way that the at least semi-autonomous robot, for example, prevents a collision with the objects identified by neural network 60.
In one further alternative, the at least semi-autonomous robot may also be a garden robot (not shown), which ascertains a type or a condition of plants in surroundings 20 using an imaging sensor 30 and neural network 60. Actuator 10 may then be an applicator of chemicals, for example. Activation signal A may be ascertained as a function of the ascertained type or the ascertained condition of the plants in such a way that an amount of the chemicals corresponding to the ascertained type or the ascertained condition is applied.
In still further alternatives, the at least semi-autonomous robot may also be a household appliance (not shown), in particular, a washing machine, a stove, an oven, a microwave or a dishwasher. Using sensor 30, for example an optical sensor, a state of an object treated with the household appliance may be detected, for example in the case of a washing machine, a state of the laundry situated in the washing machine. Using neural network 60, a type or a state of this object may then be ascertained and characterized by output signal y. Activation signal A may then be ascertained in such a way that the household appliance is activated as a function of the ascertained type or the ascertained state of the object. For example, in the case of the washing machine, the washing machine may be activated as a function of the material of which the laundry situated therein is made. Activation signal A may then be selected depending on which material of the laundry was ascertained.
Sensor 30 may be one of the sensors mentioned in connection with
As a function of the signals of sensor 30, control system 40 ascertains an activation signal A of personal assistant 250, for example in that the neural network carries out a gesture recognition. This ascertained activation signal A is then transmitted to personal assistant 250, and it is thus accordingly activated. This ascertained activation signal A may then, in particular, be selected in such a way that it corresponds to a presumed desired activation by user 249. This presumed desired activation may be ascertained as a function of the gesture recognized by neural network 60. Control system 40 may then, as a function of the presumed desired activation, select activation signal A for the transmission to personal assistant 250 and/or select activation A for the transmission to the personal assistant corresponding to the presumed desired activation 250.
This corresponding activation may, for example, include that personal assistant 250 retrieves pieces of information from a database, and renders them adoptable for user 249.
Instead of personal assistant 250, a household appliance (not shown), in particular, a washing machine, a stove, an oven, a microwave or a dishwasher may also be provided to be accordingly activated.
Artificial neural network x is configured to ascertain associated output signals y from input signals x supplied to it. These output signals y are supplied to assessment unit 180.
Assessment unit 180 may, for example, characterize a performance capability of neural network 60 with the aid of a cost function (loss function) which is dependent on output signals y and the desired output signals yT. Parameters θ may be optimized as a function of cost function .
In further preferred specific embodiments, training system 140 includes one or multiple processor(s) 145 and at least one machine-readable memory medium 146 on which instructions are stored which, when they are executed on processors 145, prompt control system 140 to execute the training method.
Output layer S5 may, for example, be an Argmax layer (i.e., a layer which, from a multitude of inputs having respective assigned input values, selects a designation of the input whose assigned input value is the greatest among these input values), and one or multiple of layers S1, S2, S3 may be convolutional layers, for example.
A layer S4 is advantageously designed as a scaling layer, which is designed to map an input signal x present at the input of scaling layer S4 in such a way to an output signal y present at the output of scaling layer S4 that output signal y present at the output is a rescaling of input x, parameters which characterize the rescaling being fixedly predefinable. Exemplary embodiments for methods which scaling layer S4 is able to carry out are described below in connection with
Furthermore, a feature, e.g., a pixel, (i, j)3 of second feature map z2 is shown. If the function which ascertains second feature map z2 from first feature map z1 is represented, for example, by a convolutional layer or a fully connected layer, it is also possible that a multitude of features of first feature map z1 is incorporated in the ascertainment of the value of this feature (i, j)3. However, it is also possible, of course, that only a single feature of first feature map z1 is incorporated in the ascertainment of the value of this feature (i, j)3.
In the process, “incorporate” may advantageously be understood to mean that a combination of values of the parameters which characterize the function with which second feature map z2 is ascertained from first feature map z1, and of values of first feature map z1 exists in such a way that the value of feature (i, j)3 depends on the value of the feature being incorporated. The entirety of these features being incorporated is referred to as area Be in
In turn, one or multiple feature(s) of input signal x is/are incorporated in the ascertainment of each feature (i, j)2 of area Be. The set of all features of input signal x which are incorporated in the ascertainment of at least one of features (i, j)2 of area Be is referred to as receptive field rF of feature (i, j)3. In other words, receptive field rF of feature (i, j)3 encompasses all those features of input signal x which are directly or indirectly (in other words: at least indirectly) incorporated in the ascertainment of feature (i, j)3, i.e., whose values may influence the value of feature (i, j)3.
Initially 1000, a training data set X encompassing pairs (xi, yi) made up of input signals xi and respective associated output signals yi is provided. A learning rate η is initialized, for example at η=1.
Furthermore, a first set G and a second set N are optionally initialized, for example when in step 1100 the exemplary embodiment of this portion of the method illustrated in
The initialization of first set G and of second set N may take place as follows: First set G, which encompasses those pairs (xi, yi) of training data set X which were already drawn during the course of a current epoch of the training method is initialized as an empty set. Second set N, which encompasses those pairs (xi, yi) of training data set X which were not yet drawn during the course of the current epoch is initialized by assigning all pairs (xi, yi) of training data set X to it.
Now 1100, a gradient g of characteristic with respect to parameters θ is estimated, i.e., g=∇θ, with the aid of pairs (xi, yi) made up of input signals xi and respective associated output signals yi of the training data set X. Exemplary embodiments of this method are described in connection with
Then 1200, a scaling of gradient g is optionally carried out. Exemplary embodiments of this method are described in connection with
Thereafter 1300, an adaptation of a learning rate η is optionally carried out. In the process, learning rate η may, for example, be reduced by a predefinable learning rate reduction factor Dη (e.g., Dη= 1/10) (i.e., η←η·Dη), provided a number of the passed-through epochs is divisible by a predefinable epoch number, for example 5.
Then 1400, parameters θ are updated with the aid of the ascertained and possibly scaled gradient g and learning rate η. For example, parameters θ are replaced by θ−η·g.
It is now 1500 checked, with the aid of a predefinable convergence criterion, whether the method is converged. For example, it may be decided based on an absolute change in parameters θ (e.g., between the last two epochs) whether or not the convergence criterion is met. For example, the convergence criterion may be met exactly when a L2 norm over the change of all parameters θ between the last two epochs is smaller than a predefinable convergence threshold value.
If it was decided that the convergence criterion is met, parameters θ are adopted as learned parameters (step 1600), and the method ends. If not, the method branches back to step 1100.
Initially 1110, a predefinable number bs of pairs (xi, yi) of training data set X is to be drawn (without replacement), i.e., selected, and assigned to a batch B. Predefinable number bs is also referred to as a batch size. Batch B is initialized as an empty set.
For this purpose, it is checked 1120 whether batch size bs is greater than the number of pairs (xi, yi) which are present in second set N.
If batch size bs is not greater than the number of pairs (xi, yi) which are present in second set N, a bs number of pairs (xi, yi) are drawn 1130, i.e., selected, randomly from second set N, and added to batch B.
If batch size bs is greater than the number of pairs (xi, yi) which are present in second set N, all pairs of second set N whose number is denoted by s are drawn 1140, i.e., selected, and added to batch B, and those remaining, i.e., a bs−s number, are drawn, i.e., selected, from first set G and added to batch B.
Subsequent to step 1130 or 1140, in step 1150, it is optionally decided for all parameters θ whether or not these parameters θ are to be ignored in this training pass. For this purpose, for example, a probability with which parameters θ of this layer are ignored is separately established for each layer S1, S2, . . . , S6. For example, this probability may be 50% for first layer S1 and be reduced by 10% with each subsequent layer.
With the aid of these established respective probabilities, it may then be decided for each of parameters θ whether or not it is ignored.
It is now 1155 optionally decided for each pair (xi, yi) of batch B whether or not the respective input signal xi is augmented. For each corresponding input signal xi which is to be augmented, an augmentation function is selected, preferably randomly, and applied to input signal xi. Input signal xi thus augmented then replaces the original input signal xi. If input signal xi is an image signal, the augmentation function may be a rotation by a predefinable angle, for example.
Thereafter 1160, the corresponding (and optionally augmented) input signal xi is selected for each pair (xi, yi) of batch B and supplied to neural network 60. Parameters θ of neural network 60 to be ignored are deactivated in the process during the ascertainment of the corresponding output signal, e.g., in that they are temporarily set to the value zero. The corresponding output signal y(xi) of neural network 60 is assigned to the corresponding pair (xi, yi). Depending on output signals y(xi) and the respective output signals yi of pair (xi, yi) as the desired output signal yT, a respective cost function i is ascertained.
Then 1165, the complete cost function =Σi∈Bi is ascertained for all pairs (xi, yi) of batch B together, and the corresponding component of gradient g is ascertained for each of parameters θ not to be ignored, e.g., with the aid of backpropagation. For each of parameters θ to be ignored, the corresponding component of gradient g is set to zero.
Now, it is checked 1170 whether it was established, during the check in step 1000, that batch size bs is greater than the number of pairs (xi, yi) which are present in second set N.
If it was established that batch size bs is not greater than the number of pairs (xi, yi) which are present in second set N, all pairs (xi, yi) of batch B are added (1180) to first set G and removed from second set N. It is now checked (1185) whether second set N is empty. If second set N is empty, a new epoch begins (1186). For this purpose, first set G is again initialized as an empty set, and second set N is newly initialized in that all pairs (xi, yi) of training data set X are assigned to it again, and the method branches off to step 1200. If second set N is not empty, the method branches off directly to step 1200.
If it was established that batch size bs is greater than the number of pairs (xi, yi) which are present in second set N, first set G is re-initialized (1190) by assigning to it all pairs (xi, yi) of batch B, second set N is newly initialized by assigning to it again all pairs (xi, yi) of training data set X, and subsequently pairs (xi, yi) which are also present in batch B are removed. Thereafter, a new epoch begins, and the method branches off to step 1200. With this, this portion of the method ends.
Thereafter 1121, a pair (xi, yi) is randomly selected from training data set X and, if necessary, is augmented. This may, for example, take place in such a way that, for each input signal xi of pairs (xi, yi) of training data set X, a μ(α(xi)) number of possible augmentations α(xi) is ascertained, and to each pair (xi, yi) a position variable
is assigned. If a random number φ ∈ [0; 1] is then drawn in a uniformly distributed manner, position variable pi which meets the inequation chain
p
i
≤φ<p
i+1 (3)
may be selected. The associated index i then denotes the selected pair (xi, yi), and an augmentation αi of input variable xi may be drawn randomly from the set of possible augmentations α(xi) and be applied to input variable xi, i.e., the selected pair (xi, yi) is replaced by (αi(xi), yi).
Input signal xi is supplied to neural network 60. Depending on the corresponding output signal y(xi) and output signal yi of pair (xi, yi) as the desired output signal yT, the corresponding cost function i is ascertained. For parameters θ, a gradient d in this regard is ascertained, e.g., with the aid of backpropagation, i.e., d=∇θ(y(xi), yi).
Then (1131), iteration counter n, first variable m1 and second variable m2 are updated as follows:
Thereafter (1141), components Ca,b of a covariance matrix C are provided as
From this, using the (vector-valued) first variable m1, a scalar product S is formed, i.e.,
S=
m
1
, C
−1
m
1
. (8)
It shall be understood that for the sufficiently precise ascertainment of scalar product S using equation (8), not all entries of covariance matrix C or of the inverse C−1 must be present at the same time. It is more memory-efficient, during the evaluation of equation (8), to determine entries Ca,b of covariance matrix C needed then.
It is then checked (1151) whether this scalar product S meets the following inequation:
S≥λ2, (9)
λ being a predefinable threshold value which corresponds to a confidence level.
If the inequation is met, the current value of first variable m1 is adopted as estimated gradient g (1161) and the method branches back to step 1200.
If the inequation is not met, the method can branch back to step 1121. As an alternative, it may also be checked (1171) whether iteration counter n has reached a predefinable maximum iteration value nmax. If this is not the case, the method branches back to step 1121; otherwise, zero vector 0 ∈ W is adopted (1181) as estimated gradient g, and the method branches back to step 1200. With this, this portion of the method ends.
As a result of this method, it is achieved that m1 corresponds to an arithmetic mean of the ascertained gradient d over the drawn pairs (xi, yi), and m2 corresponds to an arithmetic mean of a matrix product d·dT of the ascertained gradient d over the drawn pairs (xi, yi).
Now (1210), a scaling factor Ωi,l is ascertained for each component gi,l of gradient g. For example, this scaling factor Ωi,l may be the size of receptive field rF of the feature of the feature map of the i-th layer corresponding to l. As an alternative, scaling factor Ωi,l may also be a ratio of the resolutions, i.e., the number of features, of the i-th layer in relation to the input layer.
Then (1220), each component gi,l of gradient g is scaled using scaling factor Ωi,l, i.e.,
gi,l←gi,l/Ωi,l. (10)
If scaling factor Ωi,l is given by the size of receptive field rF, overfitting of parameters θ may be avoided particularly effectively. If scaling factor Ωi,l is given by the ratio of the resolutions, this is a particularly efficient approximate estimation of the size of receptive field rF.
Scaling layer S4 is configured to achieve a projection of input signal x present at the input of scaling layer S4 to a ball, having radius ρ and center c. This is characterized by a first norm N1(y−c), which measures a distance of center c from output signal x present at the output of scaling layer S4, and a second norm N2(x−y), which measures a distance of input signal x present at the input of scaling layer S4 from output signal y present at the output of scaling layer S4. In other words, output signal y present at the output of scaling layer S4 solves equation
y=argminN
Initially 2000, an input signal x present at the input of scaling layer S4, a center parameter c and a radius parameter ρ are provided.
Then (2100), an output signal y present at the output of scaling layer S4 is ascertained as
With this, this portion of the method ends.
First (3000), similarly to step 2000, input signal x present at the input of scaling layer S4, center parameter c and radius parameter ρ are provided.
Then (3100), components yi of output signal y present at the output of scaling layer S4 are ascertained as
i denoting the components here.
This method is particular processing-efficient. With this, this portion of the method ends.
First (4000), similarly to step 2000, input signal x present at the input of scaling layer S4, center parameter c and radius parameter ρ are provided.
Then (4100), a sign variable ϵi is ascertained as
and components xi of input signal x present at the input of scaling layer S4 are replaced by
xi←ϵi·(xi−ci). (15)
An auxiliary parameter γ is initialized to the value zero.
Then (4200), a set N is ascertained as N={i|xi>γ} and a distance dimension D=Σi∈N(xi−γ).
Then (4300), it is checked whether inequation
D>ρ (16)
is met.
If this is the case (4400), auxiliary parameter γ is replaced by
and the method branches back to step 4200.
If inequation (16) is not met (4500), components yi of output signal y present at the output of scaling layer S4 is ascertained as
y
i
=c
i
+ϵ
i·(xi−γ)+ (18)
Notation (·)+ usually denotes
With this, this portion of the method ends. This method corresponds to a Newton's method and is particularly processing-efficient, in particular, when many of the components of input signal x present at the input of scaling layer S4 are important.
It shall be understand that the neural network is not limited to feedforward neural networks, but that the present invention may equally be applied to any kind of neural network, in particular, recurrent networks, convolutional neural networks, autoencoders, Boltzmann machines, perceptrons or capsule neural networks.
The term “computer” encompasses arbitrary devices for processing predefinable processing rules. These processing rules may be present in the form of software, or in the form of hardware, or also in a mixed form made up of software and hardware.
It shall furthermore be understood that the methods cannot only be implemented completely in software as described. They may also be implemented in hardware, or in a mixed form made up of software and hardware.
Number | Date | Country | Kind |
---|---|---|---|
10 2018 222 345.9 | Dec 2018 | DE | national |
Filing Document | Filing Date | Country | Kind |
---|---|---|---|
PCT/EP2019/082768 | 11/27/2019 | WO | 00 |