METHOD FOR TRAINING A NEURAL NETWORK

FIELD

The present invention relates to a method for training a neural network, to a training system, to uses of the neural network thus trained, to a computer program, and to a machine-readable memory medium.

BACKGROUND INFORMATION

A method for training neural networks is described in “Improving neural networks by preventing co-adaptation of feature detectors,” arXiv preprint arXiv:1207.0580v1, Geoffrey E. Hinton, Nitish Srivastava, Alex Krizhevsky, Ilya Sutskever, Ruslan R. Salakhutdinov (2012), in which feature detectors are randomly ignored during the training. These methods are also known under the name “dropout.”

A method for training neural networks is described in “Batch Normalization: Accelerating Deep Network Training by Reducing Internal Covariate Shift,” arXiv preprint arXiv:1502.03167v3, Sergey Ioffe, Christian Szegedy (2015), in which input variables are normalized in a layer for a small batch (“mini batch”) of training examples.

SUMMARY

A method in accordance with an example embodiment of the present invention may have the advantage over the related art that a guaranteeable reliability of the trained system results, which is, in particular, essential for safety-critical applications. Surprisingly, this advantage arises without having to tolerate a worsening in the performance capability achievable at the training end (e.g., during the classification of images).

Refinements and example embodiments of the present invention are described herein.

With a sufficiently large number of training data, so-called “deep learning” methods, i.e., (deep) artificial neural networks, may be used to efficiently ascertain a map between an input space V₀and an output space V_k. This may, for example, be a classification of sensor data, in particular, image data, i.e., a mapping of sensor data or image data to classes. This is based on the approach of providing a k−1 number of hidden spaces V₁, . . . , V_k-1. Furthermore, a k number of maps ƒⁱ:V_i−1→V_i(i=1 . . . k) are provided between these spaces. Each of these maps ƒⁱis typically referred to as a layer. Such a layer ƒⁱis typically parameterized by weights w_i∈Wⁱhaving a suitable selected space Wⁱ. Weights w₁, . . . , w_kof the k number of layers ƒⁱare collectively also referred to as weights w∈W:=W¹× . . . ×W^k, and the mapping from input space V₀to output space V_kis referred to as ƒ_w:V₀→V_k, which from the individual maps ƒⁱ(with weights w_iexplicitly indicated as subscript) results as ƒ_w(x):=ƒ_w_k^ko . . . oƒ_w₁¹(x).

At a given probability distribution D, which is defined as V₀×V_k, the task of training the neural network is to determine weights w∈W in such a way that an expected value Φ of a cost function L

Φ[w]=E_(x_D_,γ_D_)˜D[L(ƒ_w(x_D)y_D)] (1)

is minimized. In the process, cost function L denotes a measure for the distance between the map, ascertained with the aid of function ƒ_w, of an input variable x_Dto a variable ƒ_w(x_D) in output space V_kand an actual output variable y_Din output space V_k.

A “deep neural network” may be understood to mean a neural network including at least two hidden layers.

To minimize this expected value Φ, gradient-based methods may be utilized, which ascertain a gradient ∇Φ with respect to weights w. This gradient ∇Φ is usually approximated with the aid of training data (x_j,y_j), i.e., by ∇_wL(ƒ_w(x_j,y_j)), indices j being selected from a so-called epoch. An epoch is a permutation of labels {1, . . . , N} of the available training data points.

To expand the training data set, so-called data augmentation (also referred to as augmentation) may be utilized. In the process, it is possible to select an augmented pair (x_a,y_j) for each index j from the epoch instead of pair (x_j,y_j), input signal x_jbeing replaced by an augmented input value x_a∈α(x_j) here. In the process, α(x_j) may be a set of typical variations of input signal x_j(including input signal x_jitself) which leave a classification of input signal x_j, i.e., the output signal of the neural network, unchanged.

This epoch-based sampling, however, is not entirely consistent with the definition from equation (1) since each data point is selected exactly one time during the course of an epoch. The definition from equation (1), in contrast, is based on independently drawn data points. This means that while equation (1) requires the data points to be drawn “with replacement,” the epoch-based sampling carries out a drawing of the data points “without replacement.” This may result in the requirements of mathematical convergence proofs not being met (because, when selecting N examples from a set of a N number of data points, the probability of selecting each of these data points exactly once is less than

$e^{- \frac{N}{2}}$

(for N>2), while this probability is always equal to 1 in the case of epoch-based sampling.

If data augmentation is utilized, this statistical effect may be further amplified since an element of set α(x_j) is present in each epoch and, depending on augmentation function α, it cannot be excluded that α(x_j)≈α(x_i) for i≠j. Statistically correct mapping of the augmentations with the aid of set α(x_j) is difficult since the effect does not have to be equally pronounced for each input datum x_j. In this way, for example, a rotation may have no impact on circular objects, but may greatly impact general objects. As a result, the size of set α(x_j) may be dependent on input datum x_j, which may be problematic for adversarial training methods.

After all, number N of the training data points is a variable which, in general, is complex to set. If N is selected to be too large, the run time of the training method may be unduly extended, if N is selected to be too small, a convergence cannot be guaranteed since mathematical proofs of the convergence, in general, are based on assumptions which are then not met. In addition, it is not clear at what point in time the training is to be reliably terminated. When taking a portion of the data points as an evaluation data set and determining the quality of the convergence with the aid of this evaluation data set, the result may be that overfitting of the weights w occurs with respect to the data points of the evaluation data set, which not only reduces the data efficiency, but may also impair the performance capability of the network when it is applied to data other than training data. This may result in a reduction of the so-called “generalizability.”

To reduce overfitting, a piece of information which is stored in the hidden layers may be randomly thinned with the aid of the “dropout” method mentioned at the outset.

To improve the randomization of the training process, it is possible, through the use of so-called batch normalization layers, to introduce statistical parameters μ and σ over so-called mini batches, which are probabilistically updated during the training process. During the inference, the values of these parameters μ and σ are selected as fixedly predefinable values, for example as estimated values from the training through extrapolation of the exponential decay behavior.

If the layer having index i is a batch normalization layer, the associated weights w_i=(μ_i,σ_i) are not updated in the case of a gradient descent, i.e., these weights w_iare thus treated differently than weights w_kof the remaining layers k. This increases the complexity of an implementation.

In addition, the size of the mini batches is a parameter which in general influences the training result and thus, as a further hyperparameter, must be set as well as possible, for example within the scope of a (possibly complex) architecture search.

In a first aspect, the present invention thus relates to a method for training a neural network which is, in particular, configured to classify physical measuring variables, the neural network being trained with the aid of a training data set X, pairs including an input signal and an associated desired output signal being (randomly) drawn from the training data set for training, an adaptation of parameters of the neural network taking place as a function of an output signal of the neural network, when the output signal is supplied, and as a function of the desired output signal, this drawing of pairs always occurring from the entire training data set.

In one preferred refinement of this aspect of the present invention, it is provided that the drawing of pairs occurs regardless of which pair was previously drawn during the course of the training.

In other words, the sampling of pairs, i.e., data points, from the training data set corresponds to a “drawing with replacement.” This breaks with the existing paradigm that the training examples of the training data set are drawn by “drawing without replacement.” This “drawing with replacement” may initially appear to be disadvantageous since it cannot be guaranteed that every data point from the training data set is actually used within a given number of training examples.

With this, a guaranteeable reliability of the trained system results, which is essential, in particular, for a safety-critical use. Surprisingly, this advantage arises without having to tolerate a worsening in the performance capability achievable at the training end (e.g., during the classification of images). In addition, an interface to other sub-blocks of a training system with which the neural network is trainable is drastically simplified.

The drawn pairs may optionally also be further augmented. This means that a set of augmentation functions may be provided for some or all of the input signals included in the training data set (as a component of the pairs), to which the input signal may be subjected. The selection of the corresponding augmentation function may also take place randomly, preferably regardless of which pairs and/or which augmentation functions were previously drawn during the course of the training.

In one refinement of the present invention, it may be provided that the input signal of the drawn pair is augmented using augmentation function α_i, i.e., that the input signal is replaced by its image under the augmentation function.

It is preferably provided in the process that augmentation function α_iis selected, in particular randomly, from the set α of possible augmentation functions, this set being dependent on the input signal.

In the process, it may be provided that, during the random drawing of pairs from the training data set, a probability that a predefinable pair is drawn is dependent on a number of possible augmentation functions α of the input signal of this predefinable pair.

For example, the probability may be a predefinable variable. In particular, the probability is advantageously selected to be proportional to the number of possible augmentation functions. This makes it possible to adequately take into consideration that some augmentation functions leave the input signal unchanged, so that the cardinal number of the set (i.e., the number of the elements of the set) of the augmentation functions between the input signals may be very different. As a result of the adequate consideration, possible problems with adversarial training methods may be avoided. This may be understood as follows: With a given input signal, an adversarial input signal may be generated with the aid of a suitable augmentation function in the case of adversarial training methods, which has a sufficiently small distance of smaller than a maximum distance r from the given input signal. If two input signals are permitted, which have a small distance (smaller than twice the maximum distance) from one another, it is possible that the sets of the adversarial input signals overlap, so that the adversarial training methods may be overrepresented, provided that this overlap is not adequately taken into consideration. This is achieved by the described method.

In another aspect of the refinements of the present invention, it may be provided that the adaptation of the parameters takes place as a function of an ascertained gradient and, for the ascertainment of the gradient, an estimated value m₁of the gradient is refined, by taking a successively increasing number of pairs which are drawn from the training data set into consideration, until a predefinable termination condition which is dependent on estimated value m₁of the gradient is met.

This means, in particular, that the adaptation of the parameters only takes place after the predefinable termination condition has been met.

This is in contrast to conventional methods from the related art, such as stochastic gradient descent, in which an averaging of the gradient always takes place over a predefinable mini batch. This mini batch has a predefinable size which may be set as a hyperparameter. By successively adding pairs from the training data set, it is possible in the described method to carry out the ascertainment until the gradient reliably points in the ascending direction.

In addition, the size of the mini batch is a hyperparameter to be optimized. As a result of being able to dispense with this optimization, the method is more efficient and more reliable since overfitting may be suppressed more effectively, and the batch size is dispensed with as a hyperparameter.

In particular, the predefinable termination condition may also be dependent on a covariance matrix C of estimated value m₁of the gradient.

In this way, it is possible to ensure particularly easily that the gradient reliably points in the ascending direction.

For example, the predefinable termination condition may encompass the condition whether estimated value m₁and covariance matrix C for a predefinable confidence value λ meet condition custom-character m₁, C⁻¹_m₁≥λ².

A probabilistic termination criterion is thus introduced with this condition. In this way, it is possible to ensure with predefinable confidence that the gradient, with confidence value λ, points in the ascending direction.

In one further aspect of the refinements of the present invention, it may be provided that the components of the ascertained gradient are scaled as a function of to which layer of the neural network the parameters corresponding to these components belong.

In this connection, “scaling” shall be understood to mean that the components of the ascertained gradient are multiplied with a factor which is dependent on the layer.

In particular, the scaling may take place as a function of a position, i.e., the depth, of this layer within the neural network.

The depth may, for example, be characterized, in particular, given, by the number of layers through which a signal which is supplied to an input layer of the neural layer has to propagate before it is present for the first time as an input signal at this layer.

In one refinement of the present invention, it may be provided that the scaling also occurs as a function of to which feature of a feature map the corresponding component of the ascertained gradient belongs.

In particular, it may be provided that the scaling occurs as a function of a size of a receptive field of this feature.

It was found that, in particular, in a convolutional network, weights of a feature map are cumulatively multiplied with pieces of information of the features of the receptive field, which is why overfitting may form for these weights. This is effectively suppressed by the described method.

In one particularly simple and efficient alternative of the present invention, it may be provided that the scaling occurs as a function of the resolution of this layer. In particular, that it occurs as a function of a quotient of the resolution of this layer and the resolution of the input layer.

It was found that, in this way, the size of the receptive field may be approximated very easily and efficiently.

In another aspect of the refinements of the present invention, it may be provided that the neural network includes a scaling layer, the scaling layer mapping an input signal present at the input of the scaling layer in such a way to an output signal present at the output of the scaling layer that the output signal present at the output represents a rescaled signal of the input signal, parameters which characterize the rescaling being fixedly predefinable.

Preferably, it may be provided here that the scaling layer maps an input signal present at the input of the scaling layer in such a way to an output signal present at the output of the scaling layer that this mapping corresponds to a projection to a ball, center c and/or radius p of this ball being fixedly predefinable. As an alternative, it is also possible that these parameters, as well as other parameters of the neural network, may be adapted during the course of the training.

In the process, the mapping may be given by equation y=argmin_N₁_(y-c)≤ρN₂(x−y) using a first norm (N₁) and a second norm (N₂). The term “norm” shall be understood in the mathematical sense in the process.

In one refinement of the present invention which may be computed particularly efficiently, it may be provided that first norm N₁and second norm N₂are selected to be identical.

As an alternative or in addition, first norm N₁may be an L^∞ norm. This norm may also be computed particularly efficiently, in particular, also when first norm N₁and second norm N₂are selected to be dissimilar.

As an alternative, it may be provided that first norm N₁is an L norm. This selection of the first norm favors the sparsity of the output signal of the scaling layer. This is advantageous, for example, for the compression of neural networks since weights having the value 0 do not contribute to the output value of their layer.

A neural network including such a layer may thus be used in a particularly memory-efficient manner, in particular in conjunction with a compression method.

In the described variants for first norm N₁in accordance with example embodiments of the present invention, it may advantageously be provided that second norm N₂is an L²norm. In this way, the methods may be implemented particularly easily.

It is particularly advantageous in the process when equation y=argmin_N₁_(y-c)≤ρN₂(x−y) is solved with the aid of a deterministic Newton's method.

Surprisingly, it was found that this method is particularly efficient when an input signal including many important, i.e., heavily weighted, features is present at the input of the scaling layer.

Specific embodiments of the present invention are described hereafter in greater detail with reference to the figures.

BRIEF DESCRIPTION OF THE DRAWINGS

FIG. 1 schematically shows a design of one specific embodiment of a control system, in accordance with the present invention.

FIG. 2 schematically shows one exemplary embodiment for controlling an at least semi-autonomous robot, in accordance with the present invention.

FIG. 3 schematically shows one exemplary embodiment for controlling a production system, in accordance with the present invention.

FIG. 4 schematically shows one exemplary embodiment for controlling a personal assistant, in accordance with the present invention.

FIG. 5 schematically shows one exemplary embodiment for controlling an access system, in accordance with the present invention.

FIG. 6 schematically shows one exemplary embodiment for controlling a monitoring system, in accordance with the present invention.

FIG. 7 schematically shows one exemplary embodiment for controlling a medical imaging system, in accordance with the present invention.

FIG. 8 schematically shows a training system, in accordance with the present invention.

FIG. 9 schematically shows a design of a neural network, in accordance with the present invention.

FIG. 10 schematically shows an information forwarding within the neural network, in accordance with the present invention.

FIG. 11 shows one specific embodiment of a training method in a flowchart, in accordance with the present invention.

FIG. 12 shows one specific embodiment of a method for estimating a gradient in a flowchart, in accordance with the present invention.

FIG. 13 shows one alternative specific embodiment of the method for estimating the gradient in a flowchart, in accordance with the present invention.

FIG. 14 shows one specific embodiment of a method for scaling the estimated gradient in a flowchart, in accordance with the present invention.

FIGS. 15a)-5c) show specific embodiments for implementing a scaling layer within the neural network in flowcharts, in accordance with the present invention.

FIG. 16 shows a method for operating the trained neural network in a flowchart, in accordance with an example embodiment of the present invention.

DETAILED DESCRIPTION OF EXAMPLE EMBODIMENTS

FIG. 1 shows an actuator 10 in its surroundings 20 in interaction with a control system 40. Actuator 10 and surroundings 20 are collectively also referred to as an actuator system. A state of the actuator system is detected at preferably regular intervals by a sensor 30, which may also be a multitude of sensors. Sensor signal S, or in the case of multiple sensors a respective sensor signal S, of sensor 30 is transmitted to control system 40. Control system 40 thus receives a sequence of sensor signals S. Control system 40 ascertains activation signals A therefrom, which are transferred to actuator 10.

Sensor 30 is an arbitrary sensor, which detects a state of surroundings 20 and transmits it as sensor signal S. It may be an imaging sensor, for example, in particular, an optical sensor such as an image sensor or a video sensor, or a radar sensor, or an ultrasonic sensor, or a LIDAR sensor. It may also be an acoustic sensor, which receives structure-borne noise or voice signals, for example. The sensor may also be a position sensor (such as for example GPS), or a kinematic sensor (for example a single-axis or multi-axis acceleration sensor). A sensor which characterizes an orientation of actuator 10 in surroundings 20 (for example a compass) is also possible. A sensor which detects a chemical composition of surroundings 20, for example a lambda sensor, is also possible. As an alternative or in addition, sensor 30 may also include an information system which ascertains a piece of information about a state of the actuator system, such as for example a weather information system which ascertains an instantaneous or future state of the weather in surroundings 20.

Control system 40 receives the sequence of sensor signals S of sensor 30 in an optional receiving unit 50, which converts the sequence of sensor signals S into a sequence of input signals x (alternatively, it is also possible to directly adopt the respective sensor signal S as input signal x). Input signal x may, for example, be a portion or a further processing of sensor signal S. Input signal x may, for example, encompass image data or images, or individual frames of a video recording. In other words, input signal x is ascertained as a function of sensor signal S. Input signal x is supplied to a neural network 60. Neural network 60 is preferably parameterized by parameters θ, for example encompassing weights w which are stored in a parameter memory P and provided thereby.

Neural network 60 ascertains output signals y from input signals x. Output signals y typically encode a piece of classification information of input signal x. Output signals y are supplied to an optional conversion unit 80, which ascertains activation signals A therefrom, which are supplied to actuator 10 to accordingly activate actuator 10.

Actuator 10 receives activation signals A, is accordingly activated, and carries out a corresponding action. Actuator 10 may include a (not necessarily structurally integrated) activation logic, which ascertains a second activation signal, with which actuator 10 is then activated, from activation signal A.

In further specific embodiments of the present invention, control system 40 includes sensor 30. In still further specific embodiments of the present invention, control system 40 alternatively or additionally also includes actuator 10.

In further preferred specific embodiments of the present invention, control system 40 includes one or multiple processor(s) 45 and at least one machine-readable memory medium 46 on which instructions are stored which, when they are executed on processors 45, prompt control system 40 to execute the method for operating control system 40.

In alternative specific embodiments of the present invention, a display unit 10a is provided as an alternative or in addition to actuator 10.

FIG. 2 shows one exemplary embodiment in which control system 40 is used for controlling an at least semi-autonomous robot, here an at least partially automated motor vehicle 100.

Sensor 30 may be one of the sensors mentioned in connection with FIG. 1, preferably one or multiple video sensor(s), preferably situated in motor vehicle 100, and/or one or multiple radar sensor(s) and/or one or multiple ultrasonic sensor(s) and/or one or multiple LIDAR sensor(s) and/or one or multiple position sensor(s) (for example GPS).

Neural network 60 may, for example, detect objects in the surroundings of the at least one semi-autonomous robot from input data x. Output signal y may be a piece of information which characterizes where in the surroundings of the at least semi-autonomous robot objects are present. Output signal A may then be ascertained as a function of this piece of information and/or corresponding to this piece of information.

Actuator 10 preferably situated in motor vehicle 100 may, for example, be a brake, a drive or a steering system of motor vehicle 100. Activation signal A may then be ascertained in such a way that actuator or actuators 10 is/are activated in such a way that motor vehicle 100, for example, prevents a collision with the objects identified by neural network 60, in particular, when objects of certain classes, e.g., pedestrians, are involved. In other words, activation signal A may be ascertained as a function of the ascertained class and/or corresponding to the ascertained class.

As an alternative, the at least semi-autonomous robot may also be another mobile robot (not shown), for example one which moves by flying, swimming, diving or walking. The mobile robot may, for example, also be an at least semi-autonomous lawn mower or an at least semi-autonomous cleaning robot. Activation signal A may also be ascertained in these cases in such a way that the drive and/or steering system of the mobile robot is/are activated in such a way that the at least semi-autonomous robot, for example, prevents a collision with the objects identified by neural network 60.

In one further alternative, the at least semi-autonomous robot may also be a garden robot (not shown), which ascertains a type or a condition of plants in surroundings 20 using an imaging sensor 30 and neural network 60. Actuator 10 may then be an applicator of chemicals, for example. Activation signal A may be ascertained as a function of the ascertained type or the ascertained condition of the plants in such a way that an amount of the chemicals corresponding to the ascertained type or the ascertained condition is applied.

In still further alternatives, the at least semi-autonomous robot may also be a household appliance (not shown), in particular, a washing machine, a stove, an oven, a microwave or a dishwasher. Using sensor 30, for example an optical sensor, a state of an object treated with the household appliance may be detected, for example in the case of a washing machine, a state of the laundry situated in the washing machine. Using neural network 60, a type or a state of this object may then be ascertained and characterized by output signal y. Activation signal A may then be ascertained in such a way that the household appliance is activated as a function of the ascertained type or the ascertained state of the object. For example, in the case of the washing machine, the washing machine may be activated as a function of the material of which the laundry situated therein is made. Activation signal A may then be selected depending on which material of the laundry was ascertained.

FIG. 3 shows one exemplary embodiment in which control system 40 is used for activating a manufacturing machine 11 of a manufacturing system 200, in that an actuator 10 controlling this manufacturing machine 11 is activated. Manufacturing machine 11 may, for example, be a machine for stamping, sawing, drilling and/or cutting.

Sensor 30 may be one of the sensors mentioned in connection with FIG. 1, preferably an optical sensor which, e.g., detects properties of manufacturing products 12. It is possible that actuator 10 controlling manufacturing machine 11 is activated as a function of the ascertained properties of manufacturing products 12, so that manufacturing machine 11 accordingly executes a subsequent processing step of these manufacturing products 12. It is also possible that sensor 30 ascertains the properties of manufacturing products 12 processed by manufacturing machine 11 and, as a function thereof, adapts an activation of manufacturing machine 11 for a subsequent manufacturing product.

FIG. 4 shows one exemplary embodiment in which control system 40 is used for controlling a personal assistant 250. Sensor 30 may be one of the sensors mentioned in connection with FIG. 1. Sensor 30 is preferably an acoustic sensor which receives voice signals of a user 249. As an alternative or in addition, sensor 30 may also be configured to receive optical signals, for example video images of a gesture of user 249.

As a function of the signals of sensor 30, control system 40 ascertains an activation signal A of personal assistant 250, for example in that the neural network carries out a gesture recognition. This ascertained activation signal A is then transmitted to personal assistant 250, and it is thus accordingly activated. This ascertained activation signal A may then, in particular, be selected in such a way that it corresponds to a presumed desired activation by user 249. This presumed desired activation may be ascertained as a function of the gesture recognized by neural network 60. Control system 40 may then, as a function of the presumed desired activation, select activation signal A for the transmission to personal assistant 250 and/or select activation A for the transmission to the personal assistant corresponding to the presumed desired activation 250.

This corresponding activation may, for example, include that personal assistant 250 retrieves pieces of information from a database, and renders them adoptable for user 249.

Instead of personal assistant 250, a household appliance (not shown), in particular, a washing machine, a stove, an oven, a microwave or a dishwasher may also be provided to be accordingly activated.

FIG. 5 shows one exemplary embodiment in which control system 40 is used for controlling an access system 300. Access system 300 may encompass a physical access control, for example a door 401. Sensor 30 may be one of the sensors mentioned in connection with FIG. 1, preferably an optical sensor (for example for detecting image or video data) which is configured to detect a face. This detected image may be interpreted with the aid of neural network 60. For example, the identity of a person may be ascertained. Actuator 10 may be a lock which releases, or does not release, the access control as a function of activation signal A, for example opens, or does not open, door 401. For this purpose, activation signal A may be selected as a function of the interpretation of neural network 60, for example as a function of the ascertained identity of the person. Instead of the physical access control, a logic access control may also be provided.

FIG. 6 shows one exemplary embodiment in which control system 40 is used for controlling a monitoring system 400. This exemplary embodiment differs from the exemplary embodiment shown in FIG. 5 in that, instead of actuator 10, display unit 10a is provided, which is activated by control system 40. For example, it may be ascertained by neural network 60 whether an object recorded by the optical sensor is suspicious, and activation signal A may then be selected in such a way that this object is represented highlighted in color by display unit 10a.

FIG. 7 shows one exemplary embodiment in which control system 40 is used for controlling a medical imaging system 500, for example an MRI, X-ray or ultrasound device. Sensor 30 may, for example, be an imaging sensor, and display unit 10a is activated by control system 40. For example, it may be ascertained by neural network 60 whether an area recorded by the imaging sensor is noticeable, and activation signal A may then be selected in such a way that this area is represented highlighted in color by display unit 10a.

FIG. 8 schematically shows one exemplary embodiment of a training system 140 for training neural network 60 with the aid of a training method. A training data unit 150 ascertains suitable input signals x, which are supplied to neural network 60. For example, training data unit 150 accesses a computer-implemented database in which a set of training data is stored and selects, e.g., randomly, input signals x from the set of training data. Optionally, training data unit 150 also ascertains desired, or “actual,” output signals y_Twhich are assigned to input signals x and supplied to an assessment unit 180.

Artificial neural network x is configured to ascertain associated output signals y from input signals x supplied to it. These output signals y are supplied to assessment unit 180.

Assessment unit 180 may, for example, characterize a performance capability of neural network 60 with the aid of a cost function (loss function) custom-character which is dependent on output signals y and the desired output signals y_T. Parameters θ may be optimized as a function of cost function .

In further preferred specific embodiments, training system 140 includes one or multiple processor(s) 145 and at least one machine-readable memory medium 146 on which instructions are stored which, when they are executed on processors 145, prompt control system 140 to execute the training method.

FIG. 9, by way of example, shows a possible design of neural network 60, which is a neural network in the exemplary embodiment. Neural network includes a multitude of layers S₁, S₂, S₃, S₄, S₅for ascertaining, from input signal x which is supplied to an input of an input layer S₁, output signal y which is present at an output of an output layer S₅. Each of layers S₁, S₂, S₃, S₄, S₅is configured to ascertain, from a (possibly multidimensional) input signal x, z₁, z₃, z₄, z₆which is present at an input of the particular layer S₁, S₂, S₃, S₄, S₅, a (possibly multidimensional) output signal z₁, z₂, z₄, z₅,y which is present at an output of the particular layer S₁, S₂, S₃, S₄, S₅. Such output signals are also referred to as feature maps, specifically in image processing. It is not necessary in the process for layers S₁, S₂, S₃, S₄, S₅to be situated in such a way that all output signals, which are incorporated as input signals in further layers, are each incorporated from a preceding layer into a directly following layer. Instead, skip connections or recurrent connections are also possible. It is also possible, of course, for input signal x to be incorporated in several of the layers, or for output signal x of neural network 60 to be made up of output signals of a multitude of layers.

Output layer S₅may, for example, be an Argmax layer (i.e., a layer which, from a multitude of inputs having respective assigned input values, selects a designation of the input whose assigned input value is the greatest among these input values), and one or multiple of layers S₁, S₂, S₃may be convolutional layers, for example.

A layer S₄is advantageously designed as a scaling layer, which is designed to map an input signal x present at the input of scaling layer S₄in such a way to an output signal y present at the output of scaling layer S₄that output signal y present at the output is a rescaling of input x, parameters which characterize the rescaling being fixedly predefinable.

Exemplary embodiments for methods which scaling layer S₄is able to carry out are described below in connection with FIG. 15.

FIG. 10 schematically illustrates the information forwarding within neural network 60. Shown schematically here are three multidimensional signals within neural network 60, namely input signal x as well as later feature maps z₁, z₂. In the exemplary embodiment, input signal x has a spatial resolution of n_x¹×n_y¹pixels, first feature map z₁has a spatial resolution of n_x²×n_y²pixels, and second feature map z₂has a spatial resolution of n_x³×n_y³pixels. In the exemplary embodiment, the resolution of second feature map z₂is lower than the resolution of input signal x; however, this is not necessarily the case.

Furthermore, a feature, e.g., a pixel, (i,j)₃of second feature map z₂is shown. If the function which ascertains second feature map z₂from first feature map z₁is represented, for example, by a convolutional layer or a fully connected layer, it is also possible that a multitude of features of first feature map z₁is incorporated in the ascertainment of the value of this feature (i,j)₃. However, it is also possible, of course, that only a single feature of first feature map z₁is incorporated in the ascertainment of the value of this feature (i,j)₃.

In the process, “incorporate” may advantageously be understood to mean that a combination of values of the parameters which characterize the function with which second feature map z₂is ascertained from first feature map z₁, and of values of first feature map z₁exists in such a way that the value of feature (i,j)₃depends on the value of the feature being incorporated. The entirety of these features being incorporated is referred to as area Be in FIG. 10.

In turn, one or multiple feature(s) of input signal x is/are incorporated in the ascertainment of each feature (i,j)₂of area Be. The set of all features of input signal x which are incorporated in the ascertainment of at least one of features (i,j)₂of area Be is referred to as receptive field rF of feature (i,j)₃. In other words, receptive field rF of feature (i,j)₃encompasses all those features of input signal x which are directly or indirectly (in other words: at least indirectly) incorporated in the ascertainment of feature (i,j)₃, i.e., whose values may influence the value of feature (i,j)₃.

FIG. 11 shows the sequence of a method for training neural network 60 according to one specific embodiment in a flowchart.

Initially 1000, a training data set X encompassing pairs (x_i,y_i) made up of input signals x_iand respective associated output signals y_iis provided. A learning rate q is initialized, for example at η=1.

Furthermore, a first set G and a second set N are optionally initialized, for example when in step 1100 the exemplary embodiment of this portion of the method illustrated in FIG. 12 is used. If, in step 1100, the exemplary embodiment of this portion of the method illustrated in FIG. 13 is to be used, the initialization of first set G and of second set N may be dispensed with.

The initialization of first set G and of second set N may take place as follows: First set G, which encompasses those pairs (x_i,y_i) of training data set X which were already drawn during the course of a current epoch of the training method is initialized as an empty set. Second set N, which encompasses those pairs (x_i,y_i) of training data set X which were not yet drawn during the course of the current epoch is initialized by assigning all pairs (x_i,y_i) of training data set X to it.

Now 1100, a gradient g of characteristic custom-character with respect to parameters θ is estimated, i.e., g=∇_θ, with the aid of pairs (x_i,y_i) made up of input signals x_iand respective associated output signals y_iof the training data set X. Exemplary embodiments of this method are described in connection with FIG. 12 or 13.

Then 1200, a scaling of gradient g is optionally carried out. Exemplary embodiments of this method are described in connection with FIG. 14.

Thereafter 1300, an adaptation of a learning rate η is optionally carried out. In the process, learning rate η may, for example, be reduced by a predefinable learning rate reduction factor Dη (e.g., Dη=1/10) (i.e., η←η·Dη), provided a number of the passed-through epochs is divisible by a predefinable epoch number, for example 5.

Then 1400, parameters θ are updated with the aid of the ascertained and possibly scaled gradient g and learning rate η. For example, parameters θ are replaced by θ−η·g.

It is now 1500 checked, with the aid of a predefinable convergence criterion, whether the method is converged. For example, it may be decided based on an absolute change in parameters θ (e.g., between the last two epochs) whether or not the convergence criterion is met. For example, the convergence criterion may be met exactly when a L²norm over the change of all parameters θ between the last two epochs is smaller than a predefinable convergence threshold value.

If it was decided that the convergence criterion is met, parameters θ are adopted as learned parameters (step 1600), and the method ends. If not, the method branches back to step 1100.

FIG. 12 illustrates, in a flowchart, an exemplary method for ascertaining gradient g in step 1100.

Initially 1110, a predefinable number bs of pairs (x_i,y_i) of training data set X is to be drawn (without replacement), i.e., selected, and assigned to a batch B. Predefinable number bs is also referred to as a batch size. Batch B is initialized as an empty set.

For this purpose, it is checked 1120 whether batch size bs is greater than the number of pairs (x_i,y_i) which are present in second set N.

If batch size bs is not greater than the number of pairs (x_i,y_i) which are present in second set N, a bs number of pairs (x_i,y_i) are drawn 1130, i.e., selected, randomly from second set N, and added to batch B.

If batch size bs is greater than the number of pairs (x_i,y_i) which are present in second set N, all pairs of second set N whose number is denoted by s are drawn 1140, i.e., selected, and added to batch B, and those remaining, i.e., a bs−s number, are drawn, i.e., selected, from first set G and added to batch B.

Subsequent to 1150 step 1130 or 1140, it is optionally decided for all parameters θ whether or not these parameters θ are to be ignored in this training pass. For this purpose, for example, a probability with which parameters θ of this layer are ignored is separately established for each layer S₁, S₂, . . . , S₆. For example, this probability may be 50% for first layer S₁and be reduced by 10% with each subsequent layer.

With the aid of these established respective probabilities, it may then be decided for each of parameters θ whether or not it is ignored.

It is now 1155 optionally decided for each pair (x_i,y_i) of batch B whether or not the respective input signal x_iis augmented. For each corresponding input signal x_iwhich is to be augmented, an augmentation function is selected, preferably randomly, and applied to input signal x_i. Input signal x_ithus augmented then replaces the original input signal x_i. If input signal x_iis an image signal, the augmentation function may be a rotation by a predefinable angle, for example.

Thereafter 1160, the corresponding (and optionally augmented) input signal x_iis selected for each pair (x_i,y_i) of batch B and supplied to neural network 60. Parameters θ of neural network 60 to be ignored are deactivated in the process during the ascertainment of the corresponding output signal, e.g., in that they are temporarily set to the value zero. The corresponding output signal y(x_i) of neural network 60 is assigned to the corresponding pair (x_i,y_i). Depending on output signals y(x_i) and the respective output signals y_iof pair (x_i,y_i) as the desired output signal y_T, a respective cost function custom-character is ascertained.

Then 1165, the complete cost function custom-character =Σ_i∈B_iis ascertained for all pairs (x_i,y_i) of batch B together, and the corresponding component of gradient g is ascertained for each of parameters θ not to be ignored, e.g., with the aid of backpropagation. For each of parameters θ to be ignored, the corresponding component of gradient g is set to zero.

Now, it is checked 1170 whether it was established, during the check in step 1000, that batch size bs is greater than the number of pairs (x_i,y_i) which are present in second set N.

If it was established that batch size bs is not greater than the number of pairs (x_i,y_i) which are present in second set N, all pairs (x_i,y_i) of batch B are added 1180 to first set G and removed from second set N. It is now checked 1185 whether second set N is empty. If second set N is empty, a new epoch begins (1186). For this purpose, first set G is again initialized as an empty set, and second set N is newly initialized in that all pairs (x_i,y_i) of training data set X are assigned to it again, and the method branches off to step 1200. If second set N is not empty, the method branches off directly to step 1200.

If it was established that batch size bs is greater than the number of pairs (x_i,y_i) which are present in second set N, first set G is re-initialized 1190 by assigning to it all pairs (x_i,y_i) of batch B, second set N is newly initialized by assigning to it again all pairs (x_i,y_i) of training data set X, and subsequently pairs (x_i,y_i) which are also present in batch B are removed. Thereafter, a new epoch begins, and the method branches off to step 1200. With this, this portion of the method ends.

FIG. 13 illustrates, in a flowchart, another exemplary method for ascertaining gradient g in step 1100. First, parameters of the method are initialized 1111. Hereafter, the mathematical space of parameters θ is denoted by W. If parameters θ thus encompass an np number of individual parameters, space W is an np-dimensional space, for example W= custom-character ^np. An iteration counter n is initialized to the value n=0, a first variable m₁is then set as m₁=0∈W (i.e., as np-dimensional vector), and a second variable as m₂=0∈W⊗W (i e., as np×np-dimensional matrix).

Thereafter 1121, a pair (x_i,y_i) is randomly selected from training data set X and, if necessary, is augmented. This may, for example, take place in such a way that, for each input signal x_iof pairs (x_i,y_i) of training data set X, a μ(α(x_i)) number of possible augmentations α(x_i) is ascertained, and to each pair (x_i,y_i) a position variable

$\begin{matrix} p_{i} = \frac{\sum_{j < i} p_{j}}{\sum_{j} p_{j}} & (2) \end{matrix}$

is assigned. If a random number φ∈[0;1] is then drawn in a uniformly distributed manner, position variable p_iwhich meets the inequation chain

p
_i
≤φ<p
_i+1 (3)

may be selected. The associated index i then denotes the selected pair (x_i,y_i), and an augmentation α_iof input variable x_imay be drawn randomly from the set of possible augmentations α(x_i) and be applied to input variable x_i, i.e., the selected pair (x_i,y_i) is replaced by (α_i(x_i),y).

Input signal x_iis supplied to neural network 60. Depending on the corresponding output signal y(x_i) and output signal y_iof pair (x_i,y_i) as the desired output signal Y_T, the corresponding cost function custom-character is ascertained. For parameters θ, a gradient d in this regard is ascertained, e.g., with the aid of backpropagation, i.e., d=∇₀(y(x_i),y_i).

Then 1131, iteration counter n, first variable m₁and second variable m₂are updated as follows:

$\begin{matrix} n \leftarrow n + 1 t = \frac{1}{n} & (4) \\ m_{1} = (1 - t) \cdot m_{1} + t \cdot d & (5) \\ m_{2} = (1 - t) \cdot m_{2} + t \cdot (d \cdot d^{T}) & (6) \end{matrix}$

Thereafter 1141, components C_a,bof a covariance matrix C are provided as

$\begin{matrix} C_{a, b} = \frac{1}{n} {(m_{2} - m_{1} \cdot m_{1}^{T})}_{a, b} . & (7) \end{matrix}$

From this, using the (vector-valued) first variable m₁, a scalar product S is formed, i.e.,

S=
custom-character
₁
,C
⁻¹
m
₁
. (8)

It shall be understood that for the sufficiently precise ascertainment of scalar product S using equation (8), not all entries of covariance matrix C or of the inverse C⁻¹must be present at the same time. It is more memory-efficient, during the evaluation of equation (8), to determine entries C_a,bof covariance matrix C needed then.

It is then checked 1151 whether this scalar product S meets the following inequation:

S≥λ
², (9)

λ being a predefinable threshold value which corresponds to a confidence level.

If the inequation is met, the current value of first variable m₁is adopted as estimated gradient g (step 1161) and the method branches back to step 1200.

If the inequation is not met, the method can branch back to step 1121. As an alternative, it may also be checked 1171 whether iteration counter n has reached a predefinable maximum iteration value n_max. If this is not the case, the method branches back to step 1121; otherwise, zero vector 0∈W is adopted 1181 as estimated gradient g, and the method branches back to step 1200. With this, this portion of the method ends.

As a result of this method, it is achieved that m₁corresponds to an arithmetic mean of the ascertained gradient d over the drawn pairs (x_i,y_i), and m₂corresponds to an arithmetic mean of a matrix product d·d^Tof the ascertained gradient d over the drawn pairs (x_i,y_i).

FIG. 14 shows one specific embodiment of the method for scaling gradient g in step 1200. Hereafter, each component of gradient g is denoted by a pair (ι,l), ι∈{1, . . . , k} denoting a layer of the corresponding parameter θ, and l∈{1, . . . , dim(V_i)} denoting a numbering of the corresponding parameter θ within the ι-th layer. If the neural network is designed, as illustrated in FIG. 10, for processing multidimensional input data x using corresponding feature maps z_ι in the ι-th layer, numbering l is advantageously given by the position of the feature in feature map z_iwith which the corresponding parameter θ is associated.

Now 1210, a scaling factor Ω_ι,lis ascertained for each component g_ι,lof gradient g. For example, this scaling factor Ω_ι,lmay be the size of receptive field rF of the feature of the feature map of the ι-th layer corresponding to l. As an alternative, scaling factor Ω_ι,lmay also be a ratio of the resolutions, i.e., the number of features, of the ι-th layer in relation to the input layer.

Then 1220, each component g_ι,lof gradient g is scaled using scaling factor Ω_ι,l, i.e.,

g
_ι,l
←g
_ι,l/Ω_ι,l. (10)

If scaling factor Ω_ι,lis given by the size of receptive field rF, overfitting of parameters θ may be avoided particularly effectively. If scaling factor Ω_ι,lis given by the ratio of the resolutions, this is a particularly efficient approximate estimation of the size of receptive field rF.

FIGS. 15a)-15c) illustrate specific embodiments of the method which is executed by scaling layer S₄.

Scaling layer S₄is configured to achieve a projection of input signal x present at the input of scaling layer S₄to a ball, having radius ρ and center c. This is characterized by a first norm N₁(y−c), which measures a distance of center c from output signal x present at the output of scaling layer S₄, and a second norm N₂(x−y), which measures a distance of input signal x present at the input of scaling layer S₄from output signal y present at the output of scaling layer S₄. In other words, output signal y present at the output of scaling layer S₄solves equation

y=argmin_N₁_(y-c)≤ρN₂(x−y). (11)

FIG. 15 a) illustrates a particularly efficient first specific embodiment for the case that first norm N₁and a second norm N₂are identical. They are denoted hereafter by ∥⋅∥.

Initially 2000, an input signal x present at the input of scaling layer S₄, a center parameter c and a radius parameter ρ are provided.

Then 2100, an output signal y present at the output of scaling layer S₄is ascertained as

$\begin{matrix} y = c + \frac{ρ \cdot (x - c)}{\max (ρ,  x - c )} . & (12) \end{matrix}$

With this, this portion of the method ends.

FIGS. 15b) and 15c) illustrate specific embodiments for particularly advantageously selected combinations of first norm N₁and second norm N₂.

FIG. 15b) illustrates a second specific embodiment for the case that, in condition 12 to be met, first norm N₁(⋅) is maximum norm and second norm N₂(⋅) is 2-norm∥⋅∥₂. This combination of norms may be computed particularly efficiently.

First 3000, similarly to step 2000, input signal x present at the input of scaling layer S₄, center parameter c and radius parameter ρ are provided.

Then 3100, components y_iof output signal y present at the output of scaling layer S₄are ascertained as

$\begin{matrix} y_{i} = {\begin{matrix} c_{i} + ρ & if x_{i} - c_{i} > ρ \\ c_{i} - ρ & if x_{i} - c_{i} < - ρ \\ x_{i} & else \end{matrix}, & (13) \end{matrix}$

i denoting the components here.

This method is particular processing-efficient. With this, this portion of the method ends.

FIG. 15c) illustrates a third specific embodiment for the case that, in condition 12 to be met, first norm N₁(⋅) is 1-norm∥⋅∥₁, and second norm N₂(⋅) is 2-norm∥⋅∥₂. As a result of this combination, as small components as possible are set to the value zero in input signal x present at the input of scaling layer S₄.

First 4000, similarly to step 2000, input signal x present at the input of scaling layer S₄, center parameter c and radius parameter ρ are provided.

Then 4100, a sign variable ϵ_iis ascertained as

$\begin{matrix} ϵ_{i} = {\begin{matrix} + 1 & if x_{i} \geq c_{i} \\ - 1 & if x_{i} < c_{i} \end{matrix} & (14) \end{matrix}$

and components x_iof input signal x present at the input of scaling layer S₄are replaced by

x
_i←ϵ_i·(x_i−c). (15)

An auxiliary parameter γ is initialized to the value zero.

Then 4200, a set N is ascertained as N={i|x_i>γ} and a distance dimension D=ΣΣ_i∈N(x_i−γ).

Then 4300, it is checked whether inequation

D>ρ (16)

is met.

If this is the case 4400, auxiliary parameter γ is replaced by

$\begin{matrix} γ \leftarrow γ + \frac{D - ρ}{\langle N \rangle}, & (17) \end{matrix}$

and the method branches back to step 4200.

If inequation (16) is not met 4500, components y_iof output signal y present at the output of scaling layer S₄is ascertained as

y
_i
=c
_i+ϵ_i·(x_i−γ)₊ (18)

Notation (⋅)₊ usually denotes

$\begin{matrix} {(ξ)}_{+} = {\begin{matrix} ξ & if ξ > 0 \\ 0 & else] \end{matrix} . & (19) \end{matrix}$

With this, this portion of the method ends. This method corresponds to a Newton's method and is particularly processing-efficient, in particular, when many of the components of input signal x present at the input of scaling layer S₄are important.

FIG. 16 illustrates one specific embodiment of a method for operating neural network 60. First 5000, the neural network is trained using one of the described methods. Then 5100, control system 40 is operated as described using neural network 60 thus trained. With this, the method ends.

It shall be understood that the neural network is not limited to feedforward neural networks, but that the present invention may equally be applied to any kind of neural network, in particular, recurrent networks, convolutional neural networks, autoencoders, Boltzmann machines, perceptrons or capsule neural networks.

The term “computer” encompasses arbitrary devices for processing predefinable processing rules. These processing rules may be present in the form of software, or in the form of hardware, or also in a mixed form made up of software and hardware.

It shall furthermore be understood that the methods cannot only be implemented completely in software as described. They may also be implemented in hardware, or in a mixed form made up of software and hardware.

METHOD FOR TRAINING A NEURAL NETWORK

Information

Publication Number

Date Filed

Date Published

Inventors

Original Assignees

CPC

International Classifications

Abstract

Description

Claims

Priority Claims (1)

PCT Information