This application claims priority under 35 U.S.C. § 119 to Patent Application No. DE 10 2021 212 727.4, filed on Nov. 11, 2021 in Germany, the disclosure of which is incorporated herein by reference in its entirety.
The disclosure relates to a method for generating training data for training a machine learning algorithm, and in particular to a method designed to generate additional training data in a simple manner and with low resource consumption.
Machine learning algorithms are based on statistical methods being used to train a data processing system in such a way that it can perform a particular task without it being originally programmed explicitly for this purpose. The goal of machine learning is to construct algorithms that can learn and make predictions from data. These algorithms create mathematical models with which data can be classified, for example.
A system to be modeled can be acquired by means of measurements, for example, wherein an empirical model can be created based on measured values, for example, and a machine learning algorithm can be trained accordingly. However, in this case, situations in which it is impossible to completely measure a process to be modeled or a system to be modeled may, for example, occur. However, this may result in only partial data from a subspace being available for the empirical modeling or the corresponding training of the machine learning algorithm, wherein process states that are not captured by these training data can, however, also occur in operation.
As a solution to this problem, augmentation methods, i.e., methods for generating additional training data, have been proposed. However, with known augmentation methods, it proves disadvantageous that they are very complex and require many computer resources, in particular storage and computing capacities, so that they are difficult to realize with ordinary data processing systems.
A method for learning a data supplementation strategy for training a machine learning algorithm is known from the publication US 2019/0354895 A1, wherein training data for training a machine learning algorithm are received and a plurality of data supplementation strategies are determined by generating a current data supplementation strategy based on quality parameters of previous data supplementation strategies, the machine learning algorithm is trained based on the current data supplementation strategy and quality parameters with respect to the current data supplementation strategy are determined after the machine learning algorithm has been trained based on the current data supplementation strategy, wherein a data supplementation strategy is subsequently selected based on the quality parameters of the individual data supplementation strategies.
The disclosure is thus based on the object of specifying an improved method for generating training data for training a machine learning algorithm.
The object is achieved by a method for generating training data for training a machine learning algorithm as disclosed herein. The object is also achieved by a control device for generating training for training and a machine learning algorithm as disclosed herein. Advantageous embodiments and developments emerge from the dependent claims and from the description with reference to the figures.
According to one embodiment of the disclosure, this object is achieved by a method for generating training data for training a machine learning algorithm, wherein the training data respectively comprise a data point and a data value associated with the data point, and wherein first training data are provided for training the machine learning algorithm, a manifold in which at least one part of the data points of the first training data is located is approximated, a structure of the at least one part of the data points of the first training data in the manifold is determined, and additional training data are generated based on the structure of the at least one part of the data points of the first training data in the manifold.
Data points are understood herein as information carriers or units of information representing input variables of the machine learning algorithm, i.e., data that can be processed by the machine learning algorithm.
Data values or function values are furthermore understood as information carriers and units of information respectively representing an output variable of the machine learning algorithm, i.e., an output variable generated by processing a corresponding input variable by means of the machine learning algorithm.
A manifold is understood to mean a structure with which points in an n-dimensional space can be represented or determined with (n-1) or fewer coordinates. A manifold in which at least one part of the data points of the first training data is located being approximated thus means that data points from an n-dimensional space corresponding to the at least one part of the data points of the first training data in the manifold can be determined in (n-1) or less coordinates.
The structure of the at least one part of the data points of the first training data in the manifold is furthermore understood as a connection, or a mathematical relationship, between the coordinates of the respective data points in the manifold.
Approximating a manifold in which at least one part of the data points of the first training data is located, wherein the additional training data is subsequently generated based on the approximated manifold, has the advantage that the number of dimensions or coordinates to be processed during the generation of additional training data can be significantly reduced and the effort associated with the generation of additional training data can thus be significantly simplified.
Generating the additional training data based on the structure of the at least one part of the data points of the first training data in the manifold moreover has the advantage that interdependencies of the training data are considered and the generated additional training data are consistent with the at least one part of the first training data, for example all have a certain property.
Overall, a method is thus specified with which the generation of additional training data can be significantly simplified and additional training data can be generated in a simple manner and with comparatively low resource consumption, e.g., low storage and/or computing capacities. For example, if the first training data are time points from time series, the effort associated with generating additional training data can be significantly simplified so that the method can in particular even be performed on control devices with limited computer resources.
Overall, an improved method for generating training for training a machine learning algorithm is thus specified.
In one embodiment, the step of approximating the at least one manifold in which the at least one part of the data points of the first training data is located comprises, for each data point from the first training data, determining the nearest neighbors of the respective data point within the data points of the first training data, wherein, for each data point of the first training data, the structure of the at least one part of the data points of the first training data in the manifold is respectively determined based on the data point and the nearest neighbors of the respective data point. The approximation of the manifold or the generation of the additional training data can thus take place in a simple manner based on the respective nearest neighbors and especially a neighborhood graph, i.e., with comparatively low computer resources.
The structure of the at least one part of the data points of the first training data in the manifold may be determined, for example, based on a main component analysis.
In this case, for each data point of the first training data, the nearest neighbors can be determined based on the Euclidean norm.
The Euclidean norm or standard norm serves to determine the distance between two points or vectors, especially in a two- or three-dimensional space.
The nearest neighbors may thus also be respectively determined in a simple manner and with low consumption of computer resources.
The nearest neighbors being determined based on the Euclidean norm is, however, only one possible embodiment. Rather, the nearest neighbors may also be determined based on other methods for determining a distance between individual data points.
Furthermore, for each data point in the additional training data, the method may furthermore comprise respectively determining a data value for the respective data point based on data values associated with the nearest neighbors of the respective data point.
In particular, since the nearest neighbors have already been determined during the generation of the additional training data, the corresponding data values can thus also be determined without great effort and with low computer resources.
The first training data may furthermore be sensor data or data captured by a sensor.
A sensor, which is also referred to as a detector, (measurement or measuring) sensor or (measuring) transmitter, is a technical part that can qualitatively detect particular physical or chemical properties and/or the material characteristics of its surroundings or detect them quantitatively as a measured variable.
Thus, circumstances outside the actual data processing system on which the additional training data are generated can be captured in a simple manner and can be taken into account when generating the additional training data.
With a further embodiment of the disclosure, a method for training a machine learning algorithm is also specified, wherein first training data and additional training data are provided by a method described above for generating training data for training a machine learning algorithm, and wherein the machine learning algorithm is trained based on the first training data and the additional training data.
A method for training a machine learning algorithm which is based on training data generated by an improved method for generating training data for training a machine learning algorithm is thus specified. In particular, the method is based on a method for generating training data for training a machine learning algorithm, with which the generation of additional training data can be significantly simplified and additional training data can be generated in a simple manner and with comparatively low resource consumption, e.g., low storage and/or computing capacities. For example, if the first training data are time points from time series, the effort associated with generating additional training data can be significantly simplified so that the method for generating training data for training the machine learning algorithm can in particular even be performed on control devices with limited computer resources.
With a further embodiment of the disclosure, a method for controlling at least one function of a controllable system is furthermore also specified, wherein a machine learning algorithm is provided for controlling the at least one function of the controllable system, wherein the machine learning algorithm has been trained by a method described above for training a machine learning algorithm, and wherein the at least one function of the controllable system is controlled based on the machine learning algorithm.
The controllable system may, for example, be a robotic system, wherein the robotic system may, for example, be an injection system of an internal combustion engine. Furthermore, the robotic system may, for example, however also be any other system that can be controlled based on a machine learning algorithm, e.g., driver assistance systems of a motor vehicle, a kitchen appliance or a washing machine.
A method is thus specified for controlling at least one function of a controllable system that is based on a machine learning algorithm that has been trained based on training data generated by an improved method for generating training data for training a machine learning algorithm. In particular, the training data in this case have been generated by a method for generating training data for training a machine learning algorithm, with which the generation of additional training data can be significantly simplified and additional training data can be generated in a simple manner and with comparatively low resource consumption, e.g., low storage and/or computing capacities. For example, if the first training data are time points from time series, the effort associated with generating additional training data can be significantly simplified so that the method for generating training data for training the machine learning algorithm can in particular even be performed on control devices with limited computer resources.
With a further embodiment of the disclosure, a control device for generating training data for training a machine learning algorithm is furthermore also specified, wherein the training data respectively comprise a data point and a data value, and wherein the control device comprises a provision unit designed to provide first training data, an approximation unit designed to approximate a manifold in which at least one part of the data points of the first training data is located, a determination unit designed to determine a structure of the at least one part of the data points of the first training data in the manifold, and a generation unit designed to generate additional training data based on the structure of the at least one part of the data points of the first training data in the manifold.
Overall, an improved control device for generating training data for training a machine learning algorithm is thus specified. In particular, a control device is specified with which the generation of additional training data can be significantly simplified and additional training data can be generated in a simple manner and with comparatively low resource consumption, e.g., low storage and/or computing capacities. For example, if the first training data are time points from time series, the effort associated with generating additional training data can be significantly simplified so that the control device can in particular even be a control device with limited computer resources.
In one embodiment, the approximation unit is designed to determine, for each data point from the first training data, the nearest neighbors within the data points of the first training data in order to approximate the manifold in which the at least one part of the data points of the first training data is located, wherein the determination unit is designed to respectively determine, for each data point from the first training data, the structure of the at least one part of the data points of the first training data in the manifold based on the data point and the nearest neighbors of the respective data point. The approximation of the manifold or the generation of the additional training data can thus take place in a simple manner based on the respective nearest neighbors and especially a neighborhood graph, i.e., with comparatively low computer resources.
The structure of the at least one part of the data points of the first training data in the manifold can in this case again be determined, for example, based on a main component analysis, i.e., the determination unit can be designed accordingly.
The approximation unit can furthermore be designed to respectively determine, for each data point from the first training data, the nearest neighbors based on the Euclidean norm. The nearest neighbors may thus also be respectively determined in a simple manner and with low consumption of computer resources.
Moreover, the control device may furthermore comprise a determination unit designed to respectively determine, for each data point in the additional training data, a data value for the respective data point based on data values associated with the nearest neighbors of the respective data point. In particular, since the nearest neighbors have already been determined during the generation of the additional training data, the corresponding data values can thus also be determined without great effort and with low computer resources.
Again, the first training data may furthermore be sensor data or data captured by a sensor. Thus, circumstances outside the actual data processing system on which the additional training data are generated can be captured in a simple manner and can be taken into account when generating the additional training data.
With a further embodiment of the disclosure, a control device for training a machine learning algorithm is furthermore also specified, wherein the control device comprises a provision unit designed to provide first training data and additional training data, wherein the additional training data have been generated by a control device described above for generating training data for training a machine learning algorithm, and a training unit designed to train the machine learning algorithm based on the first training data and the additional training data.
A control device for training a machine learning algorithm, which is designed to train a machine learning algorithm based on training data generated by an improved method for generating training data for training a machine learning algorithm, is thus specified. In particular, the additional training data in this case are generated by a method for generating training data for training a machine learning algorithm, with which the generation of additional training data can be significantly simplified and additional training data can be generated in a simple manner and with comparatively low resource consumption, e.g., low storage and/or computing capacities. For example, if the first training data are time points from time series, the effort associated with generating additional training data can be significantly simplified so that the corresponding method for generating training data for training the machine learning algorithm can in particular even be performed on control devices with limited computer resources.
With a further embodiment of the disclosure, a control device for controlling at least one function of a controllable system is furthermore also specified, wherein the control device comprises a provision unit designed to provide a machine learning algorithm for controlling the at least one function of the controllable system, wherein the machine learning algorithm has been trained by a control device described above for training a machine learning algorithm, and a control unit designed to control the at least one function of the controllable system based on the machine learning algorithm.
A control device is thus specified for controlling at least one function of a controllable system that is based on a machine learning algorithm that has been trained based on training data generated by an improved method for generating training data for training a machine learning algorithm. In particular, the training data in this case have been generated by a method for generating training data for training a machine learning algorithm, with which the generation of additional training data can be significantly simplified and additional training data can be generated in a simple manner and with comparatively low resource consumption, e.g., low storage and/or computing capacities. For example, if the first training data are time points from time series, the effort associated with generating additional training data can be significantly simplified so that the method for generating training data for training the machine learning algorithm can in particular even be performed on control devices with limited computer resources.
In summary, it must be noted that the disclosure specifies a method for generating training data for training a machine learning algorithm, and in particular a method designed to generate additional training data in a simple manner and with low resource consumption.
The described embodiments and developments can be combined with one another as desired.
Other possible embodiments, developments and implementations of the disclosure also include not explicitly mentioned combinations of features of the disclosure described above or below with respect to exemplary embodiments.
The accompanying drawings are intended to provide a further understanding of the embodiments of the disclosure. They illustrate embodiments and, in connection with the description, serve to explain principles and concepts of the disclosure.
Other embodiments and many of the mentioned advantages become apparent from the drawings. The illustrated elements of the drawings are not necessarily shown to scale with respect to one another. In the figures:
In the figures of the drawings, identical reference numbers denote identical or functionally identical elements, parts or components, unless stated otherwise.
Machine learning algorithms are based on statistical methods being used to train a data processing system in such a way that it can perform a particular task without it being originally programmed explicitly for this purpose. The goal of machine learning is to construct algorithms that can learn and make predictions from data. These algorithms create mathematical models with which data can be classified, for example.
A system to be modeled can be acquired by means of measurements, for example, wherein an empirical model can be created based on measured values, for example, and a machine learning algorithm can be trained accordingly. However, in this case, situations in which it is impossible to completely measure a process to be modeled or a system to be modeled may, for example, occur. However, this may result in only partial data from a subspace being available for the empirical modeling or the corresponding training of the machine learning algorithm, wherein process states that are not captured by these training data can, however, also occur in operation.
As a solution to this problem, augmentation methods, i.e., methods for generating additional training data, have been proposed. For example, it is known to augment data by Gaussian noise or image data by image processing methods. However, with known augmentation methods, it proves disadvantageous that they are very complex and require many computer resources, in particular storage and computing capacities, so that they are difficult to realize with ordinary data processing systems.
Overall,
The first training data may, for example, be measured values that show relationships between input and output values of a function controlled by the machine learning algorithm and based on which the machine learning algorithm is to be trained.
Furthermore, the training data generated by the method 1 may also be used to test or validate an already trained machine learning algorithm.
According to the embodiment of
The structure of the at least one part of the data points of the first training data in the manifold is determined, according to the embodiments of
wherein x0 is the mean value of the N nearest neighbors of the respective data point, ζ1 to ζN are respectively the main components, determined from the main component analysis, of the neighbors 1 to N of the respective data point, and λ1 to λN are respectively coefficients.
According to the embodiments of
According to the embodiments of
As
The data values associated with the nearest neighbors can in this case be read from the corresponding first training data.
According to the embodiments of
Steps 2, 3, 4, 5, and 6 may be performed repeatedly, particularly until sufficient training data for training the machine learning algorithm are available.
As
Moreover,
The controllable system may, for example, be an injection system of an internal combustion engine, wherein the machine learning algorithm is designed in such a way that the respective opening and/or closing time point of the injection valve can be determined based on a data-based time point determination model.
Furthermore, the controllable system may, for example, be an analyzer, e.g., an analyzer for analyzing samples for the presence of viruses, wherein the method can be applied to corresponding image data.
The controllable system 11 may, for example, be a robotic system, wherein the robotic system may, for example, be an injection system of an internal combustion engine. Furthermore, the robotic system may, for example, however also be any other system that can be controlled based on a machine learning algorithm, e.g., driver assistance systems of a motor vehicle, a kitchen appliance or a washing machine.
As
According to the embodiments of
The provision unit may, for example, be designed as a receiver, wherein the receiver is designed to receive the first training data, e.g., sensor data. The approximation unit, the determination unit and the generation unit may furthermore respectively be implemented, for example, based on code that is stored in a memory and can be executed by a processor.
According to the embodiments of
In particular, the approximation unit 16 is designed to respectively determine, for each data point from the first training data, the nearest neighbors based on the Euclidean norm.
As
Again, the application unit may furthermore be implemented, for example, based on code that is stored in a memory and can be executed by a processor.
According to the embodiments of
As
The further provision unit may, for example, again be designed as a receiver, wherein the receiver is designed to receive the generated additional training data and optionally also the first training data from the control device for generating training data for training the machine learning algorithm. Again, the training unit may furthermore be implemented, for example, based on code that is stored in a memory and can be executed by a processor.
As
The provision unit may, for example, again be designed as a receiver, wherein the receiver is designed to receive the trained machine learning algorithm from the control device for training the machine learning algorithm. The control unit may furthermore comprise corresponding actuators and/or may again at least in part be implemented, for example, based on code that is stored in a memory and can be executed by a processor.
Number | Date | Country | Kind |
---|---|---|---|
10 2021 212 727.4 | Nov 2021 | DE | national |