This nonprovisional application claims priority under 35 U.S.C. § 119(a) to German Patent Application No. 10 2020 129 675.4, which was filed in Germany on Nov. 11, 2020, and which is herein incorporated by reference.
The present invention relates to a method for reducing training data.
From Tianyang Wang, Jun Huan, Bo Li; “Data Dropout: Optimizing Training Data for Convolutional Neural Networks”; 2018 in IEEE 30th International Conference on Tools with Artificial Intelligence (ICTAI), it is known that artificial neural network (ANN) training can be improved by selectively reducing the training data. In this case, training is performed in two steps, wherein the first part of training is done with the full training dataset and the second part of training with a reduced dataset.
US 2021/0335018, which is incorporated herein by reference, is directed to a data generating device, training device, and a data generating method.
US 2021/0335469, which is incorporated herein by reference, is directed to a system for assigning concepts to a medical image that includes a visual feature module and a tagging module. The visual feature module is configured to obtain an image feature vector from the medical image. The tagging module is configured to apply a machine-learned algorithm to the image feature vector to assign a set of concepts to the image.
US 2021/0335061, which is incorporated herein by reference, is directed to techniques for monitoring and predicting vehicle health.
From Vishesh Devgan et al. “Using a Novel Image Analysis Metric to Calculate Similarity of Input Image and Images Generated by WAE,” 2019 IEEE, which is incorporated herein by reference, there is presented a quasi Euclidean distance metric as a reliable means for measuring image similarity in certain cases.
It is therefore an object of the present invention to provide a more efficient method for reducing the training data, which preferably avoids training with the complete, non-reduced training data set.
According to an exemplary embodiment, a method is provided by reducing training data via a system comprising an encoder, wherein at least a portion of the training data forms a temporal sequence and is combined into a first set of training data, and the encoder maps input data to prototype feature vectors of a set of prototype feature vectors, a) a first input datum is received from the first set of training data, b) the first input datum is propagated by the encoder, wherein the input datum is assigned one or more feature vectors by the encoder, and depending on the assigned feature vectors, a certain set of prototype feature vectors is determined and assigned to the first input datum, c) an aggregated vector is created for the first input datum, d) the steps a) to c) are performed with a second input datum from the first set of training data and a second aggregated vector is created for the second input datum, e) at least the first and second aggregated vectors are compared and a measure of similarity for the aggregated vectors is determined; and f) the first input datum is flagged or removed from the first set of training data if the determined measure of similarity exceeds a threshold value, wherein flagging or removing results in the first input datum from the first set of training data not being used for a first training.
An advantage of the method according to the invention is that training data can be excluded from the training quickly and efficiently, thus improving the training success. The method also improves the possibility of efficiently performing preprocessing steps. These include, for example, the enrichment of the individual training data with additional information about their content (labeling). Since less data needs to be labeled after the method has been carried out, preprocessing is also more effective.
It is also advantageous that the encoder used according to the invention can be trained with non-prepared data, in particular non-labeled data. This is done, for example, when training an autoencoder that includes the encoder. This unsupervised machine learning is considerably less costly, since the very time-consuming step of labeling or annotating the training data can be dispensed with.
The first set of training data can include video, radar, and/or lidar frames.
A typical type of training data can be, for example, sensor data and here specifically imaging or environment sensing sensors, such as cameras, radar or lidar sensors. Thus, typical training data are video, radar or lidar frames.
The video, radar and/or lidar frames of the first set of training data can be temporal sequences of sensor data, in particular of sensor data recorded during a journey of a vehicle or sensor data artificially generated so that they simulate sensor data of a journey of a vehicle.
A frame represents a snapshot related to a section of the image captured by the sensor. These individual frames usually form sequences of temporally successive single frames. This type of training data as temporal sequences of sensor data is often recorded by vehicles. Here, these vehicles move through the usual road traffic in order to record sensor data typical for this situation. Alternatively, sensor data can also be generated artificially. For this purpose, a fictitious scene, e.g., of road traffic, can be generated in a simulation, and sensor data for a simulated vehicle can be calculated from it. This can be done for time reasons since simulations can run much faster than real driving. Likewise, situations that cannot be easily recreated in reality, such as emergency braking or even accidents, can be easily recreated in simulation. With this kind of sequential training data, it is very common that not all frames contain information important for the training or two frames contain practically [sic] only redundant information. An example is waiting at a red light in traffic. During the wait time, a large number of sensor data are recorded, which, however, do not differ or differ only insignificantly in the aspects relevant for the training.
The first and second input datums of the first set of training data can be consecutive datums in the temporal sequence of the training data.
The training data of the first set of training data can be used to train an algorithm for the highly automated or autonomous control of vehicles.
In the context of developing algorithms for controlling highly automated or autonomous vehicles, a large amount of training data may be required. Since a large part of the algorithms is based on artificial intelligence and in particular deep neural networks, these must be trained with appropriate training data. Further training data is required for testing or validating the developed algorithms. The method according to the invention can be used in particular for the selection of relevant training data from a large set of training data, with which algorithms for highly automated or autonomous vehicles are then trained or tested.
In a further example of the method, steps a) to f) are performed directly when the training data is recorded or generated, and in step f) the first training datum is removed when the threshold value of the measure of similarity is exceeded.
The measured or generated training data, especially sensor data, usually requires a lot of storage space and can only be stored in the vehicle and thus only in limited space when driving in road traffic. Wireless transmission of sensor data is not possible in most cases due to its size. In order to save storage space, the method according to the invention is suitable for being carried out directly after the data has been acquired and for removing data identified as redundant again directly and saving storage space.
In a further example of the method, steps a) through f) are performed prior to training or preprocessing with the training data of the first set of training data.
Likewise, the method according to the invention can also be used at a later time to remove redundant training data from the data set to be used before training or before preprocessing. In particular, this saves time and computational resources in preparing the training data and provides better results during training. Preprocessing includes, in particular, labeling or annotating the training data to be used for training.
The aggregated vector can be a histogram vector that assigns to each protype feature vector an integer representing the respective assigned number of each protype feature vector.
In an example of the method, the measure of similarity in step e) is determined using a cosine similarity.
In an example of the method, the measure of similarity in step e) comprises comparing the first, the second, and a third aggregated vector, wherein the third aggregated vector was generated using steps a) through c) with a third input datum from the first set of training data.
The encoder can be trained as part of an autoencoder.
The encoder can comprise a first set of prototype feature vectors learned during training of the autoencoder.
The encoder and/or the autoencoder can be implemented by, for example, a neural network, in particular a convolutional neural network.
The method according to the invention may also be present as a computer program product comprising program code which, when executed, performs the method.
The method according to the invention may be present in a computer system set up to perform the method.
Further scope of applicability of the present invention will become apparent from the detailed description given hereinafter. However, it should be understood that the detailed description and specific examples, while indicating preferred embodiments of the invention, are given by way of illustration only, since various changes, combinations, and modifications within the spirit and scope of the invention will become apparent to those skilled in the art from this detailed description.
The present invention will become more fully understood from the detailed description given hereinbelow and the accompanying drawings which are given by way of illustration only, and thus, are not limitive of the present invention, and wherein:
The invention being thus described, it will be obvious that the same may be varied in many ways. Such variations are not to be regarded as a departure from the spirit and scope of the invention, and all such modifications as would be obvious to one skilled in the art are to be included within the scope of the following claims.
Number | Date | Country | Kind |
---|---|---|---|
10 2020 129 675.4 | Nov 2020 | DE | national |