This application claims the benefit of China Patent Application No. 202210507817.2 filed May 10, 2022, the entire contents of which are incorporated herein by reference in its entirety.
The disclosure relates to the technical field of artificial neural networks, and specifically provides a multi-task target detection method and device, an autonomous driving system, and a storage medium.
As one of the basic objectives in the field of computer vision, multi-task target detection has broad application prospects. Recognition and detection of multiple task targets need to be involved in a perception scenario of autonomous driving. If a single-task learning model is used for each task target, the superimposed delay and computing resources, and difficulty in deployment are unbearable. A multi-task learning model enables the use of a same one backbone network to extract depth features. Each task is connected to one head network after the depth features output by the backbone network, which effectively solves the problem with the computing resources and reduces the delay.
However, during practical implementation of autonomous driving perception, due to data costs, most of multi-task data sets are multi-source data sets, that is, each data sample has only one task annotation, and there is a great difference in distributions of data from different tasks; in addition, quite unlike a current multi-task learning method explored in the public domain that uses a uni-source data set and has a few number of tasks, there are generally a large number of tasks, which results in a high possibility of occurrence of task conflicts, a low convergence speed, and a poor model training effect during training of the multi-task model using multi-source data, thus affecting the reliability of multi-task target detection.
To overcome the above advantages, the disclosure is proposed to provide a multi-task target detection method and device, an autonomous driving system, and a storage medium, in order to solve or at least partially solve the technical problems that the reliability of multi-task target detection is affected due to a high possibility of occurrence of task conflicts between multi-source data, a low convergence speed, and a poor model training effect.
In a first aspect, the disclosure provides a multi-task target detection method, including:
Further, in the above multi-task target detection method, performing third-stage training on the intermediate trained model by using traffic scenario training images corresponding to all tasks in all task groups, to obtain the multi-task target detection model includes:
Further, in the above multi-task target detection method, each of the branch backbone networks is connected to the main backbone network and a head network for each task in a corresponding task group, respectively, to form a network path corresponding to each task; and
Further, in the above multi-task target detection method, the convergence conditions include:
Further, in the above multi-task target detection method, performing forward training and backward training on corresponding network paths based on traffic scenario training images corresponding to tasks in a current training stage, to obtain a gradient of the main backbone network, a gradient of a corresponding branch backbone network, and a gradient of a head network on each network path includes:
Further, in the above multi-task target detection method, updating parameters of a model in the current training stage based on the gradient of the main backbone network, the gradient of the corresponding branch backbone network, and the gradient of the head network on each network path includes:
Further, in the above multi-task target detection method, selecting traffic scenario training images corresponding to some of tasks in each task group to perform first-stage training on the model to be trained, to obtain an initial trained model includes:
In a third aspect, there is provided a multi-task target detection device, including a processor and a store adapted to store a plurality of program codes, where the program codes are adapted to be loaded and run by the processor to perform the multi-task target detection method in any one of the above aspects.
In a fourth aspect, there is provided an autonomous driving system, including the multi-task target detection device as described above.
In a fifth aspect, there is provided a computer-readable storage medium having a plurality of program codes stored therein, where the program codes are adapted to be loaded and run by a processor to perform the multi-task target detection method in any one of the above technical solutions.
The above one or more technical solutions of the disclosure have at least one or more of the following beneficial effects:
In the technical solutions for implementing the disclosure, the model to be trained is designed as a structure that the main backbone network shared by all the tasks, the branch backbone networks shared by different task groups, and the head network individually used for each task are connected in sequence. During model training, the traffic scenario training images corresponding to some of the tasks in each task group are selected to perform the first-stage training on the model to be trained, to obtain the initial trained model; the parameters of the main backbone network in the initial trained model are fixed, and the second-stage training is performed on the initial trained model by using the traffic scenario training images corresponding to all the tasks in each task group, to obtain the intermediate trained model; and then, the parameters of the main backbone network in the intermediate trained model are released, and the third-stage training is performed on the intermediate trained model by using the traffic scenario training images corresponding to all the tasks in all the task groups, to finally obtain the multi-task target detection model. In this way, during the first-stage training, only some of the tasks in each group are used for training, which reduces the volume of data, effectively prevents the occurrence of task conflicts, increases the convergence speed, and shortens the training time; during the second-stage training, with the parameters of the main backbone network unchanged, the tasks on the path for the branch backbone networks after the main backbone network may be used for separate training without mutual interference, which optimizes the parameters of the head network, shortens the training time, and makes it also possible to use tasks from a multi-source data set for training without mutual interference, regardless of whether tasks derive from a uni-source data set; and during the third-stage training, the model in the first two stages is already in a convergence state, and therefore, the use of all the tasks for simultaneous training in this case cannot only further optimize the parameters of the main backbone network, and can also implement quick convergence to obtain the required multi-task target detection model. As a result, the occurrence of task conflicts can be reduced, the convergence speed can be increased, the model training effect can be improved, and a higher reliability can be finally achieved by inputting the image under detection into the multi-task target detection model trained based on the step-by-step method, and detecting and recognizing the target detection object.
Further, during the first-stage training, the task with the maximum volume of data in each task group is selected for training, which can avoid the problem of difficulty in convergence of tasks with a small volume of data due to simultaneous training of all the tasks.
The disclosed content of the disclosure will become more readily understood with reference to the accompanying drawings. Those skilled in the art readily understand that these accompanying drawings are merely for illustrative purposes and are not intended to limit the scope of protection of the disclosure. In addition, similar components are represented by similar numbers in the figures, in which:
Some implementations of the disclosure are described below with reference to the accompanying drawings. Those skilled in the art should understand that these implementations are only used to explain the technical principles of the disclosure, and are not intended to limit the scope of protection of the disclosure.
In the description of the disclosure, a “module” or “processor” may include hardware, software, or a combination thereof. A module may include a hardware circuit, various suitable sensors, a communication port, and a memory, or may include a software part, for example, program code, or may be a combination of software and hardware. The processor may be a central processing unit, a microprocessor, a graphics processing unit, a digital signal processor, or any other suitable processor. The processor has a data and/or signal processing function. The processor may be implemented in software, hardware, or a combination thereof. A non-transitory computer-readable storage medium includes any suitable medium that may store program codes, for example, a magnetic disk, a hard disk, an optical disc, a flash memory, a read-only memory, or a random access memory. The term “A and/or B” indicates all possible combinations of A and B, for example, only A, only B, or A and B. The term “at least one of A or B” or “at least one of A and B” has a meaning similar to “A and/or B” and may include only A, only B, or A and B. The terms “a/an” and “this” in the singular form may also include the plural form.
In a multi-task target detection process, due to data costs, most of multi-task data sets are multi-source data sets, that is, each data sample has an annotation for only one task, and there is a great difference in distributions of data from different tasks; in addition, quite unlike a current multi-task learning method explored in the public domain that uses a uni-source data set and has a few number of tasks, there are generally a large number of tasks, which results in a high possibility of occurrence of task conflicts, a low convergence speed, and a poor model training effect during training of the multi-task model using multi-source data, thus affecting the reliability of multi-task target detection.
In view of this, the disclosure provides the following technical solutions in order to solve the above technical problems.
Referring to
In
As shown in
In step 101, an image under detection that contains at least one target detection object is obtained.
In a specific implementation process, the target detection object may include, but is not limited to, an obstacle, a pedestrian, a traffic light, and a lane line. The image under detection that contains the target detection object may be obtained by a vehicle-mounted camera, a millimeter wave radar, etc.
In step 102, the image under detection is input into a multi-task target detection model trained based on a step-by-step method, and the target detection object is detected and recognized.
In a specific implementation process, after the image under detection is input into the multi-task target detection model trained based on the step-by-step method, the multi-task target detection model first divides the image under detection according to characteristics of a task corresponding to the image under detection, and then sequentially inputs divided images under detection into corresponding network paths for detection and recognition, so as to obtain the target detection object.
In a specific implementation process, step-by-step training may be performed on a pre-built model to be trained to implement training of the multi-task target detection model. The specific training process is as shown in
In step 301, traffic scenario training images corresponding to some of tasks in each task group are selected to perform first-stage training on the model to be trained, to obtain an initial trained model.
In a specific implementation process, different tasks in each task group correspond to different volumes of data of traffic scenario training images, and when all the tasks are used for simultaneous training, tasks with a smaller volume of data are hardly to converge, and the use of a resampling method for the tasks with a smaller volume of data may lead to slow convergence in training and a significant increase in the training time. Therefore, in this embodiment, only the traffic scenario training images corresponding to some of the tasks in each task group may be selected to perform the first-stage training on the model to be trained, such that parameters of the main backbone network in the model to be trained, parameters of the branch backbone network in the model to be trained, and parameters of the head network corresponding to the tasks for training in the model to be trained are updated to obtain the initial trained model.
In a specific implementation process, a traffic scenario training image corresponding to a task with a maximum volume of data in each task group may be selected to perform the first-stage training on the model to be trained, to obtain the initial trained model. As such, the traffic scenario training images corresponding to the tasks with a small volume of data are no longer used to train the model to be trained, thus preventing the problem of a longer training time caused by the difficulty in convergence of the tasks with a small volume of data; in addition, when the task with the maximum volume of data is used for training, the task with a large volume of data requires no resampling, and achieves a more accurate training result as compared to the tasks with a small volume of data.
During the first-stage training, the training may be performed according to the following steps.
(1) Forward training and backward training are performed on corresponding network paths based on traffic scenario training images corresponding to a plurality of tasks in a current training stage, respectively, to obtain a gradient of the main backbone network, a gradient of a corresponding branch backbone network, and a gradient of a head network on each network path.
Specifically, based on a forward propagation algorithm, the forward training may be performed on corresponding network paths by using traffic scenario training images corresponding to the tasks in the current training stage, to obtain a loss of each network path; and based on a backward propagation algorithm, the backward training may be performed on corresponding network paths by using the losses, to determine the gradient of the main backbone network, the gradient of the corresponding branch backbone network, and the gradient of the head network on each network path.
(2) Parameters of a model in the current training stage are updated based on the gradient of the main backbone network, the gradient of the corresponding branch backbone network, and the gradient of the head network on each network path until convergence conditions are met, and then a trained model in the current training stage is obtained.
Specifically, gradients of the main backbone network on all network paths may be superimposed to obtain a superimposed gradient of the main backbone network, and gradients of all branch backbone networks on all network paths corresponding to the branch backbone networks may be superimposed to obtain a superimposed gradient of all the branch backbone networks; and the parameters of the main backbone network may be updated based on the superimposed gradient of the main backbone network, parameters of each main backbone network may be updated based on the superimposed gradient of all the branch backbone networks, and parameters of each head network may be updated based on the gradient of the head network on each network path.
It should be noted that after one instance of iterative training is completed, the next iterative training is directly carried out; and after a certain number of instances of training, a test on a validation set is conducted to check the accuracy of a model in the current training stage, and the training may not stop until
there is not any decrease in a loss of each network path, and there is not any increase in an accuracy of a test result after the model in the current training stage is tested using a traffic scenario training image corresponding to the validation set.
In step 302, parameters of the main backbone network in the initial trained model are fixed, and second-stage training is performed on the initial trained model by using traffic scenario training images corresponding to all tasks in each task group, to obtain an intermediate trained model.
In a specific implementation process, during the first-stage training, only the traffic scenario training images corresponding to some of the tasks participate in model training, and consequently, the parameters of the main backbone network in the initial trained model, the parameters of the branch main backbone network in the initial trained model, and parameters of each head network in the initial trained model are not optimal parameters, and the branch network corresponding to each task group is not obtained by all the tasks in the task group. Therefore, the second-stage training is required after the initial trained model is obtained.
In a specific implementation process, a main backbone network is used by all the tasks, the main backbone network for the model to be trained has been updated through the first training stage, and a certain training effect has been achieved. Therefore, to shorten the training time, the parameters of the main backbone network in the initial trained model may be fixed to remain unchanged, second-stage training may be performed on the initial trained model by using the traffic scenario training images corresponding to all the tasks in each task group, and the parameters of the branch backbone network in the initial trained model and parameters of all the head networks in the initial trained model may be respectively updated to obtain the intermediate trained model.
In a specific implementation process, the second-stage training process can be performed with reference to the first-stage training process, and will not be repeated herein.
It should be noted that during the second-stage training, the parameters of the main backbone network in the initial trained model are fixed, such that even if the gradient of the main backbone network on each network path is obtained, the parameters of the main backbone network in the initial trained model will not be affected. As such, the branch main backbone network on each network path and the head network connected to the branch main backbone network may be used for simultaneous training on respective different processors, which shortens the training time, and makes it also possible to use tasks from a multi-source data set for training without mutual interference, regardless of whether tasks derive from a uni-source data set.
In step 303, the parameters of the main backbone network in the intermediate trained model are released, and third-stage training is performed on the intermediate trained model by using traffic scenario training images corresponding to all tasks in all task groups, to obtain the multi-task target detection model.
In a specific implementation process, during the first-stage training and the second-stage training, only some of the tasks participate in training of the parameters of the main backbone network, and consequently, the parameters of the main backbone network are not optimal parameters. Therefore, after the intermediate trained model is obtained, the parameters of the main backbone network in the intermediate trained model may be released, and then third-stage training is performed on the intermediate trained model by using traffic scenario training images corresponding to all tasks in all the task groups, to further update the parameters of the main backbone network in the intermediate trained model, parameters of the branch backbone networks in the intermediate trained model, and parameters of the head networks corresponding to all tasks in the intermediate trained model, so as to obtain the multi-task target detection model.
In a specific implementation process, the third-stage training process can also be performed with reference to the first-stage training process, and will not be repeated herein.
It should be noted that the parameters of the main backbone network have been initially optimized through the first-stage training, and the parameters of the branch backbone networks and the parameters of the head networks corresponding to all the tasks are further optimized through the second-stage training, that is, the third-stage training is performed with convergence implemented in the first-stage training and the second-stage training. Therefore, all the tasks can implement quick convergence even if they are used for simultaneous training, and the parameters of the main backbone network in the multi-task target detection model are related to all the tasks and have a higher accuracy.
In addition, the parameters of the main backbone network have been initially optimized through the first-stage training, and therefore, this step may be completed at a low learning rate, to shorten the entire training time.
Specifically, a learning rate during second-stage training may be obtained; the learning rate during the second-stage training may be adjusted based on a preset learning rate reduction parameter, to obtain a reduced learning rate; and based on the reduced learning rate, third-stage training may be performed on the intermediate trained model by using all tasks in all the task groups, to obtain the multi-task target detection model. Usually, the preset learning rate reduction parameter may be adjusted by 1/10 of the learning rate in the previous training stage.
According to the multi-task target detection method in this embodiment, the model to be trained is designed as a structure that the main backbone network shared by all the tasks, the branch backbone networks shared by different task groups, and the head network individually used for each task are connected in sequence. During model training, the traffic scenario training images corresponding to some of the tasks in each task group are selected to perform the first-stage training on the model to be trained, to obtain the initial trained model; the parameters of the main backbone network in the initial trained model are fixed, and the second-stage training is performed on the initial trained model by using the traffic scenario training images corresponding to all the tasks in each task group, to obtain the intermediate trained model; and then, the parameters of the main backbone network in the intermediate trained model are released, and the third-stage training is performed on the intermediate trained model by using the traffic scenario training images corresponding to all the tasks in all the task groups, to finally obtain the multi-task target detection model. In this way, during the first-stage training, only some of the tasks in each group are used for training, which reduces the volume of data, effectively prevents the occurrence of task conflicts, increases the convergence speed, and shortens the training time; during the second-stage training, with the parameters of the main backbone network unchanged, the tasks on the path for the branch backbone networks after the main backbone network may be used for separate training without mutual interference, which optimizes the parameters of the head network, shortens the training time, and makes it also possible to use tasks from a multi-source data set for training without mutual interference, regardless of whether tasks derive from a uni-source data set; and during the third-stage training, the model in the first two stages is already in a convergence state, and therefore, the use of all the tasks for simultaneous training in this case cannot only further optimize the parameters of the main backbone network, and can also implement quick convergence to obtain the required multi-task target detection model. As a result, the occurrence of task conflicts can be reduced, the convergence speed can be increased, the model training effect can be improved, and a higher reliability can be finally achieved by inputting the image under detection into the multi-task target detection model trained based on the step-by-step method, and detecting and recognizing the target detection object.
It should be noted that, although the steps are described in a specific order in the above embodiments, those skilled in the art may understand that in order to implement the effects of the disclosure, different steps are not necessarily performed in such an order, but may be performed simultaneously (in parallel) or in other orders, and these changes shall all fall within the scope of protection of the disclosure.
Those skilled in the art can understand that all or some of the procedures in the method of the above embodiment of the disclosure may also be implemented by a computer program instructing relevant hardware. The computer program may be stored in a computer-readable storage medium, and when the computer program is executed by a processor, the steps of the above method embodiments may be implemented. The computer program includes computer program codes, which may be in a source code form, an object code form, an executable file form, some intermediate forms, or the like. The computer-readable storage medium may include: any entity or apparatus that can carry the computer program code, such as a medium, a USB flash drive, a removable hard disk, a magnetic disk, an optical disc, a computer memory, a read-only memory, a random access memory, an electric carrier signal, a telecommunications signal, and a software distribution medium. It should be noted that the content included in the computer-readable storage medium may be appropriately added or deleted depending on requirements of the legislation and patent practice in a jurisdiction. For example, in some jurisdictions, according to the legislation and patent practice, the computer-readable storage medium does not include an electric carrier signal and a telecommunications signal.
Further, the disclosure further provides a multi-task target detection device.
Further, the disclosure further provides an autonomous driving system, which may include the multi-task target detection device according to the above embodiments.
Further, the disclosure further provides a computer-readable storage medium. In an embodiment of the computer-readable storage medium according to the disclosure, the computer-readable storage medium may be configured to store a program for executing the multi-task target detection method according to the above method embodiment, where the program may be loaded and run by a processor to implement the above multi-task target detection method. For ease of description, only parts related to the embodiments of the disclosure are shown. For specific technical details that are not disclosed, reference may be made to the method part of the embodiments of the disclosure. The computer-readable storage medium may be a storage apparatus formed by various electronic devices. Optionally, the computer-readable storage medium in the embodiment of the disclosure is a non-transitory computer-readable storage medium.
Further, it should be understood that, because the configuration of modules is merely intended to illustrate function units of the apparatus in the disclosure, physical devices corresponding to these modules may be a processor itself, or part of software, part of hardware, or part of a combination of software and hardware in the processor. Therefore, the number of modules in the figure is merely an example.
Those skilled in the art can understand that the modules in the apparatus may be adaptively split or merged. Such a split or combination of specific modules does not cause the technical solutions to depart from the principle of the disclosure. Therefore, technical solutions after any such split or combination shall all fall within the scope of protection of the disclosure.
Heretofore, the technical solutions of the disclosure have been described with reference to the preferred implementations shown in the accompanying drawings. However, those skilled in the art can readily understand that the scope of protection of the disclosure is apparently not limited to these specific implementations. Those skilled in the art may make equivalent changes or substitutions to the related technical features without departing from the principle of the disclosure, and all the technical solutions with such changes or substitutions shall fall within the scope of protection of the disclosure.
| Number | Date | Country | Kind |
|---|---|---|---|
| 202210507817.2 | May 2022 | CN | national |