This application claims the benefit of priority to Chinese Patent Application No. 202410796725.X, filed on Jun. 19, 2024. The entire contents of this application are hereby incorporated herein by reference.
The present disclosure relates to a field of artificial intelligence technology, and in particular, to fields of computer vision technology and deep learning technology, which may be applied to scenarios such as autonomous driving. And more specifically, the present disclosure relates to an information prediction method, a method of training an autonomous driving model, a device, a medium, and an autonomous driving vehicle.
With a rapid development of artificial intelligence and autonomous driving technologies, an end-to-end autonomous driving system has attracted much attention due to its simplified system architecture, reduced error accumulation and global optimization capabilities.
For example, the autonomous driving system may predict a control signal for a vehicle by analyzing perception data. The vehicle may control a braking system according to a predicted control signal, so as to achieve automatic driving of the vehicle. Through settings of the autonomous driving system, it is possible to meet requirements for convenient travel to a certain extent.
The present disclosure provides an information prediction method, a method of training an autonomous driving model, a device, a medium, and an autonomous driving vehicle.
According to an aspect of the present disclosure, there is provided an information prediction method, including: acquiring perception data including image data acquired by a sensor in a vehicle and driving data of the vehicle; encoding the image data to obtain an image token sequence corresponding to the image data; encoding the driving data to obtain a driving feature corresponding to the driving data; and generating, using a generative model, a predicted token sequence corresponding to the image token sequence and a control information for the vehicle based on the driving feature and the image token sequence.
According to another aspect of the present disclosure, there is provided a method of training an autonomous driving model. The autonomous driving model includes an encoding layer and a generative model. The encoding layer includes a sequence encoding network and a driving data encoding network. The training method includes: encoding, using the sequence encoding network, image data in sample perception data to obtain an image token sequence corresponding to the image data; encoding, using the driving data encoding network, driving data in the sample perception data to obtain a driving feature corresponding to the driving data; generating, using the generative model, a predicted token sequence corresponding to the image token sequence and a predicted control information for a vehicle based on the driving feature and the image token sequence; and training the autonomous driving model according to the predicted token sequence and the image token sequence.
According to another aspect of the present disclosure, there is provided an electronic device including: at least one processor; and a memory communicatively coupled with the at least one processor, where the memory stores instructions executable by the at least one processor, and the instructions, when executed by the at least one processor, are used to cause the at least one processor to implement the information prediction method provided by the present disclosure.
According to another aspect of the present disclosure, there is provided an electronic device including: at least one processor; and a memory communicatively coupled with the at least one processor, where the memory stores instructions executable by the at least one processor, and the instructions, when executed by the at least one processor, are used to cause the at least one processor to implement the method of training the autonomous driving model provided by the present disclosure.
According to another aspect of the present disclosure, there is provided a non-transitory computer readable storage medium storing computer instructions, where the computer instructions are used to cause a computer to implement the information prediction method or the method of training the autonomous driving model provided by the present disclosure.
According to another aspect of the present disclosure, there is provided an autonomous driving vehicle including the electronic device for implementing the information prediction method provided by the present disclosure.
It should be understood that the content described in this section is not intended to identify key or important features of embodiments of the present disclosure, nor is it intended to limit the scope of the present disclosure. Other features of the present disclosure will be easily understood through following specification.
The accompanying drawings are used to better understand this scheme and do not constitute a limitation of the present disclosure, in which:
Exemplary embodiments of the present disclosure will be described below with reference to the accompanying drawings, including various details of embodiments of the present disclosure to facilitate understanding, and they should be considered merely exemplary. Therefore, those skilled in the art should recognize that various changes and modifications may be made to embodiments described herein without departing from the scope and spirit of the present disclosure. Similarly, for clarity and conciseness, descriptions of well-known functions and structures have been omitted in the following description.
A main challenge faced by the end-to-end autonomous driving system is how to establish robust environmental perception and representation. For example, the ability to establish the robust environmental perception and representation may be improved by introducing an aided perception task. Specifically, the existing autonomous driving system usually relies on multi-task aided planning to improve the environmental representation by introducing an aided supervision signal, these methods often require expensive and difficult-to-obtain, accurate perception annotations, that is, accurately annotated perception data is required to use the annotated data as the aided supervision signal. Therefore, there is a problem of time-consuming and labor-intensive, which will limit a deployment cost, scalability and practical applicability of the autonomous driving model to a certain extent.
In order to solve the problems existing in the related technology, the present disclosure provides an information prediction method, an information prediction apparatus, a method of training an autonomous driving model, an apparatus of training an autonomous driving model, a device, a medium, a program product and an autonomous driving vehicle that achieve control information prediction without relying on large-scale accurate annotated data. The following will first describe application scenarios of the method and the apparatus provided by the present disclosure in conjunction with
As shown in
In an embodiment, the vehicle 110 may be further integrated with various types of sensors communicatively coupled to the autonomous driving system, such as a vision-based camera and a radar-based ranging sensor. The vision-based camera may include a monocular camera, a binocular stereo vision camera, a panoramic vision camera, and an infrared camera, etc. The radar-based ranging sensor may include, for example, a laser radar, a millimeter wave radar, an ultrasonic radar, etc. For example, the autonomous driving system may process data acquired by various types of sensors, so as to determine environmental information of the vehicle 110 and control information for the vehicle 110, so that the vehicle may drive according to the control information.
In an embodiment, the server 120 may provide an autonomous driving model 130 for the autonomous driving system, for example. The autonomous driving system may process the image data acquired by a sensor according to the autonomous driving model 130, and predict the control information for the vehicle in combination with the current driving data of the vehicle, so that the vehicle may automatically drive according to the control information. For example, the autonomous driving model 130 may take the prediction of control information as a primary task, take the image generation task as an auxiliary task, and synchronously output an image token sequence and predicted control information. By decoding the output image token sequence, an image may be generated.
In an embodiment, the server 120 may train the autonomous driving model 130 using the annotated data, so that the autonomous driving model has the ability to predict the control information. The server 120 may perform self-supervised training on the autonomous driving model 130 using the image data, so that the autonomous driving model has the ability to generate images, so as to execute the auxiliary task, and rely on the auxiliary task to improve the accuracy of the autonomous driving model 130 for executing the primary task.
In an embodiment, the autonomous driving system in the vehicle 110 may also send the data acquired by the sensor to the server 120, and the server 120 generates images and predicts control information. And then, the server 120 sends the predicted control information to the autonomous driving system, and the autonomous driving system controls the movement of the vehicle according to the control information.
It should be noted that the information prediction method provided by the present disclosure may be implemented by the vehicle or the autonomous driving system in the vehicle, or by the server 120. Correspondingly, the information prediction apparatus provided by the present disclosure may be provided in the vehicle or the autonomous driving system included in the vehicle, or in the server 120. The method of training the autonomous driving model provided by the present disclosure may be implemented by the server 120. Correspondingly, the apparatus of training the autonomous driving model provided by the present disclosure may be provided in the server 120.
It should be understood that the number and type of vehicles 110 and servers 120 shown in
The information prediction method provided by the present disclosure will be described in detail with reference to
As shown in
In operation S210, perception data is acquired. The perception data includes at least image data acquired by a sensor in a vehicle and driving data of the vehicle.
According to embodiments of the present disclosure, the perception data may be acquired in real-time in a process of performing the information prediction method. The driving data of the vehicle may include one or more of navigation data at the current moment, speed at the current moment, and control information for the vehicle at the current moment. The control information may include one or more of an accelerator pedal angle, a brake pedal angle, a steering wheel rotation direction and a steering wheel rotation angle. The driving data may be acquired, for example, by the vehicle control system in the vehicle, or by the autonomous driving system in the vehicle, which is not limited in the present disclosure. The image data acquired by the sensor may include, for example, an environmental image at the current moment captured by a camera in the vehicle.
In an embodiment, the acquired perception data may include image data and driving data acquired at historical moments (such as one or more historical moments), in addition to image data and driving data acquired at the current moment. The image data at the historical moments and the image data at the current moment may form an image data sequence, and the driving data at the historical moments and the driving data at the current moment may form a driving data sequence.
In operation S220, the image data is encoded to obtain an image token sequence corresponding to the image data.
According to embodiments of the present disclosure, for example, the environmental image may be partitioned to obtain a plurality of image patches. Each image patch may be encoded into a token. By encoding the plurality of image patches obtained through partitioning, a token sequence may be obtained, which is referred to as an image token sequence. For example, a two-dimensional convolutional layer may be used to encode each image patch to obtain a token.
When an image data sequence is acquired, each image data in the image data sequence may be encoded to obtain a token sequence, resulting in a total of a plurality of token sequences. Then, according to an acquisition order of image data, the plurality of token sequences are concatenated to obtain the image token sequence corresponding to the image data sequence.
In operation S230, the driving data is encoded to obtain a driving feature corresponding to the driving data.
For example, the driving data may be represented by a string of numbers, which may include a speed of the vehicle, two-dimensional coordinates of a target point in a navigation path, the pedal angle of the vehicle accelerator, and so on. For example, the driving data may be encoded by using an embedding layer, so as to map the driving data into a vector to obtain the driving feature.
When a driving data sequence is acquired, the driving data at each moment in the driving data sequence may be encoded to obtain a feature. A plurality of features are obtained in total for the driving data sequence. And then, according to the acquisition order of driving data, the plurality of features are concatenated to obtain the driving feature corresponding to the driving data sequence.
In operation S240, based on the driving feature and the image token sequence, a predicted token sequence corresponding to the image token sequence and control information for the vehicle are generated using a generative model.
For example, the driving feature and the image token sequence may be concatenated to obtain an input feature, and the input feature may be input into the generative model to obtain the predicted token sequence and the control information. The control information may refer to a control signal of the vehicle, which may include at least one of: an accelerator pedal angle, a brake pedal angle, a steering wheel rotation direction, or a steering wheel rotation angle.
In this embodiment, the generative model is a model that executes multiple tasks. The multiple tasks executed include an image generation task and a control information prediction task. The image generation task may be an image reconstruction task or an image prediction task, and the image generation task is an auxiliary task relative to the control information prediction task.
For example, the generative model may be Variational Auto Encoders (VAEs), Diffusion Models, or Autoregressive Models, etc. The autoregressive model may be, for example, a model based on a Transformer architecture, which is not limited in the present disclosure.
In an embodiment, the image token sequence is set as a sequence Xt=(xt, 1, xt, 2, . . . , xt, n) corresponding to the image data at the tth moment. The predicted token sequence generated by the generative model may be, for example, a predicted sequence Xt′=(xt, 1′, xt, 2′, . . . , xt, n′) corresponding to the image data at the tth moment, or a predicted sequence xt+1=(xt+1,1, xt+1,2, . . . , xt+1, n) corresponding to the image data at the (t+1)th moment.
After receiving the control information, the autonomous driving system in the vehicle may, for example, determine a driving parameter according to the control information, and control the vehicle driving according to the driving parameter.
In embodiments of the present disclosure, by introducing an auxiliary task of predicting the image token sequence to predict the control information for the vehicle, robust environmental perception and representation may be established in the process of predicting the control information. As the auxiliary task may be executed with the acquired image as a supervision signal, there is no need to rely on precise perceptual annotation, which may effectively reduce the difficulty of executing the auxiliary task, facilitate the wide deployment of the control information prediction task, and improve the scalability and practical applicability of the control information prediction task.
In an embodiment, in the process of encoding the image data to an image token sequence, some fine-grained information in the image may be lost, such as association information between image patches corresponding to two tokens, and the influence of a small-scale object such as traffic light on the prediction accuracy of control information. In order to further ensure the prediction accuracy of control information and enable the auxiliary task to better assist the execution of the primary task (predicting control information), this embodiment may extract the image feature of the image data, that is, extracting the feature by taking the image data as a unit, and taking the extracted feature as a part of the input data for the generative model, while encoding the image data to obtain the image token sequence.
As shown in
After obtaining the driving feature 305, the first feature vector 303 and the image token sequence 302, a generative model 320 may be used to execute the auxiliary task and the primary task based on these features, that is, generating a predicted token sequence 306 and control information for the vehicle 307. For example, the driving feature 305, the first feature vector 303 and the image token sequence 302 may be concatenated as an input feature, which is input into the generative model 320, and the predicted token sequence 306 and control information 307 are output by the generative model 320.
In an embodiment, the generative model 320 may be an autoregressive model. When executing the task, the driving feature 305 and the first feature vector 303 may be concatenated and input into the generative model 320, and the generative model may predict to obtain a first token in the token sequence 306. Then, the driving feature 305, the first feature vector 303 and the first image token in the image token sequence 302 may be concatenated and input into the generative model 320, and the generative model may predict to obtain a second token in the token sequence 306. Following the same pattern, the token sequence 306 may be generated. Finally, the driving feature 305, the first feature vector 303 and the image token sequence 302 are concatenated and input into the generative model 320, and the generative model 320 may predict to obtain the control information 307.
In an embodiment, when the generative model 320 predicts a single token, the generated data is a probability vector, which includes a probability of each of the plurality of tokens. This embodiment may use the token with the highest probability in the probability vector as the predicted token. The token corresponding to the image patch may refer to a visual vocabulary, and the number of probability values included in the probability vector may refer to the number of predetermined visual vocabulary.
In an embodiment, the driving data encoding network 313 may, for example, adopt an embedding layer structure. This embodiment may also use any non-serialized network to replace the first convolutional network 312. The first convolutional network 312 may, for example, adopt network architectures such as VGG, ResNet, MobileNet, etc., and may adopt lightweight convolutional network architectures, etc. The present disclosure does not limit this.
The function of the first convolutional network 312 is to map image data into image features. For example, the first feature vector may be obtained by flattening the image features of the original image with a spatial resolution of 1/2 m.
In an embodiment, the driving data may include at least two parts of data: the driving parameter of the vehicle and the navigation data of the vehicle. This embodiment may use different networks to encode different parts of the driving data, so as to improve the expression ability of the encoded driving feature. This is because using different encoding principles for different types of data may improve encoding accuracy.
For example, navigation data may be a navigation map, that is, the navigation data may be image modal data. This embodiment may use the convolutional network to encode navigation data, so as to obtain a second feature vector representing the navigation data. For example, the driving parameter may include at least one of the following parameters: a vehicle speed, a scalar acceleration, a rotation angle, etc. This embodiment may use a fully connected layer or a multi-layer perceptron to encode the driving parameter, thereby obtaining a third feature vector representing the driving parameter. Correspondingly, the aforementioned driving feature includes the second feature vector and the third feature vector in this embodiment.
As shown in
In other words, in an embodiment 400, the positions of at least two target points on the navigation path may be extracted from the navigation map 401 as the navigation data 402, thereby reducing the interference of information other than the navigation path in the navigation map 401 on the prediction task, and improving the prediction accuracy of control information.
For example, in the embodiment 400, after obtaining navigation data 402, a mask image 403 representing the path formed by at least two target points may be generated based on the positions of the at least two target points. By converting to a mask image, it is beneficial to use the convolutional network to extract features of the navigation path. After obtaining the mask image 403, the convolutional network 410 may be used to encode the mask image 403, thereby obtaining the second feature vector 404 representing the navigation data.
The convolutional network 410 that encodes the mask image 403 may be, for example, ResNet series network, VGG series network, etc. The present disclosure does not limit this. This embodiment may input the mask image 403 into the convolutional network 410, and an image feature of tensor form is output by the convolutional network. By unfolding the image feature, the second feature vector 404 may be obtained.
In an embodiment, for example, in the process of encoding the image data to obtain an image token sequence, quantization processing may be performed, so as to discretize the obtained image token sequence, thereby facilitating probability calculation in the prediction task.
For example, in an embodiment 500, the encoder 510 may be used to encode the image data 501, so as to map pixel values of the image data to a more efficient representation space, and achieve the purpose of compressing information, thereby improving the computational efficiency of the generative model. By using the encoder 510 for encoding, an encoded feature sequence 502 may be obtained. Subsequently, the quantizer 520 is used to quantize the encoded feature sequence 502, the continuous representation of the encoded feature sequence is mapped to integer values for discretization representation. By quantization processing, the image token sequence 503 is obtained.
In an embodiment, the input of the encoder 510 may be a series of pixels obtained by pixel segmentation of the image. For example, the encoder 510 may adopt a two-dimensional convolutional layer structure. For example, if the processing of pixels by the encoder may be represented by a function ε(·), the following equation (1) may be used to obtain the encoded feature sequence f 502. It should be noted that in the actual processing, the variable of the input function ε(·) is the pixel sequence obtained by partitioning the image I.
In an embodiment, the quantizer 520 may be, for example, a quantization layer that has pre-learned a codebook e∈RK×, where the codebook contains K vectors, K is the number of pre-determined visual vocabulary, and C is the embedding dimension of each pre-determined visual vocabulary. The quantizer 520 may use a quantization function (·) to perform quantization processing on the encoded feature sequence f 502. For example, for the encoded feature f(i,j) corresponding to the pixel in an ith row and a jth column of the image I, the following equation (2) may be used for quantization processing to obtain a quantized feature x(i,j) corresponding to the encoded feature f(i,j), the quantized feature x(i,j) serves as an image token. In this way, each encoded feature may be mapped to a vector included in the codebook e closest to the encoded feature, that is x(i,j)∈[K], where e(k) represents the kth vector of the codebook e.
In an embodiment, the step of encoding image data to obtain an image token sequence may be implemented, for example, using data compression techniques such as Vector Quantization (VQ). For example, the encoder in the pre-trained Vector Quantized Generative Adversarial Network (VQGAN) or the encoder in the Vector Quantized Variational Auto Encoder (VQVAE) may be used to encode the image data, so as to map the image data to a discretized vector representation in a VQ space. The present disclosure does not limit this.
In an embodiment, the following equation (3) may be used to encode the image It at the tth moment to obtain a discretized image token sequence.
The encoding network q=
(E(·)) includes an encoder and a quantizer, the encoding network exhibits, for example, a spatial down-sampling rate of 16 times, and n=h×w, h and W are the pixel height and the pixel width of the image I, respectively. In this way, after processing by the encoding network, the bit compression ratio of image data I is 16×16×3×8/10g2 K, where 3 is the number of channels per pixel, and 8 is the number of bits occupied by each pixel.
According to embodiments of the present disclosure, when using a generative model to execute prediction tasks, tags may be added into the sequence input to the generative model, so as to prompt the generative model to execute the auxiliary task and the primary task.
For example, if the auxiliary task is a task of regenerating the input image token sequence, that is, if the input image token sequence represents the image data at the tth moment, the auxiliary task may be a task of predicting the token sequence representing the image data at the tth moment. Then, a start tag may be added at a head position of the image token sequence obtained through the above embodiments, so that the generative model may predict the first token in the token sequence representing the image data at the tth moment based on this start tag. The start tag may be, for example, a pre-defined tag such as <c>, which is not limited in the present disclosure.
For example, a query tag may be added at a tail position of the image token sequence, so as to prompt the generative model that the image token sequences have been fully input and the primary task, that is, predicting the control information for the vehicle may be started. The query tag may be, for example, a pre-defined tag such as <a>. The present disclosure does not limit this.
In an embodiment, the tags may be added at the head position and the tail position of the image token sequence, so as to obtain the tagged token sequence. Then, for example, the driving feature obtained from the above embodiments may be concatenated with the tagged token sequence to obtain an input sequence of the generative model. And then, by inputting the input sequence into the generative model, the generative model may generate a predicted token sequence (i.e., a token sequence representing the image data at the tth moment) and control information.
In an embodiment, when executing the auxiliary task and the primary task, for example, the first feature vector representing the image feature obtained from the above embodiments may also be considered.
In an embodiment, the driving data may include the driving parameter and navigation data described above. The second convolutional network may be used to encode the navigation data, and MLP may be used to encode the driving parameter.
For example, as shown in
The autonomous driving model may include an encoding layer and a generative model 620. The encoding layer may include networks for encoding the image data 601, the driving parameter 602 and the navigation data 603, respectively.
In an embodiment, the encoding layer includes a sequence encoding network 611 for encoding the image data 601 to obtain the image token sequence 604 corresponding to the image data 601.
In an embodiment, the encoding layer may further include a driving data encoding network. The driving data encoding network is used to encode the driving data to obtain the driving feature. In an embodiment, the driving data may include the driving parameter 602 and the navigation data 603. The driving data encoding network may include a multi-layer perceptron MLP 612 and a second convolutional network 613. The multi-layer perceptron MLP 612 is used to encode the driving parameter 602 to obtain a third feature vector 605. The second convolutional network 613 is used to encode the navigation data 603 to obtain a second feature vector 606.
For example, taking a driving parameter including a speed vt at the tth moment as an example, the multi-layer perceptron MLP 612 may, for example, encode the driving parameter using the following equation (4) to obtain the third feature vector vt 605, where s is the function used by the multi-layer perceptron.
For example, the navigation data is set to include the positions of at least two target points on the navigation path, taking the principle described in the above embodiment 400 as an example, the second convolutional network 613 may use the following equation (5) to obtain the second feature vector rt, where Mt is a mask image representing the path formed by at least two target points, generated based on the positions of at least two target points, and r is the function used by the second convolutional network 613.
In an embodiment, the encoding layer may further include a first convolutional network 614 for extracting the image feature of the image data 601 to obtain a first feature vector 607 representing the image feature. The first convolutional network 614 is a non-quantized branch that may be used to extract driving-related information from the image data 601. For example, the first convolutional network 614 may adopt a lightweight convolutional network c, so as to obtain the image feature of the image data with a spatial resolution of 1/64, which may be flattened and used as prompt information considered when predicting the control information. For example, for the image data It at the tth moment, the first convolutional network 614 may use the following equation (6) to obtain the first feature vector ct 607, where hc and wc represent the pixel height and the pixel width of the image feature at the spatial resolution of 1/64, respectively.
In an embodiment, in the process of encoding perception data, learnable position information may be combined. The learnable position information is similar to the position tag added to the token when inputting into the Transformer architecture.
In an embodiment, after encoding the perception data, for example, the input sequence of the generative model 620 may be obtained based on each vector and sequence obtained by encoding.
For example, a start tag <c> may be added at the head position of the image token sequence 604, and a query tag <a> may be added at the tail position of the image token sequence 604, so as to obtain the tagged token sequence. Then, the obtained vector and sequence may be sequentially input into the generative model in an order of (vt,rtct,<c>,xt,<a>), where the tth moment may refer to the current moment.
In an embodiment, the generative model 620 may, for example, use an autoregressive model. This is because the autoregressive model has the flexibility to accept diverse prompts, and series models (such as generative pre-training models GPT, etc.) have scalability. In this embodiment, (vt,rtct,<c>) may serve as the input sequence input into the generative model 620, and the generative model may obtain the probability vector of the first token in the predicted token sequence 608 by predicting based on the input sequence. This embodiment may use the token corresponding to the maximum probability value in the probability vector as the first token in the predicted token sequence. Then, this embodiment may input the (vt,rtct,<c>) and the first image token in xt as the input sequence into the generative model 620, and the generative model may obtain the probability vector of the second token in the predicted token sequence 608 by predicting. This embodiment may use the token corresponding to the maximum probability value in the probability vector as the second token in the predicted token sequence. Following the same pattern, the (vt,rtct,<c>) and all image tokens other than the last image token in xt may be input into the generative model, so as to obtain the last token in the predicted token sequence 608. Finally, this embodiment may input (vt,rtct,<c>,xt,<a>) as an input sequence into the generative model, and the generative model may obtain the control information 609 for the vehicle at the (t+1)th moment by predicting.
Through embodiments of the present disclosure, with the help of an autonomous driving model, end-to-end prediction of control information may be achieved. Moreover, as the image generation task is used as an auxiliary task to guide the implementation of the primary task, there is no need to rely on complex perception tasks and expensive perception annotations, and it is possible to provide a low-cost and scalable solution for end-to-end autonomous driving technology.
In order to facilitate the implementation of the information prediction method in embodiments of the present disclosure, the present disclosure further provides a method of training an autonomous driving model, which will be described in detail below with reference to
As shown in
In operation S710, a sequence encoding network is used to encode image data in sample perception data, so as to obtain an image token sequence corresponding to the image data.
According to embodiments of the present disclosure, the sample perception data includes image data acquired at historical moments and driving data of the vehicle at the moment when the image data is acquired.
The sequence encoding network may include, for example, a structure for partitioning an image and a two-dimensional convolutional layer, so as to encode the image data in the sample perception data using a principle similar to those described in the above operation S220 to obtain an image token sequence. In an embodiment, the sequence encoding network may use vector quantization compression technology to encode the image data.
In operation S720, a driving data encoding network is used to encode driving data in the sample perception data, so as to obtain a driving feature corresponding to the driving data.
According to embodiments of the present disclosure, the implementation principle of operation S720 is similar to the implementation principle of operation S220 described above, which will not be repeated here. The driving data encoding network may adopt an embedding layer structure, or as described in the above embodiment 600, which may include an MLP and a convolutional network.
In operation S730, based on the driving feature and the image token sequence, a generative model is used to generate a predicted token sequence corresponding to the image token sequence and a predicted control information for the vehicle.
According to embodiments of the present disclosure, the implementation principle of operation S730 is similar to the implementation principle of operation S230 described above. It should be noted that in a case that the generative model is an autoregressive model, the generative model generates a series of probability vectors that correspond one-to-one with the tokens in the predicted token sequence. In this embodiment, the token corresponding to the maximum probability value in the probability vector may be used as the token corresponding to the probability vector in the predicted token sequence.
In operation S740, the autonomous driving model is trained according to the predicted token sequence and the image token sequence.
For example, this embodiment may determine the loss value of the autonomous driving model based on a difference between the predicted token sequence and the image token sequence. Then, the autonomous driving model is trained with the goal of minimizing the loss value.
In an embodiment, in a case that the generative model generates a series of probability vectors, it is possible to determine the loss value of the autonomous driving model according to the probability value corresponding to the ith image token in the image token sequence in the ith probability vector. For example, a classification loss function may be used to calculate the loss value.
For example, for an embodiment in which the generative model is an autoregressive model, considering the principle of autoregressive modeling is that a series of discrete tags are given, the probability of generating the ith discrete tag depends on all tag sequences located before the ith discrete tag. Therefore, taking the cross entropy loss function as an example, this embodiment may use the following equation (7) to determine the loss value gen of the autonomous driving model based on the probability vector, where
θ,θg(·) is the probability value for the ith image token in the ith probability vector generated for the image at the tth moment.
It should be noted that this probability value θ,θg(·) is related to the driving data at the tth moment, which is not reflected in equation (7). N is the number of image tokens in the image token sequence.
Through this embodiment, as the image data itself in the sample perception data is used as the supervisory signal, without the need for precise perception annotated data, therefore, self-supervised training of the autonomous driving model may be achieved.
In an embodiment, when training the autonomous driving model, for example, the network parameters included in the sequence encoding network in the autonomous driving model may not be adjusted, so as to improve the stability of the image token sequence as the supervision signal, which is conducive to improving the training accuracy and training efficiency of the autonomous driving model.
In an embodiment, the sample perception data may include, for example, perception data at several moments before the tth moment in addition to the perception data at the tth moment. For perception data at each moment, the image token sequence and the driving feature may be obtained in a similar manner. When the generative model obtains the predicted token sequence and control information at the (t+1)th moment by predicting, it is possible to simultaneously consider the perception data at several moments before the tth moment, so as to improve prediction accuracy.
For example, if the sample perception data includes perception data from the first moment to the Tth moment among historical moments, in the process of training the autonomous driving model, a predicted token sequence corresponding to the image data at the first moment may be obtained by predicting according to the perception data at the first moment. According to this predicted token sequence, a loss value may be obtained. Then, according to the perception data at the first moment and the perception data at the second moment, a predicted token sequence corresponding to the image data at the second moment is obtained by predicting. According to this predicted token sequence, a loss value may be obtained. Following the same pattern, T loss values may be obtained, and in this embodiment, a sum of the T loss values may be used as the loss value for the token sequence generated by the autonomous driving model. The autonomous driving model is trained with the goal of minimizing the loss value.
In an embodiment, a supervision signal may be added to the primary task (predicting control information) executed by the autonomous driving model. For example, the sample perception data may further include real control information. For example, if the sample perception data includes perception data at the tth moment, the perception data at the tth moment may have a label indicating real control information at the (t+1)th moment. This embodiment may also consider the difference between the real control information and the predicted control information when training the autonomous driving model. For example, a regression loss function may be used to determine the loss value of the autonomous driving model for generating the predicted control information. For example, the real control information is set as at+1, and θ,θa(x≤t) is the predicted control information, this embodiment may use the following equation (8) to calculate the loss value
action for generating the predicted control information. It should be noted that the predicted control information
θ,θa(xt) is also related to the driving data at the tth moment, as well as the image data and the driving data before the tth moment, which are not reflected in the expression here. The regression loss function may include, for example, L1 loss function, L2 loss function, etc., which are not limited in the present disclosure.
For example, if the sample perception data includes perception data from the first moment to the Tth moment among historical moments, in the process of training the autonomous driving model, the control information at the second moment may be obtained by predicting according to the perception data at the first moment. According to the control information obtained by predicting and the real control information at the second moment, a loss value for generating control information may be obtained. Then, according to the perception data at the first moment and the perception data at the second moment, the control information at the third moment is obtained by predicting. According to the predicted control information and the real control information at the third moment, a loss value for generating control information may be obtained. Following the same pattern, T loss values may be obtained, and in this embodiment, a sum of T loss values may be used as the loss value of the autonomous driving model for generating the control information. The autonomous driving model is trained with the goal of minimizing the loss value.
In an embodiment, both the loss value of the autonomous driving model for generating control information and the loss value of the autonomous driving model for generating the predicted token sequence may be considered. For example, a weighted sum of the two loss values may be used as the total loss value of the autonomous driving model. The autonomous driving model is trained with the goal of minimizing the total loss value. The weights used for weighting may be set as desired in practice. For example, considering that the task of generating the predicted token sequence is an auxiliary task, a smaller weight may be assigned to the loss value. The present disclosure does not limit this.
In an embodiment, the above encoding layer may include a first convolutional network. The first convolutional network is similar to the first convolutional network in the above embodiment 600. This embodiment may use a first convolutional network to extract the image feature of the image data to obtain a first feature vector representing the image feature. Then, based on the driving feature, the first feature vector and the image token sequence, the generative model is used to generate the predicted token sequence and the predicted control information.
In an embodiment, the driving data may include a historical driving parameter of the vehicle and historical navigation data of the vehicle. The driving data encoding network may include the second convolutional network and the multi-layer perceptron as described in the above embodiment 600. The second convolutional network may be used to encode the historical navigation data to obtain a second feature vector representing the navigation data. By using the multi-layer perceptron to encode the historical driving parameter, a third feature vector representing the driving parameter may be obtained. The driving feature described above include the second feature vector and the third feature vector.
In an embodiment, the sequence encoding network includes an encoder and a quantizer. The sequence encoding network may adopt a principle similar to the principle described in the above embodiment 500 to encode the image data, so as to obtain the image token sequence.
In an embodiment, the principle of generating the predicted token sequence and the predicted control information by the generative model is similar to the principle described in the above embodiment 600, which will not be repeated here.
In an embodiment, the method of training the autonomous driving model may train the autonomous driving model through two tasks: one is the autoregressive generation task of the next token (i.e., the task of generating a predicted token sequence), and the other is the action prediction (i.e., the task of predicting control information) of the planner (such as braking, throttle, steering wheel, etc.). Except for different output heads, these two tasks share most of the network parameters. In this way, when optimizing shared parameters for generating tasks, the model will learn well the dependency relationships between the input tokens, thereby establishing a good environmental representation and facilitating the implementation of action prediction task.
Based on the information prediction method provided in the present disclosure, the present disclosure further provides an information prediction apparatus. The apparatus will be described in detail below in conjunction with
As shown in
The data acquisition module 810 is used to acquire perception data including image data acquired by a sensor in a vehicle and driving data of the vehicle. In an embodiment, the data acquisition module 810 may be used to perform the operation S210 described above, which will not be repeated here.
The first encoding module 820 is used to encode the image data to obtain an image token sequence corresponding to the image data. In an embodiment, the first encoding module 820 may be used to perform the operation S220 described above, which will not be repeated here.
The second encoding module 830 is used to encode the driving data to obtain a driving feature corresponding to the driving data. In an embodiment, the second encoding module 830 may be used to perform the operation S230 described above, which will not be repeated here.
The first generation module 840 is used to generate a predicted token sequence corresponding to the image token sequence and a control information for the vehicle based on driving feature and the image token sequence by using a generative model. In an embodiment, the first generation module 840 may be used to perform the operation S240 described above, which will not be repeated here.
According to embodiments of the present disclosure, the image token sequence is a discrete feature of the image data. The information prediction apparatus 800 may further include a first feature extraction module used to extract an image feature of the image data using a first convolutional network to obtain a first feature vector representing the image feature. The first generation module 840 described above may be specifically used to generate the predicted token sequence and the control information using the generative model based on the driving feature, the first feature vector and the image token sequence.
According to embodiments of the present disclosure, the driving data includes a driving parameter of the vehicle and navigation data of the vehicle. The second encoding module 830 described above may include a first encoding submodule and a second encoding submodule. The first encoding submodule is used to encode the navigation data using a second convolutional network to obtain a second feature vector representing the navigation data. The second encoding submodule is used to encode the driving parameter using a multi-layer perceptron to obtain a third feature vector representing the driving parameter. The driving feature includes a second feature vector and a third feature vector.
According to embodiments of the present disclosure, the navigation data may include positions of at least two target points on a navigation path. The first encoding submodule mentioned above may include an image generation unit and an image encoding unit. The image generation unit is used to generate a mask image representing a path formed by the at least two target points based on the positions of the at least two target points. The image encoding unit is used to encode the mask image using the second convolutional network to obtain the second feature vector.
According to embodiments of the present disclosure, the first encoding module 820 may include a third encoding submodule and a first quantization submodule. The third encoding submodule is used to encode the image data using an encoder to obtain an encoded feature sequence. The first quantization submodule is used to quantize the encoded feature sequence using a quantizer to obtain the image token sequence.
According to embodiments of the present disclosure, the first generation module may include a first tag addition submodule, a first sequence acquisition submodule, and a first generation submodule. The first tag addition submodule is used to add a start tag at a head position of the image token sequence and add a query tag at a tail position of the image token sequence, so as to obtain a tagged token sequence. The first sequence acquisition submodule is used to obtain an input sequence of the generative model based on the driving feature and the tagged token sequence. The first generation submodule is used to input the input sequence into the generative model to obtain the predicted token sequence and the control information generated by the generative model.
Based on the method of training the autonomous driving model provided in the present disclosure, the present disclosure further provides an apparatus of training an autonomous driving model. The apparatus will be described in detail below in conjunction with
As shown in
The third encoding module 910 is used to encode the image data in the sample perception data using a sequence encoding network, so as to obtain an image token sequence corresponding to the image data. In an embodiment, the third encoding module 910 may be used to perform the operation S710 described above, which will not be repeated here.
The fourth encoding module 920 is used to encode the driving data in the sample perception data using a driving data encoding network, so as to obtain a driving feature corresponding to the driving data. In an embodiment, the fourth encoding module 920 may be used to perform the operation S720 described above, which will not be repeated here.
The second generation module 930 is used to generate a predicted token sequence corresponding to the image token sequence and a predicted control information for the vehicle using a generative model based on the driving feature and the image token sequence. In an embodiment, the second generation module 930 may be used to perform the operation S730 described above, which will not be repeated here.
The training module 940 is used to train the autonomous driving model based on the predicted token sequence and the image token sequence. In an embodiment, the training module 940 may be used to perform the operation S740 described above, which will not be repeated here.
According to embodiments of the present disclosure, the sample perception data further includes a real control information. The training module 940 may further be used to train the autonomous driving model based on a difference between the real control information and the predicted control information.
According to embodiments of the present disclosure, the training module 940 is specifically used to train other model structures in the autonomous driving model other than the sequence encoding network.
According to embodiments of the present disclosure, the encoding layer further includes a first convolutional network. The apparatus 900 of training the autonomous driving model may further include a second feature extraction module used to extract an image feature of the image data using the first convolutional network to obtain a first feature vector representing the image feature. The second generation module 930 described above is specifically used to generate the predicted token sequence and the predicted control information using the generative model based on the driving feature, the first feature vector and the image token sequence.
According to embodiments of the present disclosure, the driving data includes a historical driving parameter of the vehicle and historical navigation data of the vehicle, and the driving data encoding network includes a second convolutional network and a multi-layer perceptron. The fourth encoding module 920 mentioned above may include a fourth encoding submodule and a fifth encoding submodule. The fourth encoding submodule is used to encode the historical navigation data using the second convolutional network to obtain a second feature vector representing the navigation data. The fifth encoding submodule is used to encode the historical driving parameter using the multi-layer perceptron to obtain a third feature vector representing the driving parameter. The driving feature includes the second feature vector and the third feature vector.
According to embodiments of the present disclosure, the sequence encoding network includes an encoder and a quantizer. The third encoding module 910 mentioned above may include a sixth encoding submodule and a second quantization submodule. The sixth encoding submodule is used to encode the image data using the encoder to obtain an encoded feature sequence. The second quantization submodule is used to quantize the encoded feature sequence using the quantizer to obtain the image token sequence. The sequence encoding network processes the image data based on a vector quantization compression technology.
According to embodiments of the present disclosure, the second generation module 930 may include a second tag addition submodule, a second sequence acquisition submodule, and a second generation submodule. The second tag addition submodule is used to add a start tag at a head position of the image token sequence and add a query tag at a tail position of the image token sequence, so as to obtain a tagged token sequence. The second sequence acquisition submodule is used to obtain an input sequence of the generative model based on the driving feature and the tagged token sequence. The second generation submodule is used to input the input sequence into the generative model, so as to obtain the predicted token sequence and the predicted control information generated by the generative model. The generative model includes an autoregressive model.
It should be noted that collecting, storing, using, processing, transmitting, providing, disclosing, and applying etc. of the personal information of the user involved in the present disclosure all comply with the relevant laws and regulations, are protected by essential security measures, and do not violate the public order and morals. In the technical solution of the present disclosure, the user's authorization or consent is acquired before the user's personal information is acquired or collected.
According to embodiments of the present disclosure, the present disclosure further provides an electronic device, a readable storage medium, and a computer program product.
As shown in
Various components in the electronic device 1000 are connected with I/O interface 1005, including an input unit 1006, such as a keyboard, a mouse, etc.; an output unit 1007, such as various types of displays, speakers, etc.; a storage unit 1008, such as a magnetic disk, an optical disk, etc.; and a communication unit 1009, such as a network card, a modem, a wireless communication transceiver, etc. The communication unit 1009 allows the electronic device 1000 to exchange information/data with other devices through a computer network such as the Internet and/or various telecommunication networks.
The computing unit 1001 may be various general-purpose and/or special-purpose processing components with processing and computing capabilities. Some examples of the computing unit 1001 include but are not limited to a central processing unit (CPU), a graphics processing unit (GPU), various dedicated artificial intelligence (AI) computing chips, various computing units running machine learning model algorithms, a digital signal processor (DSP), and any appropriate processor, controller, microcontroller, and so on. The computing unit 1001 may perform the various methods and processes described above, such as the information prediction method. For example, in some embodiments, the information prediction method may be implemented as a computer software program that is tangibly contained on a machine-readable medium, such as a storage unit 1008. In some embodiments, part or all of a computer program may be loaded and/or installed on the electronic device 1000 via the ROM 1002 and/or the communication unit 1009. When the computer program is loaded into the RAM 1003 and executed by the computing unit 1001, one or more steps of the information prediction method described above may be performed. Alternatively, in other embodiments, the computing unit 1001 may be configured to perform the information prediction method in any other appropriate way (for example, by means of firmware).
Various embodiments of the systems and technologies described herein may be implemented in a digital electronic circuit system, an integrated circuit system, a field programmable gate array (FPGA), an application specific integrated circuit (ASIC), an application specific standard product (ASSP), a system on chip (SOC), a complex programmable logic device (CPLD), a computer hardware, firmware, software, and/or combinations thereof. These various embodiments may be implemented by one or more computer programs executable and/or interpretable on a programmable system including at least one programmable processor. The programmable processor may be a dedicated or general-purpose programmable processor, which may receive data and instructions from the storage system, the at least one input device and the at least one output device, and may transmit the data and instructions to the storage system, the at least one input device, and the at least one output device.
Program codes for implementing the method of the present disclosure may be written in any combination of one or more programming languages. These program codes may be provided to a processor or a controller of a general-purpose computer, a special-purpose computer, or other programmable data processing devices, so that when the program codes are executed by the processor or the controller, the functions/operations specified in the flowchart and/or block diagram may be implemented. The program codes may be executed completely on the machine, partly on the machine, partly on the machine and partly on the remote machine as an independent software package, or completely on the remote machine or the server.
In the context of the present disclosure, the machine readable medium may be a tangible medium that may contain or store programs for use by or in combination with an instruction execution system, device or apparatus. The machine readable medium may be a machine-readable signal medium or a machine-readable storage medium. The machine readable medium may include, but not be limited to, electronic, magnetic, optical, electromagnetic, infrared or semiconductor systems, devices or apparatuses, or any suitable combination of the above. More specific examples of the machine readable storage medium may include electrical connections based on one or more wires, portable computer disks, hard disks, random access memory (RAM), read-only memory (ROM), erasable programmable read-only memory (EPROM or flash memory), optical fiber, convenient compact disk read-only memory (CD-ROM), optical storage device, magnetic storage device, or any suitable combination of the above.
In order to provide interaction with users, the systems and techniques described here may be implemented on a computer including a display device (for example, a CRT (cathode ray tube) or LCD (liquid crystal display) monitor) for displaying information to the user, and a keyboard and a pointing device (for example, a mouse or a trackball) through which the user may provide the input to the computer. Other types of devices may also be used to provide interaction with users. For example, a feedback provided to the user may be any form of sensory feedback (for example, visual feedback, auditory feedback, or tactile feedback), and the input from the user may be received in any form (including acoustic input, voice input or tactile input).
The systems and technologies described herein may be implemented in a computing system including back-end components (for example, a data server), or a computing system including middleware components (for example, an application server), or a computing system including front-end components (for example, a user computer having a graphical user interface or web browser through which the user may interact with the implementation of the system and technology described herein), or a computing system including any combination of such back-end components, middleware components or front-end components. The components of the system may be connected to each other by digital data communication (for example, a communication network) in any form or through any medium. Examples of the communication network include a local area network (LAN), a wide area network (WAN), and Internet.
The computer system may include a client and a server. The client and the server are generally far away from each other and usually interact through a communication network. The relationship between the client and the server is generated through computer programs running on the corresponding computers and having a client-server relationship with each other. The server may be a cloud server, also referred to as a cloud computing server or cloud host, which is a host product in the cloud computing service system to solve the defects of difficult management and weak business scalability in traditional physical hosts and VPS services (“Virtual Private Server”, or “VPS” for short). The server may also be a server of a distributed system, or a server combined with a blockchain.
Based on the electronic device implementing the information prediction method provided by the present disclosure, the present disclosure further provides an autonomous driving vehicle, which includes an electronic device for implementing the information prediction method.
In an embodiment, the autonomous driving vehicle may further include a sensor for acquiring image data, and the electronic device may predict control information at a next moment based on the image data and driving data of the autonomous driving vehicle, so as to control the driving of the autonomous driving vehicle based on the control information.
It should be understood that steps of the processes illustrated above may be reordered, added or deleted in various manners. For example, the steps described in the present disclosure may be performed in parallel, sequentially, or in a different order, as long as a desired result of the technical solution of the present disclosure may be achieved. This is not limited in the present disclosure.
The above-mentioned specific embodiments do not constitute a limitation on the scope of protection of the present disclosure. Those skilled in the art should understand that various modifications, combinations, sub-combinations and substitutions may be made according to design requirements and other factors. Any modifications, equivalent replacements and improvements made within the spirit and principles of the present disclosure shall be contained in the scope of protection of the present disclosure.
Number | Date | Country | Kind |
---|---|---|---|
202410796725.X | Jun 2024 | CN | national |