This application claims priority to United Kingdom Patent Application No. 2314928.9 filed on Sep. 28, 2023 and to United Kingdom Patent Application No. 2317029.3 filed on Nov. 6, 2023, wherein the entire contents of the foregoing applications are hereby incorporated by reference herein.
This invention relates to autonomous vehicles, such as autonomous cars, trucks, buses, vans, and such like. Particularly, though not exclusively, the present invention is concerned with improving decision making for ‘end-to-end’ methods of autonomous driving.
There has been a great deal of development in recent years in the field of autonomous vehicles, with ever more advanced self-driving vehicles becoming a reality. There are varying degrees of autonomy, often referred to as the ‘levels’ of automation, with these levels progressively outsourcing more of the driving functions from a human operator to a computer or ‘automated driver system’ (ADS). Much advancement has been made in recent years towards truly autonomous vehicles which can drive in a real-world road environment without any human interaction.
There are a number of different approaches to autonomous driving, with different paradigms being used. Some current autonomous driving systems, known in the art per se, make use of a modular paradigm in which a series of discrete processing stages make driving decisions based on predetermined rules. These modules may make utilize sensor fusion methods and route-locating algorithms, emphasizing environmental perception. However, these systems often struggle with complex situations, and their interpretation and response capabilities can be limited. Furthermore, such modular systems can suffer from error propagation.
Other solutions make use of ‘end-to-end’ methods, and there has been large progress in end-to-end deep learning methods for autonomous systems in recent years. Those skilled in the art will appreciate that with an ‘end-to-end’ paradigm, inputs from the car's sensors are typically mapped directly to driving outputs. This approach may provide better performance with complex or unusual situations. End-to-end approaches may also be more reliable compared to modular approaches which can suffer from error propagation.
However, the Applicant has appreciated that a major challenge with modern autonomous driving systems is performance, particularly with respect to the speed of the decision-making process. In autonomous vehicle applications, it is critical that decisions can be made and acted upon swiftly and safely, in an environment in which the surrounding world can change quickly and in potentially unexpected ways. In order to plan what actions the vehicle should take, typically a number of possible future events are considered and planned for, before a decision is taken and acted upon.
Typical fully learned systems, known in the art perse, are too slow to deploy in a real-world vehicle because they require interleaving planning and rolling out future world dynamics. Such a system may typically require several seconds (at least) in order to make a decision. This performance is simply too slow to be usable in a practical system, in which a decision should be made within approximately 100 ms.
If the model is too slow, the autonomous vehicle would be making a decision too late for it to be relevant. By way of example, in general when a model makes a decision in the car it may need to output the next 2 to 5 seconds worth of actions (which can be anything between 10 and 50 instantaneous actions, e.g. if you want to stop in the next 5 s and your speed is 10 mph, you would expect to be at e.g. 8 mph in 1 s, 6 mph in 2 s until reaching 0 mph in 5 s). These 2 to 5 seconds are typically not fully executed because another decision will be made shortly after (after say 50 to 100 ms). Nevertheless in order to plan the next 2 to 5 s, the model needs to somehow anticipate what is going to happen next and plan its trajectory accordingly.
The Applicant has previously developed GAIA-1 (Generative Artificial Intelligence for Autonomy), which utilises a multi-modal approach that leverages video, text and action inputs to generate realistic driving videos. By training on a vast corpus of real-world UK urban driving data, the GAIA-1 model learns to predict the subsequent frames in a video sequence, resulting in an autoregressive (AR) prediction capability without needing any labels. This resembles the approach seen in large language models (LLMs).
GAIA-1 is a true world model that learns to understand and disentangle the important concepts of driving, including cars, trucks, buses, pedestrians, cyclists, road layouts, buildings, and traffic lights. What sets the GAIA-1 model apart from other models, known in the art per se, is its ability to provide fine-grained control over both ego-vehicle (i.e. the autonomous vehicle itself) behaviour and other essential scene features. Whether altering the ego-vehicle's behaviour or modifying the overall scene dynamics, this model offers unparalleled flexibility, making it an invaluable tool for accelerating the development of foundation models for autonomous driving.
The Applicant has appreciated the need for further developments to autonomous vehicles to improve performance such that learned models may be used in a practical, real time driving environment.
In accordance with a first aspect, embodiments of the present invention provide an autonomous driving system comprising:
The first aspect of the invention extends to a method of operating an autonomous driving system, the method comprising:
The first aspect of the invention also extends to a non-transitory computer-readable medium comprising instructions which, when executed by a processor, cause the processor to carry out a method of operating an autonomous driving system, the method comprising:
The first aspect of the invention further extends to a computer software product comprising instructions which, when executed by a processor, cause the processor to carry out a method of operating an autonomous driving system, the method comprising:
When viewed from a second aspect, embodiments of the present invention provide a method of training an autonomous driving system, the method comprising:
The second aspect of the invention also extends to a non-transitory computer-readable medium comprising instructions which, when executed by a processor, cause the processor to carry out a method of training an autonomous driving system, the method comprising:
The second aspect of the invention further extends to a computer software product comprising instructions which, when executed by a processor, cause the processor to carry out a method of training an autonomous driving system, the method comprising:
When viewed from a third aspect, embodiments of the present invention provide a method of training a machine learning driving decoder, the method comprising:
The third aspect of the invention also extends to a non-transitory computer-readable medium comprising instructions which, when executed by a processor, cause the processor to carry out a method of training a machine learning driving decoder, the method comprising:
The third aspect of the invention further extends to a computer software product comprising instructions which, when executed by a processor, cause the processor to carry out a method of training a machine learning driving decoder, the method comprising:
Thus it will be appreciated that embodiments of the present invention provide an advantageous arrangement which provides for enhanced model performance by training and deploying driving models that leverage world models. The model may be fine-tuned for driving tasks and subsequently applied within a vehicle for real-time decision-making.
The Applicant has appreciated that the type of world model that may otherwise be used for generative prediction (such as the Applicant's GAIA-1 model) already captures information about possible futures. This is because the world model is a neural network designed to acquire an implicit state, encompassing the current state of the world and its potential future evolutions. Embodiments of the present invention leverage this insight—rather than needing to iteratively roll out a future scene, plan a driving action, and then roll out the resultant driving scene (and so on), the existing ‘multiple possible futures’ implicitly represented in the world state can be used.
Thus, with the approach of the present invention, this implicit state provided by the pre-trained world model can serve as a latent state for a model responsible for controlling the autonomous vehicle—the Applicant refers to this driving model as GAIA-Drive. Initially, the network learns this implicit state through future prediction tasks involving video (and potentially text data, action data, and/or auxiliary sensor data, as outlined below). The implicit representation is trained to automatically anticipate future events within the operating environment of an autonomous vehicle, eliminating the necessity for explicit specification or undertaking explicit rollouts to estimate potential futures.
By avoiding the need to roll out multiple different futures and with iterative planning, embodiments of the present invention allow a driving plan to be generated much faster than is possible with conventional approaches, known in the art per se. This may advantageously allow processing to occur sufficiently quickly that the learned model can be deployed in a real-world autonomous vehicle, with it being able to make decisions quickly enough to be used safely and reliably.
As outlined above, the driving plan is generated in a single forward pass operation. This means that the decoding process occurs in a singular forward pass, deliberately excluding any backward loops into the world model. This modification eliminates the need for the model to predict future events (removing the need to loop back into the world model, i.e. avoiding autoregressive prediction), streamlining the pathway directly to the driving decoder for immediate decision-making without future-oriented information generation. This novel approach enhances the efficiency and real-time applicability of the model within the context of driving scenarios. Compared to other learnable systems known in the art per se, embodiments of the present invention may advantageously allow efficient computation of a state of the world that captures the expectations of the future dynamics in one forward pass. This may effectively allow the model to be deployed to real world applications such as in an autonomous vehicle.
Once deployed in an autonomous vehicle, images are captured, encoded, and fed into the pre-trained world model. In some embodiments, one or more cameras are configured to capture the video input data and to provide said video input data to the image encoder.
The term ‘pre-trained’ is used herein because from the perspective of the model that is used for the generation of driving plans, the world model is already trained. It should be noted, however, that in some embodiments of certain aspects the invention, the training of that world model may form part of the invention. In some embodiments, the pre-trained world model is trained by:
When training the model for driving, the existing parameters of a pre-trained world model are employed as a foundation. The parameters of the driving decoder, a new component, are initialised, e.g. at random. This configuration allows the model to process video inputs (e.g. from the car's cameras) and does not require input of e.g. textual or action data.
The driving model leverages the knowledge of the pre-trained world model to process image features, creating comprehensive world modelling features. These features may, in some embodiments, be augmented with additional pertinent driving data such as information from a route map indicating the desired trajectory.
The fusion of world modelling features and relevant driving data undergoes decoding through the driving decoder, generating a driving plan. This driving plan could encompass various components, such as waypoints, speed adjustments, curvature, and driving indicators.
Embodiments of the present invention may provide various further improvements over explicit world state estimators for autonomous driving.
The invention provides a data-driven approach in which the model can learn to estimate world states and predict their evolution from real-world data. As a result, the invention may provide for more accurate modelling of complex real-world driving scenarios.
Additionally, the invention may provide improvements in adaptability and generalisation as the model can adapt to new scenarios and generalise to previously unseen situations. Assuming the model learns from a sufficiently large dataset, it can capture variations and edge cases, making it more robust and better equipped to handle novel situations.
Furthermore, the invention may exhibit a reduction in bias because the model learns from data, reducing the need for manual design and minimising the potential for human-induced biases. This leads to more objective and unbiased training and validation data.
In some embodiments, the driving plan comprises one or more of: a waypoint, a speed, a velocity, a curvature, a trajectory, an indicator signal, an acceleration value, a braking value, a parking brake value, a steering angle, a lighting setting, and a horn setting. It will be appreciated that the driving plan may include any or all of these. The driving plan may, additionally or alternatively, include any other suitable operational variable of the autonomous vehicle.
As mentioned previously, the input to the driving decoder may be augmented with additional driving data, e.g. a route map indicating the desired trajectory. Thus in some embodiments, the driving decoder is further configured to receive a route plan input, wherein the driving decoder is further configured to generate the driving plan based on the route plan input. Thus, in certain embodiments, the method further comprises: i) receiving a route plan input; and ii) generating the driving plan based on the route plan input. The route plan may, in some such embodiments, comprise data representative of a route and/or a destination.
The driving decoder may itself be a trainable component. In some embodiments, the driving decoder is trained by:
Thus, in some embodiments, the method further comprises:
The driving decoder weights could be initialised to a predetermined set of values. In some embodiments, the driving decoder weights are initialised to a set of random values. It will be appreciated that the term ‘random’ as used herein covers both ‘truly’ random and pseudorandom values and either type may be used as appropriate. The random initialisation may, in some embodiments, be performed using a pseudorandom number generator.
Thus in a particular set of embodiments, the driving model can be said to include three components—a video encoder, a world model, and a driving decoder. The first of these two (the video encoded and world model) may be pre-trained in an earlier training process, with the video decoder being trained separately.
As outlined above, the inputs to the driving decoder at inference time may include additional data such as a route map. Similarly, a route map may be provided during the training of the driving decoder. Thus, in some embodiments, training the driving decoder further comprises receiving a training route plan input corresponding to the further video training data, wherein the driving decoder is further configured to generate the new driving plan based on the training route plan input, optionally wherein the training route plan input comprises data representative of a route and/or a destination.
Thus the method may, in a particular set of embodiments, further comprise:
While the world model is pre-trained, the Applicant has appreciated that the weights of the world model may also be updated when the driving decoder is being trained. Thus, in some embodiments, the set of world model weights based on the difference between the new driving plan and the driving plan prior corresponding to the further video training data. This allows for ‘fine-tuning’ of the world model.
When viewed from a fourth aspect, embodiments of the present invention provide a method of training an autonomous driving system, the method comprising:
This fourth aspect extends to a non-transitory computer-readable medium comprising instructions which, when executed by a processor, cause the processor to carry out a method of training an autonomous driving system, the method comprising:
This fourth aspect extends to a computer software product comprising instructions which, when executed by a processor, cause the processor to carry out a method of training an autonomous driving system, the method comprising:
When viewed from a fifth aspect, the present invention provides a world model for use in an autonomous driving system, said world model having a set of pre-trained world model weights, wherein the world model is configured to:
In certain embodiments of any of the foregoing aspects of the invention, a trainable image encoder may be used for encoding the image frames to generate the image encodings. In some embodiments, the method comprises training a set of image encoder weights of the image encoder. The image encoder may be pre-trained.
In some such embodiments, the trained set of image encoder weights of the image encoder are used for encoding the further image frames to generate further image encodings. In other words, the same image encoder weights used for encoding the images when training the world model are used for encoding images later for use with the resultant trained world model used in conjunction with the driving decoder. Thus the world model may be trained to produce the world state (which in turn is usable as a latent state for the driving model), and the image encoder may be trained to encode the images in a manner suitable for input to that world model. That trained image encoder and world model may then readily be used as part of the driving model, together with the driving decoder discussed previously.
Each image frame may be encoded and discretized into a plurality of tokens for input into the world model. The image frames may be downsampled by a rate D.
The Applicant has appreciated that different modalities may be used to train the world model, with additional modalities beyond video or image data alone providing data on which the world model can be trained, enhancing the patterns it is able to learn from the sequences of data it is provided during training.
In some embodiments, the step of training the world model further comprises:
These driving actions correspond to driving parameters, such as a speed, a curvature, amounts of acceleration or braking being applied, a steering angle, an acceleration rate, a braking rate, a curvature rate, or such like. The driving actions may be raw signals, or may be formatted in some suitable manner, e.g. provided in a vectorised form. The driving action data may comprise a scalar value representing each driving parameters (e.g. a scalar representing speed, a scalar representing curvature, etc.).
In some such embodiments, a trainable action encoder is used for encoding the driving actions to generate the action encodings, wherein the method comprises training a set of action encoder weights of the action encoder. The action encoder may be pre-trained.
In some embodiments, the step of training the world model further comprises:
The textual data provided may be a text-based description of a particular scenario. For example, textual training data may comprise text statements such as “I am approaching a crossing yielding to pedestrians” or “It is safe to move so I am now accelerating”.
In some such embodiments, a trainable text encoder is used for encoding the text data to generate the text encodings, wherein the method comprises training a set of text encoder weights of the text encoder. The text encoder may be pre-trained.
In some embodiments, the step of training the world model further comprises:
The auxiliary sensor data corresponds to measurements acquired from any auxiliary sensor(s) on the vehicle. Such auxiliary sensors may, in some embodiments, comprise one or more of: a radar sensor, a lidar sensor, an infrared sensor, a range sensor, a distance sensor, an ultrasonic sensor, a rain sensor, a temperature sensor, a pressure sensor, and a load sensor. The auxiliary sensor data may be raw signals, or may be formatted in some suitable manner, e.g. provided in a vectorised form. The auxiliary sensor data may comprise a scalar value representing each measurement (e.g. a scalar representing distance to another vehicle ahead of the autonomous vehicle, an array of infrared values, etc.).
In some such embodiments, a trainable auxiliary sensor encoder is used for encoding the auxiliary sensor data to generate the auxiliary sensor encodings, wherein the method comprises training a set of auxiliary sensor encoder weights of the auxiliary sensor encoder. The auxiliary sensor encoder may be pre-trained.
Each modality (images, actions, text, and/or auxiliary sensor data, as appropriate) is encoded separately and then the encodings are aligned temporally such that the correct time-based sequence of contemporaneous events is preserved. The world model is then trained by using an autoregressive transformer that models the sequence of tokenised, time-ordered encodings.
In some embodiments, the world model is configured to generate output tokens, wherein the output tokens are input to a video decoder, said video decoder being configured to generate a video output comprising a plurality of generated image frames from the output tokens. Thus, in accordance with such embodiments, the video decoder can decode the tokens output by the world model back into video, thus providing for synthetically generated realistic driving videos to be output. The video decoder may be a video diffusion decoder.
The video decoder may be a trainable component. Thus, in some embodiments, the method may further comprise training a set of video decoder weights of the video decoder.
It will be appreciated that the optional features described hereinabove in respect of embodiments of the any aspect of the invention apply equally, where technically appropriate, to the other aspects of the invention outlined herein.
Where technically appropriate, embodiments of the invention may be combined.
Embodiments are described herein as comprising certain features/elements. The disclosure also extends to separate embodiments consisting or consisting essentially of said features/elements.
Technical references such as patents and applications are incorporated herein by reference.
Any embodiments specifically and explicitly recited herein may form the basis of a disclaimer either alone or in combination with one or more further embodiments.
In the context of this specification “comprising” is to be interpreted as “including”. Aspects of the invention comprising certain elements are also intended to extend to alternative embodiments “consisting” or “consisting essentially” of the relevant elements.
The term “vehicle” as used herein should be understood to mean any kind of vehicle intended to travel on roads where some mechanical and/or electrical propulsion is used to drive the vehicle, whether operated autonomously or not. This includes, but is not limited to: cars, motorbikes, trucks, buses, coaches, vans, lorries, campervans, motor caravans, minibuses, limousines, all-terrain vehicles (ATVs), tractors, and other such vehicles that are mechanically or electrically driven.
Where context allows (e.g. in respect of other vehicles detected by the autonomous vehicle), the term “vehicle” further extends to non-driven vehicles, i.e. those without mechanical or electrical propulsion. This includes, but is not limited to: bicycles, unicycles, tricycles, quadracycles, rickshaws, carts, wagons, horse-drawn carts or carriages, and other such vehicles that are not mechanically or electrically driven.
The terms “data” is used in different contexts herein to refer to digital information, such as that represented by known bit structures within one or more programming languages. In use, data may refer to digital information that is stored as bit sequences within computer memory.
Certain machine learning models may operate on structured arrays of data of a predefined bit format. Using terms of the art, these may be referred to as “vectors”, as used herein.
However, the term “vector” is understood by those skilled in the art to cover multidimensional arrays or “tensors” as well. It should be noted that for machine learning methods multidimensional arrays, e.g. with a defined extent in multiple dimensions, may be “flattened” so as to be represented (e.g., within memory) as a sequence or vector of values stored according to the predefined format (e.g., n-bit integer or floating point number, signed or unsigned). Hence, the term “tensor” as used herein covers multidimensional arrays with one or more dimensions (e.g., vectors, matrixes, volumetric arrays etc).
The patent or application file contains at least one drawing executed in color. Copies of this patent or patent application publication with color drawing(s) will be provided by the Office upon request and payment of the necessary fee.
Certain embodiments of the present invention will now be described with reference to the accompanying drawings, in which:
Certain exemplary embodiments are described herein which relate to an autonomous vehicle, control systems for autonomous vehicles, and methods of training such control systems. It will be appreciated that the autonomous vehicle and its various components—as well as systems and components it interacts with—are complex technical systems, and so the illustrations and descriptions provided herein are simplified for ease of reference.
It will be appreciated that this description provides examples for reference purposes, and the scope of the invention is defined by the claims.
In Chapter I of the following description, we explain in detail a generative world model for autonomous driving, where such a world model may be used to enable autonomous driving in accordance with embodiments of the present invention. In particular, a detailed outline of the Applicant's GAIA-1 model is set out hereinbelow.
In Chapter II, the description explains how such a model may then be used for autonomous driving in conjunction with a driving decoder, in accordance with embodiments of the present invention.
Note that numerals within square brackets, e.g. “[1]”, refer to the references provided later.
Predicting future events is a fundamental and critical aspect of autonomous systems. Accurate future prediction enables autonomous vehicles to anticipate and plan their actions, enhancing safety and efficiency on the road. To achieve this, the development of a robust model of the world is imperative [1] and huge efforts have been made in the past to build such predictive world models for autonomous driving [2, 3, 4, 5, 6]. A world model [7, 8] learns a structured representation and understanding of the environment that can be leveraged for making informed decisions when driving.
However, current approaches have had significant limitations. World models have been successfully applied to control tasks in both simulation [9, 10, 11, 12, 13] and to real-world robotics tasks [14, 15]. These methods often rely on labelled data, which is challenging to obtain at scale, and models that work on simulated data may not fully capture the complexities of real-world scenarios. Furthermore, due to their low-dimensional representations, these models may struggle to generate highly realistic samples of future events, posing challenges in achieving a high level of fidelity in predictions for complex real-world applications such as autonomous driving.
Meanwhile, progress in generative image and video generation has harnessed the power of self-supervised learning to learn from large quantities of real-world data to generate remarkably realistic video samples [16, 17, 18]. Yet, a significant challenge persists in this domain: the difficulty of learning a representation that captures the expected future events. While such generative models excel at generating visually convincing content, they may fall short in learning representations of the evolving world dynamics that are crucial for precise future predictions and robust decision-making in complex scenarios.
In this work we introduce GAIA-1, a method designed with the goal of maintaining the benefits of both world models and generative video generation. It combines the scalability and realism of generative video models with the ability of world models to learn meaningful representations of the evolution into the future. GAIA-1 works as follows. First, we partition the model into two components: the world model and the video diffusion decoder. The world model reasons about the scene's high-level components and dynamics, while the diffusion model takes on the responsibility of translating latent representations back into high-quality videos with realistic detail.
For the world model, we use vector-quantized representations of video frames to discretize each frame, transforming them into a sequence of tokens. Subsequently, we reframe the challenge of predicting the future into predicting the next token in the sequence [10, 19]. This approach has been widely employed in recent years to train large language models [20, 21, 22, 23], and it is recognized for its effectiveness in enhancing model performance through the scaling of model size and data. It is possible to generate samples within the latent space of the world model through autoregressive generation.
The second component is a multi-task video diffusion decoder that is able to perform high-resolution video rendering as well as temporal upsampling to generate smooth videos from the information autoregressively generated by the world model. Similarly to large language models, video diffusion models have demonstrated a clear correlation between scale of training and overall performance, making both components of GAIA-1 suitable for effective compound scaling.
GAIA-1 is designed to be multimodal, allowing video, text and action to be used as prompts to generate diverse and realistic driving scenarios, as demonstrated in
GAIA-1 demonstrates the ability to manifest the generative rules of the real world. Emerging properties such as learning high-level structures, generalization, creativity, and contextual awareness indicate that the model can comprehend and reproduce the rules and behaviors of the world. Moreover, GAIA-1 exhibits understanding of 3D geometry, for example, by effectively capturing the intricate interplay of pitch and roll induced by road irregularities such as speed bumps. It showcases reactive behaviors of other agents demonstrating the ability to understand causality in decision making of road users. Surprisingly, it shows the capability to successfully extrapolate beyond the training data, for example to driving outside of the boundaries of the road. See Section 7 for a comprehensive list of examples.
The power of GAIA-1's learned representations to predict future events, paired with control over both ego-vehicle dynamics and scene elements, is an exciting advance that paves the way for improving embodied intelligence and providing synthetic data to accelerate training and validation. World models, such as GAIA-1, are the basis for the ability to predict what might happen next, which is fundamentally important for decision-making in autonomous driving.
In this section we describe the model architecture of the trainable components of GAIA-1. The general architecture is presented in
GAIA-1 can leverage three different input modalities (video, text, action), which are encoded into a shared d-dimensional space.
Image tokens. Each image frame of a video is represented as discrete tokens. To achieve this, we use a pre-trained image tokenizer for discretization (for details about the pre-training see Section 2.2).
Formally, let us consider a sequence of T images (x1, . . . , xT), where each image xT in this sequence is discretized into n=576 discrete tokens using the pre-trained image tokenizer. We obtain a sequence denoted as (z1, . . . , zT), where each zt=(zt,1, . . . , zt,n)ϵn corresponds to
discrete tokens. Here, H and W represent the height and width of the input image, while D denotes the downsampling factor of the image tokenizer. These discrete tokens are then mapped to a d-dimensional space via an embedding layer that is trained alongside the world model.
Text tokens. At each time step t, we incorporate information from both text and action. Textual input is encoded using the pre-trained T5-large model [24], resulting in m=32 text tokens per time step. These tokens are mapped to a d-dimensional space through a linear layer that is trained in conjunction with the world model. This process yields a text representation denoted as ct=(ct,1, . . . , ct,m)ϵm×d.
Action tokens. For actions, we consider l=2 scalar values (representing speed and curvature). Each scalar is independently mapped to the d-dimensional space via a linear layer that is trained with the world model. Consequently, the action at time step t is represented as at=(at,1, . . . , at,l)ϵl×d.
For each time step, the input tokens are interleaved in the following order: text—image—action. The final input of the world model is therefore (c1, z1, a1, . . . , cT, zT, aT). To encode the position of the input tokens, we use a factorized spatio-temporal positional embedding. 1) A learnable temporal embedding is shared across all the tokens of a given time step, i.e. there are T temporal embeddings. 2) A learnable spatial embedding indicates the position of a token within a time step, i.e. there are m+n+l=610 spatial embeddings (m text tokens, n image tokens, and I action tokens) of dimension d=4096.
When modeling discrete input data with a sequence model, there is a trade-off between the sequence length and the vocabulary size. The sequence length refers to the number of discrete tokens that are needed to describe the data. The vocabulary size corresponds to the number of possible values a single token can take. For language, there are two obvious choices for tokens: characters and words. When using character-level tokens, the input data has a longer sequence length, and each individual token belongs to a smaller vocabulary, but conveys little meaning. When using word-level tokens, the input data has a shorter sequence length, and each token contains a lot of semantics but the vocabulary is extremely large. Most language models [25, 26, 24, 21, 27, 22] use byte-pair encoding (or equivalent) as a trade-off between character-level and word-level tokenization.
Likewise for video, we would like to reduce the sequence length of the input, while possibly making the vocabulary larger, but with tokens that are more semantically meaningful than raw pixels. We do this with a discrete image autoencoder [28]. There are two objectives we would like to achieve in this first stage:
We reduce the sequence length of the input data by downsampling each input image by a factor D=16 in both height and width. Each image xt of size H×W is described by
tokens with a vocabulary size K. Inspired by [29], we guide the compression towards meaningful representations by regressing to the latent features of a pre-trained DINO model [30], a self-supervised image model that is known to contain semantic information. See
The discrete autoencoder is a fully convolutional 2D U-Net [31]. The encoder EB quantizes the image features using nearest neighbor look-up from a learnable embedding table [28], resulting in image tokens zt=Eθ(xt). Note that the decoder is only used to train the image autoencoder, solely the discrete encoder Eθ is part of the final GAIA-1 model. Due to the decoder being trained on single images it lacks temporal consistency when decoding to a video. For this reason we also train a video decoder that is described in Section 2.4.
The training losses for the image autoencoder are the following:
As described in Section 2.1 the input of the world model is (c1, z1, a1, . . . , cT, zT, aT). The world model is an autoregressive transformer network that models the sequence input. Its training objective is to predict the next image token in the sequence conditioned on all past tokens, using causal masking in the attention matrix of the transformer blocks [35].
We randomly dropout conditioning tokens during training so that the world model can do (i) unconditional generation, (ii) action-conditioned generation, and (iii) text-conditioned generation.
To further reduce the sequence length of our world model we temporally subsample videos from 25 Hz to 6.25 Hz. This allows the world model to reason over longer periods without leading to intractable sequence lengths. To recover video predictions at full frame rate we perform temporal super-resolution using the video decoder described in Section 2.4.
Following the recent advances in image [36, 37] and video generation [16, 18] we use denoising video diffusion models for the GAIA-1 decoder. A naive approach of independently decoding each frame-tokens to pixel space results in a temporally inconsistent video output. Modeling the problem as denoising a sequence of frames during the diffusion process, where the model can access information across time, greatly improves temporal consistency of the output video.
We follow [38] and use a 3D U-Net with factorized spatial and temporal attention layers. During training, our video diffusion model is conditioned on the image tokens obtained by discretizing input images with the pre-trained image tokenizer Eθ. During inference, the diffusion model is conditioned on the predicted image tokens from the world model.
We train a single model jointly on both image and video generation tasks. Training on videos teaches the decoder to be temporally consistent, while training on images is crucial for the quality of individual frames [16] as it teaches the model to extract information from conditioning image tokens. We disable temporal layers when training on images.
To train our video diffusion decoder for multiple inference tasks we take inspiration from [17] where we can perform multiple tasks by masking certain frames or the conditioning image tokens. We choose to train a single video diffusion model for all tasks as it has been shown that multi-task training improves performance on individual tasks [17]. The tasks include image generation, video generation, autoregressive decoding, and video interpolation. Each task is sampled equally. For example, for the autoregressive generation task, we provide previously generated past frames as context and conditioning image tokens for frames we want to predict. We include both forward and backward autoregressive tasks. See
The video decoder is trained on the noise prediction objective. More specifically, we use the v-parameterization as proposed in [39] because it avoided unnatural color shifts and maintained long-term consistency as similarly found in [16]. In practice, we use a weighted average of L1 and L2 losses. The video decoder loss Lvideo is:
where:
Our training dataset consists of 4,700 hours at 25 Hz of proprietary driving data collected in London, UK between 2019 and 2023. This corresponds to approximately 420M unique images. During training we balance over a customizable set of features to control the distribution of data (
For the tokenizer we balanced over (latitude, longitude, weather category) to account for geography and visually distinct weather conditions ensuring our tokenizer can adequately represent a diverse range of scenes.
For the world model and the video diffusion model we balanced over (latitude, longitude, weather category, steering behavior category, speed behavior category), additionally considering speed and steering behaviors to ensure the dynamics of different behaviors are captured and sufficiently modelled by the world model and the temporal decoder.
Our validation dataset contains 400 hours of driving data from runs not included in the training set. The runs selected for validation are those that pass through predetermined geofences as well as a selection of randomly selected runs. We further split our validation set into strict geofences in order to analyze only those samples strictly within the validation geofence (i.e., roads never seen during training) and another geofence around our main data collection routes (i.e., roads seen during training) as a way to monitor overfitting and generalization.
In this section, we describe how the three trainable components of GAIA-1 were optimized. We provide details of hyperparameter configurations, hardware used and training times.
The image tokenizer (0.3B parameters) was trained on images of resolution H×W=288×512 (9/16 ratio). The spatial downsampling of the encoder is D=16, therefore each image is encoded as n=18×32=576 discrete tokens with a vocabulary size K=8192. The bit compression is
The discrete autoencoder was optimised with AdamW [40] and a learning rate of 1×10−4, weight decay 0.01, beta coefficients (0.5, 0.9). The loss weights are λL
The model was trained for 200 k steps in 4 days with a batch size equal to 160, split across 32 A100 80 GB GPUs. We used 5 k of linear warm-up and 10 k of cosine decay to a final learning rate of 1×10−5.
The world model (6.5B parameters) was trained on video sequences of size T=26 at 6.25 Hz, which correspond to 4 s-long videos. The text was encoded as m=32 text tokens per time step, and the action as l=2 tokens. The total sequence length of the world model is therefore T×(m+n+l)=15860.
The world model was optimized with AdamW and a learning rate of 1×10−4, weight decay 0.1, beta coefficients (0.9, 0.95), norm gradient clipping 1.0. Training examples were either unconditioned, action-conditioned, or text conditioned. The ratios of these respective conditioning modes were 20%/40%/40%.
The model was trained for 100 k steps in 15 days, with 2.5 k of linear warm-up and 97.5 k of cosine decay reducing the learning rate by a factor of 10 over the course of training. The batch size was 128 split across 64 A100 80 GB GPUs. We used the FlashAttention v2 implementation [41] in the transformer module, as it offered significant advantages in terms of both memory utilization and inference speed. To optimize distributed training, we used the Deepspeed ZeRO-2 training strategy [42] with activation checkpointing.
The video decoder (2.6B) was trained on sequences of T′=7 images of resolution H×W=288×512 sampled from the dataset at either 6.25 Hz, 12.5 Hz or 25 Hz. The training tasks (
The video decoder was optimized with AdamW and a learning rate of 5×10−5, weight decay 0.01, beta coefficients (0.9, 0.99), norm gradient clipping 1.0. The model was trained for 300 k steps in 15 days, with 2.5 k of linear warm-up and 5 k of cosine decay to a final learning rate of 1×10−6. We used a weighted average of L1 and L2 losses with weights λL
In this section, we describe in more detail the inference procedure of the world model and the video decoder.
Sampling. The world model autoregressively predicts the next image token, conditioned on previous text, image and action tokens. Given the past tokens we perform n forward steps to generate one new image frame. At each step we must sample a token from the predicted logits to select the next token in our sequence. Empirically we observed that maximization-based sampling (i.e. argmax) generates futures that get stuck in a repetitive loop, similarly to language models [44]. Conversely, if we simply sample from the logits, the selected token can come from the unreliable tail of the probability distribution, which throws the model out-of-distribution, see
To encourage diversity as well as realism we employ top-k sampling to sample the next image token from the top-k most likely choices. The chosen value of k is a function of the number of tokens that constitute an image frame as well as the pre-learnt codebook (vocabulary) size.
Our world model can be used to roll out possible futures given starting context as well as generating futures from scratch without any starting context. For long video generation, if the length of the video exceeds the context length of the world model, we employ a sliding window.
Text-conditioning. The video prediction can be prompted, and thus directed, with text. At training time, we condition our video sequences with text coming from either online narration or offline metadata sources. Because these text sources are imperfect, to improve the alignment between generated futures and the text prompt, we employ classifier-free guidance [45, 46] at inference time. The effect of guidance is to increase text-image alignment by decreasing the diversity of possible samples. More precisely, for each next token to predict, we compute logits conditioned on text as well as logits with no conditioning (unconditioned). At inference, we can then amplify the differences between the unconditioned and the text-conditioned logits with a scale factor to give the final logits used for sampling.
By substituting the unconditioned logits with those conditioned on another text prompt, we can perform “negative” prompting [47]. Pushing the logits away from the negative prompt and towards the positive one encourages the future tokens to include the “positive” prompt features while removing the “negative” ones.
We found it was important to schedule the scale factor used for guidance over tokens as well as frames. Scheduling over tokens allows some to be sampled with high guidance (hence adhering strongly to the prompt) and others to be sampled with low guidance (hence increasing sample diversity). Scheduling over frames allows for controlling the transition from earlier frames as well as mitigating compounding guidance over subsequent consecutive frames. In
To decode a sequence of generated tokens from the world model, we use the following video decoding method:
We use the DDIM sampler [48] with 50 diffusion steps. During autoregressive decoding, we see a trade-off between reflecting token information content in the generated video and temporal consistency. To balance between these two objectives, we calculate a weighted average of the two tasks [18].
where function ϵθπ(xt′, t′, z, m) denoises each frame individually as images and function ϵθ(xt′, t′, z, m) denoises the sequence of frames jointly as a video. In practice, we simply switch on and off the temporal layers. We apply this weighted average randomly for each diffusion step with probability p=0.25 and weight w=0.5.
While exploring different inference approaches for video decoding we found that decoding video frames autoregressively backwards starting from the end of the sequence led to more stable objects and less flickering on the horizon. In our overall video decoding method, we thus decode the last T′ frames and autoregressively decodes the remaining frames backward from there.
The formulation of the world modeling task in GAIA-1 shares a commonality with the approach frequently used in large language models (LLMs). In both instances, the task is streamlined to focus on predicting the next token. Although this approach is adapted for world modeling in GAIA-1 rather than the traditional language tasks seen in LLMs, it is intriguing to observe that scaling laws [49, 21, 27], analogous to those observed in LLMs, are also applicable to GAIA-1. This suggests the broader applicability of scaling principles in modern AI models across diverse domains, including autonomous driving.
To explore scaling laws with GAIA-1, we predicted the final performance of the world model using models trained with less than 20× the compute. We evaluated those models on a held-out geofenced validation set by measuring cross-entropy. A power-law of the form
was then fitted to the data points. In
The models used to fit the power-law ranged from 10,000× to 10× smaller models in terms of parameters (0.65M to 650M), as visualized in
It is worth noting that our extrapolation leads us to the conclusion that there is substantial potential for further improvement through the expansion of both data and computational resources.
In this section we showcase the capabilities and emerging properties of GAIA-1 through a series of qualitative examples. The comprehensive list of video examples can be found at [95].
GAIA-1 can generate stable long videos (minutes) entirely from imagination (
GAIA-1 has the ability to generate a variety of distinct future scenarios based on a single initial prompt. When presented with a brief video as context, it can generate numerous plausible and diverse outcomes by repeatedly sampling. GAIA-1 accurately models multiple potential future scenarios in response to the video prompt while maintaining consistency with the initial conditions observed in the video. As seen in
GAIA-1 can generate videos from text prompts only, completely imagining the scene. To demonstrate this we showcase how we can generate driving scenarios from text prompts that guide the model towards specific weather or lighting conditions in
Next, we present compelling examples where the model exhibits fine-grained control over the vehicle dynamics in the video. By leveraging this control, we can prompt the model to generate videos depicting scenarios that lie outside the bounds of the training data. This shows that GAIA-1 is able to disentangle the ego-vehicle dynamics from the surrounding environment and effectively generalize to unfamiliar scenarios. It provides explicit ability to reason about the impact of our actions on the environment (safety), it allows richer understanding of dynamic scenes (intelligence), it unlocks model-based policy learning (planning in the world model), and it enables exploration in closed-loop (by considering the world model as a neural simulator). To showcase this, we make GAIA-1 generate futures where the ego-vehicle steers left or right, deviating from its lane (
Video generative models. Video generative models are neural networks that can generate realistic video samples. They can be grouped in four categories: VAE-based (variational autoencoder [50], GAN-based (generative adversarial network [51]), diffusion-based [52], and autoregressive-based [53].
Latent-variable video models (VAE-based) try to infer the underlying latent process that generated the videos [54, 55, 56, 57, 58]. One known limitation of those models is that they tend to generate blurry outputs due to limited representational power, inadequate choice of prior distribution, and the optimization of a lower-bound instead of the true likelihood. GAN-based methods produce more realistic videos [59, 60, 61, 62, 63, 64] but are known to suffer from training instability and a lack of generation diversity [65]. Diffusion-based methods have yielded significant enhancements in realism, controllability, and temporal consistency. They can operate either at the pixel level [38, 17, 66, 67, 68, 69, 16] or in the latent space of a pre-trained image tokenizer [70, 18, 71].
Diffusion models are expressive neural networks that can fit complex data distributions, but rely on a long Markov chain of diffusion steps to generate samples. Lastly, autoregressive-based methods are conceptually simple and rely on tractable exact likelihood optimization (fits the entire data distribution). Likewise, they can operate at the pixel level [72, 73], or in a discrete learned token space [74, 75, 76, 77]. A known limitation is the slow generation speed, but this issue could be alleviated by future research on parallel sampling [78, 79, 80], reducing the number of latent variables [81], and improvements in hardware accelerators.
World models. A world model is a predictive model of the future that learns a general representation of the world in order to understand the consequences of its actions [7, 8]. The main use cases are: pure representation learning, planning (look-ahead search), or learning a policy in the world model (neural simulator).
World modeling has been used as a pre-training task to learn a compact and general representation in a self-supervised way [82, 83]. Subsequently, using this representation as a state for traditional reinforcement learning (RL) algorithms significantly accelerated convergence speed. World models can also be utilized for look-ahead search, in order to plan by imagining the outcomes of future actions. They have proven to be highly effective in game environments or board games [9, 84]. Additionally, world models can be a solution to the sample efficiency issues of RL algorithms by acting as a simulator of the environment [7, 85, 86, 62, 13, 15, 87], although this assumes the world model is an accurate model of the environment.
A recent line of work suggests casting world modeling as a single sequence model, treating states, actions and rewards as simply a stream of data [10, 19, 14, 88, 12, 89]. The advantage of such a perspective is that world models can benefit from scaling properties of high-capacity sequence model architectures applied to large-scale unsupervised training [26]. This is the approach that GAIA-1 takes, leveraging those scaling properties to model complex environments such as real-world driving scenes.
Scaling. Large language models have shown clear benefits in scaling model size and data [90, 24, 26, 20, 21, 22, 23]. In particular, [49] showed predictable relationships between model/data size and loss over multiple orders of magnitude. [49] derived power laws for transformer based language models in order to optimally allocate the compute budget between the model and data size. Those laws were then refined by [27] by adapting the learning rate schedule when changing the dataset size. Another direction of research to improve the training efficiency of language models is data quality. [91] showed that the quality of the training data plays a critical role in the performance of language models in downstream tasks.
Transferring the scaling principles from large language models to the visual domain holds the potential for delivering consistent and expected performance improvements [92, 93, 43, 16, 94]. In this work, by casting the problem of world modeling as unsupervised sequence modeling, we have shown that similar scaling trends from language models also applied to world models.
GAIA-1 is a generative world model for autonomous driving. The world model uses vector-quantized representations to turn the task of future prediction into a next token prediction task, a technique that has been successfully employed in large language models. GAIA-1 has demonstrated its capability to acquire a comprehensive understanding of the environment, distinguishing between various concepts such as cars, trucks, buses, pedestrians, cyclists, road layouts, buildings, and traffic lights—all through self-supervision. Further, GAIA-1 harnesses the capabilities of video diffusion models to generate realistic driving scenarios, thereby functioning as an advanced neural simulator. GAIA-1 is a multimodal approach that enables the control of the ego-vehicle's actions and other scene attributes through a combination of textual and action-based instructions.
While our method demonstrated promising results that have the potential to push the boundaries of autonomous driving, it is important to acknowledge current limitations. For instance, the autoregressive generation process, while highly effective, does not yet run at real-time. Nevertheless, it is noteworthy that this process lends itself well to parallelization, allowing for the concurrent generation of multiple samples.
The significance of GAIA-1 extends beyond its generative capabilities. World models represent a crucial step towards achieving autonomous systems that can understand, predict, and adapt to the complexities of the real world. Furthermore, by incorporating world models into driving models, we can enable them to better understand their own decisions and ultimately generalize to more real-world situations. Lastly, GAIA-1 can also serve as a valuable neural simulator, allowing the generation of unlimited data, including adversarial examples, for training and validating autonomous driving systems.
In Chapter I above, it is disclosed how to learn a world model. For ease of reference, this world modelling process is summarised in brief below, where this process acts as a pre-training stage resulting in a pre-trained world model suitable for use in the driving model outlined later. This driving model builds upon the foundation provided by GAIA-1. The architecture used in the pre-training phase corresponds to the architecture described previously with reference to
The input modalities include an input video 100, driving actions 102, and text 104. Additional modalities may be included such as auxiliary sensor data (e.g. using data from sensors such as radar, lidar, infrared, ultrasonic sensors, or such like)—where such additional modalities are included, they would have the data received and encoded in a similar way, with the resultant encodings being temporally aligned with the encodings from the other modalities, with the corresponding sequence of tokens being modelled using the autoregressive transformer in the same way set out above.
The input video 100 comprises a sequence of time-ordered image frames. The driving actions 102 correspond to driving parameters, such as amounts of acceleration or braking being applied, a steering angle, a speed, an acceleration rate, a braking rate, a curvature rate, or such like. The driving actions 102 may be raw signals, or may be formatted in some suitable manner, e.g. provided in a vectorised form. The textual data provided may be a text-based description of a particular scenario. For example, textual training data may comprise text statements such as “I am approaching a crossing yielding to pedestrians” or “It is safe to move so I am now accelerating”.
It should be noted, however, that the training of the world model using modalities other than images (e.g. action and text) is optional. If driving is to be performed using image inputs only, then the world model need only have been trained using the video modality, however improved performance may be obtained by additionally training the world model using action and/or text modalities too. If an auxiliary sensor modality is used, it may be used both for pre-training the world model and during run-time, using the auxiliary sensors on the vehicle if present.
Where multiple modalities are used, each modality 100, 102, 104 is encoded separately using an appropriate encoder. An image encoder 106 encodes the frames of the input video 100 to generate image encodings 108. An action encoder 110 encodes the driving actions 102 to generate action encodings 112. A text encoder 114 encodes the text input 104 to generate text encodings 116.
The encodings 108, 112, 116 are aligned temporally, which allows the correct action 102 and/or text 104 to be aligned to the part of the video 100 it refers to. After this alignment, a sequence of input tokens 118 has been created that represent the inputs 100, 102, 104.
A world model 120 based on an autoregressive transformer is used to model this sequence (similar to LLMs known in the art perse such as GPT or Llama). During training, the weights of the world model 120 are learnt in the manner described hereinabove in Chapter I.
This world model 120 can predict one token at a time and generate output tokens 122 that correspond to the future frames. Given some past images (e.g. the two images shown in
The video decoder 124 can then decodes the tokens 122 back to a video 126, where that the image frames of that output video 126 have been synthetically generated based on the outputs of the world model 120.
As can be seen in
It should be noted that, for the purposes of generating a pre-trained world model suitable for use in autonomous driving (as explained below), the video decoder 124 is optional—there is no need to explicitly use the model for its generative video capabilities in order to drive, however this capability is mentioned here for completeness.
The resultant pre-trained world model is, in accordance with embodiments of the present invention, then used for the purposes of autonomous driving. The model is fine-tuned and trained for the task of driving. That driving model—which we call GAIA-Drive—can then be used to do inference in an autonomous vehicle.
As a result of the pre-training process set out above, the end result is five trained components—the image encoder 106, the action encoder 110, the text encoder 114, the world model 120, and the video decoder 124. For the purposes of the driving model, the components of interest are the trained image encoder 106 and world model 120.
The driving model comprises three components—an image encoder 206, a pre-trained world model 220, and a driving decoder 240. If an auxiliary sensor encoder were included during training of the world model, then that trained auxiliary sensor encoder may also be used for the driving model, however for ease of reference this is not shown in
When training this driving model, the parameters of the world model 220 are initialised with the pre-trained weights as above, i.e. the weights of the world model 120.
Similarly, the parameters of the image encoder 206 are also initialised with the pre-trained weights learned when training the image encoder 106.
The parameters of the driving decoder 240 are initialised at random because this is a new component that was not present during the pre-training process.
For the driving model, we only encode videos 200 as we do not have text when running on a car but only inputs from the cameras. During the training process for this driving model, these videos 200 are pre-recorded videos for training, however at inference time the videos are taken from cameras on the vehicle in real time. As the image encodings do not need to be temporally aligned with any other modality (as images are the only input here), the image encodings 208 are used as the input tokens for the pre-trained world model 220.
The image features captured within the image encoding based input tokens 208 are processed using the world model 220 which produces output tokens relating to a world state 222 representative of the world modelling features. Critically, this world state 222 provides an implicit representation of the multiple possible futures that the world model 220 considers as a result of the image input 200 and its learnings from the earlier training phase. In other words, the output of the world model 220 is not itself a prediction of what will come next, but rather is a ‘fuzzy’ output representative of different possibilities based on sequences it has seen before, as applied to the new image input 200.
In addition to the world state 222 output by the world model 220, we can concatenate extra features that may be used for driving. In this particular example, we concatenate the world state 222 with driving input features 242 from the route map that is telling the model where to go (this may be a route and/or a destination, or similar).
The features of the world state 222 generated by the world model 220—together with the extra features 242 concatenated to these, if appropriate—are input to the driving decoder 240. The driving decoder 240 is configured to decode the inputs it is provided to a driving plan 244. The driving plan 244 may include one or more of: a waypoint, a speed, a velocity, a curvature, a trajectory, an indicator signal, an acceleration value, a braking value, a parking brake value, a steering angle, a lighting setting, and a horn setting.
The decoding process carried out by the driving decoder 240 is carried out in one forward pass. This avoids the autoregressive prediction (as used with the world model 120 during the pre-training phase)—by way of comparison, note the removal of the dashed line that looped back into the world model 120 in the pre-training phase.
As the world state 222 is used as the input to the driving decoder 240, rather than a prediction of a specific future generated by the world model 220, the driving decoder 240 receives information about multiple possible futures.
This means that the image encoding based input tokens 208 are processed into the world model features 222 by the pre-trained world model 220 without any autoregression, which means that the world model features 222—which provide an implicit representation of different possible futures—can be generated quickly. The driving decoder 240 can then be trained to take these ‘fuzzy’ inputs from the world model and predict a driving plan 244 based on that input.
To train the driving model, a set of driving decoder weights for the driving decoder 240 are first initialised to a set of random values. As above, the weights for the image encoder 206 and world model 220 are imported from the pre-training process.
Video training data is supplied as the image input 200, where the video training data comprising a plurality of time-varying further image frames, as with videos described previously. As part of the training data, a driving plan prior corresponding to the video training data is provided—this driving plan prior is used to check performance of the driving model.
The image encoder 206 encodes the image frames from the video training data to generate image encodings, which as above are supplied as input tokens to the pre-trained world model 220. The pre-trained world model 220 generate a world state 222 from the image encodings.
The world state 222 (and potentially a training route plan) is input to the driving decoder 240 which generates a new driving plan 244 based on the world state 222 and the current set of driving decoder weights.
The new driving plan 244 is compared to the driving plan prior to determine a difference between these, and the set of driving decoder weights is updated based on that difference. Optionally, the set of weights for the world model may also be updated based on the difference.
This process may then be repeated with a further training videos and corresponding driving plan prior as many times as desired, incrementally updating the driving decoder weights (and potentially the world model weights).
Once the training of the driving model is complete, it is ready to be used for inference.
Once trained, the driving model can be used at inference time with new video image input 200 taken from camera(s) on the autonomous vehicle in the real-world environment.
The autonomous driving system can be provided driving input features 242, e.g. relating to a route map. This may, for example, be a desired route or an end destination (e.g. a user's house, office, a supermarket, or such like).
As the vehicle sets off driving, its cameras will capture images of its surroundings. These images are provided as the image input 200, which are processed by the driving model as above to generate the driving plan 244. The autonomous vehicle is then operated according to the driving plan.
The proposed approach advantageously leverages information implicitly embedded within a world model (of the type usable to create synthetic driving videos) to drive a vehicle in a real world, practical environment with the constraints that environment imposes. This GAIA-Drive model builds on models such as GAIA-1 developed by the Applicant.
Whereas conventional approaches known in the art per se iteratively roll out a future scene and plan driving actions accordingly (and then repeat), embodiments of the present invention allow the driving plan to be formed in a single pass, providing for much faster processing. As speed is critical to the model being suitable for use in an autonomous vehicle, the approach provided by the present invention overcomes the issues associated with prior art approaches.
This data-driven approach allows the model can learn to estimate world states and predict their evolution from real-world data, which may provide more accurate modelling of complex real-world driving scenarios. The invention may provide improvements in adaptability and generalisation as the model can adapt to new scenarios and generalise to previously unseen situations, with the model being robust and better equipped to handle novel situations. Additionally, the model of the present invention may be subject to less bias because the model learns from data, reducing the need for manual design and minimising the potential for human-induced biases. This leads to more objective and unbiased training and validation data.
While specific embodiments of the present invention have been described in detail, it will be appreciated by those skilled in the art that the embodiments described in detail are not limiting on the scope of the claimed invention.
In arXiv preprint, 2022.
Number | Date | Country | Kind |
---|---|---|---|
2314928.9 | Sep 2023 | GB | national |
2317029.3 | Nov 2023 | GB | national |