COMPUTER IMPLEMENTED METHOD FOR PROVIDING A PERCEPTION MODEL FOR ANNOTATION OF TRAINING DATA

CROSS-REFERENCE TO RELATED APPLICATIONS

The present application for patent claims priority to European Patent Office Application Ser. No. 23206091.3, entitled “A COMPUTER IMPLEMENTED METHOD FOR PROVIDING A PERCEPTION MODEL FOR ANNOTATION OF TRAINING DATA” filed on Oct. 26, 2023, assigned to the assignee hereof, and expressly incorporated herein by reference.

TECHNICAL FIELD

The present inventive concept relates to the field of autonomous vehicles. In particular, it is related to methods and devices for annotation of training data for use in training of a perception model.

BACKGROUND

With the development of technology in recent years, image capturing and processing techniques have become widely used in different fields of technology. In particular, vehicles produced today are commonly equipped with some form of vision or perception system for enabling new functionalities. Moreover, an increasing portion of modern vehicles has advanced driver-assistance systems (ADAS) to increase vehicle safety and more generally road safety. ADAS—which for instance may be represented by adaptive cruise control (ACC), collision avoidance system, forward collision warning, lane support systems, etc.—are electronic systems that may aid a driver of the vehicle. Today, there is ongoing research and development within a number of technical areas associated to both the ADAS and the Autonomous Driving (AD) field. ADAS and AD may also be referred to under the common term Automated Driving System (ADS) corresponding to all of the different levels of automation as for example defined by the SAE J3016 levels (0-5) of driving automation.

Some functions of these systems can be implemented using simple rule-based techniques. However, to handle the complexity of real-world driving scenarios, which involves varying road conditions, unpredictability in human or non-human behavior, and rapidly changing environments, the use of machine learning models has proven to enhance the safety, capability and performance of the ADS. Machine learning models, such as deep learning models or neural networks are especially useful as part of the perception system of the ADS for e.g. detecting, identifying, or tracking objects in the surrounding environment of the vehicle.

Solving the perception tasks necessary to achieve autonomous driving with deep learning algorithms requires a vast quantity of labeled training data. Such datasets need to cover any imaginable scenario that might present itself while driving. Collecting the data is a relatively easy task. However, annotating the data to make it useful for training of a machine learning model is many orders of magnitude more expensive, as it typically requires human involvement. These problems are only made worse when moving to spatiotemporal models which require annotated sequence data, bringing a new dimension to the annotation cost. One of the holy grails in the development of AD is therefore to find ways of doing this in an automated manner. The present inventive concept provides techniques for acquiring high-fidelity annotation in a more automated manner, which can remove or drastically reduce the need for human involvement.

SUMMARY

The herein disclosed technology seeks to mitigate, alleviate, or eliminate one or more of the above-identified deficiencies and disadvantages in the prior art to address various problems relating to acquiring annotated training data. Recent advances in large language models have demonstrated the fact that deep learning is at its most powerful when there is no clear limitation to the scale of the model or the size of its input dataset. The inventors have realized that these aspects can be utilized also in other areas, such as in the field of autonomous driving development for annotation of data. The presently disclosed technology at least partly builds upon leveraging easy to collect data to train an offline model to be able to annotate training data which then can be used to train an online (or production) model used in a vehicle equipped with an automated driving system, ADS.

As stated above, data collection is orders of magnitude cheaper than annotations. For this reason, the presently disclosed technology leverages data that need not to be explicitly labeled. The essential problem then becomes defining an objective function for a model that enables it to leverage this vast quantity of unlabeled data, while building an understanding of the world around the vehicle that can be used to solve relevant AD tasks, such as object or lane tracking. The proposed technology allows one to train a model for offline auto-annotation which is limited only by the amount of raw data collected, e.g. by test vehicles or a fleet of vehicles, and the available computational resources for training, rather than the resources for human annotation. The proposed objective function for an offline perception model for subsequent annotation of training data is herein selected as the problem of predicting missing sensor data of a sensor data sequence, based on the available sensor data of the sensor data sequence.

Various aspects and embodiments of the disclosed invention are defined below and in the accompanying independent and dependent claims.

According to a first aspect, there is provided a computer-implemented method for providing an offline perception model for subsequent annotation of training data. The training data can be used in training of an online perception model. The method comprises training a foundation model, using a first training dataset, to predict sensor data pertaining to a physical environment for a time instance of a sequence of time instances, based on sensor data for remaining time instances of said sequence of time instances. The method further comprises forming the offline perception model by adding a task-specific layer to the trained foundation model. The task-specific layer is configured to perform a perception task of the offline perception model. The method further comprises fine-tuning the offline perception model, using a second training dataset, to perform the perception task. The second training dataset comprises training data annotated for said perception task.

According to a second aspect, there is provided a computer program product comprising instructions which when the program is executed by a computing device, causes the computing device to carry out the method according to any embodiment of the first aspect. According to an alternative embodiment of the second aspect, there is provided a (non-transitory) computer-readable storage medium. The non-transitory computer-readable storage medium stores one or more programs configured to be executed by one or more processors of a processing system, the one or more programs comprising instructions for performing the method according to any embodiment of the first aspect. Any of the above-mentioned features and advantages of the first aspect, when applicable, apply to the second aspect as well. In order to avoid undue repetition, reference is made to the above.

According to a third aspect, there is provided a device for providing an offline perception model for subsequent annotation of training data. The training data may then be used in training of an online perception model. The device comprises control circuitry. The control circuitry is configured to train a foundation model, using a first training dataset, to predict sensor data pertaining to a physical environment for a time instance of a sequence of time instances, based on sensor data for remaining time instances of said sequence of time instances. The control circuitry is further configured to form the offline perception model by adding a task-specific layer to the trained foundation model. The task-specific layer is configured to perform a perception task of the offline perception model. The control circuitry is further configured to fine-tune the offline perception model, using a second training dataset, to perform the perception task. The second training dataset comprises training data annotated for said perception task. Any of the above-mentioned features and advantages of the other aspects, when applicable, apply to this third aspect as well. In order to avoid undue repetition, reference is made to the above.

According to a fourth aspect, there is provided a computer-implemented method for annotating data for use in subsequent training of an online perception model. The online perception model is configured to perform a perception task of a vehicle equipped with an automated driving system. The method comprises obtaining sensor data pertaining to a physical environment. The method further comprises determining a perception output by inputting the obtained sensor data into an offline perception model provided by the method according to any embodiment of the first aspect. The method further comprises storing the sensor data together with the perception output as annotation data for subsequent training of the online perception model. Any of the above-mentioned features and advantages of the other aspects, when applicable, apply to this fourth aspect as well. In order to avoid undue repetition, reference is made to the above.

According to a fifth aspect, there is provided a computer program product comprising instructions which when the program is executed by a computing device, causes the computing device to carry out the method according to any embodiment of the fourth aspect. According to an alternative embodiment of the fifth aspect, there is provided a (non-transitory) computer-readable storage medium. The non-transitory computer-readable storage medium stores one or more programs configured to be executed by one or more processors of a processing system, the one or more programs comprising instructions for performing the method according to any embodiment of the fourth aspect. Any of the above-mentioned features and advantages of the other aspects, when applicable, apply to this fifth aspect as well. In order to avoid undue repetition, reference is made to the above.

According to a sixth aspect, there is provided a device for annotating data for use in subsequent training of an online perception model. The online perception model is configured to perform a perception task of a vehicle equipped with an automated driving system. The device comprises control circuitry. The control circuitry is configured to obtain sensor data pertaining to a physical environment. The control circuitry is further configured to determine a perception output by inputting the obtained sensor data into an offline perception model provided by the method according to any embodiment of the first aspect. The control circuitry is further configured to store the sensor data together with the perception output as annotation data for subsequent training of the online perception model. Any of the above-mentioned features and advantages of the other aspects, when applicable, apply to this sixth aspect as well. In order to avoid undue repetition, reference is made to the above.

The term “non-transitory,” as used herein, is intended to describe a computer-readable storage medium (or “memory”) excluding propagating electromagnetic signals, but are not intended to otherwise limit the type of physical computer-readable storage device that is encompassed by the phrase computer-readable medium or memory. For instance, the terms “non-transitory computer readable medium” or “tangible memory” are intended to encompass types of storage devices that do not necessarily store information permanently, including for example, random access memory (RAM). Program instructions and data stored on a tangible computer-accessible storage medium in non-transitory form may further be transmitted by transmission media or signals such as electrical, electromagnetic, or digital signals, which may be conveyed via a communication medium such as a network and/or a wireless link. Thus, the term “non-transitory”, as used herein, is a limitation of the medium itself (i.e., tangible, not a signal) as opposed to a limitation on data storage persistency (e.g., RAM vs. ROM).

The disclosed aspects and preferred embodiments may be suitably combined with each other in any manner apparent to anyone of ordinary skill in the art, such that one or more features or embodiments disclosed in relation to one aspect may also be considered to be disclosed in relation to another aspect or embodiment of another aspect. Moreover, any advantages mentioned in connection with one aspect, when applicable, applies to the other aspects as well.

As stated previously, the presently disclosed technology may be advantageous in that it enables annotation of training data with less need for human involvement. Aside from a relatively small training dataset of annotated data (second training dataset compared to first training dataset), the offline perception model used for annotating data can be provided using vast amounts of data which do not require explicit annotations. Thus, one can simply drive vehicles with appropriate sensor setups around and collect the relevant sensor data. Thus, high fidelity annotations of training data, which can be used in training of the online perception model, cab subsequently be generated in a more time efficient way, and in quantities much greater than was previously feasible with today's technologies, which in turn can improve the subsequent training of the online perception model. An effect of utilizing implicitly annotated data and/or semi-supervised learning of the foundation model makes it easy to collect these vast amounts of data with little to no effort. Moreover, by deploying the perception model used for data annotation as an offline model allows for pushing the model and dataset size to new heights. An effect of the above aspects may be that the offline perception model can learn the complex task of predicting sensor data missing from a sensor data sequence provided as input to the model. Solving this task on a sufficiently large and varied dataset can result in a model that can understand the dynamics of the environment and learn the temporal evolution of the scene. Thus, the offline perception model provided by the present technology may be more powerful (e.g. in the sense of capability, accuracy and general performance) than any auto-annotation model trained only on a limited set of human labeled data in accordance with what is known today. The provided offline perception model is also more powerful than models trained with contrastive loss or classification tasks since these objectives do not directly supersede the relevant AD tasks which the online perception model is intended to perform. In other words, a model can learn to minimize a contrastive loss without actually having a general understanding of object tracking or other perception tasks.

Finally, the proposed objective function for training of the offline perception model may be better efficient compared to other solutions, in the sense that the input and outputs of the model can comprise similar amounts of information.

Further embodiments are defined in the dependent claims. It should be emphasized that the term “comprises/comprising” when used in this specification is taken to specify the presence of stated features, integers, steps, or components. It does not preclude the presence or addition of one or more other features, integers, steps, components, or groups thereof.

These and other features and advantages of the disclosed technology will, in the following, be further clarified with reference to the embodiments described hereinafter.

BRIEF DESCRIPTION OF THE DRAWINGS

The above aspects, features and advantages of the disclosed technology, will be more fully appreciated by reference to the following illustrative and non-limiting detailed description of example embodiments of the present disclosure, when taken in conjunction with the accompanying drawings, in which:

FIG. 1 is a schematic flowchart representation of a method for providing an offline perception model for subsequent annotation of training data, according to some embodiments.

FIG. 2 is a schematic flowchart representation of a method for annotating data for use in subsequent training of an online perception model, according to some embodiments.

FIG. 3 is a schematic illustration of a device for providing an offline perception model for subsequent annotation of training data, according to some embodiments.

FIG. 4 is a schematic illustration of a device for annotating data for use in subsequent training of an online perception model, according to some embodiments.

FIG. 5 is a schematic illustration of a vehicle, in accordance with some embodiments.

FIG. 6 is a schematic illustration of a system in accordance with some embodiments.

FIG. 7A to 7E illustrates, by way of example, sensor data for a sequence of time instances.

DETAILED DESCRIPTION

The present disclosure will now be described in detail with reference to the accompanying drawings, in which some example embodiments of the disclosed technology are shown. The disclosed technology may, however, be embodied in other forms and should not be construed as limited to the disclosed example embodiments. The disclosed example embodiments are provided to fully convey the scope of the disclosed technology to the skilled person. Those skilled in the art will appreciate that the steps, services and functions explained herein may be implemented using individual hardware circuitry, using software functioning in conjunction with a programmed microprocessor or general purpose computer, using one or more Application Specific Integrated Circuits (ASICs), using one or more Field Programmable Gate Arrays (FPGA) and/or using one or more Digital Signal Processors (DSPs).

It will also be appreciated that when the present disclosure is described in terms of a method, it may also be embodied in apparatus comprising one or more processors, one or more memories coupled to the one or more processors, where computer code is loaded to implement the method. For example, the one or more memories may store one or more computer programs that causes the apparatus to perform the steps, services and functions disclosed herein when executed by the one or more processors in some embodiments.

It is also to be understood that the terminology used herein is for the purpose of describing particular embodiments only, and is not intended to be limiting. It should be noted that, as used in the specification and the appended claim, the articles “a”, “an”, “the”, and “said” are intended to mean that there are one or more of the elements unless the context clearly dictates otherwise. Thus, for example, reference to “a unit” or “the unit” may refer to more than one unit in some contexts, and the like. Furthermore, the words “comprising”, “including”, “containing” do not exclude other elements or steps. It should be emphasized that the term “comprises/comprising” when used in this specification is taken to specify the presence of stated features, integers, steps, or components. It does not preclude the presence or addition of one or more other features, integers, steps, components, or groups thereof. The term “and/or” is to be interpreted as meaning “both” as well and each as an alternative.

It will also be understood that, although the term first, second, etc. may be used herein to describe various elements or features, these elements should not be limited by these terms. These terms are only used to distinguish one element from another. For example, a first element could be termed a second element, and, similarly, a second element could be termed a first element, without departing from the scope of the embodiments. The first element and the second element are both elements, but they are not the same element.

As used herein, the wording “one or more of” a set of elements (as in “one or more of A, B and C” or “at least one of A, B and C”) is to be interpreted as either a conjunctive or disjunctive logic. Put differently, it may refer either to all elements, one element or combination of two or more elements of a set of elements. For example, the wording “one or more of A, B and C” may be interpreted as A or B or C, A and B and C, A and B, B and C, or A and C.

Throughout the present disclosure, reference is made to machine learning models (or just “models”). By the wording “machine learning model” it is herein meant any form of machine learning algorithm, such as deep learning models, neural networks, or the like, which is able to learn and adapt from input data and subsequently make predictions, decisions, or classifications based on new data. In general, the machine learning model, as used herein, may be any neural network based model which operates on sensor data of an autonomous vehicle. In the following, the wording “perception model” and “foundation model” will be used to distinguish between more specific types of machine learning model, or to define the purpose of the machine learning models.

Deployment of a machine learning model typically involves a training phase where the model learns from labeled or unlabeled training data to achieve accurate predictions during the subsequent inference phase. The training data (and input data during inference) may e.g. be an image, or sequence of images, LIDAR data (i.e. a point cloud), radar data etc. Furthermore, the training/input data may comprise a combination or fusion of one or more different data types. The training/input data may for instance comprise both an image depicting a scene of a surrounding environment of the vehicle, and corresponding LIDAR point cloud of the same scene.

The machine learning model may be implemented in some embodiments using publicly available suitable software development machine learning code elements, for example, such as those which are available in Pytorch, TensorFlow, and Keras, or in any other suitable software development platform, in any manner known to be suitable to someone of ordinary skill in the art.

The wording “perception model” herein refers to a computational system or algorithm designed to perceive or interpret an environment depicted in sensor data, such as digital images, video frames, LIDAR data, radar data, ultrasonic data, or other types of visual data relevant for driving of the vehicle. In other words, the perception model may be designed to detect, locate, identify and/or recognize instances of specific objects within the sensor data, vehicle lanes, relevant signage, appropriate navigation paths, etc. Thus, the perception model may be configured to perform a perception task of an automated driving system, ADS, of a vehicle. Examples of perception tasks include, but are not limited to object detection, object classification, lane estimation, and free-space estimation. More specifically, the machine learning model may be an object detection model, an object classification model, a lane estimation model, or a free-space estimation model. The perception model may employ a combination of advanced techniques from computer vision, machine learning, and pattern recognition to analyze the visual sensor data and output e.g. bounding boxes or regions of interest around objects of interest present in the input imagery. The perception model may be further configured to classify what type of object is detected. The perception model may encompass different architectures, including but not limited to convolutional neural networks (CNNs), recurrent neural networks (RNNs), transformers, and other existing or future alternatives.

The output of the perception model may be used in a downstream task or by a downstream system of the ADS, such as in trajectory prediction, path planning, or emergency brake systems. In some embodiments, the perception model may be part of an end-to-end model configured to (as opposed to above) perform both a perception task and a downstream task. For example, the machine learning model may perform trajectory prediction or path planning based on the sensor data directly.

Moreover, in the following, distinction will be made between an “online” perception model, and an “offline” perception model. This distinction should be understood as referring to how or where the perception model is deployed. The online perception model should be construed as a perception model deployed at the edge, i.e. directly on an edge device, in this case an ADS equipped vehicle. The online perception model may thus be seen as a production model deployed in the vehicle. In other words, the computations of the online perception model are performed locally, close to the data source. In contrast, the offline perception model refers to a perception model deployed e.g. at a remote server (also referred to as cloud server, central server, back-office server, fleet server, or back-end server). Moreover, as opposed to the online perception model, the offline perception model is not used in a production scenario (i.e. in a live real-case scenario). Instead, the offline perception model can be run independently, during a development process. Due to their computational environments, the online perception model typically has a simpler or less computational heavy architecture than the offline perception model, since it is run at the edge, with limited memory and computation resources. The offline perception model on the other hand may be a larger and more complex model, as it may be deployed on a server with more available computational resources. In fact, there may be no clear limit to the size of the offline perception model as it could even be parallelized across several computational devices.

The wording “foundation model” herein refers to a machine learning model that can serve as a base or core architecture upon which more specialized or customized machine learning models are built. The foundation model may also be commonly known as a “base model” or “general-purpose model”. The foundational model is typically pre-trained (often by self-supervised or semi-supervised learning) on a vast and diverse dataset at scale to learn general patterns, features, or representations of data. These learned representations can be leveraged and fine-tuned for a wide range of specific tasks, such as natural language processing, image recognition, recommendation systems, and various other applications. Foundation models are typically characterized by their large model size, including a vast number of trainable parameters. The model size and complexity contribute to its ability to capture intricate patterns and representations from extensive datasets. As a non-limiting example, the foundation model may build upon a convolutional neural network (CNN), such as a Residual Neural Network (commonly known in the art as ResNet), as well as one or more transformer models. For example, images captured by one or more cameras of the vehicle may be fed to the CNN to encode them. Alternatively, a vision transformer may be used. Then a LIDAR point cloud and/or radar scan corresponding to the physical environment depicted in the image(s) may be encoded by the CNN or a different model. The encoded image(s), LIDAR point cloud, and/or radar scan may be fed to the transformer model, which can build a unified abstract representation of the physical environment. The transformer model may further take into account encoded sensor data, or the sensor data itself, of previous time instances. As a non-limiting example, the so called BEVFormer (presented by Li et al.) may be used. The unified abstract representation may then be further processed by the above mentioned transformer model, or a further transformer model, before providing an output of the foundation model. In summary, arbitrary large models (e.g. CNNs) can be used to encode the sensor data. One or more transformer models or arbitrary size may then be used to interpret the encoded sensor data. Training such a foundation model can be done end-to-end. In other words, the entire model can be trained simultaneously as a whole. It goes without saying that the above example of a foundation model structure is only to be seen as a non-limiting example, as many alternatives are also possible, as readily appreciated by the person skilled in the art.

In essence, a foundation model can employ a transfer learning approach where knowledge gained from one domain or task can be transferred and adapted to improve performance in another domain or task. The concept of a foundation model plays a crucial role in the efficiency and effectiveness of machine learning systems, enabling faster development and improved performance across a spectrum of applications through the reuse of learned features and representations.

In the present disclosure, the foundation model is a generative model. By the wording “generative model” it is herein meant a machine learning model with the task of generating data based on a certain input. In the present case, the generative model is trained to predict (i.e. generate or reconstruct) sensor data of a certain time instance (or a sequence of sensor data, i.e. sensor data over a set of time instances), based on sensor data for other neighboring time instances. This is a challenging objective, which if solved on a sufficiently large set of data provides a machine learning model that has learnt to understand the behavior and features of all objects on the road, as well as the road itself, and the surrounding environment. More details regarding the online and offline perception model, the foundation model, and the training thereof will become apparent from the following detailed description.

FIG. 1 is a schematic flowchart representation of a computer-implemented method 100 for providing an offline perception model for subsequent annotation of training data. The offline perception model may also be referred to as an annotation model. The training data may in turn be used in subsequent training of an online perception model. The method 100 may be performed by a device 300 as described below in connection with FIG. 3. More generally, the method 100 may be performed by any suitable computing device, such as a remote server. Advantageously, the server is a device having more available computational resources than an ADS equipped vehicle. This may facilitate deployment of a more computational heavy offline perception model. The online perception model trained on the annotated training data can instead be deployed in the vehicle.

The wording “annotation” as used herein, refers to the process of adding some form of metadata or tags to data to make it understandable and usable for machine learning algorithms. The metadata can be used to enrich the sensor data in this case, to make it useful for training and evaluating machine learning models. This can include associating labels for identifying e.g. an object in the image, or determining bounding boxes or assigning segmentation data. The wording “labelling” or “labels” can thus be seen as a subset of data annotation. More specifically, it refers to the process of assigning one or more labels or categories to data instances (such as sensor data). For example, in image classification, labeling involves tagging images with their respective classes (e.g., cat, dog, or car).

Below, the different steps of the method 100 are described in more detail. Even though illustrated in a specific order, the steps of the method 100 may be performed in any suitable order as well as multiple times. Thus, although FIG. 1 may show a specific order of method steps, the order of the steps may differ from what is depicted. In addition, two or more steps may be performed concurrently or with partial concurrence. Such variation will depend on the software and hardware systems chosen and on designer choice. All such variations are within the scope of the invention. Likewise, software implementations could be accomplished with standard programming techniques with rule-based logic and other logic to accomplish the various steps. Further variants of the method 100 will become apparent from the present disclosure. The herein mentioned and described embodiments are only given as examples and should not be limiting to the present invention. Other solutions, uses, objectives, and functions within the scope of the invention as claimed below described patent claims should be apparent for the person skilled in the art.

The method 100 comprises training S102 a foundation model, using a first training dataset, to predict sensor data pertaining to a physical environment for a time instance of a sequence of time instances, based on sensor data for remaining time instances of said sequence of time instances. In other words, the foundation model may be trained to predict sensor data in a sensor data sequence, based on the remaining sensor data of the sensor data sequence. Put differently, the foundation model is trained to generate or reconstruct the sensor data pertaining to the physical environment for said time instance, based on the sensor data pertaining to the physical environment at the other time instances of the sequence of time instances. FIG. 7A to 7E illustrates, by way of example, sensor data 702a-e for a sequence 700 of time instances. The sensor data illustrated in broken lines corresponds to withheld (or masked) sensor data (can also be described as gaps in the sensor data sequence) which is not available as input to the foundation model. Moreover, the sensor data to which the arrow points, corresponds to the sensor data of the time instance which is to be predicted. The examples of FIG. 7A to 7E illustrates just some principles of how the foundation model may be trained. It goes without saying that the training of the foundation model may be done in many different ways. Further, the principles of the different examples may be combined in any suitable manner. As one example, the principles of FIG. 7B of predicting the sensor data 702c based on both past and future time instances are applicable also to the principles illustrated in FIG. 7C to 7E. In general, the gaps in the sensor data sequence can be synchronized between sensors so that all information is hidden for the time instance to be predicted. Alternatively, or in combination, only a subset of sensors for that time instance may be hidden. The same holds for the sensor data of the remaining time instance of the sequence of time instances. Moreover, it is to be noted that the principles described in FIG. 7A to 7E can be expanded to predicting the sensor data for a plurality of time instances. As an example, in the case illustrated in FIG. 7A, this may translate into predicting e.g. the sensor data for the time instance t-1 and t, based on the sensor data for time instances t-n to t-2. All such variants fall within the scope of the present inventive concept.

The remaining time instances of the sequence of time instances may comprise past and/or future time instances of the time instance of which sensor data is predicted. Thus, in some embodiments, the foundation model is trained to predict the sensor data of a future time instance based on sensor data of a number of past time instances. This is illustrated in FIG. 7A in which the sensor data denoted 702e of the time instance denoted t is to be predicted. In this example, the sensor data 702e of time instance t is predicted based on sensor data 702a-d of the n past time instances, t-n to t-1. It is to be noted that the number n can be any positive integer. In some embodiments, the foundation model is trained to predict the sensor data of a past time instance, based on sensor data of a number of future time instances, i.e. opposite from what is shown in FIG. 7A. In this scenario the sensor data denoted 702a of time instance t-n would be predicted based on the sensor data 702b-e of the remaining time instances. In some embodiments, the foundation model is trained to predict the sensor data of said time instance, based on the sensor data of both past and future time instances. This is illustrated in FIG. 7B in which the sensor data denoted 702c of time instance t is predicted based on sensor data 702a-b of n past time instances, and on sensor data 702d-e of m future time instances. The numbers n and m can be any positive integer. Moreover, the numbers n and m may be the same, or different values. Using both past and future time instances may allow the model to obtain a more complex understanding of how a physical environment can develop with time.

It is to be noted that in some embodiments, the foundation model is trained to predict sensor data for a plurality of time instances. In other words, the foundation model may be trained to predict a sensor data sequence, where the sensor data sequence comprises sensor data for the plurality of time instances. Thus, the foundation model may be trained to predict sensor data pertaining to the physical environment for one or more time instances of the sequence of time instances, based on sensor data for remaining time instances of said sequence of time instances.

The first training dataset may comprise a number of training samples. Each training sample may consist of a sensor data sequence (i.e. sensor data for a sequence of time instances). The first training dataset can be readily collected by capturing sensor data of a vehicle for a sequence of time instances. Thereby, a vast amount of data can be collected in a simple way. When training the foundation model, the sensor data corresponding to a time instance to be predicted can be withheld (or masked) from the foundation model, and used as ground truth. In other words, the foundation model can be fed with the sensor data of the remaining time instances, to try to reconstruct the withheld sensor data. A comparison of an output of the model with the ground truth may then serve as basis for learning and improving the foundation model in predicting the sensor data. The first training dataset can thus be described as an implicitly annotated data set. By the wording “implicitly annotated”, it is herein meant a low-level annotation which can be obtained without any human or automated annotation processes. Put differently, the sensor data can be implicitly annotated in the sense that it is annotated without having to process or analyze the contents of the sensor data. Instead, the sensor data (or parts of it) of a time instance to be predicted need just to be withheld from the foundation model input. In some regards, the first training dataset can be seen as an unannotated training dataset, as no formal labels are associated with the training samples. The supervision signal instead comes from the data itself, rather than from an external source, such as human or other automated annotation processes. This allows the first training dataset to be much larger than would be feasible for an explicitly annotated training dataset (such as the second training dataset described below).

By using a first training dataset as described above, the collection of vast amounts of training data can be done with little to no effort. Using implicitly annotated data means that there is virtually no limit in how much training data can be collected for the training of the foundation model, since, if having a fleet of vehicles, it is only limited by storage and transfer of data between vehicles and a central server.

The training S102 of the foundation model can thus be performed using self-supervised learning. Self-supervised learning herein refers to a machine learning paradigm where the model learns from unlabeled data by creating its own supervision signal or labels (may also be referred to implicit labels or implicit annotation data). In traditional supervised learning, models are trained on labeled datasets, where each input is paired with a corresponding target label. In traditional unsupervised learning, the model is trained exclusively on unlabeled datasets. However, in self-supervised learning, the model generates its own labels or representations from the input data without any human supervision. In this case, the self-created supervision signal corresponds to the withheld (or masked) sensor data associated with the time instance of which the foundation model is to predict the sensor data. Using self-supervised learning may be advantageous in its ability to leverage vast amounts of unlabeled data, which is often more readily than labeled data, while still enjoying at least some benefits of having a supervision signal associated with the data.

The wording “predicting”, as in “predicting sensor data” should herein be construed as generating or reconstructing the sensor data for said time-instance. In other words, the predicted sensor data can be seen as machine-generated data, resembling actual sensor data captured by real-world sensors, and corresponding to what the foundation model believes the sensor data of said time-instance to look like.

The surrounding physical environment of the vehicle can be understood as a general area around the vehicle in which objects (such as other road users, landmarks, obstacles, etc.) can be detected and identified by vehicle sensors (radar sensor, LIDAR sensor, camera(s), etc.), i.e. within a sensor range of the vehicle. The sensor data pertains to the physical environment in the sense that the sensor data reflects one or more properties of the physical environment, e.g. by depicting one or more objects in the physical environment.

The sensor data may be collected by on-board sensors of an ADS equipped vehicle. Thus, the sensor data may pertain to a surrounding physical environment of the vehicle. The sensor data may comprise one or more types of sensor data out of a group comprising image data, LIDAR data, radar data and ultrasonic data. The image data may e.g. be one or more images or image frame(s). The LIDAR data may be a point cloud. In some embodiments, the sensor data to be predicted is LIDAR data. The LIDAR data may be predicted based on LIDAR data for the remaining time instances, or based on sensor data comprising one or more types of sensor data. Training the offline perception model to predict LIDAR data may be advantageous in that the offline perception model can focus on learning how objects in the physical environment look and behave dynamically, without having to also understand complex physics of light is tasked to predict images for instance. The task of predicting new LIDAR point clouds may therefore be more readily achieved, than predicting new camera images, while still superseding the intended perception task for which the annotated training data is to be used for. This may therefore provide a balance between still achieving a highly capable model, without being overly complex or computationally heavy.

By using only one type of sensor data, the foundation model may be trained to better understand the world from that point of view. By using more than one type of sensor data, the foundation model may learn also how different sensor data types relate to each other.

The sensor data may be raw sensor data. Alternatively, the sensor data may be processed or fused sensor data of two or more different types of sensor data. FIG. 7A and 7B illustrate two examples where only one type of sensor data is used. In other words, the sensor data 702a-e may be any one of image data, LIDAR data, radar data or ultrasonic data. Thus, the foundation model may be trained S102 to predict one type of sensor data. However, as explained above, the sensor data may comprise more than one type of sensor data. Thus, the foundation model may be trained S102 to predict at least one type of the sensor data. FIG. 7C to 7E shows examples where the sensor data 702a-e comprises more than one type of sensor data, such as three types of sensor data as shown. It goes without saying that the sensor data may comprise at least two types of sensor data out of the above mentioned group of sensor data types. In case the sensor data comprises two or more types of sensor data, the foundation model may be trained S102 to predict a type of sensor data of the two or more types of sensor data for the time instance of the sequence of time instances, based on sensor data for remaining time instances of said sequence of time instances. In other words, the sensor data for the predicted time instance t may be predicted based only on sensor data of the remaining time instances. This is illustrated in FIG. 7C, in which all sensor data of the predicted time instance t is withheld. Moreover, in case the sensor data comprises two or more types of sensor data, the foundation model may be trained to predict any number of the types of sensor data. For example, it may be trained to predict only one type of sensor data, or all types of sensor data.

In case the sensor data comprises two or more types of sensor data, the foundation model may be trained S102 to predict a type of the two or more types of sensor data for the time instance of the sequence of time instances, based on sensor data for remaining time instances of said sequence of time instances, and further based on remaining types of the two or more types of sensor data for the predicted time instance. Two examples of this is shown in FIG. 7D and 7E. In FIG. 7D, only one type of the sensor data 702e for the predicted time instance t is withheld. The foundation model may then be trained to predict this one type, based on all sensor data of the remaining time instances t-n to t-1, and the available sensor data of the predicted time instance t. In some embodiments, parts of the sensor data of one or more types of sensor data may be withheld. The foundation model may then be trained to predict the part(s) of the sensor data. In general, some form of arbitrary masking may be applied to the sensor data sequence, which the foundation model is then trained to learn to predict. This allows the foundation model to learn a deep understanding of how objects and the physical environment evolve with time. As a non-limiting example, based on a tenth the points from a past LIDAR point cloud, a bottom half of a future image, and a third of the ultrasonic data of the time instance to be predicted, the foundation model may be trained to predict the radar data (or parts of the radar data) of the time instance to be predicted. FIG. 7E shows a slightly different scenario, in which the sensor data type to be predicted of the time instance t is also withheld from the remaining time instances which is used as input to the offline perception model. For example, the offline perception model may be trained to predict LIDAR data of the time instance to be predicted based on other types of sensor data of the remaining time instances. This may be advantageous in that it forces the offline perception model to also learn how the different sensor data types relate to each other.

In another concrete non-limiting example, the foundation model may be trained to predict LIDAR point-clouds into the future (i.e. based on only previous time instances). As input to the foundation model, all available sensor data may be used. Using all sensors in the input may allow the model to learn the most information about the world. For instance, in LIDAR data, a car at a distance of 200 m might be unrecognizable, in general, but combined with the camera the model can know that a single point of the LIDAR point-cloud corresponds to a car. Having this one LIDAR point the model can now also know much more about the location of the car. Moreover, predicting the future LIDAR data requires fully understanding the scene dynamics etc. Thus, making it a suitable task for the foundation model. A camera image also has this property, but properly learning the physics of all the lighting, reflections, etc. can be a harder task than learning to model the LIDAR data.

The method 100 further comprises forming S104 the offline perception model by adding a task-specific layer to the trained foundation model. The task-specific layer is configured to perform a perception task of the offline perception model. The task-specific layer may comprise one or more sub-layers needed for performing the perception task. The perception task may be one of object detection, object classification, object tracking, lane estimation, free-space estimation, trajectory prediction, obstacle avoidance, path planning, scene classification and traffic sign classification.

The step of forming S104 the offline perception model may be seen as transforming the trained foundation model into a task-specific model. In addition to adding the task-specific layer, the structure of the foundation model may be additionally modified to accommodate this. For example, an output layer of the foundation model may be removed, and replaced by the task-specific layer. Moreover, an input layer of the foundation model may be modified or replaced. This may for example be the case if the foundation model and the offline perception model will take different types of data as input. After having added the task specific layer(s) appropriate for making predictions according to the intended perception task, the foundation model may be frozen before subsequently fine-tuning the model, as described below. Ways of converting a trained foundation model into a task-specific model (also known as transfer learning) are to be considered well-known in the art. Any suitable way may be used in this case.

The method 100 further comprises fine-tuning S106 the offline perception model, using a second training dataset, to perform the perception task. The second training dataset comprises training data annotated for said perception task. Fine-tuning the offline perception model allows the already trained, and then modified, foundation model to be adapted to the perception task. Fine-tuning may involve training a part of the offline perception model, such as the task specific layer. Before doing so, the foundation model being part of the offline perception model may be frozen, so that its trainable parameters do not change during the fine-tuning process. Thereby, the fine-tuning of the offline perception model allows trainable parameters (e.g. model weights) of the task specific layer to be learned. Alternatively, the entire offline perception model may be trained during the fine-tuning. In other words, one or more trainable parameters of the foundation model may be updated during fine tuning of the offline perception model.

In contrast to the first training dataset which may be an implicitly annotated dataset, the second training dataset may be an explicitly annotated dataset. In other words, the second training dataset can comprise training samples with explicit labels determined e.g. through human or auto annotation procedures. Due to the offline perception model comprising the already trained foundation model, the second training dataset can be several orders of magnitude smaller than the first training dataset while still being able to achieve higher accuracy and performance compared to a perception model trained only on the second dataset. In other words, the second training dataset may be several orders of magnitude smaller than the first training dataset. Fine-tuning S106 of the offline perception model may be performed by supervised learning. More specifically, fine-tuning S106 of the offline perception model may be performed by supervised learning using explicitly annotated data.

Executable instructions for performing these functions are, optionally, included in a non-transitory computer-readable storage medium or other computer program product configured for execution by one or more processors.

Generally speaking, a computer-accessible medium may include any tangible or non-transitory storage media or memory media such as electronic, magnetic, or optical media—e.g., disk or CD/DVD-ROM coupled to computer system via bus. The terms “tangible” and “non-transitory,” as used herein, are intended to describe a computer-readable storage medium (or “memory”) excluding propagating electromagnetic signals, but are not intended to otherwise limit the type of physical computer-readable storage device that is encompassed by the phrase computer-readable medium or memory. For instance, the terms “non-transitory computer-readable medium” or “tangible memory” are intended to encompass types of storage devices that do not necessarily store information permanently, including for example, random access memory (RAM). Program instructions and data stored on a tangible computer-accessible storage medium in non-transitory form may further be transmitted by transmission media or signals such as electrical, electromagnetic, or digital signals, which may be conveyed via a communication medium such as a network and/or a wireless link.

FIG. 2 is a schematic flowchart representation of a computer-implemented method 200 for annotating data for use in subsequent training of an online perception model, according to some embodiments. The online perception model is configured to perform a perception task of a vehicle equipped with an automated driving system. The method 200 can be performed by a device 400 as described below in connection with FIG. 4.

Below, the different steps of the method 200 are described in more detail. Even though illustrated in a specific order, the steps of the method 200 may be performed in any suitable order as well as multiple times. Thus, although FIG. 2 may show a specific order of method steps, the order of the steps may differ from what is depicted. In addition, two or more steps may be performed concurrently or with partial concurrence. For example, the steps denoted S208 and S210 may be performed independently of each other. Such variation will depend on the software and hardware systems chosen and on designer choice. All such variations are within the scope of the invention. Likewise, software implementations could be accomplished with standard programming techniques with rule-based logic and other logic to accomplish the various steps. Further variants of the method 200 will become apparent from the present disclosure. The herein mentioned and described embodiments are only given as examples and should not be limiting to the present invention. Other solutions, uses, objectives, and functions within the scope of the invention as claimed below described patent claims should be apparent for the person skilled in the art.

The method 200 comprises obtaining S202 sensor data pertaining to a physical environment. The senor data may be collected by one or more onboard sensors of a vehicle. The sensor data may thus pertain to a surrounding physical environment of the vehicle having collected the sensor data. The vehicle may be a vehicle provided with the online perception model. Alternatively, the vehicle may be a different vehicle configured for sensor data collection.

The wording “obtaining” is throughout the present disclosure to be interpreted broadly and encompasses receiving, retrieving, collecting, acquiring, and so forth directly and/or indirectly between two entities configured to be in communication with each other or further with other external entities. However, in some embodiments, the term “obtaining” is to be construed as determining, deriving, forming, computing, etc. Thus, as used herein, “obtaining” may indicate that a parameter is received at a first entity/unit from a second entity/unit, or that the parameter is determined at the first entity/unit e.g. based on data received from another entity/unit. In some embodiments, the sensor data is obtained by being received from the vehicle having collected the sensor data. The vehicle may be part of a fleet of vehicles configured to collect sensor data for use as training data. It is to be noted that the vehicle having collected the sensor data need not to be the same vehicle as being provided with the online perception model. In some embodiments, the sensor data is obtained by being retrieved from a database. In other words, the database may comprise sensor data already collected by one or more vehicles, or by any other collecting means.

The method 200 further comprises determining S204 a perception output by inputting the obtained sensor data into an offline perception model provided by any embodiment of the method 100 described above in connection with FIG. 1. In other words, the perception output may be determined by feeding the sensor data to the trained offline perception model. The offline perception model may be trained to perform the same perception task as the online perception model. Thus, the perception output may be a same type of output as the online perception model would output. More specifically, the perception output may e.g. comprise bounding boxes of objects detected in the sensor data, labels of identified objects, and/or a segmentation of the sensor data etc. The perception output of the offline perception model may thus be used as annotation data for the sensor data inputted to the offline perception model. The perception output may be used as annotation data directly. However, in some embodiments, the perception output may be further processed before being used as annotation data.

The offline perception model as provided according to what is described above can, thanks to its high performance, be able to perceive objects also in new or previously unseen scenarios or environments, thus making it possible to provide annotation data to a wide variety of scenes. This means that the offline perception model becomes more capable for annotating data, than previous attempts of achieving auto annotation models, which are merely trained on a limited training dataset of explicitly annotated data. As a non-limiting example, the offline perception model may, despite being fine-tuned on the second training dataset only comprising examples of tractors in a country-side environment, be able to recognize tractors in a city-environment depicted in the obtained sensor data, at least partly due to the trained foundation model being part of the model architecture of the offline perception model. Another kind of auto-annotation model trained only on a training dataset like the second training dataset describe above, may not be able to recognize a tractor in such a new scenario. It is to be appreciated that this simplified example merely serves for the purpose of illustrating the principles of the presently disclosed technology, and may not be representative of an actual case.

The method 200 further comprises storing S206 the sensor data together with the perception output as annotation data for subsequent training of the online perception model of the vehicle. The above mentioned steps may be repeated for additional sensor data to obtain a training dataset of annotated sensor data. This training dataset may then be used in training the online perception model using supervised learning.

The method 200 may further comprise transmitting S208 the perception data together with the perception output to the vehicle for subsequent training of the online perception model in the vehicle. Thus, a local model of the online perception model may be trained at the edge, i.e. by the vehicle.

The method 200 may further comprise training S210 the online perception model on the stored sensor data together with the perception output, thereby generating an updated online perception model. Thus, a global (or master) model of the online perception model may be trained e.g. by a centralized server. The method 200 may then further comprise transmitting the updated online perception model to a vehicle, or fleet of vehicles.

FIG. 3 is a schematic illustration of a device 300 for providing an offline perception model for subsequent annotation of training data for use in training of an online perception model, in accordance with some embodiments. The device 300 may be configured to perform the method 100 as described in connection with FIG. 1.

The device 300 as described herein for the purpose of this patent application, refers to a computer system, or a networked device configured to provide various computing services, data storage, processing capabilities, or resources to clients or users over a communication network. In the present case, the wording “clients” refers to connected vehicles (such as the vehicle 400 described below) of a fleet of vehicles. Thus, the device 300 as described herein may refer to a general computing device. The device 300 may be a server such as a remote server, cloud server, central server, back-office server, fleet server, or back-end server. Even though the device 300 is herein illustrated as one device, the device 300 may be a distributed computing system, formed by a number of different devices.

The device 300 comprises control circuitry 302. The control circuitry 302 may physically comprise one single circuitry device. Alternatively, the control circuitry 302 may be distributed over several circuitry devices.

As shown in the example of FIG. 3, the device 300 may further comprise a transceiver 306 and a memory 308. The control circuitry 302 being communicatively connected to the transceiver 306 and the memory 308. The control circuitry 302 may comprise a data bus, and the control circuitry 302 may communicate with the transceiver 306 and/or the memory 308 via the data bus.

The control circuitry 302 may be configured to carry out overall control of functions and operations of the device 300. The control circuitry 302 may include a processor 304, such as a central processing unit (CPU), microcontroller, or microprocessor. The processor 304 may be configured to execute program code stored in the memory 308, in order to carry out functions and operations of the device 300. The control circuitry 302 is configured to perform the steps of the method 100 as described above in connection with FIG. 1. The steps may be implemented in one or more functions stored in the memory 308.

The transceiver 306 is configured to enable the device 300 to communicate with other entities, such as vehicles or other devices. The transceiver 306 may both transmit data from and receive data to the device 300.

The memory 308 may be a non-transitory computer-readable storage medium. The memory 308 may be one or more of a buffer, a flash memory, a hard drive, a removable media, a volatile memory, a non-volatile memory, a random access memory (RAM), or another suitable device. In a typical arrangement, the memory 308 may include a non-volatile memory for long-term data storage and a volatile memory that functions as system memory for the device 300. The memory 308 may exchange data with the circuitry 302 over the data bus. Accompanying control lines and an address bus between the memory 308 and the circuitry 302 also may be present.

Functions and operations of the device 300 may be implemented in the form of executable logic routines (e.g., lines of code, software programs, etc.) that are stored on a non-transitory computer readable recording medium (e.g., the memory 308) of the device 300 and are executed by the circuitry 302 (e.g., using the processor 304). Put differently, when it is stated that the circuitry 302 is configured to execute a specific function, the processor 304 of the circuitry 302 may be configured execute program code portions stored on the memory 308, wherein the stored program code portions correspond to the specific function. Furthermore, the functions and operations of the circuitry 302 may be a stand-alone software application or form a part of a software application that carries out additional tasks related to the circuitry 302. The described functions and operations may be considered a method that the corresponding device is configured to carry out, such as the method 100 discussed above in connection with FIG. 1. In addition, while the described functions and operations may be implemented in software, such functionality may as well be carried out via dedicated hardware or firmware, or some combination of one or more of hardware, firmware, and software. In the following, the function and operations of the device 300 is described.

The control circuitry 302 is configured to train a foundation model, using a first training dataset, to predict sensor data pertaining to a physical environment for a time instance of a sequence of time instances, based on sensor data for remaining time instances of said sequence of time instances. This may be performed e.g. by execution of a training function 310.

The control circuitry 302 is further configured to form the offline perception model by adding a task-specific layer to the trained foundation model. The task-specific layer is configured to perform a perception task of the offline perception model. This may be performed e.g. by execution of a forming function 312.

The control circuitry 302 is further configured to fine-tune the offline perception model, using a second training dataset, to perform the perception task. The second training dataset comprises training data annotated for said perception task. This may be performed e.g. by execution of a fine-tuning function 314.

It should be noted that the principles, features, aspects, and advantages of the method 100 as described above in connection with FIG. 1, are applicable also to the device 300 as described herein. In order to avoid undue repetition, reference is made to the above.

FIG. 4 is a schematic illustration of a device 400 for annotating data for use in subsequent training of an online perception model. The online perception model is configured to perform a perception task of a vehicle equipped with an automated driving system. The device 400 may be configured to perform the method 200 as described in connection with FIG. 2. The device 400 described in connection with FIG. 4 and the device 300 described above in connection with FIG. 3 may be incorporated in a single device, such as a common server.

The device 400 as described herein for the purpose of this patent application, refers to a computer system, or a networked device configured to provide various computing services, data storage, processing capabilities, or resources to clients or users over a communication network. In the present case, the wording “clients” refers to connected vehicles (such as the vehicle 400 described below) of a fleet of vehicles. Thus, the device 400 as described herein may refer to a general computing device. The device 400 may be a server such as a remote server, cloud server, central server, back-office server, fleet server, or back-end server. Even though the device 400 is herein illustrated as one device, the device 400 may be a distributed computing system, formed by a number of different devices.

The device 400 comprises control circuitry 402. The control circuitry 402 may physically comprise one single circuitry device. Alternatively, the control circuitry 402 may be distributed over several circuitry devices.

As shown in the example of FIG. 4, the device 400 may further comprise a transceiver 406 and a memory 408. The control circuitry 402 being communicatively connected to the transceiver 406 and the memory 408. The control circuitry 402 may comprise a data bus, and the control circuitry 402 may communicate with the transceiver 406 and/or the memory 408 via the data bus.

The control circuitry 402 may be configured to carry out overall control of functions and operations of the device 400. The control circuitry 402 may include a processor 404, such as a central processing unit (CPU), microcontroller, or microprocessor. The processor 404 may be configured to execute program code stored in the memory 408, in order to carry out functions and operations of the device 400. The control circuitry 402 is configured to perform the steps of the method 200 as described above in connection with FIG. 2. The steps may be implemented in one or more functions stored in the memory 408.

The transceiver 406 is configured to enable the device 400 to communicate with other entities, such as vehicles or other devices. The transceiver 406 may both transmit data from and receive data to the device 400.

The memory 408 may be a non-transitory computer-readable storage medium. The memory 408 may be one or more of a buffer, a flash memory, a hard drive, a removable media, a volatile memory, a non-volatile memory, a random access memory (RAM), or another suitable device. In a typical arrangement, the memory 408 may include a non-volatile memory for long-term data storage and a volatile memory that functions as system memory for the device 400. The memory 408 may exchange data with the circuitry 402 over the data bus. Accompanying control lines and an address bus between the memory 408 and the circuitry 402 also may be present.

Functions and operations of the device 400 may be implemented in the form of executable logic routines (e.g., lines of code, software programs, etc.) that are stored on a non-transitory computer readable recording medium (e.g., the memory 408) of the device 400 and are executed by the circuitry 402 (e.g., using the processor 404). Put differently, when it is stated that the circuitry 402 is configured to execute a specific function, the processor 404 of the circuitry 402 may be configured execute program code portions stored on the memory 408, wherein the stored program code portions correspond to the specific function. Furthermore, the functions and operations of the circuitry 402 may be a stand-alone software application or form a part of a software application that carries out additional tasks related to the circuitry 402. The described functions and operations may be considered a method that the corresponding device is configured to carry out, such as the method 200 discussed above in connection with FIG. 2. In addition, while the described functions and operations may be implemented in software, such functionality may as well be carried out via dedicated hardware or firmware, or some combination of one or more of hardware, firmware, and software. In the following, the function and operations of the device 400 is described.

The control circuitry 402 is configured to obtain sensor data pertaining to a physical environment. The sensor data may pertain to a surrounding physical environment of a vehicle having collected the sensor data. This may be performed e.g. by execution of an obtaining function 410.

The control circuitry 402 is further configured to determine a perception output by inputting the obtained sensor data into an offline perception model provided by the method 100 as described above in connection with FIG. 1. This may be performed e.g. by execution of a determining function 412.

The control circuitry 402 is further configured to store the sensor data together with the perception output as annotation data for subsequent training of the online perception model of the vehicle. This may be performed e.g. by execution of a storing function 414.

The control circuitry 402 may be further configured to transmit the perception data together with the perception output to the vehicle for subsequent training of the online perception model in the vehicle. This may be performed e.g. by execution of a transmitting function 416.

The control circuitry 402 may be further configured to train the online perception model on the stored sensor data together with the perception output, thereby generating an updated online perception model. This may be performed e.g. by execution of a training function 418.

It should be noted that the principles, features, aspects, and advantages of the method 200 as described above in connection with FIG. 2, are applicable also to the device 400 as described herein. In order to avoid undue repetition, reference is made to the above.

FIG. 5 is a schematic illustration of a vehicle 500 in accordance with some embodiments. The vehicle 500 is equipped with an Automated Driving System (ADS) 510. As used herein, a “vehicle” is any form of motorized transport. For example, the vehicle 500 may be any road vehicle such as a car (as illustrated herein), a motorcycle, a (cargo) truck, a bus, a smart bicycle, etc.

The vehicle 500 comprises a number of elements which can be commonly found in autonomous or semi-autonomous vehicles. It will be understood that the vehicle 500 can have any combination of the various elements shown in FIG. 5. Moreover, the vehicle 500 may comprise further elements than those shown in FIG. 5. While the various elements are herein shown as located inside the vehicle 500, one or more of the elements can be located externally to the vehicle 500. Further, even though the various elements are herein depicted in a certain arrangement, the various elements may also be implemented in different arrangements, as readily understood by the skilled person. It should be further noted that the various elements may be communicatively connected to each other in any suitable way. The vehicle 500 of FIG. 5 should be seen merely as an illustrative example, as the elements of the vehicle 500 can be realized in several different ways.

The vehicle 500 comprises a control system 502. The control system 502 is configured to carry out overall control of functions and operations of the vehicle 500. The control system 502 comprises control circuitry 504 and a memory 506. The control circuitry 502 may physically comprise one single circuitry device. Alternatively, the control circuitry 502 may be distributed over several circuitry devices. As an example, the control system 502 may share its control circuitry 504 with other parts of the vehicle. The control circuitry 502 may comprise one or more processors, such as a central processing unit (CPU), microcontroller, or microprocessor. The one or more processors may be configured to execute program code stored in the memory 506, in order to carry out functions and operations of the vehicle 500. The processor(s) may be or include any number of hardware components for conducting data or signal processing or for executing computer code stored in the memory 506. In some embodiments, the control circuitry 504, or some functions thereof, may be implemented on one or more so-called system-on-a-chips (SoC). As an example, the ADS 510 may be implemented on a SoC. The memory 506 optionally includes high-speed random access memory, such as DRAM, SRAM, DDR RAM, or other random access solid-state memory devices; and optionally includes non-volatile memory, such as one or more magnetic disk storage devices, optical disk storage devices, flash memory devices, or other non-volatile solid-state storage devices. The memory 506 may include database components, object code components, script components, or any other type of information structure for supporting the various activities of the present description.

In the illustrated example, the memory 506 further stores map data 508. The map data 508 may for instance be used by the ADS 510 of the vehicle 500 in order to perform autonomous functions of the vehicle 500. The map data 508 may comprise high-definition (HD) map data. It is contemplated that the memory 508, even though illustrated as a separate element from the ADS 510, may be provided as an integral element of the ADS 510. In other words, according to some embodiments, any distributed or local memory device may be utilized in the realization of the present inventive concept. Similarly, the control circuitry 504 may be distributed e.g. such that one or more processors of the control circuitry 504 is provided as integral elements of the ADS 510 or any other system of the vehicle 500. In other words, according to an exemplary embodiment, any distributed or local control circuitry device may be utilized in the realization of the present inventive concept.

The vehicle 500 further comprises a sensor system 520. The sensor system 520 is configured to acquire sensory data about the vehicle itself, or of its surroundings. The sensor system 520 may for example comprise a Global Navigation Satellite System (GNSS) module 522 (such as a GPS) configured to collect geographical position data of the vehicle 500. The sensor system 520 may further comprise one or more sensors 524. The one or more sensor(s) 524 may be any type of on-board sensors, such as cameras, LIDARs and RADARs, ultrasonic sensors, gyroscopes, accelerometers, odometers etc. The one or more sensor(s) 524 may thus be used for collecting sensor data sequences pertaining to the physical surrounding environment of the vehicle 500 to be used as training data for the foundation model. Moreover, the one or more sensor(s) 524 can be used collect sensor data pertaining to the surrounding physical environment of the vehicle 500 to be used in fine-tuning of the offline perception model, and/or training of the online perception model, after being annotated. It should be appreciated that the sensor system 520 may also provide the possibility to acquire sensory data directly or via dedicated sensor control circuitry in the vehicle 500.

The vehicle 500 further comprises a communication system 526. The communication system 526 is configured to communicate with external units, such as other vehicles (i.e. via vehicle-to-vehicle (V2V) communication protocols), remote servers (e.g. cloud servers as the devices described above in connection with FIGS. 3 and 4), databases or other external devices, i.e. vehicle-to-infrastructure (V2I) or vehicle-to-everything (V2X) communication protocols. The communication system 526 may communicate using one or more communication technologies. The communication system 526 may comprise one or more antennas. Cellular communication technologies may be used for long-range communication such as to remote servers or cloud computing systems. In addition, if the cellular communication technology used has low latency, it may also be used for V2V, V21 or V2X communication. Examples of cellular radio technologies are GSM, GPRS, EDGE, LTE, 5G, 5G NR, and so on, also including future cellular solutions. However, in some solutions mid to short-range communication technologies may be used such as Wireless Local Area (LAN), e.g. IEEE 802.11 based solutions, for communicating with other vehicles in the vicinity of the vehicle 500 or with local infrastructure elements. ETSI is working on cellular standards for vehicle communication and for instance 5G is considered as a suitable solution due to the low latency and efficient handling of high bandwidths and communication channels.

The communication system 526 may further provide the possibility to send output to a remote location (e.g. remote server, operator or control center) by means of the one or more antennas. Moreover, the communication system 526 may be further configured to allow the various elements of the vehicle 500 to communicate with each other. As an example, the communication system may provide a local network setup, such as CAN bus, I2C, Ethernet, optical fibers, and so on. Local communication within the vehicle may also be of a wireless type with protocols such as Wi-Fi®, LoRa, Zigbee, Bluetooth, or similar mid/short range technologies.

The vehicle 500 further comprises a maneuvering system 520. The maneuvering system 528 is configured to control the maneuvering of the vehicle 500. The maneuvering system 528 comprises a steering module 530 configured to control the heading of the vehicle 500. The maneuvering system 528 further comprises a throttle module 532 configured to control actuation of the throttle of the vehicle 500. The maneuvering system 528 further comprises a braking module 534 configured to control actuation of the brakes of the vehicle 500. The various modules of the steering system 528 may receive manual input from a driver of the vehicle 500 (i.e. from a steering wheel, a gas pedal and a brake pedal respectively). However, the maneuvering system 528 may be communicatively connected to the ADS 510 of the vehicle, to receive instructions on how the various modules should act. Thus, the ADS 510 can control the maneuvering of the vehicle 500.

As stated above, the vehicle 500 comprises an ADS 510. The ADS 510 may be part of the control system 502 of the vehicle. The ADS 510 is configured to carry out the functions and operations of the autonomous functions of the vehicle 500. The ADS 510 can comprise a number of modules, where each module is tasked with different functions of the ADS 510.

The ADS 510 may comprise a localization module 512 or localization block/system. The localization module 512 is configured to determine and/or monitor a geographical position and heading of the vehicle 500, and may utilize data from the sensor system 520, such as data from the GNSS module 522. Alternatively, or in combination, the localization module 512 may utilize data from the one or more sensors 524. The localization system may alternatively be realized as a Real Time Kinematics (RTK) GPS in order to improve accuracy.

The ADS 510 may further comprise a perception module 514 or perception block/system. The perception module 514 may refer to any commonly known module and/or functionality, e.g. comprised in one or more electronic control modules and/or nodes of the vehicle 500, adapted and/or configured to interpret sensory data-relevant for driving of the vehicle 500—to identify e.g. obstacles, vehicle lanes, relevant signage, appropriate navigation paths etc. The perception module 514 may thus be adapted to rely on and obtain inputs from multiple data sources, such as automotive imaging, image processing, computer vision, and/or in-car networking, etc., in combination with sensory data e.g. from the sensor system 520. The online perception model for performing a perception task of the vehicle may be provided as part of the ADS 510, or more specifically as part of the perception module 514.

The localization module 512 and/or the perception module 514 may be communicatively connected to the sensor system 520 in order to receive sensor data from the sensor system 520. The localization module 512 and/or the perception module 514 may further transmit control instructions to the sensor system 520.

The ADS may further comprise a path planning module 516. The path planning module 516 is configured to determine a planned path of the vehicle 500 based on a perception and location of the vehicle as determined by the perception module 514 and the localization module 512 respectively. A planned path determined by the path planning module 516 may be sent to the maneuvering system 528 for execution.

The ADS may further comprise a decision and control module 518. The decision and control module 518 is configured to perform the control and make decisions of the ADS 510. For example, the decision and control module 518 may decide on whether the planned path determined by the path-planning module 516 should be executed or not. The decision and control module 518 may be further configured to detect any deviating behavior of the vehicle, such as deviations from the planned path, or expected trajectory of the path planning module 516. This includes both evasive maneuvers performed by the ADS 510 and by a driver of the vehicle.

It should be understood that parts of the described solution may be implemented either in the vehicle 500, in a system located external to the vehicle, or in a combination of internal and external to the vehicle; for instance, in a server in communication with the vehicle, a so called cloud solution. The different features and principles of the embodiments may be combined in other combinations than those described. Further, the elements of the vehicle 500 (i.e. the systems and modules) may be implemented in different combinations than those described herein.

FIG. 6 illustrates, by way of example, a distributed system 600 according to some embodiments. The system 600 should be seen as a non-limiting example of a realization of the herein disclosed aspects of the present inventive concept. For instance, the system 600 is configured to perform the method 100 as described above in connection with FIG. 1. The system 600 may be further configured to perform the method 200 as described above in connection with FIG. 2. Thus, any features or principles described above are applicable also to the system 600 as described herein and vice versa, unless otherwise stated.

The system 600 comprises a server 602 (or remote, cloud, central, back-office, fleet, or back-end server), referred to in the following as the remote server 602 or just server 602. The server 602 may be the device 300 as described in connection with FIG. 3, and/or the device 400 as described in connection with FIG. 4. In other words, the server 602 may be configured to perform the functions of the above described devices. Thus, the server 602 may be configured to perform the method 100 as described in connection with FIG. 1, and/or the method 200 described in connection with FIG. 2. As illustrated, the server 602 may be provided in the cloud, i.e. as a cloud-implemented server.

The system 600 further comprises one or more vehicles 604a-c, also referred to as a fleet of vehicles. The one or more vehicles 604a-c may be vehicles 500 as described above in connection with FIG. 5. Thus, the one or more vehicles 604a-c may be used for sensor data collection for use as training data. Moreover, the one or more vehicles 604a-c may be provided with the online perception model as described in the foregoing.

The one or more vehicles 604a-c are communicatively connected to the remote server 602 for transmitting and/or receiving data 606 between the vehicles and the server. The one or more vehicles 604a-c may be further communicatively connected to each other. The data 606 may be any kind of data, such as communication signals, or sensor data. The communication may be performed by any suitable wireless communication protocol. The wireless communication protocol may e.g. be long range communication protocols, such as cellular communication technologies (e.g. GSM, GPRS, EDGE, LTE, 5G, 5G NR, etc.) or short to mid-ranged communication protocols, such as Wireless Local Area (LAN) (e.g. IEEE 802.11) based solutions. The sever 602 comprises a suitable memory and control circuitry, for example, one or more processors or processing circuitry, as well as one or more other components such as a data interface and transceiver. The server 602 may also include software modules or other components, such that the control circuity can be configured to execute machine-readable instructions loaded from memory to implement the steps of the method to be performed.

The fleet illustrated in FIG. 6 comprises three vehicles, a first, second and third vehicle 604a-c, by way of example. The system 600 may however comprise any number of vehicles 604a-c. In the following, the system 600 will be described mainly with reference to the first vehicle 604a. It is to be understood that the principles apply to any vehicle of the fleet of vehicles.

In the following, an example of how the system 600 may perform the techniques according to some embodiments will be described. For further details regarding the different steps, reference is made to FIGS. 1 and 2 above to avoid undue repetition.

In a first scenario, the server 602 performs the process of providing the offline perception model for subsequent annotation of training data. In such case, the server 602 performs the functions of the device 300 as described above in connection with FIG. 3.

In a second scenario, the server 602 performs the process of annotating data for use in subsequent training of the online perception model. As explained above, the vehicle 604a may collect sensor data of a physical surrounding environment of the vehicle. The vehicle 604a may then transmit the collected sensor data to the server 602. Upon receiving the sensor data, the server may determine a perception output by inputting the obtained sensor data into the offline perception model. The server may then store the sensor data together with the perception output as annotation data for subsequent training of the online perception model. This process may be repeated for sensor data received from the fleet of vehicles until a sufficiently large dataset has been formed. The online perception model may then be trained (or re-trained) using the dataset. The online perception model may be trained at the edge, i.e. at the vehicle 604a. The sensor data together with the associated perception output may then be transmitted to the vehicle 604a. Alternatively, the online perception model may be trained at the server 602. An updated version of the online perception model after training may then be transmitted to the vehicles of the fleet of vehicles.

The above-described process of the system 600 is to be understood as a non-limiting example of the presently disclosed technology for improved understanding. Further variants are apparent from the present disclosure and readily realized by the person skilled in the art.

The present invention has been presented above with reference to specific embodiments. However, other embodiments than the above described are possible and within the scope of the invention. Different method steps than those described above, performing the methods by hardware or software, may be provided within the scope of the invention. Thus, according to an exemplary embodiment, there is provided a non-transitory computer-readable storage medium storing one or more programs configured to be executed by one or more processors of a vehicle control system, the one or more programs comprising instructions for performing the methods according to any one of the above-discussed embodiments. Alternatively, according to another exemplary embodiment a cloud computing system can be configured to perform any of the methods presented herein. The cloud computing system may comprise distributed cloud computing resources that jointly perform the methods presented herein under control of one or more computer program products.

It should be noted that any reference signs do not limit the scope of the claims, that the invention may be at least in part implemented by means of both hardware and software, and that the same item of hardware may represent several “means” or “units”.

COMPUTER IMPLEMENTED METHOD FOR PROVIDING A PERCEPTION MODEL FOR ANNOTATION OF TRAINING DATA

Information

Publication Number

Date Filed

Date Published

Inventors

Original Assignees

CPC

International Classifications

Abstract

Description

Claims

Priority Claims (1)