The present disclosure generally relates to the fields of computer technology and data processing technology, and also relates to the technical fields of autonomous driving, electronic map, deep learning, image processing, and the like.
Localization is a fundamental task in a self-driving system of a vehicle, and a localization model or localization system is a basic module in the self-driving system. Precise localization of a vehicle is not only an input required by a path planning module in the self-driving system, but can also be applied to simplify a scene interpretation and classification algorithm of an environment perception module. To exploit high definition maps (also referred to as HD maps) as priors for robust environment perception and safe motion planning, the localization system of a vehicle is typically required to reach centimeter-level accuracy.
The present disclosure provides a technical solution for vehicle localization, more specifically a method for vehicle localization, an apparatus for vehicle localization, an electronic device and a computer readable storage medium.
According to a first aspect of the present disclosure, there is provided a method for vehicle localization. The method comprises: obtaining an image descriptor map corresponding to a captured image of an external environment of a vehicle and a predicted pose of the vehicle when the captured image is captured, the image descriptor map comprising descriptors of points in the captured image. The method also comprises: obtaining a set of reference descriptors and a set of spatial coordinates corresponding to a set of keypoints in a reference image of the external environment, the reference image being pre-captured by a capturing device. The method also comprises: determining a plurality of sets of image descriptors corresponding to the set of spatial coordinates when the vehicle is in a plurality of candidate poses, respectively, the plurality of sets of image descriptors belonging to the image descriptor map, the plurality of candidate poses being obtained by offsetting the predicted pose. The method also comprises: determining a plurality of similarities between the plurality of sets of image descriptors and the set of reference descriptors. The method further comprises: updating the predicted pose based on the plurality of candidate poses and the plurality of similarities corresponding to the plurality of candidate poses.
According to a second aspect of the present disclosure, there is provided an electronic device. The electronic device comprises at least one processor and a memory communicatively connected to the at least one processor. The memory stores instructions executable by the at least one processor, and the instructions when executed by the at least one processor cause the at least one processor to: obtain an image descriptor map corresponding to a captured image of an external environment of a vehicle and a predicted pose of the vehicle when the captured image is captured, the image descriptor map comprising descriptors of points in the captured image. The instructions when executed by the at least one processor also cause the at least one processor to: obtain a set of reference descriptors and a set of spatial coordinates corresponding to a set of keypoints in a reference image of the external environment, the reference image being pre-captured by a capturing device. The instructions when executed by the at least one processor also cause the at least one processor to: determine a plurality of sets of image descriptors corresponding to the set of spatial coordinates when the vehicle is in a plurality of candidate poses, respectively, the plurality of sets of image descriptors belonging to the image descriptor map, the plurality of candidate poses being obtained by offsetting the predicted pose. The instructions when executed by the at least one processor also cause the at least one processor to: determine a plurality of similarities between the plurality of sets of image descriptors and the set of reference descriptors. The instructions when executed by the at least one processor further cause the at least one processor to: update the predicted pose based on the plurality of candidate poses and the plurality of similarities corresponding to the plurality of candidate poses.
According to a third aspect of the present disclosure, there is provided a non-transitory computer readable storage medium storing computer instructions. The computer instructions cause a computer to perform the method of the first aspect of the present disclosure.
Embodiments of the present disclosure can improve localization accuracy and robustness of a vehicle visual localization algorithm, thereby boosting performance of a vehicle localization system.
It should be appreciated that this Summary is not intended to identify key features or essential features of the embodiments of the present disclosure, nor is it intended to be used to limit the scope of the disclosure. Other features of the present disclosure will be made apparent by the following description.
Through reading the following detailed description with reference to the accompanying drawings, the above and other objectives, features, and advantages of embodiments of the present disclosure will become more comprehensible. Several embodiments of the present disclosure will be illustrated in the drawings by way of example, without limitation. Therefore, it should be appreciated that the drawings are provided for better understanding of the technical solution of the present application, without constituting limitations to the present application.
Throughout the drawings, the same or similar reference signs refer to the same or similar elements.
Example embodiments of the present application will now be described in connection with the drawings in the following, including various details of those embodiments of the present application for better understanding, which should be considered as being provided exemplarily. Thus, it would be appreciated by those skilled in the art that various changes and modifications to the embodiments described herein can be made, without departing from the scope and spirit of the present application. Moreover, description of well-known functionalities and structures will be omitted in the following description for clarity and brevity.
As aforementioned, localization is a fundamental task in a self-driving system of a vehicle. To exploit high definition maps as priors for robust environment perception and safe motion planning, the localization system of an unmanned vehicle may be required to reach centimeter-level accuracy. Despite many decades of research, building a long-term, precise and reliable localization system using low-cost sensors, such as automotive and consumer-grade global positioning system (GPS)/inertial measurement unit (IMU) and cameras, is still an open-ended and challenging problem.
Traditional solutions for visual localization of a vehicle are mainly divided into two categories. One category of traditional solutions accomplishes vehicle localization by matching local keypoints in a high definition map with corresponding keypoints in a real-time (also referred to as “online”) image captured by the vehicle. Generally speaking, this category of traditional solutions leverages a conventional approach or machine learning-based approach for extracting keypoints from a high definition map to build a sparse keypoint map. When performing online localization of a vehicle, a pose of the vehicle is computed by determining a “three-dimensional and two-dimensional (3D-2D)” correspondence relation between keypoints in the sparse keypoint map and keypoints in the online image captured by the vehicle.
However, compared to the Light Detection and Ranging (LiDAR), cameras of a vehicle are passive sensors, meaning that they are more susceptible to changes in the appearance of an object, which may be caused by varying light conditions or changes in viewpoints. Accordingly, in the category of traditional solutions, handcrafted point features suffer from unreliable feature matching under large lighting or viewpoint changes, leading to localization failure of the vehicle eventually. Even when using most recent deep features, local 3D-2D matching is prone to fail under strong visual changes in practice due to the lack of repeatability in the keypoint detector, thereby impacting the final vehicle localization result. In addition, repeated structures may exist in some natural environments that a vehicle may encounter, and such repeated structures probably lead to failure in achieving a good effect in one-to-one keypoint matching.
The other category of traditional solutions achieves vehicle localization using human-made objects, where specific appearances and semantic meanings in an environment or scene are encoded, such as lane markings, road signs, road curbs, poles, and the like. Those features are typically considered relatively stable and can be easily recognized as they are built by humans for specific purposes and also used by human drivers to aid their driving behavior. Based on such concepts, in this category of traditional solutions, various human-made elements, such as lane markings, poles, and the like, are used for localization. Specifically, types of the artificial elements for localization may be predetermined by humans and stored in a high definition map. When performing online localization of a vehicle, the artificial elements in the high definition map may be compared with the artificial elements detected by the vehicle in real time to obtain a pose of the vehicle.
Nonetheless, such category of traditional solutions is only adaptive for environments with rich human-made features but easily fail in scenarios that lack the human-made features, for example, road sections with worn-out markings under poor maintenance, rural streets with no lane markings or other open spaces without clear signs. In addition, these carefully selected semantic signs or markings typically only cover a small area in an image. This leads to an obvious design paradox in such category of traditional solutions, namely, it suffers from the usual absence of distinctive human-made features for vehicle localization, but at the same time, it deliberately abandons rich and important information in an image by solely relying on human-made features. Moreover, since high definition map elements for vehicle localization are defined manually, considerable manual labor for identification and marking are required. Further, it is hard to define some elements (for example, a curved trunk at a roadside) for vehicle localization. Furthermore, labor-intensive adjustments are required for matching high definition map elements for vehicle localization with online elements.
In view of the foregoing research and analysis, embodiments of the present disclosure propose a technical solution for vehicle localization, and specifically provide a method, electronic device and computer storage medium for vehicle localization to at least partly solve the above technical problems and other potential technical problems in the traditional solutions.
As used herein, vehicle localization refers to determining a position and a posture of a vehicle, which are collectively referred to as a pose. In the technical solution for vehicle localization provided by the present disclosure, a computing device of a vehicle (or another computing device) may obtain an image (also referred to as a captured image herein) of an external environment external to the vehicle captured by an imaging device of the vehicle, and a predicted pose when the vehicle is capturing the image. The accuracy of the predicted pose may be less than a predetermined threshold and thus cannot be applied to applications (for example, autonomous driving) requiring high accuracy localization. Then, based on the captured image and a reference image of the external environment, the computing device may update the predicted pose of the vehicle to ultimately obtain a predicted pose with accuracy greater than the predetermined threshold, for use in applications requiring high accuracy localization.
In order to update the predicted pose of the vehicle, on one hand, the computing device may process the captured image of the external environment to obtain an image descriptor map of the captured image. In the context of the present disclosure, a descriptor map of an image may refer to a map formed by descriptors corresponding to respective image points in the image. In other words, in a position corresponding to a certain image point (for example, a pixel) of the image, a descriptor of the image point is recorded in the descriptor map.
On the other hand, the computing device may obtain a reference image of the external environment captured by a capturing device (for example, a high definition map capturing a vehicle or the like). During the pre-capturing of the external environment performed by the capturing device, spatial coordinate information associated with the reference image may also be collected. As such, the computing device may obtain spatial coordinates corresponding to image points in the reference image, such as, three-dimensional spatial coordinates. In this event, the computing device may select a set of keypoints for aiding vehicle localization from all image points in the reference image, and may further obtain a set of reference descriptors and a set of spatial coordinates corresponding to the set of keypoints. The set of reference descriptors includes descriptors corresponding to respective keypoints in the set of keypoints, and the set of spatial coordinates includes spatial coordinates corresponding to respective keypoints in the set of keypoints.
As indicated, the predicted pose of the vehicle obtained by the computing device is not a real pose of the vehicle, but approximates the real pose of the vehicle to a certain extent. In other words, the real pose of the vehicle may be considered as “adjacent to” the predicted pose of the vehicle. In light of this idea, in embodiments of the present disclosure, the computing device may obtain a plurality of “candidate poses” for the real pose of the vehicle by offsetting the predicted pose. Then, the computing device can determine the updated predicted pose of the vehicle based on the plurality of candidate poses.
To this end, for a certain candidate pose of the plurality of candidate poses, the computing device may assume that it is the real pose of the vehicle. Under this assumption, in the image descriptor map of the captured image, the computing device may determine a set of image descriptors corresponding to the set of spatial coordinates. Since there are a plurality of candidate poses, the computing device can determine a plurality of sets of image descriptors respectively corresponding to the plurality of candidate poses in the same manner. Thereafter, the computing device may determine a plurality of similarities between the plurality of sets of image descriptors and the set of reference descriptors, and update the predicted pose based on the plurality of candidate poses and the respective plurality of similarities.
The technical solution of the present disclosure provides a novel visual localization framework, and for example, can be used for autonomous driving of a vehicle, which does not rely on artificial elements in a map (for example, a high definition map) for localization or selection of local keypoints in the map, thereby avoiding inherent deficiencies and problems in the two above-mentioned categories of traditional solutions. In addition, the technical solution of the present disclosure can significantly improve the localization accuracy and robustness of vehicle localization, for example, yielding centimeter level precision under various challenging lighting conditions. Some example embodiments of the present disclosure will be described below with reference to the drawings.
As shown, the example road in
In the context of the present disclosure, the external environment 105 of the vehicle 110 may include or contain all objects, targets or elements outside the vehicle 110. For example, the external environment 105 may include the road boundary lines 102 and 104, the lane markings 106 and 108, the trees 112, the traffic light 114, and the like, as shown in
In some embodiments, the vehicle 110 may capture a captured image 130 of the external environment 105 via an imaging device (not shown) and provide it to a computing device 120 of the vehicle 110. It should be noted that the imaging device as used herein may be an imaging device fixedly mounted on the vehicle 110, an imaging device handheld by a passenger within the vehicle 110, an imaging device outside the vehicle 110, or the like. Embodiments of the present disclosure do not restrict the specific positional relation between the imaging device and the vehicle 110. For convenience of description, the imaging device for capturing the external environment 105 of the vehicle 110 will be referred to as the imaging device of the vehicle 110 in the following. However, it should be appreciated that embodiments of the present disclosure are equally applicable to a situation where the imaging device is not fixedly mounted on the vehicle 110.
In general, the imaging device of the vehicle 110 may be any device having an imaging function. Such imaging device includes, but is not limited to, a camera, a video camera, a camcorder, a driving recorder, a surveillance probe, a movable device having an image capturing or video recording function, and the like. For instance, in the example of
In addition to obtaining the captured image 130, the computer device 120 may obtain a predicted pose 150 of the vehicle 110 when capturing the captured image 130. As used herein, the pose of the vehicle 110 may refer to a position where the vehicle 110 is located and a posture that the vehicle 110 has. In some embodiments, the pose of the vehicle 110 may be represented by six degrees of freedom (DoF). For example, the position of the vehicle 110 can be represented by a horizontal coordinate (x coordinate), a longitudinal coordinate (y coordinate) and a vertical coordinate (z coordinate) of the vehicle 110 in a predetermined reference coordinate system, and the posture of the vehicle 110 may be represented by a pitch angle relative to a horizontal axis (x axis), a yaw angle relative to a longitudinal axis (y axis) and a roll angle relative to a vertical axis (z axis). It should be appreciated that the pose of the vehicle 110 represented by a horizontal coordinate, a longitudinal coordinate and a vertical coordinate, a pitch angle, a yaw angle and a roll angle is provided only as an example. Embodiments of the present disclosure are equally applicable to a situation where the pose of the vehicle 110 is expressed or described in any other manner. For example, the position of the vehicle 110 can also be represented by latitude, longitude and altitude coordinates, and the pitch angle, the yaw angle and the roll angle may be described in other equivalent manners.
In some circumstances, the measurement of some of the six degrees of freedom may be implemented through some known, well-developed approaches. For example, the vertical coordinate, the pitch angle and the roll angle of the vehicle 110 on the road may be estimated or determined in a simpler way in practice. For example, a customer-grade inertial measurement unit is eligible to precisely estimate the roll angle and the pitch angle, due to non-negligible gravity. As another example, after the vehicle 110 is successfully located horizontally, the altitude of the vehicle 110 may be estimated or determined by reading a Digital Elevation Model (DEM) map. Therefore, in some implementations, embodiments of the present disclosure may focus only on the determination of three degrees of freedom (namely, the horizontal axis, the longitudinal axis and the yaw angle axis) in the pose of the vehicle 110. However, it should be appreciated that embodiments of the present disclosure may be equally applicable to the determination of all the six degrees of freedom in the pose of the vehicle 110, or may be equally applicable to the determination of more or fewer degrees of freedom in the pose of the vehicle 110.
In the context of the present disclosure, the pose of the vehicle 110 and the pose of the imaging device of the vehicle 110 may be regarded as having a fixed conversion relation, that is, the two can be deduced from each other based on the conversion relation. The specific conversion relation may be dependent on how the imaging device is provided on or in the vehicle 110. As a result, although the pose of the imaging device determines in which direction and angle the captured image 130 is captured and impacts the image features in the captured image 130, the captured image 130 may be used to determine the pose of the vehicle 110 due to the fixed conversion relation. Accordingly, in the context of the present disclosure, the pose of the vehicle 110 and the pose of the imaging device are not substantially distinguished from each other unless otherwise indicated, and the two are considered to be consistent in the sense of the embodiments of the present disclosure. For example, when the vehicle 110 is in different poses, the objects presented in the captured image 130 of the external environment 105 captured by the vehicle 110 are varied. For example, the positions and angles of the respective objects in the captured image 130 may be changed. As such, the image features of the captured image 130 may embody the pose of the vehicle 110.
In some embodiments, the accuracy of the predicted pose 150 of the vehicle 110 obtained by the computing device 120 may be less than a predetermined threshold and thus cannot be used in applications requiring high localization accuracy, for example, autonomous driving of the vehicle 110, and the like. Therefore, the computing device 120 may need to update the predicted pose 150, so as to obtain the updated predicted pose 180 with accuracy greater than the predetermined threshold for use in applications requiring high localization accuracy, for example, autonomous driving of the vehicle 110, and the like. In some embodiments, the predicted pose 150 of the vehicle 110 may be determined roughly in other less accurate localization manners. Then, the rough predicted pose may be updated to an accurate predicted pose. In other embodiments, the predicted pose 150 of the vehicle 110 may be obtained through the technical solution of the present disclosure. In other words, the technical solution for vehicle localization of the present disclosure can be used iteratively to update the predicted pose of the vehicle 110.
In order to update the predicted pose 150, the computing device 120 may obtain a reference image 140 of the external environment 105, in addition to obtaining the captured image 130. The reference image 140 of the external environment 105 may be pre-captured by a capturing device. For example, in some embodiments, the capturing device may be a capturing vehicle for generating a high definition map. In other embodiments, the capturing device may be any other surveying and mapping device for collecting data for a road environment. It should be noted that when the capturing device is capturing the reference image 140 of the external environment 105, other measurement information associated with the reference image 140 may be collected as well, for example, spatial coordinate information corresponding to image points in the reference image 140.
In the context of the present disclosure, a high definition map typically refers to an electronic map having high accuracy data. For example, the high accuracy used herein, on one hand, means that the high definition electronic map has high absolute coordinate accuracy. The absolute coordinate accuracy refers to the accuracy of a certain target on the map relative to a corresponding real object in the external world. On the other hand, road traffic information elements contained in the high definition map are more abundant and finer. As another example, the absolute accuracy of the high definition map is generally at the sub-meter level, namely, it has accuracy within 1 meter, and the relative accuracy in the horizontal direction (for example, the relative position accuracy between lanes or between a lane and a lane marking) is often much higher. In addition, in some embodiments, the high definition map includes not only high accuracy coordinates but also a precise road shape, a slope and curvature of each lane, heading, elevation, and roll data.
In some embodiments, the high definition map can depict not only a road but also a number of lanes on the road, so as to truly reflect the actual road condition.
As shown in
It should be noted that, although described with the example environment 100 including the vehicle 110 in
In some embodiments, the computing device 120 may include any device that can implement a computing function and/or a control function, which may be any type of fixed computing device, movable computing device or portable computing device, including but not limited to, a dedicated computer, a general-purpose computer, a desktop computer, a laptop computer, a notebook computer, a netbook computer, a tablet computer, a multimedia computer, a mobile phone, a general-purpose processor, a microprocessor, a microcontroller, or a state machine. The computing device 120 may be implemented as an individual computing device or a combination of computing devices, for example, a combination of a Digital Signal Processor (DSP) and a microcontroller, a plurality of microprocessors, a combination of one or more microprocessors and a DSP core, or any other similar configurations.
It should be noted that, although the computing device 120 is depicted as being arranged inside the vehicle 110 in
In addition, it should be appreciated that
As described above, prior to performing the example process 200, the computing device 120 may obtain a captured image 130 of the external environment 105 captured by an imaging device (not shown) of the vehicle 110. Then, at block 210 of the example process 200, the computing device 120 may obtain an image descriptor map 160 corresponding to the captured image 130. In some embodiments, the image descriptor map 160 may include descriptors of respective image points in the captured image 130. For example, in the image descriptor map 160, a position corresponding to an image point in the captured image 130 records a descriptor of the image point. In some embodiments, a descriptor of an image point is extracted from an image block where the image point is located (for example, an image block with a center at the image point), and the descriptor may be represented by a multidimensional vector. For example, descriptors of respective pixels in the captured image 130 may be represented using 8-dimensional vectors to form the image descriptor map 160. A pixel of the captured image 130 is only an example of an image point of the captured image 130. In other embodiments, an image point may also refer to an image unit larger or smaller than a pixel. In addition, it is only an example to represent a descriptor of an image point using an 8-dimensional vector, and embodiments of the present disclosure are equally applicable to a descriptor represented using a vector in any number of dimensions.
The computing device 120 may obtain the image descriptor map 160 corresponding to the captured image 130 in any appropriate manner. For example, for a certain image point in the captured image 130, the computing device 120 may extract a descriptor of the image point from an image block where the image point is located, according to a predetermined feature extraction algorithm. Likewise, the computing device 120 may extract a descriptor of each image point in the captured image 130 to obtain the image descriptor map 160. In other embodiments, the computing device 120 may input the captured image 130 into a trained machine learning model and then gain the image descriptor map 160 at the output of the machine learning model. For ease of description, the machine learning model for extracting a descriptor map from an image may also be referred to as a feature extraction model or a Local Feature Embedding (LFE) module as used herein. Since the feature extraction model is trained using training data, the image descriptor map 160 extracted by the trained feature extraction model can be more suitable for locating the vehicle 110. For example, the feature extraction model may be trained based on a difference between an estimated pose of the vehicle 110 obtained ultimately through the example process 200 and a real pose of the vehicle 110, such that the image descriptor map 160 generated using the trained feature extraction model can improve the localization accuracy of the vehicle 110. Reference will be made to
In
As aforementioned, the feature extraction model 310 may be trained based on a difference between the estimated pose and the real pose of the vehicle 110. More specifically, the feature extraction model 310 may be trained based on a set of training images of the external environment 105 and a set of training descriptor maps obtained from a set of training images. The set of training descriptor maps may be used in the example process 200 to generate an updated predicted pose (namely, an estimated pose) of the vehicle 110 as determined ultimately, and the set of training descriptor maps therefore can be determined based on the difference between the estimated pose and the real pose of the vehicle 110. The feature extraction model 310 trained in this manner can improve the localization accuracy of the vehicle 110.
In some embodiments, since in the example process 200, the feature extraction model 310 may be used to process an image of the external environment 105 captured from the vehicle 110 or an image of the external environment 105 captured by a capturing device, the set of training images may be captured by the imaging device of the vehicle 110, pre-captured by a capturing device, or the combination of both. For example, in some embodiments, the feature extraction model 310 may be a part of the localization system for locating the vehicle 110, and the localization system may include machine learning models for other functions. In those embodiments, the computing device 120 may implement, based on the difference between the estimated pose of the vehicle 110 determined by the localization system and the real pose of the vehicle 110, an end-to-end training of the feature extraction model 310 together with other machine learning models. Such embodiments will be detailed hereinafter with reference to
In general, the feature extraction model 310 may be implemented using a convolutional neural network, for example, a deep learning-based convolutional neural network of any appropriate architecture. In some embodiments, considering that the feature extraction model 310 is used for visually locating the vehicle 110, the feature extraction model 310 may be designed to extract good local feature descriptors from the image of the external environment 105, so as to achieve accurate and robust visual localization of the vehicle 110. More specifically, the descriptors extracted by the feature extraction model 310 from the image of the external environment 105 may have robustness. That is, despite appearance changes caused by varying lighting conditions, or changes in viewpoint, season or the like, feature matching can still be achieved to complete visual localization of the vehicle 110. To this end, in some embodiments, the feature extraction model 310 may be implemented using a convolutional neural network based on a feature pyramid network. Reference will be made to
Referring back to
In some embodiments, the predicted pose 150 may be an updated predicted pose 150 obtained after the computing device 120 updated a predicted pose of the vehicle 110 previously using the example process 200. Then, the computing device 120 may use the example process 200 again to further update the predicted pose 150. In other words, the computing device 120 may iteratively use the example process 200 to update the predicted pose of the vehicle 110, so as to gradually approach the real pose of the vehicle 100 from the rough predicted pose of the vehicle 110 and thus obtain the more accurate predicted pose of the vehicle 110 with localization accuracy less than the predetermined threshold.
In other embodiments, the predicted pose 150 may also be obtained by the computing device 120 using other measurement means. For example, the computing device 120 may obtain an incremental motion estimation of the vehicle 110 from an IMU sensor and then accumulate it to a localization result obtained based on the preceding frame of the captured image 130, so as to estimate the predicted pose 150 when the vehicle 110 is capturing the captured image 130. As another example, at the initial stage of the example process 200, the computing device 120 may obtain the predicted pose 150 of the vehicle 110 using a GPS positioning technology (outdoors), other image retrieval technology or Wi-Fi fingerprint identification technology (indoors), and the like. In some other embodiments, the computing device 120 may obtain the predicted pose 150 of the vehicle 110 when capturing the captured image 130 in any other appropriate manner.
As described above with reference to
In the case that the capturing device pre-captures the set of reference images about the external environment 105 (for example, a video or a series of reference images), the computing device 120 may need to determine the reference image 140 corresponding to the captured image 130 from the set of reference images, namely, to look for the reference image 140 in the set of reference images. For example, the computing device 120 may directly compare the captured image 130 with each reference image in the set of reference images and then select a reference image closest to the captured image 130 in the set of reference images as the reference image 140. As another example, when capturing the set of reference images, the capturing device can record a pose of the capturing device when capturing each reference image. In this event, the computing device 120 may select, from the set of reference images, a reference image related to the capturing pose of the capturing device closest to the predicted pose 150 of the vehicle 110, as a reference image 140. It is provided only as an example that the computing device 120 selects the reference image closest to the captured image 130, or a reference image closest to the captured image 130 in capturing pose, as the reference image 140. In other embodiments, the computing device 120 may select a reference image relatively close to the captured image 130 in image per se or capturing pose, as a reference image 140. For example, the difference between the two is less than a predetermined threshold, and so on. More generally, the computing device 120 may obtain a reference image 140 corresponding to the captured image 130 in any other appropriate manner.
After acquiring the reference image 140 of the external environment 105, the computing device 120 may obtain the set of spatial coordinates 145 and the set of reference descriptors 147 corresponding to the set of keypoints 143 in the reference image 140. In some embodiments, the computing device 120 or another entity (for example, another computing device) may have generated and stored a set of keypoints, a set of reference descriptors and a set of spatial coordinates in association for each reference image in the set of reference images of the external environment 105. In the context of the present disclosure, image data including such data or information may also be referred to as a localization map. In this situation, using the reference image 140 as an index, the computing device 120 may retrieve in the localization map the set of keypoints 143, the set of spatial coordinates 145 and the set of reference descriptors 147 corresponding to the reference image 140. Reference will be made to
In other embodiments, the computing device 120 may not have a pre-stored localization map, or may be unable to obtain the localization map. In such a circumstance, the computing device 120 may first extract the set of keypoints 143 from the reference image 140 and then obtain the set of spatial coordinates 145 and the set of reference descriptors 147 associated with the set of keypoints 143. More specifically, the computing device 120 may employ various appropriate keypoint selection algorithms for selecting the set of keypoints 143 from a set of points in the reference image 140. In some embodiments, in order to avoid the impact of uneven distribution of the set of the keypoints 143 in the reference image 140 on the subsequent localization effect of the vehicle 110, the computing device 120 may select, based on a Farthest Point Sampling (FPS) algorithm, the set of keypoints 143 from the set of points in the reference image 140 to achieve uniform sampling of the set of points in the reference image 140.
To obtain the set of reference descriptors 147 associated with the set of keypoints 143, the computing device 120 may first determine a reference descriptor map of the reference image 140 and then obtain, from the reference descriptor map, a plurality of reference descriptors (namely, the set of reference descriptors 147) corresponding to respective keypoints of the set of keypoints 143. For this purpose, the computing device 120 may generate the reference descriptor map from the reference image 140 in the same manner as that for generating the image descriptor map 160 from the captured image 130, as described above. In some embodiments, in a similar way as the embodiments described in
In addition to the set of reference descriptors 147, the computing device 120 may obtain the set of spatial coordinates 145 corresponding to the set of keypoints 143 of the reference image 140, for example, three-dimensional coordinates of three-dimensional spatial points corresponding to respective keypoints in the set of keypoints 143. It is worth noting that, since the reference image 140 of the external environment 105 is pre-captured by the capturing device, the capturing device may obtain three-dimensional coordinate information (for example, a point cloud) of various objects in the external environment 105 simultaneously when capturing the reference image 140. Accordingly, the computing device 120 can determine, based on projection, three-dimensional reconstruction, or the like, a spatial coordinate corresponding to each point in the reference image 140. For the set of keypoints 143 in the reference image 140, the computing device 120 may determine a plurality of spatial coordinates (namely, the set of spatial coordinates 145) corresponding to respective keypoints in the set of keypoints 143. Reference will be made to
As an example,
As discussed above, the predicted pose 150 of the vehicle 110 may be a relatively inaccurate pose with accuracy less than a predetermined threshold. However, considering that the predicted pose 150 is gained through measurement or the example process 200, there may not be a significant difference between the predicted pose 150 and the real pose of the vehicle 110. In other words, the predicted pose 150 of the vehicle 110 may be regarded as “neighboring” the real pose. More specifically, in the embodiments of the present disclosure, if the pose of the vehicle 110 is regarded as a point in a multidimensional (for example, six-dimensional) space, it would be considered that the real pose of the vehicle 110 is neighboring the predicted pose 150 in the six-dimensional space. In a simplified case, assuming that the vertical coordinate, pitch angle and roll angle in the pose of the vehicle 110 are known, it should be considered that the pose of the vehicle 110 is a point in a three-dimensional space (including an x coordinate, a y coordinate and a yaw angle), and the real pose of the vehicle 110 is neighboring the predicted pose 150 in the three-dimensional space. As a result, assuming that the predicted pose 150 is a point in a multidimensional space, the computing device 120 may select a plurality of points neighboring the point and then update the predicted pose 150 based on the plurality of points, in order to obtain an updated predicted pose 180 much closer to the real pose of the vehicle 110.
Referring back to
More specifically, in the case that the predicted pose 150 includes three degrees of freedom, namely the horizontal axis, the longitudinal axis and the yaw angle axis, the computing device 120 may take a horizontal coordinate, a longitudinal coordinate and a yaw angle of the predicted pose 150 as a center and offset from the center in the three dimensions of the horizontal axis, the longitudinal axis and the yaw angle axis, using respective predetermined offset units and within respective predetermined maximum offset ranges, so as to determine the plurality of candidate poses 155. For example, assuming that the predicted pose 150 of the vehicle 110 has a horizontal coordinate of 10 m, a longitudinal coordinate of 10 m, and a yaw angle of 10°, which can be represented as (10 m, 10 m, 10). Then, one of the plurality of candidate poses 155 obtained by offsetting the predicted pose 155 may be (10.5 m, 10 m, 10°), indicating that the candidate pose is offset 0.5 m in the horizontal axis relative to the predicted pose 150 and remains unchanged in the longitudinal coordinate and the yaw angle. In this way, the computing device 120 may perform offsetting uniformly in the vicinity of the predicted pose 150 in a fixed manner to obtain the plurality of candidate poses 155, thereby increasing the probability that the plurality of candidate poses 155 cover the real pose of the vehicle 110. In addition, when the example process 200 is used iteratively to determine the pose of the vehicle 110 with accuracy meeting the requirement, the manner in which the candidate poses 155 are obtained by performing offsetting uniformly in the vicinity of the predicted pose 150 can accelerate convergence of the localization results of the vehicle 110 to the pose.
Moreover, it should be noted that the predetermined offset units and the predetermined maximum offset ranges used herein may be determined based on a specific system environment and accuracy requirement. For example, if the computing device 120 iteratively updates the predicted pose 150 using the example method 200, the predetermined offset units and the predetermined maximum offset ranges may be reduced gradually in the iterations. This is because the predicted pose of the vehicle 110 becomes more precise with the increasing number of iterations and is getting closer to the real pose of the vehicle 110 accordingly. In some embodiments, in order to better represent and process data associated with the plurality of candidate poses 155 (for example, probabilities of the plurality of candidate poses 155 being the real pose, and the like), the plurality of candidate poses 155 may be represented in the form of three-dimensional cubes having a center at the predicted pose 150. Reference will be made to
Likewise, as another example, the small cube 155-N representing the Nth candidate pose 155-N is offset by a predetermined maximum offset from the small cube 150 in a negative direction of the horizontal axis, offset by a predetermined maximum offset from the small cube 150 in a positive direction of the longitudinal axis, and offset by a predetermined maximum offset from the small cube 150 in a negative direction of the axis of the yaw angle. In this way, the plurality of candidate poses 155 obtained through offsetting the predicted pose 150 may be represented in the form of small cubes included in the cube 600. In some embodiments, cost volumes of the candidate poses 155 represented in a similar form may be processed advantageously through a 3D Convolutional Neural Network (3D CNN). Reference will be made to
Referring back to
Using these projection parameters or data, the computing device 120 may project the first spatial coordinate 145-1 in the set of spatial coordinates 145 onto the captured image 130, so as to determine a projection point 710 of the first spatial coordinate 145-1. Thereafter, in the image descriptor map 160 of the captured image 130, the computing device 120 may determine an image descriptor 715 corresponding to the projection point 710 to obtain an image descriptor of the set of image descriptors 165-1. Likewise, for other spatial coordinates in the set of spatial coordinates 145, the computing device 120 may determine image descriptors corresponding to these spatial coordinates and thus obtain the set of image descriptors 165-1. It should be pointed out that, although it is described herein that the computing device 120 first projects the set of spatial coordinates 145 onto the captured image 130 and then determines the corresponding set of image descriptors 165-1 from the image descriptor map 160, such a manner is merely an example, without suggesting any limitation to the scope of the present disclosure. In other embodiments, the computing device 120 may project the set of spatial coordinates 145 directly onto the image descriptor map 160 to determine the set of image descriptors 165-1 corresponding to the set of spatial coordinates 145.
In addition, it is worth noting that, in some embodiments, the projection point 710 of the first spatial coordinate 145-1 in the captured image 130 may correspond exactly to an image point in the captured image 130, and the image descriptor 715 corresponding to the first spatial coordinate 145-1 thus can be determined directly from the image descriptor map 160. Nonetheless, in other embodiments, the projection point 710 of the first spatial coordinate 145-1 in the captured image 130 may not correspond directly to a certain image point in the captured image 130, but falls among a plurality of image points in the captured image 130. In those embodiments, the computing device 120 may determine the image descriptor 715 corresponding to the projection point 710 based on a plurality of descriptors in the image descriptor map 160 corresponding to the plurality of image points around the projection point 710. Reference will be made to
Referring back to
Generally, the computing device 120 may determine the first similarity 170-1 between the first set of image descriptors 165-1 and the set of reference descriptors 147 in any appropriate manner. For example, the computing device 120 may compute the first similarity 170-1 as a difference between a mean value of the first set of image descriptors 165-1 and a mean value of the set of the reference descriptors 147. As another example, the computing device 120 may compute the first similarity 170-1 based on some descriptors of the first set of image descriptors 165-1 and corresponding descriptors of the set of reference descriptors 147. As a further example, the computing device 120 may determine a plurality of differences between corresponding descriptors in the first set of image descriptors 165-1 and the set of reference descriptors 147 and then determine the first similarity 170-1 based on the plurality of differences. In the following, the first set of image descriptors 165-1 will be taken as an example to illustrate determining the first similarity 170-1 between the first set of image descriptors 165-1 and the set of reference descriptors 147 in such a manner.
As aforementioned, the first set of image descriptors 165-1 includes a plurality of image descriptors which correspond to respective spatial coordinates in the set of spatial coordinates 145. On the other hand, the set of spatial coordinates 145 and the set of reference descriptors 147 are also in a correspondence relation. In other words, the first set of image descriptors 165-1 and the set of reference descriptors 147 both correspond to the set of spatial coordinates 145. For example, referring to
More specifically, for the first set of image descriptors 165-1 among the plurality of sets of image descriptors 165, the computing device 120 may determine a plurality of differences between respective image descriptors in the first set of image descriptors 165-1 and corresponding reference descriptors in the set of reference descriptors 147. For example, in the case that the image descriptors and the reference descriptors are represented in the form of an n-dimensional vector, for each pair of corresponding “image descriptor-reference descriptor,” the computing device 120 may calculate the difference between the two descriptors as an L2 distance between the two paired descriptors. Using the L2 distance between descriptors to represent a difference between descriptors is merely an example, without suggesting any limitation to the scope of the present disclosure. In other embodiments, the computing device 120 may also utilize any other appropriate metric to represent a difference between two descriptors.
Subsequent to determining the plurality of differences associated with the plurality of descriptor pairs between the first set of image descriptors 165-1 and the set of reference descriptors 147, the computing device 120 may determine, based on the plurality of differences, a similarity between the first set of image descriptors 165-1 and the set of reference descriptors 147, namely the first similarity 170-1 of the plurality of similarities 170. For example, in a straightforward manner, the computing device 120 may sum up the plurality of differences to obtain a total difference of the plurality of descriptor pairs for representing the first similarity 170-1. In other embodiments, the computing device 120 may obtain the first similarity 170-1 from the above-mentioned plurality of differences in any other appropriate manner as long as the plurality of differences are taken into consideration for obtaining the first similarity 170-1. For example, the computing device 120 may perform averaging, weighted averaging, or weighted summing on the plurality of differences, and average or sum up some differences falling within a predetermined range, or the like.
At block 250, after obtaining the plurality of similarities 170 corresponding to the plurality of candidate poses 155, the computing device 120 may update the predicted pose 155 based on the plurality of candidate poses 155 and the plurality of similarities 170, in order to obtain the updated predicted pose 180. It should be appreciated that the plurality of similarities 170 actually embody respective approach degrees of the plurality of candidate poses 155 to the real pose of the vehicle 110 when capturing the captured image 130. For example, the first similarity 170-1 corresponding to the first candidate pose 155-1 may reflect an approach degree of the first candidate pose 155-1 to the real pose of the vehicle 110. In other words, the plurality of similarities 170 may be considered as embodying probabilities that the plurality of candidate poses 155 are the real pose of the vehicle 110, respectively. As such, the computing device 120 may update the predicted pose 150 based on the plurality of candidate poses 155 and the respective plurality of similarities 170, namely, may determine a new predicted pose as a more accurate updated predicted pose 180.
As an example, the computing device 120 may determine, from the plurality of similarities 170, respective probabilities of the plurality of candidate poses 155 being the real pose of the vehicle 110. For example, the computing device 120 may normalize the plurality of similarities 170 to make the sum of the plurality normalized similarities 170 equal to 1. The computing device 120 may then take the plurality of normalized similarities 170 as respective probabilities of the plurality of candidate poses 155. It should be appreciated that the normalization of the plurality of similarities 170 herein is merely an example, without suggesting any limitation to the scope of the present disclosure. In other embodiments, the computing device 120 may apply other appropriate computing manners (for example, a weighted normalization of the plurality of similarities 170, or the like) to obtain, from the plurality of similarities 170, the respective probabilities of the plurality of candidate poses 155 being the real pose.
After determining the respective probabilities of the plurality of candidate poses 155 being the real pose, the computing device 120 may determine, from the plurality of candidate poses 155 and their respective probabilities, an expected pose of the vehicle 110 as the updated predicted pose 180. As such, all the candidate poses 155 are accounted for the ultimately updated predicted pose 180 according to respective probabilities, so as to enhance the accuracy of the updated predicted pose 180. It is to be appreciated that the expected pose being determined by the computing device 120 as the updated predicted pose 180 is merely an example, without suggesting any limitation to the scope of the present disclosure. In other embodiments, the computing device 120 may determine the updated predicted pose 180 in other appropriate manners. For example, the computing device 120 may directly determine the candidate pose having the greatest probability as the updated predicted pose 180, or determine the updated predicted pose 180 based on several candidate poses having probabilities ranked in the top, and so on. In addition, it should be pointed out that, if the plurality of candidate poses 155 are represented in the form of an offset relative to the predicted pose 150 in the example process 200, then the computing device 120 may obtain the updated predicted pose 180 by offsetting the predicted pose 150 by an offset determined according to the example process 200.
In the embodiments as described above, at blocks 230 through 250 of the example process 200, the computing device 120 updates the predicted pose 150 by processing, step by step, the set of spatial coordinates 145, the set of reference descriptors 147, the predicted pose 150, the image descriptor map 160, and other data, so as to gain the updated predicted pose 180. In other embodiments, the computing device 120 may complete the processing operations at blocks 230 through 250 in a modular way (namely, a processing module for performing a pose updating function may be built to process the data as mentioned above), thereby obtaining the updated predicted pose 180. In the context of the present disclosure, the processing module may also be referred to as a pose updating model or Feature Matching (FM) module. In some embodiments, the pose updating model may be implemented using a machine learning model based on deep learning. Reference will be made below to
In some embodiments, the pose updating model 810 is a deep learning model trained using training data, and a more accurate updated predicted pose 180 thus can be determined using the trained pose updating model 810. As an example, the pose updating model 810 may be trained based on a difference between an estimated pose of the vehicle 110 finally obtained through the example process 200 and the real pose of the vehicle 110, such that the updated predicted pose 180 generated by the trained pose updating model 810 can be closer to the real pose of the vehicle 110. For instance, in some embodiments, the pose updating model 810 may be a part of a localization system for locating the vehicle 110, and the localization system may include machine learning models for other functions, such as the feature extraction model 310 as described above. In those embodiments, the computing device 120 can implement, based on the difference between the estimated pose of the vehicle 110 determined by the localization system and the real pose of the vehicle 110, an end-to-end training of the feature extraction model 310 together with the pose updating model 810. Reference will be made to
As mentioned in the description with reference to
In the example of
In some embodiments, the convolutional layers 902, 904, 906, 912 and 918 may be two-dimensional (2D) convolutional layers while the residual blocks 908, 910, 914, 916, 920 and 922 may each include two 3><3 convolutional layers. Therefore, the encoder 950 may include 17 convolutional layers in total. Moreover, in some embodiments, the convolutional layer 906 may have 64 channels, 3 kernels and a stride size of 2 while the convolutional layers 912 and 918 may each have 128 channels, 3 kernels and a stride size of 2. The residual blocks 908 and 910 may each have 64 channels and 3 kernels while the residual blocks 914, 916, 920 and 922 may each have 128 channels and 3 kernels.
In the decoder 960, following the convolutional layer 924, two upsampling layers 926 and 928 are applied to generate or hallucinate higher resolution features from coarser but semantically stronger features. Through the above-mentioned lateral connection layers 930 and 932, the features of the same resolution from the encoder 950 may be merged to enhance those features in the decoder 960. The outputs of the decoder 960 may be feature maps with different resolutions of the original image (namely, the captured image 130). In some embodiments, the convolutional layer 924 may be a 2D convolutional layer, which may have 32 channels, 1 kernel and a stride size of 1. In some embodiments, the lateral connection layers 930 and 932 may each be a 2D convolutional layer, which have 32 channels, 1 kernel and a stride size of 1.
The outputs of the decoder 960 may be fed back into a network head 934 which may be responsible for extracting descriptors and outputting the image descriptor map 160. In some embodiments, the network head 934 may include two convolutional layers, such as 2D convolutional layers. The preceding convolutional layer may have 32 channels, 1 kernel and a stride size of 1, while the subsequent convolutional layer may have 8 channels, 1 kernel and a stride size of 1. In some embodiments, feature descriptors in the image descriptor map 160 output via the network head 934 may be represented as D-dimensional vectors. These feature descriptors can still achieve robustness matching under severe object appearance changes caused by varying lighting conditions or viewpoint conditions. For example, the image descriptor map 160 may be represented as a three-dimensional (3D) tensor
where H and W represent resolutions in height and width of the input captured image 130, s ∈ 2, 4, 8 is a scale factor, D=8 is a descriptor dimension size in the image descriptor map 160, and R denotes the set of real numbers.
With the example feature pyramid network architecture as depicted in
As mentioned in the description with reference to block 220 of the example process 200, in some embodiments, the computing device 120 or another entity (for example, another computing device) may generate and store a set of keypoints, a set of reference descriptors and a set of spatial coordinates associated with each reference image in the set of reference images of the external environment 105. As used herein, a map associated with the external environment 105, including data or content, such as sets of keypoints, sets of reference descriptors, sets of spatial coordinates, and the like of a plurality of reference images, may be referred to as a localization map. Therefore, in those embodiments, for the set of keypoints 143 of the reference image 140, the computing device 120 may obtain the corresponding set of spatial coordinates 145 and the corresponding set of reference descriptors 147 from the localization map of the external environment 105. References will be made to
In some embodiments, each reference image in the set of reference images 1120 may include a set of keypoints, which may be stored in the localization map 1130. Moreover, the localization map 1130 further stores a set of reference descriptors and a set of spatial coordinates associated with the set of keypoints. For example, the localization map 1130 may store the set of keypoints in the reference image 140, as well as the set of reference descriptors 147 and the set of spatial coordinates 145 associated with the set of keypoints 143, and the set of spatial coordinates 145 may be determined by projecting the laser radar point cloud 510 to the reference image 140.
Referring to
At block 1020 of
As aforementioned, in the localization map 1130, the set of keypoints 143 of the reference image 140, the set of spatial coordinates 145 and the set of reference descriptors 147 are stored in association. Therefore, at block 1030 of
Through the example process 1000, if the localization map 1130 is available for locating the vehicle 110, the computing device 120 can directly retrieve, based on the reference image 140 corresponding to the captured image 130, the set of spatial coordinates 145 and the set of reference descriptors 147. There is no need to utilize the feature extraction model 310 to generate the set of reference descriptors 147, or adopt a three-dimensional reconstruction method or the like to obtain the set of spatial coordinates 145. In this way, the computing loads and overhead of the computing device 120 of the vehicle 110 can be reduced significantly, and the computing device 120 can spend much less time on locating the vehicle 110.
As mentioned in the description with reference to
It is worth noting that, in some embodiments, the localization map 1130 may be generated and stored by the computing device 120 of the vehicle 110 based on various data captured by the capturing vehicle 1110, so as to visually locate the vehicle 110 based on the captured image 130. In other embodiments, the localization map 1130 may be generated and stored by a computing device other than the computing device 120, or may be stored in another device. In this event, the computing device 120 may obtain the localization map 1130 from the device storing the localization map 1130 for visually locating the vehicle 110 based on the captured image 130.
Referring back to
Referring back to
Referring back to
Through the example operation process 1200, the computing device 120 or another computing device can generate the localization map 1130 associated with the external environment 105 of the vehicle 110 efficiently and in a centralized manner, such that when locating the vehicle 110 based on the captured image 130, the computing device 120 can retrieve related data and information for locating the vehicle 110 using the localization map 1130 as an input. As such, the computing loads and overhead of the computing device 120 for locating the vehicle 110 may be reduced greatly, and the computing device 120 may spend significantly less time on locating the vehicle 110. In addition, since the example operation process 1200 generates the localization map 1130 based on the feature extraction model 310 and the keypoint sampling module 1220, the localization map 1130 can be optimized by optimizing the feature extraction model 310 and the keypoint sampling module 1220 to improve the localization accuracy of the vehicle 110.
As mentioned in the description with reference to
At block 1310, the computing device 120 may assume that the vehicle 110 is in the first candidate pose 155-1 among the plurality of candidate poses 155. Based on the first candidate pose 155-1, the computing device 120 may project the set of spatial coordinates 145 onto the captured image 130 such that the computing device 120 can determine a set of projection points of the set of the spatial coordinates 145, namely respective projection points corresponding to the spatial coordinates in the set of spatial coordinates 145, respectively. For example, referring to
Referring back to
Referring back to
Referring back to
Through the example process 1300, even though there is no descriptor in the image descriptor map 160 directly corresponding to the projection point 710 of the spatial coordinate 145-1 in the captured image 130, the computing device 120 may reasonably determine the descriptor 715 for the projection point 710. The computing device 120 may in turn reasonably determine the set of image descriptors 165-1 corresponding to the first candidate pose 155-1. Further, the computing device 120 may reasonably determine the plurality of sets of image descriptors 165 corresponding to the plurality of candidate poses 155. In this way, the ultimate accuracy of the pose of the vehicle 110 determined based on the plurality of candidate poses 155 can be improved.
As mentioned in the description with reference to
For example, referring to
Then, the plurality of cost volumes 1408 may be input into a trained Three-Dimensional Convolutional Neural Network (3D CNN) 1410 for regularization, thereby obtaining a plurality of regularized cost volumes 1412. For example, after the cost volume 1408-1 is processed by the 3D CNN 1410, a regularized cost volume 1412-1 can be obtained. In some embodiments, the 3D CNN 1410 may include three convolutional layers 1410-1, 1410-2 and 1410-3. The convolutional layers 1410-1 and 1410-2 may each have 8 channels, 1 kernel and a stride size of 1, while the convolutional layer 1410-3 may have 1 channel, 1 kernel and a stride size of 1. It should be appreciated that the specific numerical values related to the 3D CNN 1410 described herein are merely an example, without suggesting any limitation to the scope of the present disclosure. In other embodiments, the 3D CNN 1410 may be of any appropriate structure.
Next, the computing device 120 may input the plurality of regularized cost volumes 1412 into a first dimension reduction and summation unit 1414 for reducing dimensions (also referred to as marginalization) in the keypoint dimension, so as to obtain a similarity cube 1416. For example, the first dimension reduction and summation unit 1414 may add data recorded in corresponding small cubes in the plurality of regularized cost volumes 1412 to obtain the similarity cube 1416. It is to be appreciated that directly summing up the plurality of regularized cost volumes 1412 is only an example, and the first dimension reduction and summation unit 1414 may obtain the similarity cube 1416 in any other appropriate manner. For example, the first dimension reduction and summation unit 1414 may perform averaging, weighted summing, weighted averaging, or the like, on the data recorded in corresponding small cubes in the plurality of regularized cost volumes 1412. In some embodiments, the first dimension reduction and summation unit 1414 may be implemented using the “reduce_sum” function in the deep learning system “TensorFlow.” In other embodiments, the first dimension reduction and summation unit 1414 may be implemented using other similar functions in the TensorFlow system or other deep learning system.
The way of representation of the plurality candidate poses 155 in the similarity cube 1416 may be identical to that as depicted in
Thereafter, the computing device 120 may input the similarity cube 1416 into a normalization unit 1418 for normalization, so as to obtain a probability distribution cube 1420. In some embodiments, the normalization unit 1418 may be implemented using the “softmax” function in the deep learning system “TensorFlow.” In other embodiments, the normalization unit 1418 may be implemented using other similar functions in the TensorFlow system or other deep learning system. The way of representation of the plurality of candidate poses 155 in the probability distribution cube 1420 may be identical to that as depicted in
The computing device 120 may then input the probability distribution cube 1420 into a second dimension reduction and summation unit 1422 to obtain the updated predicted pose 180. For example, the second dimension reduction and summation unit 1422 may compute, based on the plurality of candidate poses 155 and a plurality of probabilities corresponding thereto, an expected pose of the vehicle 110 as the updated predicted pose 180. It is to be appreciated that directly computing an expected pose based on the probabilities of the plurality of candidate poses 155 is only an example, and the second dimension reduction and summation unit 1422 may obtain the updated predicted pose 180 in other appropriate manners. For example, the probabilities of the plurality of candidate poses 155 can be weighted and then an expected pose can be computed by the second dimension reduction and summation unit 1422. In some embodiments, the second dimension reduction and summation unit 1422 may be implemented using the “reduce_sum” function in the deep learning system “TensorFlow.” In other embodiments, the second dimension reduction and summation unit 1422 may be implemented using other similar functions in the TensorFlow system or other deep learning system.
With the pose updating model 810 as depicted in
As mentioned in the description with reference to
As shown in
Subsequent to obtaining the set of keypoints 143, the computing device 120 may determine the set of spatial coordinates 145 and the set of reference descriptors 147 corresponding to the set of keypoints 143 in the localization map 1130. Thereafter, the computing device 120 may input the set of spatial coordinates 145 and the set of reference descriptors 147 into the pose updating model 810. On the other hand, the computing device 120 may input the captured image 130 into the feature extraction model 310 to gain the image descriptor map 160. Then, the computing device 120 may also input the image descriptor map 160 into the pose updating model 810. Furthermore, the computing device 120 may input the predicted pose 150 of the vehicle 110 into the pose updating model 810. Based on the set of spatial coordinates 145, the set of reference descriptors 147, the predicted pose 150 and the image descriptor map 160, the pose updating model 810 may output the updated predicted pose 180.
It can be seen that the localization system built from the feature extraction model 310 and the pose updating model 810 achieves a novel visual localization framework. In some embodiments, based on the localization system, an end-to-end Deep Neural Network (DNN) may be trained to extract machine learning-based feature descriptors, select keypoints from a localization map, perform feature matching between the selected keypoints and images captured by the vehicle 110 in real time, and infer the real pose of the vehicle 110 through a differentiable cost volume. Compared to the traditional solutions, the architecture of the localization system enables jointly training of various machine learning models or networks in the localization system by backpropagation and optimization towards the eventual goal of minimizing the absolute localization error. Furthermore, the localization system bypasses the repeatability crises in keypoint detectors in the traditional solutions in an efficient way.
In addition, by utilizing an end-to-end deep neural network to select keypoints, the localization system can find abundant features that are salient, distinctive and robust in a scene. The capability of full exploitation of these robust features enables the localization system to achieve centimeter localization accuracy, which is comparable to the latest LiDAR-based localization approaches and substantially greater than other vision-based approaches in terms of both robustness and accuracy. The strong performance makes the localization system ready to be integrated into a self-driving car, providing precise localization results using low-cost sensors. The experiment results demonstrate that the localization system can achieve competitive localization accuracy when compared to the LiDAR-based localization solutions under various challenging circumstances, leading to a potential low-cost localization solution for autonomous driving.
As shown in
The first determining module 1630 may be configured to determine a plurality of sets of image descriptors corresponding to the set of spatial coordinates when the vehicle is in a plurality of candidate poses, respectively. The plurality of sets of image descriptors belong to the image descriptor map. The plurality of candidate poses are obtained by offsetting the predicted pose. The second determining module 1540 may be configured to determine a plurality of similarities between the plurality of sets of image descriptors and the set of reference descriptors. The updating module 1650 may be configured to update the predicted pose based on the plurality of candidate poses and the plurality of similarities corresponding to the plurality of candidate poses.
In some embodiments, the first obtaining module 1610 may include an input module configured to input the captured image into a feature extraction model to obtain the image descriptor map. The feature extraction model is trained based on a set of training images of the external environment and a set of training descriptor maps obtained from the set of training images. The set of training descriptor maps is determined based on a difference between the updated predicted pose and a real pose of the vehicle.
In some embodiments, the second obtaining module 1620 may include: a reference image set obtaining module configured to obtain a set of reference images of the external environment, each of the set of reference images comprising a set of keypoints as well as a set of reference descriptors and a set of spatial coordinates associated with the set of keypoints, the set of spatial coordinates being determined by projecting a laser radar point cloud onto the reference image; a selection module configured to select, from the set of reference images, the reference image corresponding to the captured image based on the predicted pose; and a reference descriptor set and spatial coordinate set obtaining module configured to obtain the set of reference descriptors and the set of spatial coordinates stored in association with the set of keypoints in the reference image.
In some embodiments, the first determining module 1630 may include: a projection point set determining module configured to determine a set of projection points of the set of spatial coordinates by projecting the set of spatial coordinates onto the captured image based on a first candidate pose of the plurality of candidate poses; a neighboring point determining module configured to determine, for a projection point of the set of projection points, a plurality of points neighboring the projection point in the captured image; a descriptor determining module configured to determine a plurality of descriptors of the plurality of points in the image descriptor map; and an image descriptor obtaining module configured to determine a descriptor of the projection point based on the plurality of descriptors to obtain a first image descriptor of a set of image descriptors corresponding to the first candidate pose among the plurality of sets of image descriptors.
In some embodiments, the second determining module 1640 may include: a difference determining module configured to determine, for a first set of image descriptors among the plurality of sets of image descriptors, a plurality of differences between a plurality of image descriptors of the first set of image descriptors and corresponding reference descriptors of the set of reference descriptors; and a similarity determining module configured to determine, based on the plurality of differences, a similarity between the first set of image descriptors and the set of reference descriptors as a first similarity of the plurality of similarities.
In some embodiments, the updating module 1650 may include: a probability determining module configured to determine, based on the plurality of similarities, probabilities that the plurality of the candidate poses are the real pose, respectively; and an expected pose determining module configured to determine, based on the plurality of candidate poses and the probabilities, an expected pose of the vehicle as the updated predicted pose.
In some embodiments, the apparatus 1600 may further include a candidate pose determining module configured to determine the plurality of candidate poses by taking a horizontal coordinate, a longitudinal coordinate and a yaw angle of the predicted pose as a center and by offsetting from the center in three dimensions of a horizontal axis, a longitudinal axis and a yaw angle axis, with respective predetermined offset units and within respective predetermined maximum offset ranges.
In some embodiments, the apparatus 1600 may also include a keypoint set selection module configured to select, based on a farthest point sampling algorithm, the set of keypoints from a set of points in the reference image.
The following components in the device 1700 are connected to the I/O interface 1705: an input unit 1706 such as a keyboard, a mouse and the like; an output unit 1707 including various kinds of displays and a loudspeaker, etc.; a storage unit 1708 including a magnetic disk, an optical disk, and etc.; a communication unit 1709 including a network card, a modem, and a wireless communication transceiver, etc. The communication unit 1709 allows the device 1700 to exchange information/data with other devices through a computer network such as the Internet and/or various kinds of telecommunications networks.
Various processes and processing described above, for example, the example process 200, 1000 or 1300, may be executed by the processing unit 1701. For example, in some embodiments, the example process 200, 1000 or 1300 may be implemented as a computer software program that is tangibly included in a machine readable medium, for example, the storage unit 1708. In some embodiments, part or all of the computer programs may be loaded and/or mounted onto the device 1700 via ROM 1702 and/or communication unit 1709. When the computer program is loaded to the RAM 1703 and executed by the CPU 1701, one or more steps of the example process 200, 1000 or 1300 as described above may be executed.
As used herein, the term “includes” and its variants are to be read as open terms that mean “includes, but is not limited to.” The term “based on” is to be read as “based at least in part on.” The term “one embodiment” and “the embodiment” are to be read as “at least one embodiment.” The terms “first,” “second,” and the like may refer to different or same objects. Other definitions, explicit and implicit, may be included in the context.
As used herein, the term “determining” covers various acts. For example, “determining” may include operation, calculation, processing, derivation, investigation, search (for example, search through a table, a database or a further data structure), identification and the like. In addition, “determining” may include receiving (for example, receiving information), accessing (for example, accessing data in the memory) and the like. Further, “determining” may include resolving, selecting, choosing, establishing and the like.
It will be noted that the embodiments of the present disclosure can be implemented in software, hardware, or a combination thereof. The hardware part can be implemented by a special logic; the software part can be stored in a memory and executed by a suitable instruction execution system such as a microprocessor or special purpose hardware. Those skilled in the art should appreciate that the above apparatus and method may be implemented with computer executable instructions and/or in processor-controlled code, and for example, such code is provided on a carrier medium such as a programmable memory or an optical or electronic signal bearer.
Further, although operations of the present methods are described in a particular order in the drawings, it does not require or imply that these operations are necessarily performed according to this particular sequence, or a desired outcome can only be achieved by performing all shown operations. On the contrary, the execution order for the steps as depicted in the flowcharts may be varied. Alternatively, or in addition, some steps may be omitted, a plurality of steps may be merged into one step, or a step may be divided into a plurality of steps for execution. It should be appreciated that features and functions of two or more devices according to the present disclosure can be implemented in combination in a single device. Conversely, various features and functions that are described in the context of a single device may also be implemented in multiple devices.
Although the present disclosure has been described with reference to various embodiments, it should be understood that the present disclosure is not limited to the disclosed embodiments. The present disclosure is intended to cover various modifications and equivalent arrangements included in the spirit and scope of the appended claims.
The various embodiments described above can be combined to provide further embodiments. Aspects of the embodiments can be modified to provide yet further embodiments.
These and other changes can be made to the embodiments in light of the above-detailed description. In general, in the following claims, the terms used should not be construed to limit the claims to the specific embodiments disclosed in the specification and the claims, but should be construed to include all possible embodiments along with the full scope of equivalents to which such claims are entitled. Accordingly, the claims are not limited by the disclosure.