Autonomous mobile robots are making positive contributions by providing smart and autonomous services in the fields of manufacturing, healthcare, and logistic systems. The smart robots can improve the efficiency and effectiveness of the systems by performing tasks autonomously, such as handling materials, doing surveillance of the ambient environment, and conducting complex surgeries with higher accuracy and precision. In other roles, they may serve as assistants for people to do tasks, such as cleaning, washing, cooking, and security of a home or office building. They can also deliver packages and lunch boxes to people in an office building. These robots can be designed to operate autonomously, navigating through complex indoor or outdoor environments without significant human intervention.
The robots can utilize indoor or outdoor localization techniques for ensuring accurate navigation in the environment. The outdoor localization can use the GPS technology, which is effective in open spaces only and might pose significant challenges in urban canyons and environments with a limited satellite visibility. Indoor localization utilizes methods like Wi-Fi fingerprinting, infrared sensors, beacons, tags, inertial measurement units, and cameras to determine a robot's position within a building. The one way to localize the position of a robot accurately in both types of environments might be to use the perception system of a robot that that does not rely on specific external sensors such as beacons, tags, or RFID. One such method is visual localization, that can estimate the pose of a robot from its camera by analyzing the image of the scene he is currently viewing using its visual localization pipeline.
The robots can utilize indoor or outdoor localization techniques for ensuring accurate navigation in the environment. The outdoor localization can use the GPS technology, which is effective in open spaces only and might pose significant challenges in urban canyons and environments with a limited satellite visibility. Indoor localization utilizes methods like Wi-Fi fingerprinting, infrared sensors, beacons, tags, inertial measurement units, and cameras to determine a robot's position within a building. The one way to localize the position of a robot accurately in both types of environments might be to use the perception system of a robot that that does not rely on specific external sensors such as beacons, tags, or RFID. One such method is visual localization, that can estimate the pose of a robot from its camera by analyzing the image of the scene he is currently viewing using its visual localization pipeline. The visual localization pipeline may start with an image retrieval step. This step can provide a first coarse grain localization estimate based on the information of the camera pose, extracted from similar images of the scene or location, and are retrieved from an image database that contains images of the environment in which the robot is operating. The image retrieval component of visual localization pipeline can be sensitive to the changes of the environment as observed by the robot. Sometimes the changes in appearance of a visual scene due to different poses cannot be recognized from the pure visual information. The challenges in the visual localization may include changes in lighting conditions, occlusions, day-night changes, as well as weather and seasonal variations. Therefore, it is important to make image retrieval models robust against these changes in appearances of images to do accurate localization prediction. Providing solutions to these challenges will improve the reliability and accuracy of autonomous mobile robots in real-world environments with changing ambient conditions.
Some embodiments of the present disclosure relate to use of generative AI models to generate synthetic images for augmenting training dataset, validate synthetic images using geometric filters, and to improve training process of image retrieval model that is aiming for accurate long term visual localization. A computer-implemented method includes accessing an initial dataset of real images of an environment from an image database. The initial dataset may have at least one matching image and a set of non-matching images for each image of the initial dataset. The matching image represents a similar scene of the environment as contained in the image of the initial dataset.
The method further includes identifying a set of matching pairs in the initial dataset. A matching pair comprises of images that can capture similar aspects or scenes of the environment. For each matching pair in the initial dataset, a set of tentative pairs is generated. Each tentative pair includes at least one synthetic image. The tentative pairs may be generated by making image pairs, either combining a first image of the matching pair with the synthetic images of a second image of the matching pair, or the synthetic images of the first image of the matching pair with the second image of the matching pair.
The method further includes identifying a set of domain shifts that corresponds to potential changes in the same environment as of the image database. A text prompt may be generated for each of the identified domain shifts. The set of domain shifts may include indoor and/or outdoor changing conditions of the environment. The outdoor changing conditions may correspond to weather, season, or time of day related information that may also be reflected in the scenes of the environment. The indoor changing conditions may include holiday or event related changes such as Christmas, Kwanzaa, Hanukkah, Easter, Sales events and similar public or religious holidays that may add, remove, or dislocate objects thereby changing the indoor scenes within the environment. A synthetic image may be generated for each image of the initial dataset and for each of the set of domain shifts, by processing the text prompt with the generative AI model. In some instances, the synthetic images may be generated by simulating each of the set of domain shifts using a 3D game engine.
To determine whether the tentative pair is a valid pair or not, geometric correspondences can be computed between images of the tentative pair inside an area of interest. The tentative pair may be considered as a valid pair or valid synthetic pair based on the degree of the geometric correspondence. In some instances, a geometric transformation can be computed between the first image and the second image of a matching pair of the initial dataset. The method may further compute, for each of the tentative pairs corresponding to the matching pair, a number of local geometric correspondences that satisfy the geometric transformation. The tentative pair corresponding to the matching pair may be considered valid if an absolute or relative number of local geometric correspondences inside the area of interest of two images exceeds a threshold.
The valid pairs can be combined with the initial dataset to generate an extended dataset. A contrastive loss function can be used to train an image retrieval model with weights assigned to each matching pair and corresponding valid pairs in the extended dataset. The weights can be proportional to the degree of geometric correspondence inside the area of interest between images of the matching pair or the valid pair. To improve training of the image retrieval model, domain randomization may be applied to improve resilience to domain shifts. For domain randomization, a random cropping mechanism may be used at first and then a data augmentation technique that mixes multiple augmentations such as AugMix. The image retrieval model may be trained using an optimizer such as AdamW optimizer for a longer time with a cosine learning rate schedule. The trained image retrieval model can be used to retrieve a set of matching images from the image database by processing a query image that is captured by a camera device to localize a position and/or orientation of the camera device.
The area of interest can be determined by computing a matching function which returns a set of matches that are geometrically consistent between two images of the matching pair. The matching function can be any function which takes two images as input and provides a set of correspondences as an output. For example, if the task is in 3D, then the matching function may perform matching over 3D models. The area of interest is composed of image pixels corresponding to the set of matches. In some embodiments, the area of interest is determined based on identical sets of 3D co-observations between two images of the matching pair. In this case, the area of interest is composed of image pixels corresponding to matching 3D co-observations of the two images.
In some embodiments of the present disclosure, a first set of correspondences can be determined between images of the matching pair using a matching function. Similarly, a second set of correspondences can also be determined between images of the tentative pair corresponding to the matching pair using the matching function. Moreover, common geometric correspondences may be identified between the first set of correspondences and the second set of correspondences. A geometric consistency score can be computed for the tentative pair by taking a ratio of the common geometric correspondences with the first set of correspondences in the matching pair.
The method can further include generating a set of synthetic tuples based on the set of tentative pairs corresponding to the matching pair. Each tuple of the set of synthetic tuples may comprise of images of the tentative pair and the set of non-matching images with the tentative pair. A subset of k tuples may be selected from the set of synthetic tuples for training the image retrieval model. In some instances, the subset of k tuples is selected randomly, whereas in other instances the subset of k tuples is selected based on the geometric consistency score. The synthetic tuple with a lower geometric consistency score has a higher probability of selection. The contrastive loss function may be computed on the subset of k tuples to update model parameters.
The technique disclosed in the present disclosure can be utilized for any AI task that can have following properties: 1) model training is performed on image pairs, 2) the domain shifts are known and can be synthesized by generative AI models, and 3) geometric consistency is important for the task.
In some embodiments, a system is provided that includes one or more data processors and a non-transitory computer readable storage medium containing instructions which, when executed on one or more data processors, cause one or more data processors to perform part or all of one or more methods disclosed herein.
In some embodiments, a computer-program product tangibly embodied in a non-transitory machine-readable storage medium, including instructions configured to cause one or more data processors to perform part or all of one or more methods or processes disclosed herein.
In some embodiments, a system is provided that includes one or more means to perform part or all of one or more methods or processes disclosed herein.
The terms and expressions which have been employed are used as terms of description and not of limitation, and there is no intention in the use of such terms and expressions of excluding any equivalents of the features shown and described or portions thereof, but it is recognized that various modifications are possible within the scope of the invention claimed. Thus, it should be understood that although the present invention as claimed has been specifically disclosed by embodiments and optional features, modification and variation of the concepts herein disclosed may be resorted to by those skilled in the art, and that such modifications and variations are considered to be within the scope of this invention as defined by the appended claims.
The patent or application file contains at least one drawing executed in color. Copies of this patent or patent application publication with color drawing(s) will be provided by the Office upon request and payment of the necessary fee. The present disclosure is described in conjunction with the appended figures.
The present disclosure discloses embodiments relating to an enhanced image retrieval component for long-term visual localization in case of extreme domain shifts. In some embodiments, techniques are provided to utilize generative AI models to generate synthetic images for augmenting training dataset, validate synthetic images using geometric filters, and to improve training process of image retrieval model that is aiming for accurate long term visual localization.
Visual localization refers to the task of estimating the camera pose for a given view of a known scene and is a core component of a perception system of autonomous vehicles and robotic platforms. For example, given a view of the known scene that is currently observed by a robot, often a 2D image captured by a camera, the visual localization task is to provide an accurate estimate of six degree-of-freedom camera pose to compute the robot's position by using visual data only. The visual localization pipeline may comprise of an image retrieval component and a pose estimator. The quality of the retrieval component is affected by the appearance of landmarks that are altered depending on the changing conditions such as weather, season, occlusions, or time of day. The visual localization under such varying environmental conditions is referred to as long-term visual localization with changing conditions or long-term visual localization with extreme domain shifts.
The image retrieval component is responsible for retrieving images based on the similarity with the given view (also refer herein as a query image) from an image database containing images of the environment in which the robot is planned to operate. In some of the embodiments, the image retrieval component may include a feature extractor and a feature comparator. The feature extractor may be comprising of handcrafted local descriptors, bag-of-words representations, or more sophisticated aggregation techniques such as Fisher Vectors or Aggregated Selective Match Kernels (ASMK). Moreover, deep learning-based models such as
Convolutional Neural Network (CNN) encoders can also be used to generate global descriptor per image for the retrieval task. In this case, feature vectors can be generated either directly or by aggregating local activations of the deep model. In some instances, the feature comparator may include ASMK to evaluate the similarities between the feature vectors of the query image and the retrieved images. The ASMK is a matching process defined over selective matching kernel of local features and it is a relatively more precise matching function than the global representations.
The pose estimator then estimates the pose for the query image by using the camera pose information assigned to the retrieved images. In some instances, only top k retrieved images may be used to predict the camera pose. The pose can be predicted either by interpolating or approximating the poses from the top k retrieved images, or by using a complex method for increased precision. The complex method may include utilization of the local or global 3D maps created via Structure from Motion (SfM) such as COLMAP and the query image pose can be estimated by registering it in the 3D map.
To enhance the performance of retrieval component, some embodiments of the present invention include using text-to-image generation techniques to alter images from an initial dataset and expand the initial dataset with a set of synthetic variants. The synthetic variants can be generated for some or all the images in the initial dataset for several meaningful domain shifts that can be described with words. The synthetic variants can be generated by using the generative AI models such as DALL-E or Stable Diffusion to synthesize different challenging scenarios of the environment. In addition, images of the initial dataset can be altered via a textual prompt by using methods like InstructPix2Pix, ControlNet, or EmuEdit. In some instances, the synthetic images or variants may be generated by simulating different domain shifts using a 3D game engine. The synthetic variants together with the initial dataset may be used to train generic representations that can transfer across a broad range of the classification tasks.
Text prompts can be selected to describe meaningful domain shifts. For example, prompts for indoor localization may include alterations due to holiday or event related changes such as Christmas, Kwanzaa, Hanukkah, Easter, Sales events, and the like. For outdoor scenarios, the text prompts may capture the changing conditions related to weather, seasons, or time of the day. Moreover, the generation of synthetic images may happen on-the-fly during batch construction for training of a model or can be implemented as a preprocessing step.
In some aspects of the present disclosure, a subset of synthetic images may be selected by using geometric filtering to achieve better training of the image retrieval model. Some of the generated synthetic images may not be suitable for learning representations of visual localization. The characteristics of the scene may get altered during the generation process. Moreover, while the synthetic image generation step deals with each image of the initial dataset individually, the training process of the image retrieval model focuses on image pairs. Therefore, geometric filtering can be used to automatically verify the validity of an image pair.
For example, a training dataset D of images may be considered suitable for landmark retrieval. The matching pairs can be found in the training dataset D. The term ‘matching pair’ as used herein refers to the images depicting the same scene or landmark. The training dataset D can be seen as a set of training tuples of matching pairs (q, p) composed of a query and a positive image, together with a small set of M negatives Uqp={q, p, n1 . . . , nm}. The set of M negatives comprises of images that capture a different scene or landmark. For the matching pair (q, p), the (, p) may represent a corresponding synthetic pair where the query is replaced by a synthetic variant obtained with a text prompt t. As used herein, the terms “synthetic pair” and “tentative pair” are used interchangeably. Consider a matching function Cqp=c(·,·) that returns a set of correspondences Cap i.e., a set of geometrically consistent local matches between q and p. For the synthetic pair (
, p), the geometric correspondences Caty may be computed by using the matching function and only those correspondences that also existed in the original pair (q, p) are kept: ∥Cq
) with the number of geometric correspondences ∥Cgp∥ in the matching pair (q, p). The synthetic pair may be considered a valid synthetic pair or valid pair if its geometric consistency score exceeds a threshold τ. The geometric consistency score may be used to rank the synthetic pairs according to the level of preservation of their geometry.
In some embodiments, the generative model can be considered to produce synthetic variants that should not shift or alter local features in terms of their geometry. Therefore, a geometric transformation between an image and its variant should be an identity transformation. A self-consistency check can be performed to validate each synthetic image. For example, a pair can be created between the original image and its synthetic variant, i.e. (q, ) for each text prompt t. If the geometric transformation of the pair is the identity transformation, then the synthetic image is a valid image.
Further, an improved training process of the image retrieval model is provided based on an extended dataset. The extended dataset may include real and synthetic images. The image retrieval model may be a deep learning-based model such as CNN-encoder or can be any known model such as HOW or FIRe. The existing image retrieval models are normally trained using contrastive loss function (as defined in Equation 1) on the set of training tuples comprising of real images only. In the present disclosure, due to the use of synthetic images the loss may be computed over original as well as synthetic tuples.
In Equation 1, f(x) represents aggregated feature vector for image x and can be computed as a weighted average of all local features. Moreover, u is a margin hyper-parameter, and [·]+ denotes a positive part function max (0,·).
The technique disclosed in the present disclosure can be utilized on top of existing image retrieval approaches or with any other AI task that requires model training based on image pairs and the known domain shifts can be synthesized by generative AI models while preserving geometry. For improve training of the image retrieval model, domain randomization is applied to improve resilience to domain shifts. For this, a random cropping mechanism may be used at first and then a data augmentation technique that mixes multiple augmentations such as AugMix. Further, an optimizer (such as AdamW optimizer) is used to facilitate training the image retrieval model for a longer time with a cosine learning rate schedule. As used herein, the term “Ret4Loc” may refer to the image retrieval models that may be trained using the training settings mentioned above.
The model can be trained in different ways by using synthetic variants. In some instances, the model training process may involve random selection of k tuples from a set of synthetic tuples that is generated using the original tuple. This approach may be called Ret4Loc-synth. In other instances, a filtered set of tuples may be used that contain valid synthetic tuples only. The model may then be trained using randomly selected k tuples from the filtered set of tuples and can be referred here as Ret4Loc-synth+. While in some other instances, geometry aware sampling can be performed to select k tuples from the filtered set of tuples. Each tuple in the filtered set of tuples may have a geometry consistency score s. The sampling of a valid synthetic tuple Ũqpt may be performed proportionally to 1/s({tilde over (q)}t, p). Hence, the tuples with a larger drop in correspondences are picked with a higher probability. This weighing scheme may favor valid tuples that are “harder” and may be referred to as geometry-aware sampling. The models trained using geometry-aware sampling may refer herein as Ret4Loc-Synt++.
In some embodiments, the robot 110 may capture an image (also refer herein as query image) and send the image to the cloud server 130 via any wireless internet access such as 5G network, Wi-Fi etc. The cloud server 130 may run a visual localization service and process the query image received from the robot 110. The visual localization service may execute the visual localization method using the images from the image database 120. The image database 120 may be comprises of images of the environment in which the robot 110 move around. In response to the query image, the visual localization service running on the cloud server 130 may return the pose estimate of the robot 110. Multiple robots can be connected to the cloud server 130 and utilize the visual localization service to get an accurate estimate of their current position. The cloud server 130 acts as a shared brain for the robots 110.
In some other embodiments, if the environment is not big or does not require multiple robots, then the robot 110 may host the image database 120 in its memory and may run the visual localization method on its onboard computational unit. For example, a vacuum cleaner robot cleaning in a small home. The different aspects or scenes of the environment (i.e., home) can be captured using a small number of images, which can be easily accommodated in the built-in memory of the robot 110. Many modern robotic vacuum cleaners are designed to operate autonomously without requiring a constant internet connection for basic functionalities like cleaning and navigation. The core algorithms for Simultaneous Localization and Mapping (SLAM) and obstacle avoidance are executed locally on the robot 110 without the need for a continuous internet connection or with the cloud server 130.
To synthesize such challenging scenarios, generative models 220 can be used. The generative models 220 may include text-to-image generation models, for example, DALL-E, or Stable Diffusion and the like. In addition, image alteration models via a text prompt may also be used as the generative models 220 including but not limited to InstructPix2Pix, EMU Edit, or ControlNet. By using the generative models 220, multiple synthetic variants may be generated for every image in the initial dataset 210 based on the set of text prompts 215. For example, if g(·) be a generative model that takes as input an image x and a text prompt t, and produces , a synthetic variant of image x with respect to text prompt t, i.e.
=g(x; t). For the training set D and a set of T textual prompts T={t1, . . . , tT}, an extended dataset can be generated which contains the original images as well as their T synthetic variants
=g(x; t), for every image in D and for all t={1 . . . T}.
In some instances, an image validator 225 can also be used to evaluate self-consistency of the synthetic images before saving them into the image database 120. The image validator 225 may include a feature extractor 230 and a geometric matcher 235. The feature extractor 230 may include local feature detector and descriptor e.g., R2D2, or SIFT and can be used to extract local features of the image. In some instances, local features may be available already for a dataset with 3D SfM maps. Since, map construction process is based on such local features. For each image in the initial dataset 210, the pairs can be created between the original image and synthetic image. For example, for an image x, the image pairs may be represented as (x, ) for every text prompt t.
The geometric matcher 235 may compare the feature vectors of an image and with its synthetic variant. The geometric matcher 235 may compute geometric transformation between the images of a pair. The synthetic image may be considered valid if the geometric transformation of the image pair (x, ) is identity transformation. The synthetic variants created by the generative model 220 can be valid if they did not shift or alter local features in terms of geometry. The geometric transformation between every image x and
may be the identity transformation.
In Equation 1, Nt may represents total number of text prompts. To make sure that any pair (q, p)∈{tilde over (P)}(q, p) is a valid pair or not, a pair validator 315 may be utilized to validate each pair (q, p) from the set of tentative pairs {tilde over (P)}(q, p).
According to an example embodiment, an area of interest may be determined initially using the matching pair (q, p) 305. The area of interest can be an arbitrary subset of image pixels akin to a segmentation mask. For example, the area of interest between images of the matching pair 305 may comprise of those pixels which describe the scene (e.g. landmark) and not occluding objects such as people. Depending on the nature of the initial dataset 210, the area of interest may be either known, or easy to determine automatically. The common visual localization datasets, for example, come with 3D SfM maps. Images from the matching pair 305 could be registered, and a set of 3D co-observations that the two images share could define a suitable area of interest. In some other instances, when the initial dataset 210 comprises of generic set of image pairs, such as landmark images sets which do not have extra information. For this type of dataset, any matching framework, such as RANSAC, deep learning-based alternatives (e.g. Deep-Matching) or even diffusion-based dense matching can be utilized as the matching function 405. These matching frameworks may have feature extractor 230 built-in.
In some instances, the generative model 220 may be considered that do not shift or alter local features in terms of geometry. Then the geometric transformation between every image x and may be the identity transformation. Hence, any verified geometric transformation between (q, p) may remain preserved also for every pair in {tilde over (P)}(q, p). The geometric transformation may be available for every matching pair (q, p) of the initial dataset 210 as a by-product of the process of computing the area of interest. The geometric transformation can be extracted either via pose registration process on the 3D SfM maps, or directly computed during geometric matching. The geometric transformation for every tentative pair (containing at least one synthetic image) can be obtained in a similar manner. The validity of the tentative pair may be obtained by calculating the number of local geometric correspondences that abide by known transformation using a geometric filter 410. This process can be much faster in practice as it only requires extracting local features for each of the images and then matching them per pair. A tentative pair from P (q, p) may be declared valid if the absolute or relative number of local key point correspondences inside the area of interests of the two images exceed a threshold τ.
In other aspects of the present disclosure, a geometric consistency score may be computed to assess the degree to which a synthetic pair preserves location characteristics shared across the matching images. For example, (, p) represents the synthetic pair of the matching pair (q, p) 305 in which query image (q) is replaced by a synthetic variant obtained with text prompt t. If Cqp=c(·,·) denote the matching function 405 that returns a set of correspondences Cqp, i.e. a set of geometrically consistent local matches between q and p. The geometric correspondences Cq
in the matching pair with the set of correspondences of the matching pair (q, p) 305. The geometric consistency score may be defined as in Equation 3.
The geometric consistency score may be used to rank synthetic pairs according to the level of preservation of their geometry.
The pose estimator 550 may then estimate the pose for the query image 520 by using the camera pose information assigned to the retrieved images. In some instances, only the top-k images 540 may be used to predict the camera pose. The pose can be predicted either by interpolating or approximating the poses from the top-k images 540, or by using a complex method for increased precision. The complex method may include utilization of local or global 3D maps created via Structure from Motion (SfM) such as COLMAP and the query image 520 pose can be estimated by registering it in the 3D map.
In some embodiments, the image retriever 530 may utilize other image retrieval approaches for place recognition such as techniques based on handcrafted local descriptors and bag-of-words representations, or more sophisticated aggregation techniques e.g., Fisher Vectors or ASMK. With the rise of deep learning, retrieval techniques started using one global descriptor per image for retrieval, either produced directly or obtained by aggregating local activations of the deep model. For example, deep metric learning may be applied on large sets of landmark images mined from the web in order to learn global features that excel at the task of place recognition. The retrieval techniques that utilize CNN-based local features are typically outperform other techniques.
The image retriever 530 can execute any image retrieval approach including but not limited to: HOW, FIRe. The retrieval approaches that best correlate with localization accuracy are HOW and FIRe. They employ a global contrastive loss to learn a model whose local features are then used with match kernels such as ASMK to perform image retrieval. ASMK is a matching process defined over selective matching kernels of local features. ASMK is a much stricter and more precise matching function than comparing global representations.
For example, if a tuple Uqp corresponds to the matching pair (q, p) 305, then a synthetic tuple Ũqpt=({tilde over (q)}t, p, ñ1t, . . . , ñMt) for variant t (or corresponding to text prompt t) can be generated by replacing q with {tilde over (q)}t and all the negatives nm with ñMt. When the synthetic tuple 810 is used in the contrastive loss of HOW (Equation 1), the first part of the loss, which uses a synthetic {tilde over (q)}t with an unaltered p, may bring the representations of the original positive image p and of the synthetic variant of the query q close to each other. At the same time, the loss pushes the synthetic variant of the query {tilde over (q)}t away from the representations of all the synthetic negative images ñMt, in the domain corresponding to the selected text prompt. Therefore, the query feature should be invariant to the different domain shifts described by the set of text prompts 215 and simultaneously for any domain shift, the query feature should be different enough from its associated negatives.
In some embodiments (e.g., Ret4Loc-Synth), a training dataset can be extracted from the extended dataset that may comprise of both the original tuple 820 and one or more synthetic tuples 810 sampled from the set of possible ones for the different prompts. A tuple selector 830 may select subset of tuples from the training dataset by uniformly sample K text prompts to select the synthetic tuples 810 among T options. For example, consider a set of K synthetic tuples Ũ={Ũqp1, I . . . , ŨqpK} corresponding to K≥1 selected variants. Then an extended set of tuples Ũ={Uqp, Ũqp1, I . . . , ŨqpK} may contains the original tuple 820 Uqp and K synthetic tuples. The extended set of tuples may be used to compute a loss function 840 to train the image retrieval model such as the encoder 610.
In some instances, the loss function 840 may be computed for the extended set of tuples Ũ by adding the individual losses incurred from the original tuple 820 and the synthetic tuples 810 as defined in Equation 4.
In some other instances, a set of features can be aggregated from all variants of each image in the original tuple 820 independently, and then a single value of the loss function 840 value can be computed on the aggregated features. The set of features may be obtained from the encoder 610 output. For example, Q, P, Nm may represent the set of features extracted for the corresponding query, positive and each image of the set of M negatives, respectively, that appear in the extended set of tuples Ũ. An aggregation function φ may include simple averaging. The aggregation function φ may produce a single aggregated feature vector for each set of features Q/P/Nm. The loss function 840 can be computed using the aggregated vectors as defined in Equation 5. By computing the loss function 840 using Equation 5, the images in the original tuple 820 and the synthetic tuples 810 (i.e., original, and synthetic query, positive and negative images) may impact each other. The tuples may not be treated independently.
In some embodiments of the present disclosure (e.g., Ret4Loc-Synth+), the geometric consistency score s can be used to select the synthetic tuples 810 to be used during training of the encoder 610 (or other known image retrieval models). The geometric consistency score s may vary in the range of [0, 1] and can measure the percentage of correspondences remaining for the matching pair after the query is altered. If the geometric consistency score s of the synthetic tuple is less than the threshold t then the synthetic tuple may be discarded. The synthetic tuples 810 having geometric consistency score s greater than the threshold t may be considered as valid tuples. The increase in the threshold t may led to selection of fewer synthetic tuples 810. During training of the image retrieval model, the selection of a subset of k tuples may be restricted to picking only from the valid tuples by the tuple selector 830.
In another aspect of the present disclosure (e.g., Ret4Loc-Synth++), each tuple in the filtered set of tuples may have a geometric consistency score s which correlates with the level of local appearance preservation. The geometric consistency score s may be utilized by the tuple selector 830 to compute the probability of sampling variants. The sampling of a valid synthetic tuple Ũqpt may be performed proportionally to 1/s ({tilde over (q)}t, p). Hence, the tuples with a larger drop in correspondences are picked more often. This weighing scheme may favor valid tuples that are “harder” and may be referred geometry-aware sampling. The geometry-aware sampling may only work well in conjunction with tuple filtering, as variants with very low geometric consistency may never be considered.
The I/O interface devices 1225 allow user interaction with the example computing system 1200. Input interface devices may include a keyboard, pointing devices such as a mouse, trackball, touchpad, or graphics tablet, a scanner, a touchscreen incorporated into the display, audio input devices such as voice recognition systems, microphones, and/or other types of input devices. In general, use of the term “input device” is intended to include all possible types of devices and ways to input information into the example computing system 1200 or onto the communication network 1230. Output interface devices may include a display subsystem, a printer, a fax machine, or non-visual displays such as audio output devices. The display subsystem may include a cathode ray tube (CRT), a flat-panel device such as a liquid crystal display (LCD), a projection device, or some other mechanism for creating a visible image. The display subsystem may also provide non-visual display such as via audio output devices. In general, use of the term “output device” is intended to include all possible types of devices and ways to output information from the example computing system 1200 to the user or to another machine or computing device.
The communication interface 1215 provides an interface to the communication networks 1230 and is coupled to corresponding interface devices in other computing devices. Some of the examples of the communication interfaces 1215 are a modem, digital subscriber line (“DSL”) card, cable modem, network interface card, wireless network card, or other interface device capable of wired, fiber optic, or wireless data communications.
Storage systems store programming and data constructs that provide the functionality of some, or all of the modules described herein. These software modules are generally executed by the processor 1210 alone or in combination with other processors. The memory 1205 used in the example computing system 1200 can include several memories including a main random-access memory (RAM) for storage of instructions and data during program execution, a mass storage device that provides persistent storage for program and data files, and may include a hard disk drive, a floppy disk drive along with associated removable media, a CD-ROM drive, a read only memory (ROM) in which fixed instructions are stored, an optical drive, or removable media cartridges. The modules implementing the functionality of certain implementations may be stored in the mass storage system, or in other machines accessible by the processor(s) 1210 via the I/O interface 1220.
The example computing system 1200 can be of varying types including a workstation, server, computing cluster, blade server, server farm, or any other data processing system or computing device. Due to the ever-changing nature of computers and networks, the description of the example computing system 1200 depicted in
At block 1315, a text prompt is generated for each of the plurality of domains shifts. At block 1320, a synthetic image is generated by processing the text prompt with a generative model for each image of the initial dataset 210. At block 1325, matching pairs in the initial dataset 210 are identified, each matching pair comprises images that captured similar aspects or scene of the environment. At block 1330, tentative pairs are determined for each matching pair in the initial dataset, the tentative pairs are generated either by combining first image of the matching pair with the synthetic variant of second image of the matching pair or by combining the synthetic variant of the first image with the second image of the matching pair. At block 1335, each of the tentative pairs is determined valid or not based on a degree of geometric correspondence. At block 1340, an extended dataset is created by combining the initial dataset 210 with the valid pairs. At block 1345, a contrastive loss function is accessed to train an image retrieval model using the extended dataset. At block 1350, the trained image retrieval model outputs an accurate set of matching images to localize a position and/or orientation of the camera device 105 based on the input query image.
An example implementation of the disclosed methods to achieve enhanced image retrieval models (or Ret4Loc models) is provided on 5 visual localization and 1 place recognition datasets. Of those 6 datasets, 4 are for outdoor and 2 for indoor localization. Although some part of the disclosed methods explicitly targets outdoor localization, Ret4Loc models also give state-of-the-art results for indoor localization.
The SfM-120K dataset was used to train all Ret4Loc models. The dataset SfM-120k was augmented by generating 11 synthetic variants for each image with the process described in
The public instructPix2Pix model was used for generating synthetic variants. The 20 inference sampling steps were used for diffusion, and the text prompt and image guidance scale were set to 10 and 1.6, respectively. LightGlue was utilized as the matching function 405 to estimate the geometric consistency score s that was used for filtering or sampling synthetic variants.
The Ret4Loc models were built on top of the HOW codebase. The models were trained as explained in the description of
To measure visual localization performance, the Kapture localization toolbox was used. In the retrieval step of this toolbox, a retrieval model produces a shortlist of the top-k nearest database images per query. For this example implementation, the retrieval model was replaced with the public HOW or FIRe models, or with one of the disclosed Ret4Loc Models. The rest of the pipeline, i.e., the pose estimation step, was shared across all experiments, and follows two of the protocols: a) pose approximation with equal weighted barycenter (EWB) using the poses of the top-k retrieved database images; or b) pose estimation with a global 3D-map (SfM) and R2D2 local features.
Two methods, HOW and FIRe designed for landmark image retrieval, were found to be state-of-the-art for the localization. In TABLE 1, the results are presented for several Ret4Loc-HOW models trained under different setups. The results are reported for different flavors of Ret4Loc models training with and without the use of synthetic variants and geometric consistency. In Table 1, VMix refers to the use of variant mixing with Eq. (4) instead of summing the different losses. Similarly, Filt. refers to synthetic tuple filtering and GaS to geometry-aware sampling.
It can be observed that the basic Ret4Loc-HOW setup (row 2) already brings some consistent gains over HOW. Using synthetic variants during training further improves performance even without any filtering (rows 3-4). In addition, mixing the variants generally improves performance (rows 3 vs. 4) in all datasets apart from RobotCar-Night. By inspecting the complete set of results on 6 datasets, overall consistent gains can be observed when incorporating synthetic data during training. From rows 5 and 6, it appears that the top performance can be obtained in all datasets except RobotCar-Night by incorporating geometric information, either as a way of filtering or as sampling the synthetic variants.
The alignment loss was measured between the feature of a real image and each of its 11 synthetic variants, and the uniformity loss was measured across all features. In
The devices and/or apparatuses described herein may be implemented through the hardware components and software components, and/or a combination thereof. For example, a device may be implemented utilizing one or more general-purpose or special purpose computers, such as, for example, processors, controllers, an arithmetic an logic units (ALUs), application specific integrated circuits (ASICs), digital signal processors (DSPs), digital signal processing devices (DSPDs), programmable logic devices (PLDs), field programmable gate arrays (FPGAs), micro-controllers, microprocessors, programmable logic units (PLUS) or any other electronic device designed to perform the functions described above. The processing device may run an operating system (OS) and one or more software applications that run on the OS. The processing device also may access, store, manipulate, process, and create data in response to execution of the software. For simplicity, the description of a processing device is used as singular; however, one skilled in the art will appreciate that a processing device may include multiple processing elements and multiple types of processing elements. For example, a processing device may include multiple processors or a processor and a controller. In addition, different processing configurations are possible, such as parallel processors.
Furthermore, when implemented in software, firmware, middleware, scripting language, and/or microcode, the program code or code segments to perform the necessary tasks can be stored in a machine-readable medium such as a storage medium. The software may include a computer program, a piece of code, an instruction, or some combination thereof, for independently or collectively instructing or configuring the processing device to operate as desired. Software and data may be embodied permanently or temporarily in any type of machine, component, physical or virtual equipment, computer storage medium or device, or in a propagated signal wave capable of providing instructions or data to or being interpreted by the processing device. The software also may be distributed over network coupled computer systems so that the software is stored and executed in a distributed fashion. In particular, the software and data may be stored by one or more computer readable recording mediums.
Some embodiments of the present disclosure include a system including one or more data processors. In some embodiments, the system includes a non-transitory computer readable storage medium containing instruction which, when executed on the one or more data processors, cause the one or more data processors to perform part or all of one or more methods and/or part or all of one or more processes disclosed herein. Some embodiments of the present disclosure include a computer-program product tangibly embodied in a non-transitory machine-readable storage medium, including instructions configured to cause one or more data processors to perform part or all of one or more methods and/or part or all of one or more processes disclosed herein. The media may also include, alone or in combination with the program instructions, data files, data structures, and the like. The media may continuously store computer executable programs or may temporarily store the same for execution or download. Also, the media may be several types of recording or storage devices in a form in which one or a plurality of hardware components are combined. Without being limited to media directly connected to a computer system, the media may be distributed over the network. Examples of the media include magnetic media such as hard disks, floppy disks, and magnetic tapes; optical media such as CD-ROM and DVDs; magneto-optical media such as floptical disks; and hardware devices that are specially configured to store and perform program instructions, such as ROM, RAM, flash memory, and the like. Software codes can be stored in a memory that can be implemented within the processor or external to the processor. As used herein the term “memory” refers to any type of long term, short term, volatile, nonvolatile, or other storage medium and is not to be limited to any memory or number of memories, or type of media upon which memory is stored. Examples of a program instruction may include a machine language code produced by a compiler and a high-language code executable by a computer using an interpreter.
The terms and expressions which have been employed are used as terms of description and not of limitation, and there is no intention in the use of such terms and expressions of excluding any equivalents of the features shown and described or portions thereof, but it is recognized that various modifications are possible within the scope of the invention claimed. Thus, although the present invention as claimed has been specifically disclosed by embodiments and optional features, modification and variation of the concepts herein disclosed may be resorted to by those skilled in the art, and that such modifications and variations are considered to be within the scope of this invention as defined by the appended claims.
The present description provides preferred exemplary embodiments only, and is not intended to limit the scope, applicability, or configuration of the disclosure. Rather, the present description of the preferred exemplary embodiments will provide those skilled in the art with an enabling description for implementing various embodiments. It is understood that various changes may be made in the function and arrangement of elements without departing from the spirit and scope as set forth in the appended claims.
Specific details are given in the present description to provide a thorough understanding of the embodiments. However, it will be understood that the embodiments may be practiced without these specific details. For example, circuits, systems, networks, processes, and other components may be shown as components in block diagram form in order not to obscure the embodiments in unnecessary detail. In other instances, well-known circuits, processes, algorithms, structures, and techniques may be shown without unnecessary detail to avoid obscuring the embodiments.
This application claims the priority to and the benefit of U.S. Provisional Patent Application No. 63/614,113, filed on Dec. 22, 2023, which is hereby incorporated by reference in its entirety for all purposes.
Number | Date | Country | |
---|---|---|---|
63614113 | Dec 2023 | US |