The disclosed implementations relate generally to 3-D reconstruction and more specifically to systems and methods for generating dimensionally coherent training data.
3-D building models and visualization tools can produce significant cost savings. Using accurate 3-D models of properties, homeowners, for instance, can estimate and plan every project. With near real-time feedback, contractors could provide customers with instant quotes for remodeling projects. Interactive tools can enable users to view objects (e.g., buildings) under various conditions (e.g., at different times, under different weather conditions). 3-D models may be reconstructed from various input image data, but excessively large image inputs, such as video input, may require costly computing cycles and resources to manage, whereas image sets with sparse data fail to capture adequate information for realistic rendering or accurate measurements for 3-D models.
Traditional methods generate a dataset for interest point detection training by taking at least one image, annotating it with reference points, and applying a series of homographies to the image to produce warped/synthetic images with warped reference points of the original image(s). The trained network may then recognize the same feature in multiple images when deployed in the field because it had learned the various ways a feature may look from different perspectives. Feature detection across images is valuable to determine how a camera has changed positions across images. By training a network with homographies of images, the network may learn more ways a given feature looked in a self-supervised fashion. In other words, instead of receiving thousands of images of a feature generated and annotated by humans, the network is biased to spot and learn that a particular feature can look a certain way in millions of automated views while “knowing” it is the same.
While these traditional methods can work for correspondence training where the feature should be similarly constrained across different views, there are several shortcomings. For example, the conventional methods only generate additional synthetic images similarly in 2D data format, and does not convey or preserve additional details that would be beneficial for training a network expected for use in a 3D reconstruction pipeline. While geometries with only 2D descriptions (e.g. single planar surfaces) may suffice under such a training method, more complex geometries such as a 3D object (e.g. a house) are not well suited to this technique. A homography will not transform surfaces with different orientations and maintain the ground truth relations between such surfaces. These techniques also treat the entire image equally with respective to its homography, without appreciating that only a certain object within an image may matter. The entire image is warped in this technique, which introduces variability of relevance for non-relevant data, or incoherence of the data in general. Warping a 2D image transforms all pixels according to the change of the degree of freedom; because all pixels reside on a 2D plane of the image, any 3D dimensional relationships of an object in the image are subsumed by the change. In other words, 3D geometries and relationships of non-coplanar surfaces that should not transform according to a 2D transform are inaccurately displayed by such a technique. This may lead to false positives or false negatives in real-world scenarios relying on a model trained in this way. For example, during training set creation a warped image may crop out portions of the object of interest, or disproportionately show the sky or minimize portions of an object (e.g. a house) observed in the image. Inappropriate cropping may also result; for example, an image rotated 45 degrees away from the render camera reduces the effective size of the image that camera can view, and there is no information “filling” the loss within the render camera's frustum by the 2D image perturbation. Taken together, the trained network is not exposed to real life observations during training and is therefore prone to errors when evaluating data in real time after training.
Accordingly, there is a need for systems and methods for generating improved training data for feature matching algorithms. According to some implementations, 2D images are first transformed into a 3D visualization preserving the original visual data and the spatial representations and relationships they captured. Training data may be generated by perturbing (e.g., warping) the 3D visualization to generate additional synthetic images or synthetic view, in an efficient manner.
Feature matching techniques across images is valuable, as identifying matched features across images informs the transformation of camera positions between images, and therefore the camera pose(s). With camera poses solved for, it is possible to reconstruct in 3D coordinates the geometry of content within the 2D image at a given camera pose. For example, given an image of a building or a house, it is preferred to train a computer to identify lots of different points of this image, and lots of different points of that house from other camera positions and then to determine which of the detected points are the same. Some implementations use trained networks based on synthetic images with different perspectives of a given point or feature, such that the trained network can identify that same feature when it appears in a different camera's view, despite its different visual appearance from that view.
In some implementations, the problem of data incoherence while training a network by perturbing image sets is solved by instead training on perturbing a 3D representation of an object created from the image set. Data incoherence may manifest as generating synthetic images or data that necessarily are not possible in the real world and therefore will never actually be viewed by an imager using the network trained on the incoherent synthetic data. Data incoherence may manifest as decoupling spatial realities of the image content in order to produce a warped image for training (e.g. the homography of the image breaks an actual dimensional relationship between features of the imaged object). Networks trained on such data are therefore prone to inaccurate matches when deployed, as the training data may conflict with observed data. Maintaining spatial relationships of features in the images improves the networks ability to appropriately match observed features in a real world setting. Stated differently, the problem of false positives or false negatives from a network trained on 2D data is solved by training for feature matches among spatially coherent perturbations of a 3D visual representation. The problem of variability in feature matching training from perturbed 2D images is solved by generating 3D visual representations of an object based on planar reconstruction of visual data. In some examples, the problem of occlusions interfering with generating robust visual data for a 3D visual representation is solved by generating sub-masks or visibility masks of façades within relevant images and applying texture data from images with un-occluded pixels of a respective sub-mask.
Though the techniques as described here relate to generating training data from and for building structures, the techniques may apply to other 3-D objects as well.
Systems, methods, devices, and non-transitory computer readable storage media for generating training data from and for building structures are disclosed.
(A1) In one aspect, a method is provided for generating training data of a building structure for a feature matching network. The method includes obtaining a geometric multi-dimensional model (sometimes referred to as a geometric model or a 3D model) of a building that includes a camera solution and a plurality of images taken by respective cameras at poses within the camera solution. In some implementations, the geometric model is a CAD model, or other model type depicting geometric relationships and figures but not visual information (e.g. a brick wall may appear as a blank rectangular plane or rectangular plane “filled” or annotated with pixel information to simulate brick rather than use actual brick imagery such as from the plurality of images). The method also includes, for a plurality of façades of the model: applying a minimum bounding box to a respective façade to obtain a respective façade slice that is a 2-D plane represented by the 3-D coordinate system of the model; and projecting visual data of at least one camera in the camera solution that viewed the respective façade onto the respective façade slice. The method also includes photo-texturing the projected visual data on each façade slice to generate a visual 3-D representation of the building; and generating a training dataset by perturbing the visual 3-D representation.
(A2) In some implementations of A1, the method further includes determining cameras within the camera solution that viewed the respective façade by reprojecting boundaries of the minimum bounding box into one or more images of the plurality of images within the camera solution and detecting alignment of the bounding box with a façade. Because the bounding box is fit to a façade, it will be represented according to the same coordinate space as the model that façade falls upon and therefore reprojecting the bounding box into other images of the camera solution associated with that model will necessarily reproject the bounding box into the other images using the same coordinate system. A bounding box reprojected into an image that cannot view the façade that bounding box was fit to will not align to any geometry as viewed from such image.
(A3) In some implementations of any of A1-A2, a camera view of a respective façade includes one or more occlusions, such as from architectural elements of the building object itself. The method further includes generating cumulative texture information for the respective façade based on visual information of one or more additional cameras of the camera solution.
(A4) In some implementations of A3, generating the cumulative texture information includes projecting visual information from the plurality of images to a visibility mask for the respective façade slice, and aggregating the visual information.
(A5) In some implementations of any of A1-A2, projecting visual data of the cameras that viewed the respective façade includes: for each camera that viewed the respective façade: transforming image data of the respective façade to generate a respective morphed image, wherein a plane of the respective façade is orthogonal relative to an optical axis of the respective camera; and merging visual data from each morphed image to generate cumulative visual data for the respective façade slice.
(A6) In some implementations of A5, transforming the image data of the respective façade uses a homogenous transformation.
(A7) In some implementations of A5, generating the respective morphed image orients a plane of the respective façade orthogonal to an optical axis of a virtual camera viewing the transformed image data.
(A8) In some implementations of A5, merging the visual data includes: selecting a base template from amongst visibility masks for the respective façade. The base template has a largest volume of observed pixels when compared to other visibility masks; applying visual data from an image of a set of partially occluded images that shows the respective façade; and importing visual information for pixels from other images corresponding to unobserved area(s) of the base template.
(A9) In some implementations of any of A1-A2, the perturbing is performed relative to one or more cameras of the camera solution, each perturbation performed on the building in that position.
(A10) In some implementations of any of A1-A2, the perturbing is performed relative to one or more virtual cameras viewing the visual 3-D representation.
(A11) In some implementations of any of A1-A2, the perturbing includes one or more of: moving, rotating, zooming, and combinations thereof. A plurality of images of the model at perturbed positions, less the image information of the cameras within the camera solution, is taken to generate the training set of spatially coherent 3D visual representations of the building object. Common features across each perturbed image can then be identified, and a network trained to recognize feature matches across images given camera changes (simulated by the perturbed 3D representation).
(A12) In some implementations of any of A1-A2, the method further includes capturing new images of the visual 3-D representation from each perturbed position.
(A12) In some implementations of any of A1-A2, the method further includes performing feature matching by identifying common features across images of the training dataset for determining camera position transformations between images and camera poses.
(B1) In another aspect, a method is provided for generating training data for feature matching among images of a building structure. In some implementations, the training data is embedded with 3D spatial relationships among the geometry of the building structure. The method includes obtaining a geometric multi-dimensional model (sometimes referred to as a geometric model or a 3D model) of a building that includes a camera solution and a plurality of images taken by respective cameras at poses within the camera solution. In some implementations, the geometric model is a CAD model, or other model type depicting geometric relationships and figures but not visual information (e.g., a brick wall may appear as a blank rectangular plane or rectangular plane “filled” or annotated with pixel information to simulate brick rather than use actual brick imagery such as from the plurality of images). The method also includes, for a plurality of facades of the model, generating a façade slice for each of a plurality of facades of the model, and identifying each camera of the camera solution that observes or captured the respective façade. The method also includes, for each identified camera for a respective façade, identifying pixels in that camera's image that comprise visual data associated with the façade slice. The pixels may be identified according to a visibility mask, such as a reprojection of the façade slice as one or more segmentation masks for the image. The method also includes generating an aggregate façade view of cumulative visual information associated with the façade slice. The combined visual data, e.g. pixels, for the façade slice, according to each identified camera are combined and photo-textured to the façade slice, or the geometric model, to generate a 3D representation of the building structure. The 3D representation can then be perturbed to a variety of positions and transforms, and additional images of the perturbed 3D representation taken at each perturbation; each additional image may be used as a training data image in a resultant training dataset.
(B2) In some implementations of B1, the façade slice is isolated as a 2D representation of the respective façade, and defined in 3D coordinates as with the 3D model.
(B3) In some implementations of B2, the façade is isolated to create a façade slice by applying a bounding box to the model for the respective façade and cropping the content within the bounding box. In some implementations, the bounding box is fit to an already isolated façade slice.
(B4) In some implementations of B3, cameras that observe a respective façade are identified by reprojecting a bounding box for the façade slice into the plurality of images and recording which cameras observe the reprojected bounding box.
(B5) In some implementations of any of B1-B4, the visibility mask is a classification indicating one or more classified pixels related to a façade slice.
(B6) In some implementations of any of B1-B4, the method for generating the aggregate façade view further includes generating one or more visual data templates using the images from the identified cameras that observe the respective façade. A visual data template may be a modified image that displays the visual data (e.g., the observed pixels within an image) according to a camera's visibility mask. In other words, the pixels of an image that coincide with classified pixels of a visibility mask are used to create the modified image.
(B7) In some implementations of B6, each visual data template is transformed to a common perspective.
(B8) In some implementations of B7, the transformation is a homogenous transform that transforms the modified image such that its plane is orthogonal to an optical axis of a camera (e.g., virtual camera) viewing or displaying the image.
(B9) In some implementations of B7, aligning the visual data templates to a common perspective aligns associated bounding boxes for the visual data templates, such as the bounding box associated with the façade slice that governed the visibility mask for the visual data template.
(B10) In some implementations of any of B1-B4, a base visual data template is selected for a given façade or façade slice. In some implementations, a visual data template is selected as the base visual data template when it has the most observed pixels, e.g. its associated visibility mask has the highest quantity of classified observed pixels, of a given façade.
(B11) In some implementations of (B10), the pixels from additional visual data templates is added to that base template. In this way, portions of a façade slice that are unobserved by the base visual data template are filled in by the pixels of the additional visual data templates.
(B12) In some implementations of any of B1-B4, perturbing the 3D model is performed by translating, rotating, zooming, or similar actions relative to a camera position of the camera solution.
(B13) In some implementations of any of B1-B4, perturbing the 3D model is performed by translating, rotating, zooming, or similar actions relative to a virtual camera position (e.g., a render camera position other than a position of a camera within the camera solution).
(B14) In some implementations of any of B1-B4, perturbing the 3D model is performed by translating, rotating, zooming, or combinations thereof.
(B15) In some implementations of any of B1-B4, feature matches across the captured images of a perturbed 3D representation that form the training dataset are obtained and form part of a data file for a trained network to use in a deployed setting.
(B16) In some implementations of any of B1-B4, each photo-textured façade slice is reassembled according to the 3D coordinates of the 3D model to create a 3D representation of the visual data captured by the images of the camera solution.
In another aspect, a computer system includes one or more processors, memory, and one or more programs stored in the memory. The programs are configured for execution by the one or more processors. The programs include instructions for performing any of the methods described herein.
In another aspect, a non-transitory computer readable storage medium stores one or more programs configured for execution by one or more processors of a computer system. The programs include instructions for performing any of the methods described herein.
Like reference numerals refer to corresponding parts throughout the drawings.
Reference will now be made to various implementations, examples of which are illustrated in the accompanying drawings. In the following detailed description, numerous specific details are set forth in order to provide a thorough understanding of the invention and the described implementations. However, the invention may be practiced without these specific details or in alternate sequences or combinations. In other instances, well-known methods, procedures, components, and circuits have not been described in detail so as not to unnecessarily obscure aspects of the implementations.
An image capture device 104 communicates with the computing device 108 through one or more networks 110. The image capture device 104 provides image capture functionality (e.g., take photos of images) and communications with the computing device 108. In some implementations, the image capture device is connected to an image preprocessing server system (not shown) that provides server-side functionality (e.g., preprocessing images, such as creating textures, storing environment maps (or world maps) and images and handling requests to transfer images) for any number of image capture devices 104.
In some implementations, the image capture device 104 is a computing device, such as desktops, laptops, smartphones, and other mobile devices, from which users 106 can capture images (e.g., take photos), discover, view, edit, or transfer images. In some implementations, the users 106 are robots or automation systems that are pre-programmed to capture images of the building structure 102 at various angles (e.g., by activating the image capture image device 104). In some implementations, the image capture device 104 is a device capable of (or configured to) capture images and generate (or dump) world map data for scenes. In some implementations, the image capture device 104 is an augmented reality camera or a smartphone capable of performing the image capture and world map generation functions. In some implementations, the world map data includes (camera) pose data, tracking states, or environment data (e.g., illumination data, such as ambient lighting).
In some implementations, a user 106 walks around a building structure (e.g., the house 102), and takes pictures of the building 102 using the device 104 (e.g., an iPhone) at different poses (e.g., the poses 112-2, 112-4, 112-6, 112-8, 112-10, 112-12, 112-14, and 112-16). Each pose corresponds to a different perspective or a view of the building structure 102 and its surrounding environment, including one or more objects (e.g., a tree, a door, a window, a wall, a roof) around the building structure. Each pose alone may be insufficient to generate a reference pose or reconstruct a complete 3-D model of the building 102, but the data from the different poses can be collectively used to generate reference poses and the 3-D model or portions thereof, according to some implementations. In some instances, the user 106 completes a loop around the building structure 102. In some implementations, the loop provides validation of data collected around the building structure 102. For example, data collected at the pose 112-16 is used to validate data collected at the pose 112-2.
At each pose, the device 104 obtains (118) images of the building 102, and/or world map data (described below) for objects (sometimes called anchors) visible to the device 104 at the respective pose. For example, the device captures data 118-1 at the pose 112-2, the device captures data 118-2 at the pose 112-4, and so on. As indicated by the dashed lines around the data 118, in some instances, the device fails to capture the world map data, illumination data, or images. For example, the user 106 switches the device 104 from a landscape to a portrait mode, or receives a call. In such circumstances of system interruption, the device 104 fails to capture valid data or fails to correlate data to a preceding or subsequent pose. Some implementations also obtain or generate tracking states (further described below) for the poses that signify continuity data for the images or associated data. The data 118 (sometimes called image related data 274) is sent to a computing device 108 via a network 110, according to some implementations.
Although the description above refers to a single device 104 used to obtain (or generate) the data 118, any number of devices 104 may be used to generate the data 118. Similarly, any number of users 106 may operate the device 104 to produce the data 118.
In some implementations, the data 118 is collectively a wide baseline image set, that is collected at sparse positions (or poses 112) around the building structure 102. In other words, the data collected may not be a continuous video of the building structure or its environment, but rather still images or related data with substantial rotation or translation between successive positions. In some embodiments, the data 118 is a dense capture set, wherein the successive frames and poses 112 are taken at frequent intervals. Notably, in sparse data collection such as wide baseline differences, there are fewer features common among the images and deriving a reference pose is more difficult or not possible. Additionally, sparse collection also produces fewer corresponding real-world poses and filtering these, as described further below, to candidate poses may reject too many real-world poses such that scaling is not possible.
The computing device 108 obtains the image-related data 274 (which may include a geometric model of the building that in turn includes a camera solution and images of the building from camera positions within the camera solution) via the network 110. Based on the data received, the computing device 108 generates training datasets of the building structure 102. As described below in reference to
The computer system 100 shown in
The communication network(s) 110 can be any wired or wireless local area network (LAN) or wide area network (WAN), such as an intranet, an extranet, or the Internet. It is sufficient that the communication network 110 provides communication capability between the image capture devices 104, the computing device 108, or external servers (e.g., servers for image processing, not shown). Examples of one or more networks 110 include local area networks (LAN) and wide area networks (WAN) such as the Internet. One or more networks 110 are, optionally, implemented using any known network protocol, including various wired or wireless protocols, such as Ethernet, Universal Serial Bus (USB), FIREWIRE, Global System for Mobile Communications (GSM), Enhanced Data GSM Environment (EDGE), code division multiple access (CDMA), time division multiple access (TDMA), Bluetooth, Wi-Fi, voice over Internet Protocol (VOIP), Wi-MAX, or any other suitable communication protocol.
The computing device 108 or the image capture devices 104 are implemented on one or more standalone data processing apparatuses or a distributed network of computers. In some implementations, the computing device 108 or the image capturing devices 104 also employ various virtual devices or services of third party service providers (e.g., third-party cloud service providers) to provide the underlying computing resources or infrastructure resources.
The memory 206 includes high-speed random access memory, such as DRAM, SRAM, DDR RAM, or other random access solid state memory devices; and, optionally, includes non-volatile memory, such as one or more magnetic disk storage devices, one or more optical disk storage devices, one or more flash memory devices, or one or more other non-volatile solid state storage devices. The memory 206, optionally, includes one or more storage devices remotely located from one or more processing units 202. The memory 206, or alternatively the non-volatile memory within the memory 206, includes a non-transitory computer readable storage medium. In some implementations, the memory 206, or the non-transitory computer readable storage medium of the memory 206, stores the following programs, modules, and data structures, or a subset or superset thereof:
The above description of the modules is only used for illustrating the various functionalities. In particular, one or more of the modules may be combined in larger modules to provide similar functionalities.
In some implementations, an image database management module (not shown) manages multiple image repositories, providing methods to access and modify image-related data that can be stored in local folders, NAS or cloud-based storage systems. In some implementations, the image database management module can even search online/offline repositories. In some implementations, offline requests are handled asynchronously, with large delays or hours or even days if the remote machine is not enabled. In some implementations, an image catalog module (not shown) manages permissions and secure access for a wide range of databases.
Each of the above identified elements may be stored in one or more of the previously mentioned memory devices, and corresponds to a set of instructions for performing a function described above. The above identified modules or programs (i.e., sets of instructions) need not be implemented as separate software programs, procedures, or modules, and thus various subsets of these modules may be combined or otherwise re-arranged in various implementations. In some implementations, memory 206, optionally, stores a subset of the modules and data structures identified above. Furthermore, memory 206, optionally, stores additional modules and data structures not described above.
Although not shown, in some implementations, the computing device 108 further includes one or more I/O interfaces that facilitate the processing of input and output associated with the image capture devices 104 or external server systems (not shown). One or more processors 202 obtain images and information related to images from image-related data 274 (e.g., in response to a request to generate training datasets for a building), processes the images and related information, and generates training datasets. I/O interfaces facilitate communication with one or more image-related data sources (not shown, e.g., image repositories, social services, or other cloud image repositories). In some implementations, the computing device 108 connects to image-related data sources through I/O interfaces to obtain information, such as images stored on the image-related data sources.
Memory 256 includes high-speed random access memory, such as DRAM, SRAM, DDR RAM, or other random access solid state memory devices; and, optionally, includes non-volatile memory, such as one or more magnetic disk storage devices, one or more optical disk storage devices, one or more flash memory devices, or one or more other non-volatile solid state storage devices. Memory 256, optionally, includes one or more storage devices remotely located from one or more processing units 122. Memory 256, or alternatively the non-volatile memory within memory 256, includes a non-transitory computer readable storage medium. In some implementations, memory 256, or the non-transitory computer readable storage medium of memory 256, stores the following programs, modules, and data structures, or a subset or superset thereof:
Examples of the image capture device 104 include, but are not limited to, a handheld computer, a wearable computing device, a personal digital assistant (PDA), a tablet computer, a laptop computer, a cellular telephone, a smartphone, an enhanced general packet radio service (EGPRS) mobile phone, a media player, a navigation device, a portable gaming device console, a tablet computer, a laptop computer, a desktop computer, or a combination of any two or more of these data processing devices or other data processing devices. In some implementations, the image capture device 104 is an augmented-reality (AR)-enabled device that captures augmented reality maps (AR maps, sometimes called world maps). Examples include Android devices with ARCore, or iPhones with ARKit modules.
In some implementations, the image capture device 104 includes (e.g., is coupled to) a display 254 and one or more input devices (e.g., camera(s) or sensors 258). In some implementations, the image capture device 104 receives inputs (e.g., images) from the one or more input devices and outputs data corresponding to the inputs to the display for display to the user 106. The user 106 uses the image capture device 104 to transmit information (e.g., images) to the computing device 108. In some implementations, the computing device 108 receives the information, processes the information, and sends processed information to the display 116 or the display of the image capture device 104 for display to the user 106.
Some implementations generate a 3D representation of an object. Each pixel of the representation is based on at least one pixel from 2D image(s) used to construct the 3D representation. Instead of, or in addition to, generating a 3D representation of an object based on geometry detected in a 2D image, some implementations generate a dataset for feature matching, using the original visual data in addition to the geometric representation. By generating a 3D accurate representation of the visual data, any perturbing and modification of the representation, such as rotating, or zooming in on a rotated view, that otherwise randomizes views for training, will not only focus the training on the object of relevance of the original 2D images, but also preserve the 3D relationships (e.g. 3D appearances) of all features and points for that object in the representation. Together, these lead to a stronger dataset, resulting in a more robust network when actually deployed as described next. More accurate training data, meaning it more accurately depicts real world environments/scenes, leads to increased accuracy in the field. In other words, a camera is never going to view a warped façade in real life, so training data based on warped images could lead networks predicting false positives or false negatives in feature matching actual observed inputs as the training data included inherently or purposefully inaccurate data.
Techniques described herein can be used to develop new feature matching techniques and for generating training data by perturbing (e.g., by warping) a 3D representation of an object to generate additional synthetic images in a cheap and fast, and geometrically accurate technique. Feature matching across images is useful for identifying matching features across images that informs the transformation of camera positions between images, and therefore the camera pose(s). With camera poses solved for, the techniques can be used to reconstruct in 3D coordinates the geometry of content within the 2D image at a given camera pose.
Some implementations render the house in 3D using the cameras' visual data, such that a training set built from that 3D representation inherently comprises 3D data for its underlying features in addition to the visual appearance in such synthetic rendering, as opposed to a 2D feature in a synthetic image as in the prior art. This is performed by “texturing” (applying visual data of the image) a plurality of façade masks derived from a ground truth model.
Some implementations obtain a geometric multi-dimensional model of a building structure, the model includes a camera solution and image data used to generate that model. This is sometimes called the ground truth data. Some implementations apply a minimum bounding box to each façade in that model. The façade itself is a 2D plane, even though it lives in a 3D coordinate system. This isolates the façade.
Some implementations subsequently gather, and in some implementations project, visual data of cameras that viewed that façade for that façade slice. To ascertain which visual data to project, some implementations reproject the boundaries of the bounding box back into the images, this shows which images can “see” that façade. For example, bounding boxes 306, 308, and 310, each based on the bounding box that generated façade slice 304, are reprojected into the images as shown in
Across all of these photos or images, each image may be subject to some occlusion such that a full perfect view of the façade (that is otherwise available in the synthetic façade slice taken from the geometric model) and all its visual appearance is not possible from any one image. In other words, simply taking one image and reprojecting its visual data within the bounding box onto the façade slice may not account for every pixel of the façade slice. With a set of partially occluded images, it is necessary to generate cumulative texture information. Some implementations achieve this by accumulating visual information for a façade slice from at least one image. This may be done by photo-texturing a façade slice with visual information from an image and then assembling each photo-textured façade slice with other photo-textured façade slices in 3-D space according to the arrangement of façades as in the original geometric 3-D model. In some implementations, this is performed by generating cumulative visual information for a façade slice according to visibility masks associated with the façade slice. In some implementations, the original geometric 3-D model is photo-textured with the cumulative visual information, e.g., to obtain an aggregate façade view, for each respective façade.
In some implementations, a façade slice is reprojected in an image, as a visibility mask for the façade in question, and may be within a broader segmentation mask of the image. In some implementations, only when a façade has line of sight to the image's render camera will it appear within the segmentation mask (e.g. as white against the broader black segmentation mask in the illustrated examples). For example,
After a façade's visibility is determined with respect to a frame, the image data of a particular image observing that façade may be applied. In some implementations, pixels in an image that correspond to pixels in a visibility mask are used to generate the visual data for a façade slice. In some implementations, the visual data is transformed to a common perspective. In some implementations, the transformation is according to the bounding box of each image observing the façade, such that the transformation aligns the boundaries of the respective bounding boxes. In some implementations, the transformation is such that the plane of the façade is orthogonal relative to the render camera's optical axis. Some implementations use homogeneous transformation for this purpose, or more specifically according to a homography.
The visual data from each of these images are now merged. Some implementations choose the visibility mask described above that has the largest observed pixel volume for the façade or façade slice as a base template. For example,
In some implementations, pixels from the other images that would be in the unobserved area of the base template are applied or merged. For example, in
This 3D representation may be moved around, rotated, and/or zoomed, relative to the render camera or display camera (e.g. a virtual camera). In some implementation, a real camera position from the camera solution is used. A plurality of images at each perturbation may be made of the house in that position. Examples in
Referring now back to
In some implementations, the perturbing is performed (416) relative to one or more cameras of the camera solution and one or more images of the plurality of images, at each perturbation performed on the building in that position. In some implementations, the perturbing includes (418) one or more of: moving, rotating, zooming, and combinations thereof.
In some implementations, the method further includes determining (410) cameras that viewed the respective façade by reprojecting boundaries of the minimum bounding box into one or more images of the plurality of images.
Referring next to
Referring next to
Referring next to
In this way, the techniques provided herein generate training data for feature matching algorithms.
The foregoing description, for purpose of explanation, has been described with reference to specific implementations. However, the illustrative discussions above are not intended to be exhaustive or to limit the invention to the precise forms disclosed. Many modifications and variations are possible in view of the above teachings. The implementations were chosen and described in order to best explain the principles of the invention and its practical applications, to thereby enable others skilled in the art to best utilize the invention and various implementations with various modifications as are suited to the particular use contemplated.
This application claims priority to U.S. Provisional Patent Application No. 63/313,257, filed Feb. 23, 2022, entitled “Systems and Methods for Generating Dimensionally Coherent Training Data,” which is incorporated by reference herein in its entirety.
Filing Document | Filing Date | Country | Kind |
---|---|---|---|
PCT/US2023/013749 | 2/23/2023 | WO |
Number | Date | Country | |
---|---|---|---|
63313257 | Feb 2022 | US |