The present disclosure relates to methods, storage media, and systems for generating a three-dimensional model associated with a building.
Three-dimensional models of a building may be generated based on two-dimensional digital images taken of the building. The digital images may be taken via aerial imagery, specialized-camera equipped vehicles, or by a user with a camera from a ground-level perspective when the images meet certain conditions. The three-dimensional building model is a digital representation of the physical, real-world building. An accurate three-dimensional model may be used to derive various building measurements or to estimate design and renovation costs.
However, generating an accurate three-dimensional model of a building from two-dimensional images that are useful for deriving building measurements can require significant time and resources. Current techniques are computationally expensive and prone to false positives and false negatives in feature matching with respect to the two-dimensional digital images used. Thus, the current techniques disallow an end-user from rapidly obtaining such a three-dimensional model.
One aspect of the present disclosure relates to a method for measuring generating three-dimensional data. The method comprises obtaining a plurality of images depicting a building, wherein individual images are taken at individual positions about an exterior of the building, and wherein the images are associated with camera properties reflecting extrinsic and/or intrinsic camera parameters; determining, via a machine learning model, semantic labels for the images which are associated with a plurality of elements of the building, wherein the semantic labels are associated with two-dimensional positions in the images; estimating three-dimensional positions associated with the plurality of elements, wherein estimating is based on the camera properties and one or more epipolar constraints; and generating a three-dimensional representation of at least a portion of the building with the estimated three-dimensional positions.
One aspect of the present disclosure relates to a system comprising one or more processors and non-transitory computer storage media storing instructions that when executed by the one or more processors, cause the processors to perform operations. The operations comprise obtaining a plurality of images depicting a building, wherein individual images are taken at individual positions about an exterior of the building, and wherein the images are associated with camera properties reflecting extrinsic and/or intrinsic camera parameters; determining, via a machine learning model, semantic labels for the images which are associated with a plurality of elements of the building, wherein the semantic labels are associated with two-dimensional positions in the images; estimating three-dimensional positions associated with the plurality of elements, wherein estimating is based on the camera properties and one or more epipolar constraints; and generating a three-dimensional representation of at least a portion of the building, with the estimated three-dimensional positions.
One aspect of the present disclosure relates to non-transitory computer storage media storing instructions that when executed by a system of one or more processors, cause the processors to perform operations. The operations comprise obtaining a plurality of images depicting a building, wherein individual images are taken at individual positions about an exterior of the building, and wherein the images are associated with camera properties reflecting extrinsic and/or intrinsic camera parameters; determining, via a machine learning model, semantic labels for the images which are associated with a plurality of elements of the building, wherein the semantic labels are associated with two-dimensional positions in the images; estimating three-dimensional positions associated with the plurality of elements, wherein estimating is based on the camera properties and one or more epipolar constraints; and generating a three-dimensional representation of at least a portion of the building with the estimated three-dimensional positions.
One aspect of the present disclosure relates to a method for confirming a semantic label prediction in an image. The method comprises obtaining a plurality of images depicting a scene, wherein individual images comprise co-visible aspects with at least one other image in the plurality of images, and wherein the images are associated with camera properties reflecting extrinsic and/or intrinsic camera parameters; determining, via a machine learning model, semantic labels for the images which are associated with a plurality of elements of the scene, wherein the semantic labels are associated with two-dimensional positions in the images; and validating a first semantic label in a first image by satisfying an epipolar constraint of the first semantic label according to the first semantic label in a second image and satisfying the epipolar constraint of the first semantic label in the second image according to the first semantic label in the first image.
One aspect of the present disclosure relates to non-transitory computer storage media storing instructions that when executed by a system of one or more processors, cause the processors to perform operations. The operations comprise obtaining a plurality of images depicting a scene, wherein individual images comprise co-visible aspects with at least one other image in the plurality of images, and wherein the images are associated with camera properties reflecting extrinsic and/or intrinsic camera parameters; determining, via a machine learning model, semantic labels for the images which are associated with a plurality of elements of the scene, wherein the semantic labels are associated with two-dimensional positions in the images; and validating a first semantic label in a first image by satisfying an epipolar constraint of the first semantic label according to the first semantic label in a second image and satisfying the epipolar constraint of the first semantic label in the second image according to the first semantic label in the first image.
One aspect of the present disclosure relates to a method for generating three-dimensional data. The method comprises obtaining a plurality of images depicting an object, wherein individual images are taken at individual positions about the object, and wherein the images are associated with camera properties reflecting extrinsic or intrinsic camera parameters; determining, via a machine learning model, semantic labels for the images which are associated with a plurality of elements of the object, wherein the semantic labels are associated with two-dimensional positions in the images; estimating three-dimensional positions associated with the plurality of elements, wherein estimating is based on camera properties of a selected camera pair from the plurality of images validated by a visual property of an image associated with a nonselected camera from the plurality of images and one or more epipolar constraints; and generating a three-dimensional representation of at least a portion of the object with the estimated three-dimensional positions.
One aspect of the present disclosure relates to one or more non-transitory storage media storing instructions that when executed by a system of one or more processors, cause the processors to perform operations. The operations comprise obtaining a plurality of images depicting an object, wherein individual images are taken at individual positions about the object, and wherein the images are associated with camera properties reflecting extrinsic and/or intrinsic camera parameters; determining, via a machine learning model, semantic labels for the images which are associated with a plurality of elements of the object, wherein the semantic labels are associated with two-dimensional positions in the images; estimating three-dimensional positions associated with the plurality of elements, wherein estimating is based on camera properties of a selected camera pair from the plurality of images validated by a visual property of an image associated with a nonselected camera from the plurality of images and one or more epipolar constraints; and generating a three-dimensional representation of at least a portion of the object with the estimated three-dimensional positions.
These and other features, and characteristics of the present technology, as well as the methods of operation and functions of the related elements of structure and the combination of parts and economics of manufacture, will become more apparent upon consideration of the following description and the appended claims with reference to the accompanying drawings, all of which form a part of this specification, wherein like reference numerals designate corresponding parts in the various figures. It is to be expressly understood, however, that the drawings are for the purpose of illustration and description only and are not intended as a definition of the limits of the invention. As used in the specification and in the claims, the singular form of ‘a’, ‘an’, and ‘the’ include plural referents unless the context clearly dictates otherwise.
In the following description, for the purposes of explanation, numerous specific details are set forth in order to provide a thorough understanding of the present disclosure. It will be appreciated, however, that the present disclosure may be practiced without these specific details. In other instances, well-known structures and devices are shown in block diagram form in order to avoid unnecessarily obscuring the present disclosure.
This specification describes techniques to generate a three-dimensional model of at least a portion of a building, such as a home or other dwelling. As used herein, the term building refers to any three-dimensional object, man-made or natural. Buildings may include, for example, houses, offices, warehouses, factories, arenas, and so on. As will be described, images of the building may be obtained, such as via a user device (e.g., a smartphone, tablet, camera), as an end-user moves about an exterior of the building. Thus, the images be taken from different vantage points about the building. Analysis techniques, such as machine learning techniques, may then be used to label elements depicted in the images. Example elements may include roof elements such as caves, ridges, rakes, and so on. Correspondences between depictions of these elements may then be used to generate the three-dimensional model. As will be described, the model may be analyzed to inform building measurements (e.g., roof facet pitch, roof facet area, and so on).
One example technique to generate three-dimensional models of buildings relies upon matching features between images using descriptor-based matching. For this example technique, descriptors such as scale-invariant feature transform (SIFT) descriptors may be used to detect certain elements in the images. These SIFT descriptors may then be matched between images to identify portions of the images which are similar in appearance.
As may be appreciated, there may be distinctions in the appearance of a same building-element (e.g., an apex of a roof) in images due to variations in the image viewpoint, variations in lighting, and so on. Due to these distinctions, incorrect matches may be identified. Additionally, there may be a multitude of candidate matches with no clear way to identify a correct match. Using such descriptor-based matching may therefore require substantial post-processing techniques to filter incorrect matches. Since descriptor-based matching relies upon appearance-based matching, descriptor-based matching may be an inflexible technique to determine correspondence between images which may lead to inaccuracies in three-dimensional model generation.
In contrast, the techniques described herein leverage a machine learning model to classify, or otherwise label, building-specific elements in images. For example, the machine learning model may be trained to output building-specific labels to portions of an input image. As an example, a forward pass through the machine learning model may output a two-dimensional image position associated with a particular class. As another example, the model may output a bounding box about a portion of an image which is associated with a class. As another example, the output may reflect an assignment of one or more image pixels as forming part of a depiction of a building-specific element. As another example, the machine learning model may generate a mask which identifies a building-specific element (e.g., a contour or outline of the element).
Since the machine learning model may be trained using thousands or millions of training images and labels, the model may be resilient to differences in appearances of building-specific elements. Additionally, the machine learning model may accurately label the same building-specific element across images with lighting differences, differences in perspective, and so on. In this way, the labels may represent viewpoint invariant descriptors which may reliably characterize portions of images as depicting specific building-specific elements.
Due to complexity of certain buildings, such as the complexity of roofs with multiple roof facets, there may be a multitude of a certain type of building-specific element depicted in an image. Thus, when identifying and matching a location of a building-specific element in the image with another location of the element in a different image, there may be multiple potential label matches. As will be described, epipolar matching may be utilized to refine the potential label matches. For example, the epipolar matching may use intrinsic and/or extrinsic camera parameters associated with the images. In this example, the same building-specific element may be identified between images using epipolar geometry.
In some embodiments, there may be a substantial number of images of a building. For example, there may be 4, 5, 7, 12, and so on, images of the building taken at various positions about the exterior of the building. A subset of the images may depict the same building-specific element. For example, a particular roof feature may be visible in the front of building, with the subset depicting the front. To determine a three-dimensional location of the building-specific element, and as will be described, a reprojection technique may employed. For example, a three-dimensional location for the element may be determined using a first image pair of the subset. This location may then be reprojected into remaining images of the subset. As an example, the location may be identified in a third image of the subset. A reprojection error may then be determined between that location in the third image and a portion of the third image labeled as depicting the element. Similarly, reprojection errors may be determined for all, or a portion of, the remaining images in the subset.
A sum, or combination, of the above-described reprojection errors may be determined for each image pair. That is, the sum may reflect the reprojection error associated with a three-dimensional location of a particular building-specific element as determined from each image pair. In some embodiments, the image pair, and thus three-dimensional location of the element, associated with the lowest reprojection error may be selected for the three-dimensional model
In this way, three-dimensional locations of building-specific elements may be determined. The elements may be connected, in some embodiments, to form the three-dimensional model of at least a portion of the building. In some embodiments, logic (e.g., domain specific logic associated with buildings) may be used. As an example, a system may form a roof ridge as connecting two apex points. As another example, the system may connect an eave between two eave end points. As another example, in some embodiments, an eave or ridge line may have no, or a small, z-axis change. Thus, if an eave or ridge line has an angle the system may cancel it (e.g., remove it) from the model. Optionally, camera information may be used to align the model geometry. For example, the z-axis may correspond to a gravity vector.
The above and additional disclosure will now be described in more detail.
At block 102, one or more images are received or accessed. As described above, the images may depict an exterior of a building (e.g., a home). The images may be obtained from cameras positioned at different locations, or differently angled at a same location, about the exterior. For example, the images may depict a substantially 360-degree view of the building. As another example, the images may depict a front portion of the building from different angles. The images may optionally be from a similar distance to the building, such as a center of the building (e.g., the images may be obtained from a circle surrounding the building). The images may also be from different distances to the building, such as illustrated in
A data capture device, such as a smartphone or a tablet computer, can capture the images. Other examples of data capture devices include drones and aircraft. The images can include ground-level images, aerial images, or both. The aerial images can include orthogonal images, oblique images, or both. The images can be stored in memory or in storage.
The images, in some embodiments, may include information related to camera extrinsics (e.g., pose of the data capture device, including position and orientation, at the time of image capture), camera intrinsics (e.g., camera constant, scale difference, focal length, and principal point), or both. The images can include image data (e.g., color information) and depth data (e.g., depth information). The image data can be from an image sensor, such as a charge coupled device (CCD) sensor or a complementary metal-oxide semiconductor (CMOS) sensor, embedded within the data capture device. The depth data can be from a depth sensor, such as a LiDAR sensor or time-of-flight sensor, embedded within the data capture device.
Referring back to
The segmented image can include one or more semantically labeled elements which describe a two-dimensional (2D) position (e.g., X, Y). For example, the 2D position of a roof apex may be determined. As another example, the 2D positions associated with an cave or ridge may be determined. In some embodiments, the 2D positions may represent cave endpoints of an eave (e.g., cave line or segment) or ridge endpoints of a ridge (e.g., a ridge line or segment). The labeled elements can also describe a segment (e.g., (X1, Y1) to (X2, Y2)), or polygon (e.g., area) for classified elements within the image, and associated classes (e.g., data related to the classified elements). Thus, for certain element classes the segmentation may indicate two-dimensional positions associated with locations of the element classes. As an example, and as described above, an element class may include an cave point (e.g., cave endpoint). For this example, the two-dimensional location of the cave point may be determined (e.g., a center of a bounding box about the cave point). For other element classes the segmentation may indicate a segment and/or area (e.g., portion of an image). For example, a gable may be segmented as a segment in some embodiments. As another example, a window may be segmented as an image area.
In some embodiments, each semantically labeled element is a viewpoint invariant descriptor when such element is visible across multiple images and is appropriately constrained by rotational relationships such as epipolar geometry. In some embodiments, each semantically labeled element can include a probability or confidence metric that describes the likelihood that the semantically labeled element belongs to the associated class.
As will be described, a machine learning model may be used to effectuate the segmentation. For example, the machine learning model may include a convolutional neural network which is trained to label portions of images according to the above-described classifications. The system may compute a forward pass through the model and obtain output reflecting the segmentation of the image into different classifications. As described above, the output may indicate a bounding box about a particular classified element. The output may also identify pixels which are assigned as forming a particular classified element. The output, in some embodiments, may be an image or segmentation mask which identifies the particular classified element.
Among the segmentation channels 404 are rakes (e.g., lines culminating in apexes on roofs), eaves (e.g., lines running along roof edges distal to a roof's ridge), posts (e.g., vertical lines of facades such as at structure corners), fascia (e.g., structural elements following eaves), and soffit (e.g., the surface of a fascia that faces the ground). Many more sub-elements and therefore channels are possible, such as ridge lines, apex points, and surfaces are part of a non-exhaustive list.
In some embodiments, the segmentation channels may be aggregated. For example, knowing that a sub-structure such as a gable is a geometric or structural representation of architectural features such as rakes and posts, a new channel may be built that is a summation of the output of the rake channel and the post channel, resulting in a representation similar to mask 306 of
In some embodiments, a channel is associated with an activation map for data in an image (pre- or post-capture) indicating a model's prediction that a pixel in the image is attributable to a particular classification of a broader segmentation mask. The activation maps are, then, an inverse function of a segmentation mask trained for multiple classifications. By selectively isolating or combining single activation maps, new semantic information, masks, and bounding boxes can be created for sub-structures or sub-features in the scene within the image.
As described above, a machine learning model may be used to segment an image. For example, a neural network may be used. In some embodiments, the neural network may be a convolutional neural network which includes a multitude of convolutional layers optionally followed by one or more fully-connected layers. The neural network may effectuate the segmentation, such as via outputting channels or subchannels associated with individual classifications.
Use of a neural network enables representations across an input image to influence prediction of related classifications, while still maintaining one or more layers, or combinations of filters or kernels, which are optimized for a specific classification. In other words, a joint prediction of multiple classes is enabled by this system. While the presence of points and lines within an image can be detected, shared representations across the network's layers can lend to more specific predictions; for example, two apex points connected by lines can predict or infer a rake more directly with the spatial context of the constituent features. In some embodiments, each subchannel in the final layer output is compared during training to a ground truth image of those same classified labels and any error in each subchannel is propagated back through the network. This results in a trained model that outputs N channels of segmentation masks corresponding to target labels of the aggregate mask. Merely for illustrative purposes, the six masks depicted among the segmentation channels 404 reflect a six-classification output of such a trained model.
In some embodiments, output from the machine learning model is further refined using filtering techniques. Keypoint detection such as Harris corner algorithm, line detection such as Hough transforms, or surface detections such as concave hull techniques can clean noisy output.
Referring to
As discussed above, the segmented element 504 output may be grouped with other such elements or refined representations and applied to a scene. Grouping logic is configurable for desired sub-structures or architectural features or architectural sub features. For example, a rake output combined with a post output can produce a gable output, despite no specific output for that type of sub-structure.
Referring back to
As another illustrative example,
In some embodiments, grouping of architectural features or architectural sub-features may be configurable or automated. Users may select broad categories for groups (such as gable or roof) or configure unique groups based on use case. As the activation maps represent low order components, configuration of unique groups comprising basic elements, even structurally unrelated elements, can enable more responsive use cases. Automated grouping logic may be done with additional machine learning techniques. Given a set of predicted geometric constraints, such as lines or points generally or classified lines or points, a trained neural network can output grouped structures (e.g., primitives) or sub-structures.
Whereas the House Elements head of network 800 may use a combination of transpose convolution layer and upsampling layer, the House Structures head may use a series of fully connected (‘fc’) layers to identify structural groupings within an image. This output may be augmented with the House Elements data, or the activation map data from the previously discussed network (e.g., network 802), to produce classified data within a distinct group. In other words, the R-CNN network 800 can discern multiple subcomponents or sub-structures within a single parent structure to avoid additional steps to group these subcomponents or sub-structures after detection into an overall target.
The above avoids fitting a bounding box for all primitives or sub-structures and distinguishes to which sub-structure any one architectural feature or architectural sub feature may group. As an example which uses the gable detection as an illustrative use case, the R-CNN network 800 may identify a cluster of architectural features first and then assign them as grouped posts to appropriate rakes to identify distinct sub-structures comprising those features, as opposed to predicting all rakes and posts in an image indicate “gable pixels.”
Specifically,
Classical feature detection such as scale-invariant feature transform (SIFT), feature from accelerated segment test (FAST), speeded up robust features (SURF), binary robust independent elementary features (BRIEF), oriented FAST and rotated BRIEF (ORB), SuperPoint, or their combinations rely on visual appearance of a particular element, which can substantially change across images and degrade detection and matching. This problem is compounded with wide baseline changes that impart ever increased viewpoint changes of a scene, while introducing additional scene variables like lighting variability in addition to rotation changes. For example, traditional feature detection for a circular object may be trained to identify the center of circle. In this example, as a camera undergoes rotational change the circular object as depicted in one image will gradually be depicted as an oval, then as an ellipse, and finally more like a line in images obtained by the rotating camera. Thus, the feature descriptor becomes weaker and weaker, and false positives more likely, or feature matching may otherwise outright fail. If new visual information, or occluding objects with similar feature appearances as to the feature from the previous frame, were to enter the field of view, the feature matching may be prone to make false positive matches.
Referring back to
In some embodiments, the camera poses can be generated or updated based on a point cloud or a line cloud. The point cloud can be generated based on one or more of the images, the camera intrinsics, and the camera extrinsics. The point cloud can represent co-visible points across the images in a three-dimensional (3D) coordinate space. The point cloud can be generated by utilizing one or more techniques, such as, for example, structure-from-motion (SfM), multi-view stereo (MVS), simultaneous localization and mapping (SLAM), and the like. In some embodiments, the point cloud is a line cloud. A line cloud is a set of data line segments in a 3D coordinate space. Line segments can be derived from points using one or more techniques, such as, for example, Hough transformations, edge detection, feature detection, contour detection, curve detection, random sample consensus (RANSAC), and the like. In some embodiments, the point cloud or the line cloud can be axis aligned. For example, the Z-axis can be aligned to gravity, and the X-axis and the Y-axis can be aligned to one or more aspects of the building structure and/or the one or more other objects, such as, for example, walls, floors, and the like.
Referring back to
For example, the system obtains a pair of the images. The system determines labeled elements which are included in the pair. For example, the system may identify that a first image of the pair includes an element labeled as a ridge endpoint. As another example, the system may also determine that the remaining image in the pair includes an element labeled as a ridge endpoint. Since these images are from different vantage points, and indeed may even depict the building from opposite views (e.g., a front view and a back view), the system identifies whether these elements correspond to the same real-world element in 3D space. As will be described below, for example in
While the above-described pair of images may be used to determine a 3D position for a real-world element, the system may determine distinct 3D positions for this element when analyzing remaining pairs of images. For example, a different pair of images may include the semantic label and based on epipolar geometry a 3D position of the element may be determined. This 3D position may be different, for example slightly different in a 3D coordinate system, as compared to the 3D position determined using the above-described pair of images. In other words, a plurality of image pairs may produce a plurality of candidate 3D positions for the same element. Variability in candidate 3D positions may be the result of variability or error in the 2D position of the segmented element as from step 104, or errors in the camera poses as from step 106, or a combination of the two that would lead to variability in the epipolar matching.
To select a robust candidate 3D position, the system uses a reprojection score associated with each of the 3D positions determined for a real-world element. With respect to a first pair of images, the system determines a first 3D position. Using camera properties, the first 3D position may be projected onto remaining images that observe the same classified element. The difference between the projected location in each remaining image and the location in the remaining image of the element may represent a reprojection error. The sum, or combination, of these reprojection errors for the remaining images may indicate the reprojection score associated with the first 3D position. Subsequently, all, or a subset of, the remaining image pairs that observe the element are similarly analyzed to determine reprojection errors associated with their resultant 3D positions. The 3D position with the lowest reprojection score may be selected as the 3D position of the element.
At block 1102, the system matches a semantically labeled element in one image with elements associated with the same semantic label in at least one other image of a set of images. In some examples, a semantically labeled element is matched by finding the similarly labeled element in the at least one other image that conforms to an epipolar constraint. In some examples, an initial single image is selected. At least one other image that observes a specific semantically labeled element in common with the selected initial image is then sampled with the initial image for this analysis.
Reference will now be made to
Viewpoint invariant descriptor-based matching as described herein enables feature matching across camera pose changes which traditional feature matching, such as use of appearance-based matching (e.g., descriptor matching), is inaccurate. For example, using viewpoint invariant descriptor-based matching an element of a roof which is depicted in an image of the front of a building may be matched with that element as depicted in a different image of the back of the building. Because traditional descriptors use appearance-based matching, and as the perspective and scene information changes with the camera pose change, the confidence of traditional feature matching drops and detection and matching reduces or varies. An element that is objectively the same may look quite different in images given the different perspectives or lighting conditions or neighbor pixel changes. Semantically labeled elements, on the other hand, obviate these-appearance based variables by employing consistent labeling regardless of variability in appearance. Using secondary localization techniques for matching, such as epipolar constraints or mutual epipolar constraints, tightens the reliability of the match.
In some examples, the density of similarly labeled semantically labeled elements may result in a plurality of candidate matches which fall along, or which are close to, an epipolar line. In some examples, a semantically labeled element with the shortest distance to the epipolar line (e.g., as measured by the pixel distance of the image on which the epipolar line is projected) is selected as the matching element. The selection of a candidate match may therefore, as an example, be inversely proportional to a distance metric from an epipolar line. While equidistant candidate matches may occur, the mutual epipolar constraint of multiple candidates facilitates identifying a single optimized match among a plurality of candidate matches, in addition to the false positive filtering the mutual constraint already imposes.
In some embodiments, the system determines element matches between each pair, or greater than a threshold number of pairs, of images. Thus, the system may identify each image which depicts a particular element. For example, a particular roof apex may be depicted in a subset of the images. In this example, the images may be paired with each other. The system may determine that the roof apex is depicted in a first pair that includes a first image and a second image. Subsequently, the system may determine that the roof apex is depicted in a second pair which includes the first image and a third image. This process may continue such that the system may identify the first image, second image, third image, and so on, as depicting the roof apex. The system may therefore obtain information identifying a subset of the set of images which depict the element.
Returning to
The 3D position of the element determined in block 1104, which may be determined using a pair of the images, may then be reprojected into the remaining of the subset of the set of images at block 1106. For example, using camera properties associated with the remaining images the system identifies a location in the remaining images which correspond to the 3D position.
At block 1108, the system calculates a reprojection error for each identified image with a reprojected triangulated position. In some examples, the reprojection error is calculated based on a Euclidean distance between the pixel coordinates of the 2D position of the specific semantically labeled element in the image and a 2D position of the reprojected triangulated specific semantically labeled element in the image. As described herein, the 2D position may refer to a particular pixel associated with a semantic label. The 2D position may also refer to a centroid of a bounding box positioned about a portion of an image associated with a semantic label.
With respect to the above, the difference between the projected 2D position and the 2D position of the element may represent the reprojection error. Difference may be determined based on a number of pixels separating the projected 2D position and the semantically labeled 2D position of the element (e.g., a number of pixels forming a line between the positions).
At block 1110, the system calculates a reprojection score for the triangulated specific semantically labeled element based on the calculated reprojection errors. Calculating the reprojection score can include summation of the reprojection errors across all images in the set of images.
In some embodiments, blocks 1104 through 1110 are then repeated by pairing every other image of images that can view a specific semantically labeled element with the initial selected image (e.g., the blocks may be performed iteratively). For example, if the system initially identified at block 1102 that three images produced a match for specific semantically labeled elements and blocks 1104 through 1110 were performed for a first and second image, then the process is repeated using the first and third image.
At block 1112, the system selects an initial 3D position of the specific semantically labeled element based on the aggregate calculated reprojection scores. This produces the triangulation with the lowest reprojection error, relative to an initial image only. Even though the triangulated point was reprojected across images and each image was eventually paired with the initial image through the iteration of blocks 1102 through 1110, the initial image may be understood to be the common denominator for the pairing and triangulation resulting in that initial 3D position.
In some examples, blocks 1104 through 1112 are further repeated selecting a second image in the image set that observes the semantically labeled element. The triangulation and reprojection error measurements are then performed again to produce another initial 3D position relative to that specific image. This iteration of blocks continues until each image has been used as the base image for analysis against all other images. This process of RANSAC-inspired sampling produces robust estimation of 3D data using optimization of only a single image pair. This technique overcomes the more computationally resource heavy bundle adjustment and its use of gradient descent to manage reprojection errors of several disparate points across many camera views.
Blocks 1102 through 1112 may produce multiple initial selections for 3D positions for the same specific semantically labeled element. In some embodiments, the multiple selections of initial 3D positions can be reduced to a single final 3D position at block 1114, such as via clustering. For example, the cumulative initial 3D positions are collapsed into a final 3D position. The final 3D position/point can be calculated based on the mean of all the initial 3D positions for a semantically labeled element or based on only those within a predetermined distance of one another in 3D space.
In some examples, rather than produce a series of initial 3D positions based on a common denominator image and aggregating into a final 3D position, process 1100 culminates with selecting a single image pair that produces the lowest reprojection score.
For example, in some examples if there are four images that observe a particular apex point, process 1100 may be run to determine which image, when paired with a first image, produces the lowest reprojection error among the other images; then the sequence is repeated to determine which image when paired with the second image produces the lowest reprojection error; and so on with the third and fourth images and the triangulated points from each iteration is aggregated into a final position. In some examples, rather than leverage a plurality of initial 3D positions from multiple image pairs, only the image pair among those four images that produces the lowest reprojection score among any other image pairs is selected.
Returning to
In some embodiments, the 3D positions of the semantically labeled elements are connected based on associated classes and geometric constraints related to the associated classes. Examples of geometric constraints include: rake lines connecting ridge end points with eave end points; ridge lines connecting ridge end points; rake lines being neither vertical nor horizontal; cave lines or ridge lines being aligned to a horizontal axis; or eave lines being parallel or perpendicular to other eave lines. In this way, 3D lines are produced from the 3D data of its associated semantically labeled element(s).
In some embodiments, generating the 3D model can include determining one or more faces based on the associated semantically labeled elements. The faces can be polygons, such as, for example, rectangles. In some embodiments, the faces can be determined based on the line segments connecting the semantically labeled elements. In some embodiments, the faces can be determined utilizing polygon surface approximation techniques, for example with the 3D positions of the semantically labeled elements and associated classes as input. In some embodiments, determining the faces can include deduplicating overlapping faces, for example, based on the 3D position of the faces.
In some embodiments, determining the faces can include calculating a score for each face, where the score is based on the number of multiple estimated final 3D positions for the same specific semantically labeled element that correspond to the vertices of the faces. For example, a cluster size can be determined based on the number of multiple estimated final 3D positions for the same specific semantically labeled element, and the score for a face can be calculated as the sum of the cluster sizes associated with the semantically labeled elements that are the vertices of the face.
In some embodiments, generating the 3D building model can include determining a set of mutually consistent faces based on the one or more faces. The set of mutually consistent faces includes faces that are not inconsistent with one another. Faces in a pair of faces are consistent with each other if the faces share an edge, do not overlap, and do not intersect. The set of mutually consistent faces can be determined based on pairwise evaluation of the faces to determine consistency (or inconsistency) between the faces in the pair of faces. In some embodiments, generating the 3D building model can include determining a maximally consistent set of mutually consistent faces based on the set of mutually consistent faces. The maximally consistent set of mutually consistent faces is a subset of the set of mutually consistent faces that maximize the scores of the faces.
In some embodiments, generating the 3D building model can include generating one or more measurements related to the 3D building model. In some embodiments, the measurements can be generated based on one or more of the associations of the semantically labeled elements and the faces. The measurements can describe lengths of to the line segments connecting the semantically labeled elements, areas of the faces, and the like.
In some embodiments, generating the 3D building model can include scaling the 3D building model. In some embodiments, the 3D building model is correlated with an orthographic (top down) scaled image of the building structure, and the 3D building model is scaled based on the correlated orthographic image. For example, at least two vertices of the 3D building model are correlated with at least two points of the orthographic image, and the 3D building model is scaled based on the correlated orthographic image. In some embodiments, the 3D building model is correlated with a scaled oblique image of the building structure, and the 3D building model is scaled based on the correlated oblique image. For example, at least two vertices of the 3D building model are correlated with at least two points of the oblique image, and the 3D building model is scaled based on the correlated oblique image.
In some embodiments, the 3D building model (e.g., 3D representation), or portion thereof (e.g., roof), may be output in a user interface presented to a user. For example, an application may be executed via a user device of the user. In this example, the application may be used to present the model and associated measurements. Additionally, measurements may be derived based on the model such as the pitch of each roof facet, or the area of the roof facet. Pitch may represent the rise over run of the roof face and may be determined based on the model, e.g., by calculating the change in elevation of the roof facet per unit of lateral distance. As an example, calculating the rise may include calculating the change in elevation of the roof facet (e.g., from its lowest to its highest point) and calculating the run may include calculating the distance the roof facet extends in a horizontal (x or y-axis) direction, with the rise and run forming the sides of a triangle and with the surface of the facet forming the hypotenuse. In some embodiments, the area may be calculated from measurements of the distance that each side of the facet extends. In some embodiments, the pitch and/or area of each roof facet may be presented in the user interface, for example positioned proximate to the roof facet in the model.
Computer system 1500 also includes a main memory 1506, such as a random access memory (RAM), cache and/or other dynamic storage devices, coupled to I/O Subsystem 1502 for storing information and instructions to be executed by processor 1504. Main memory 1506 also may be used for storing temporary variables or other intermediate information during execution of instructions to be executed by processor 1504. Such instructions, when stored in storage media accessible to processor 1504, render computer system 1500 into a special-purpose machine that is customized to perform the operations specified in the instructions.
Computer system 1500 further includes a read only memory (ROM) 1508 or other static storage device coupled to I/O Subsystem 1502 for storing static information and instructions for processor 1504. A storage device 1510, such as a magnetic disk, optical disk, or USB thumb drive (Flash drive), etc., is provided and coupled to I/O Subsystem 1502 for storing information and instructions.
Computer system 1500 may be coupled via I/O Subsystem 1502 to an output device 1512, such as a cathode ray tube (CRT) or LCD display (or touch screen), for displaying information to a computer user. An input device 1514, including alphanumeric and other keys, is coupled to I/O Subsystem 1502 for communicating information and command selections to processor 1504. Another type of user input device is control device 1516, such as a mouse, a trackball, or cursor direction keys for communicating direction information and command selections to processor 1504 and for controlling cursor movement on output device 1512. This input device typically has two degrees of freedom in two axes, a first axis (e.g., x) and a second axis (e.g., y), that allows the device to specify positions in a plane. In some embodiments, the same direction information and command selections as cursor control may be implemented via receiving touches on a touch screen without a cursor.
Computing system 1500 may include a user interface module to implement a GUI that may be stored in a mass storage device as computer executable program instructions that are executed by the computing device(s). Computer system 1500 may further, as described below, implement the techniques described herein using customized hard-wired logic, one or more ASICs or FPGAs, firmware and/or program logic which in combination with the computer system causes or programs computer system 1500 to be a special-purpose machine. According to one embodiment, the techniques herein are performed by computer system 1500 in response to processor(s) 1504 executing one or more sequences of one or more computer readable program instructions contained in main memory 1506. Such instructions may be read into main memory 1506 from another storage medium, such as storage device 1510. Execution of the sequences of instructions contained in main memory 1506 causes processor(s) 1504 to perform the process steps described herein. In alternative embodiments, hard-wired circuitry may be used in place of or in combination with software instructions.
Various forms of computer readable storage media may be involved in carrying one or more sequences of one or more computer readable program instructions to processor 1504 for execution. For example, the instructions may initially be carried on a magnetic disk or solid state drive of a remote computer. The remote computer can load the instructions into its dynamic memory and send the instructions over a telephone line, cable, using a modem (or optical network unit with respect to fiber). A modem local to computer system 1500 can receive the data on the telephone line and use an infra-red transmitter to convert the data to an infra-red signal. An infra-red detector can receive the data carried in the infra-red signal and appropriate circuitry can place the data on I/O Subsystem 1502. I/O Subsystem 1502 carries the data to main memory 1506, from which processor 1504 retrieves and executes the instructions. The instructions received by main memory 1506 may optionally be stored on storage device 1510 either before or after execution by processor 1504.
Computer system 1500 also includes a communication interface 1518 coupled to I/O Subsystem 1502. Communication interface 1518 provides a two-way data communication coupling to a network link 1520 that is connected to a local network 1522. For example, communication interface 1518 may be an integrated services digital network (ISDN) card, cable modem, satellite modem, or a modem to provide a data communication connection to a corresponding type of telephone line. As another example, communication interface 1518 may be a local area network (LAN) card to provide a data communication connection to a compatible LAN (or WAN component to communicated with a WAN). Wireless links may also be implemented. In any such implementation, communication interface 1518 sends and receives electrical, electromagnetic or optical signals that carry digital data streams representing various types of information.
Network link 1520 typically provides data communication through one or more networks to other data devices. For example, network link 1520 may provide a connection through local network 1522 to a host computer 1524 or to data equipment operated by an Internet Service Provider (ISP) 1526. ISP 1526 in turn provides data communication services through the world wide packet data communication network now commonly referred to as the “Internet” 1528. Local network 1522 and Internet 1528 both use electrical, electromagnetic or optical signals that carry digital data streams. The signals through the various networks and the signals on network link 1520 and through communication interface 1518, which carry the digital data to and from computer system 1500, are example forms of transmission media.
Computer system 1500 can send messages and receive data, including program code, through the network(s), network link 1520 and communication interface 1518. In the Internet example, a server 1530 might transmit a requested code for an application program through Internet 1528, ISP 1526, local network 1522 and communication interface 1518.
The received code may be executed by processor 1504 as it is received, and/or stored in storage device 1510, or other non-volatile storage for later execution.
All of the processes described herein may be embodied in, and fully automated, via software code modules executed by a computing system that includes one or more computers or processors. The code modules may be stored in any type of non-transitory computer-readable medium or other computer storage device. Consequently, various electronic storage media discussed herein may be understood to be types of non-transitory computer readable media in some implementations. Some or all the methods may be embodied in specialized computer hardware.
Many other variations than those described herein will be apparent from this disclosure. For example, depending on the implementation, certain acts, events, or functions of any of the algorithms described herein can be performed in a different sequence or can be added, merged, or left out altogether (for example, not all described acts or events are necessary for the practice of the algorithms). Moreover, in certain implementations, acts or events can be performed concurrently, for example, through multi-threaded processing, interrupt processing, or multiple processors or processor cores or on other parallel architectures, rather than sequentially. In addition, different tasks or processes can be performed by different machines and/or computing systems that can function together.
The various illustrative logical blocks, modules, and engines described in connection with the implementations disclosed herein can be implemented or performed by a machine, such as a processing unit or processor, a digital signal processor (DSP), an application specific integrated circuit (ASIC), a field programmable gate array (FPGA) or other programmable logic device, discrete gate or transistor logic, discrete hardware components, or any combination thereof designed to perform the functions described herein. A processor can be a microprocessor, but in the alternative, the processor can be a controller, microcontroller, or state machine, combinations of the same, or the like. A processor can include electrical circuitry configured to process computer-executable instructions. In another implementation, a processor includes an FPGA or other programmable device that performs logic operations without processing computer-
This application claims priority to U.S. Prov. Patent App. No. 63/271,197 titled “SYSTEMS AND METHODS IN 3D RECONSTRUCTION WITHOUT DESCRIPTORS” and filed on Oct. 24, 2021, the disclosure of which is hereby incorporated herein by reference in its entirety. This application also incorporates by reference the following applications, U.S. patent application Ser. No. 17/118,370, International Application No. PCT/US20/48263, and International Application No. PCT/US22/14164.
Filing Document | Filing Date | Country | Kind |
---|---|---|---|
PCT/US2022/078558 | 10/21/2022 | WO |
Number | Date | Country | |
---|---|---|---|
63271197 | Oct 2021 | US |