Example embodiments relate to real object and camera localization for interactive 3D mapping applications.
Virtual 3D maps can be made to represent real world environments. One form of a virtual 3D map is by way of a point cloud map (also known as point cloud). A point cloud map is a set of points in space. The points may represent a 3D shape or object, and can also represent background environment and fixed infrastructure. One way to represent the point cloud map is to give each point a set of Cartesian coordinates in 3D space. For example, an object can be represented by a plurality of points that represent the object.
In some products, a LiDAR scanner can be used to capture a point cloud of an object which is then desired to be converted to a virtual version of the object. The virtual version of the object can then be viewed by a display device. When the object is to be represented as a point cloud, a large amount of data can be required to transmit the point cloud. The problem is compounded when color information is transmitted, which can require substantially more data when compared to a grayscale or black and white version.
For real-time localization applications, the large amount of data transfer required to acquire and transmit the point cloud can hinder or lag the real-time virtual representation of the object.
It would be advantageous to provide an object and camera localization system and localization method for real-time mapping of the real world.
It would be advantageous to reduce the amount of data transfer required for generating real-time maps of the real world.
It would be advantageous to track a location of the camera within the real-time mapping of the real world.
It would be advantageous to provide rule based modules to generate object localization and camera localization.
It would be advantageous to provide machine learning models to generate object localization and camera localization.
Additional difficulties of conventional localization and mapping systems may be appreciated in view of the Detailed Description, herein below.
Example embodiments relate to a localization system and localization method for mapping of real objects and users in the real world. The mapping can be done in real-time or near real-time to the detection of the real objects by a camera device. An example of the localization method can be applied to the scenario where a user wishes to track a digital twin of a real object. The user can interact with the digital twin in a virtual reality (VR), augmented reality (AR), mixed reality (MR), or other 3D environments.
An example of the localization method can be applied to the scenario of convenient and automated mapping of the real world. A camera device can be used to capture one or more images of an object. The localization method can be used to calculate a cuboid and a centroid of an object. The camera device can send the location of the centroid without sending an entire 3D model of the object (e.g. point cloud map). When the object is moved, the localization method can use the centroid, the cuboid, or anchor points, to track the movement of the cuboid in order to determine the new location of the object. A 3D map can be generated that takes into account the new location of the object. An advantage of the localization method is that entire 3D models or point clouds are not required to be transmitted by the camera device at every stage of the localization method.
In another example embodiment, the images, camera locations and camera orientations are received from a third party mapping service, such as Google Maps™. The localization method can be used to generate, from the images, the camera locations and the camera orientations: the location and pose of objects in the real world in order to generate a virtual 3D map which contains 3D models of the objects with the respective location and pose.
In another example embodiment, the camera device can be used to capture an image of a stationary real object with a known location and known pose in the real world. A point of view pose of the stationary real object is generated from the one or more images. Using the one or more images and the point of view pose of the stationary real object, a location of the camera device in real world 3D coordinates is determined using the localization method.
Another example embodiment is a localization method, comprising: receiving at least one image which includes an object; generating for each image, using a positioning module: a camera location in real world coordinates of real 3-Dimensional (3D) space, a camera orientation, and a camera distance to the object; generating, using an image 2D object detection module and each image: i) an object label of the object detected in that image, ii) a bounding box of the object in that image, and iii) feature points in that image; generating, using a cuboid generator, the bounding box for each image, the camera location for each image, the camera orientation for each image, and the camera distance to the object for each image: a cuboid in the real world coordinates of the real 3D space which bounds the object in the real world coordinates of the real 3D space; generating, using an anchor point generator, the feature points of the at least one image, and the cuboid: anchor points in the real world coordinates of the real 3D space of the object which are contained in the cuboid; and generating, using the object label, the anchor points, and at least one of the cuboid in the real world coordinates of the real 3D space, a centroid of the cuboid, or the bounding box of the object with at least one of the images: a 3D map which includes the object located in the real world coordinates in a virtual 3D space.
Another example embodiment is a localization method, comprising: receiving at least one image which includes an object; generating for each image, using a positioning module: a camera location in real world coordinates of real 3-Dimensional (3D) space, a camera orientation, and a camera distance to the object; generating, using an image 2D object detection module and each image: i) an object label of the object detected in that image, ii) a bounding box of the object in that image, and iii) feature points in that image; generating, using a cuboid generator, the bounding box for each image, the camera location for each image, the camera orientation for each image, and the camera distance to the object for each image: a cuboid in the real world coordinates of the real 3D space which bounds the object in the real world coordinates of the real 3D space; generating, using an anchor point generator, the feature points of the at least one image, and the cuboid: anchor points in the real world coordinates of the real 3D space of the object which are contained in the cuboid; and outputting the object label, the anchor points, and at least one of the cuboid in the real world coordinates of the real 3D space, a centroid of the cuboid, or the bounding box of the object with at least one of the images, for generating a 3D map which includes the object located in the real world coordinates in a virtual 3D space.
In an example embodiment of any of the above localization methods, the localization method further comprises: generating, using a centroid generator and the cuboid: the centroid of the cuboid in real world coordinates of the real 3D space.
In an example embodiment of any of the above localization methods, the localization method further comprises: generating, using the cuboid generator, the at least one image, the bounding box for that image, the camera location for that image, the camera orientation for that image, the camera distance to the object for that image: at least one vanishing point in the real world coordinates of the real 3D space for that object; and wherein the generating the cuboid in the real world coordinates of the real 3D space further uses the at least one vanishing point in the real world coordinates of the real 3D space for the object.
In an example embodiment of any of the above localization methods, the generating the cuboid in the real world coordinates of the real 3D space includes transforming the cuboid from camera 3D coordinates to the real world coordinates of the real 3D space.
In an example embodiment of any of the above localization methods, the generating the anchor points in the real world coordinates of the real 3D space includes transforming the feature points in the respective image to the anchor points in camera 3D coordinates and transforming the anchor points in the camera 3D coordinates to the real world coordinates of the real 3D space.
In an example embodiment of any of the above localization methods, the localization method further comprises: detecting a plane of a floor, generating a height from a camera device to the floor, and wherein the transforming of the anchor points in the camera 3D coordinates to the real world coordinates of the real 3D space includes scaling the object based on the height of the camera device to the floor.
In an example embodiment of any of the above localization methods, the localization method further comprises: generating, using a pose estimation module, the at least one image, the camera location, the camera orientation, the camera distance to the object, and the bounding box of the object in each image: a pose of the object in the real world coordinates of the real world coordinates of the real 3D space; and outputting the pose of the object for the generating the 3D map which includes the object having the pose in the real world coordinates in the virtual 3D space.
In an example embodiment of any of the above localization methods, the localization method further comprises: generating, using a front detection module, the object label, the bounding box for each image, and the at least one image: front identifying information of the object; and wherein the generating the pose of the object in the real world coordinates of the real 3D space further uses the front identifying information of the object.
In an example embodiment of any of the above localization methods, the front identifying information includes: a point of view of a 3D model of the object, a front bounding box of a front of the object, an image of the front of the object, a 3D model or point cloud map of only the front of the object, the anchor points of the front of the object, or descriptive text of the front of the object.
In an example embodiment of any of the above localization methods, the localization method further comprises: retrieving, using the object label and an object database: front identifying information of the object; and wherein the generating the pose of the object in the real world coordinates of the real 3D space further uses the front identifying information of the object.
In an example embodiment of any of the above localization methods, the localization method further comprises: generating, using the pose estimator module, the object label, the at least one image, the bounding box of the object in each image: a point of view pose of the object from a line of sight between the camera location to the object; and wherein the generating the pose of the object in the real world coordinates of the real 3D space further uses the point of view pose of the object.
In an example embodiment of any of the above localization methods, the generating of the 3D map includes determining, using a mapping module, a change in the pose and updating the object already in the 3D map with the changed in the pose.
In an example embodiment of any of the above localization methods, the localization method further comprises: determining that the pose is different than a stored pose of the object and outputting an instruction to move the object in the real 3D space to the stored pose.
In an example embodiment of any of the above localization methods, the generating the pose of the object in the real world coordinates of the real 3D space further uses the anchor points in the real world coordinates of the real 3D space of the object which are contained in the cuboid.
In an example embodiment of any of the above localization methods, the localization method further comprises: generating, using a front detection module, front identifying information which identifies a face of the cuboid as being a front of the object; and wherein the generating the 3D map uses the front identifying information of the object.
In an example embodiment of any of the above localization methods, the at least one image, the camera location, and the camera orientation is received from a third party mapping service.
In an example embodiment of any of the above localization methods, the image includes a stationary real object, wherein the generating the camera location and the camera orientation comprises: generating, using the image 2D object detection module and the image: i) a second object label of the stationary real object detected in that image, and ii) a second bounding box of the stationary real object in that image; generating, using a pose estimator module, the image, the second object label and the second bounding box: a point of view pose of the stationary real object; retrieving, using the second object label: a known cuboid in the real world coordinates of the real 3D space of the stationary real object and a known pose in the 3D space of the stationary real object; and generating, using the positioning module, the image, the second object label, the second bounding box, the point of view pose, the known cuboid in the real world coordinates of the real 3D space, and the known pose in the real world coordinates of the real 3D space: the camera location in the real world coordinates of the real 3D space, and the camera orientation in the real world coordinates of the real 3D space.
In an example embodiment of any of the above localization methods, the outputting does not output a 3D model or point cloud map of the object.
In an example embodiment of any of the above localization methods, the at least one image includes a plurality of images.
In an example embodiment of any of the above localization methods, the object label is unique to the object.
In an example embodiment of any of the above localization methods, the positioning module includes a global positioning system (GPS), a local positioning system (LPS), and/or a Light Detection And Ranging (LiDAR) scanner.
In an example embodiment of any of the above localization methods, the localization method further comprises: performing, using a mapping module, the object label, the anchor points, and the at least one of the cuboid, the centroid, or the bounding box of the object with at least one of the at least one image: the generating of the 3D map which includes the object located in the real world coordinates in the virtual 3D space.
In an example embodiment of any of the above localization methods, the generating of the 3D map includes the mapping module determining a change in a location of the cuboid or the centroid and the mapping module updating the object already in the 3D map with the change in the location.
In an example embodiment of any of the above localization methods, the generating of the 3D map includes the mapping module retrieving, using the object label: a 3D model of the object; wherein the 3D map includes the 3D model of the object in the real world coordinates in the virtual 3D space.
In an example embodiment of any of the above localization methods, the mapping module is in a camera device that captured the at least one image.
In an example embodiment of any of the above localization methods, the localization method further comprises: determining that the cuboid or the centroid is different than a location of a stored cuboid or stored centroid of the object and outputting an instruction to move the object in the real 3D space to the location of the stored cuboid or the stored centroid.
In an example embodiment of any of the above localization methods, the localization method further comprises: displaying the 3D map on a display device.
In an example embodiment of any of the above localization methods, the localization method is performed by a camera device that captured the at least one image.
In an example embodiment of any of the above localization methods, the positioning module includes a positioning model that includes a first convolutional neural network (CNN); and/or the image 2D object detection module includes an image 2D object detector model that includes a second CNN.
In an example embodiment of any of the above localization methods, the localization method is performed by a camera device that captured the at least one image.
Another example embodiment is a localization method, comprising: receiving at least one image which include an object; generating for each image, using a positioning module: a camera location in real world coordinates of real 3-Dimensional (3D) space, a camera orientation, and a camera distance to the object; generating, using a scanner or photogrammetry: a point cloud map of the object; generating, using an image 2D object detection module and each image: i) an object label of the object detected in that image, ii) a bounding box of the object in that image, and iii) feature points in that image; generating, using a cuboid generator, the bounding box for each image, the camera location for each image, the camera orientation for each image, the camera distance to the object for each image: a cuboid in the real world coordinates of the real 3D space which bounds the object in the real 3D space; generating, using an anchor point generator, the feature points of the at least one image, and the cuboid: anchor points in the real world coordinates of the real 3D space of the object which are contained in the cuboid; storing in an object database: the object label, the cuboid, the anchor points, and at least one of the point cloud map or a 3D model of the object generated from the point cloud map; and storing in a map database: the object label, the cuboid, and the anchor points.
In an example embodiment of any of the above localization methods, the localization method further comprises: generating, using a centroid generator and the cuboid: a centroid of the cuboid; and storing in the map database: the centroid.
In an example embodiment of any of the above localization methods, the localization method further comprises: generating, using a pose estimation module, the at least one image, the camera location, the camera orientation, the camera distance to the object, and the bounding box of the object in each image: a pose of the object in the real world coordinates of the real 3D space; and storing the pose of the object in the map database.
Another example embodiment is a localization method, comprising: receiving an image which includes a stationary real object; generating, using an image 2D object detection module and the image: i) an object label of the stationary real object detected in that image, and ii) a bounding box of the stationary real object in that image; generating, using a pose estimator module, the image, the object label and the bounding box: a point of view pose of the stationary real object; retrieving, using the object label: a known cuboid in real world coordinates of real 3-Dimensional (3D) space of the stationary real object and a known pose in the real world coordinates of the real 3D space of the stationary real object; and generating, using a positioning module, the image, the object label, the bounding box, the point of view pose, the known cuboid in the real world coordinates of the real 3D space, and the known pose in the real world coordinates of the real 3D space: a camera location in the real world coordinates of the real 3D space, and a camera orientation in the real world coordinates of the real 3D space.
In an example embodiment of any of the above localization methods, the localization method further comprises: retrieving, using the object label, a known centroid of the known cuboid; generating, using the image, a centroid of the stationary real object; and wherein the generating the camera location in the real world coordinates of the real 3D space and the camera orientation in the real world coordinates of the real 3D space uses the known centroid of the known cuboid.
In an example embodiment of any of the above localization methods, the generating the camera location in the real world coordinates of the real 3D space and the camera orientation in the real world coordinates of the real 3D space does not require sensors other than a camera device.
In an example embodiment of any of the above localization methods, the localization method further comprises: generating, using the image 2D object detection module and the image: iii) feature points of the image; generating, using an anchor point generator, the image, the feature points of the image, the object label of the stationary real object, and the bounding box: anchor points in camera 3D coordinates; retrieving, using the object label: known anchor points in the real world coordinates of the real 3D space of the object; and wherein the generating the camera location in the real world coordinates of the real 3D space and the camera orientation in the real world coordinates of the real 3D space further uses the anchor points in the camera 3D coordinates and the known anchor points in the real world coordinates of the real 3D space of the object.
In an example embodiment of any of the above localization methods, the receiving the image comprises receiving exactly one image for the generating the point of view pose of the stationary real object.
Another example embodiment is a localization method, comprising: receiving, from a camera device: i) an object label for a real object, ii) anchor points in real world coordinates of real 3D space of the real object, iii) at least one of a cuboid in the real world coordinates of the real 3D space, a centroid of the cuboid, or a respective bounding box of the real object with at least one image; retrieving, using the object label: a 3D model of the real object; generating, using a mapping module, the 3D model of the real object, the anchor points, and the at least one of the cuboid in the real world coordinates of the real 3D space, the centroid of the cuboid, or the respective bounding box of the real object with the at least one image: a 3D map for an immersive extended reality (XR) application which includes the 3D model of the real object located in the real world coordinates in a virtual 3D space.
In an example embodiment of any of the above localization methods, the localization method further comprises: receiving, from the camera device: a pose of the real object; wherein the generating the 3D map uses the pose; and wherein the 3D map includes the 3D model of the real object with the pose in the virtual 3D space.
In an example embodiment of any of the above localization methods, the localization method further comprises: determining, using the mapping module, a change in the pose; and updating, using the mapping module, the 3D model of the real object already in the 3D map with the changed pose.
In an example embodiment of any of the above localization methods, the localization method further comprises: determining, using the mapping module, a change in a location in the real world coordinates of the real 3D space of the cuboid or the centroid; and updating, using the mapping module, the 3D model of the real object already in the 3D map with the change in the location.
In an example embodiment of any of the above localization methods, the receiving from the camera device does not include a 3D model of the real object.
Another example embodiment is a localization method, including: receiving an image which includes an object; generating a screen normal of the image in 2D space; generating, using an image 2D object detection module and the image: i) an object label of the object detected in the image, ii) line segments in the 2D space of respective edges of the object detected in the image, and iii) a bounding box in the 2D space of the object; generating, using a cuboid generator, the image, and the bounding box for the image: i) a plurality of candidate cuboids in the 2D space which bound the object, ii) respective vanishing points for each of the plurality of candidate cuboids, iii) a respective vanishing point center of the respective vanishing points, and iv) a respective vanishing point angle from a screen center to the respective vanishing point center in the 2D space; calculating, for each of the plurality of candidate cuboids, an angle difference in the 2D space between the respective vanishing point angle and the screen normal; selecting the candidate cuboids having the angle difference in the 2D space which are within an angle threshold; generating, for each of the selected candidate cuboids, using the cuboid generator, for the image: i) the selected candidate cuboid in the 2D space into real world coordinates of real 3D space, and ii) a respective cost function value between the selected candidate cuboid in the real world coordinates of the real 3D space and the object in the real 3D space; and associating with the object: i) the object label, and ii) the selected candidate cuboid in the real world coordinates of the real 3D space having the respective cost function value that is optimal.
In an example embodiment of any of the above localization methods, the localization method further includes tracking the selected candidate cuboid having the respective cost function value that is optimal.
In an example embodiment of any of the above localization methods, the localization method further includes: generating a 3D normal orthogonal to a detected plane in the real world coordinates of the real 3D space; generating, for each of the selected candidate cuboids, a respective centroid and a respective centroid angle from a camera position to the respective centroid; and second calculating, for each of the selected candidate cuboids, a second angle difference in the real world coordinates of the real 3D space between the respective centroid angle and the 3D, wherein the selecting further includes second selecting from the selected candidate cuboids having the second angle difference in the real world coordinates of the real 3D space which are within a second angle threshold.
In an example embodiment of any of the above localization methods, the line segments are generated using a fast line detector function.
In an example embodiment of any of the above localization methods, the cost function value is based on distance error, angle alignment error, or shape error.
In an example embodiment of any of the above localization methods, the cost function value is based on all of the distance error, the angle alignment error, and the shape error.
In an example embodiment of any of the above localization methods, the angle threshold is on or about 20 degrees.
Another example embodiment is a localization system, including: at least one processor; and memory containing instructions which, when executed by the at least one processor, cause the processor to perform the localization method of any of the above.
Another example embodiment is a camera device, including: a camera; at least one processor; and memory containing instructions which, when executed by the at least one processor, cause the processor to perform the localization method of any of the above.
Another example embodiment is a non-transitory memory containing instructions which, when executed by at least one processor, cause the at least one processor to perform the localization method of any of the above.
Another example embodiment is a computer program product by a machine learning training process, the computer program product comprising instructions stored in a non-transitory computer-readable medium which, when executed by at least one processor, causes the at least one processor to perform the localization method of any of the above.
Reference will now be made, by way of example, to the accompanying drawings which show example embodiments, and in which:
Similar reference numerals may have been used in different figures to denote similar components.
Example embodiments relate to a localization system and localization method for real-time mapping of real objects and camera devices in the real world.
The term “real” is used to refer to any thing or location in the real or physical world, e.g. a real object, a real location. The term “virtual” is used to refer to any thing or location in the digital or virtual world, e.g. a virtual object, a virtual location. A “digital twin” is a virtual thing which corresponds to a real thing. The virtual thing has a virtual appearance and virtual location that is indicative of the counterpart real thing which has a real appearance and real location. The location can include longitude and latitude or northing and easting in real world coordinates. In some examples, the location can also include height in real world coordinates. The location can have more refined details with reference to local coordinates within a location, once the general longitude and latitude of the location are known. The digital twin can be a virtual object of a real object. The digital twin can be a virtual location (environment) of a real location (environment). The digital twin of the real world can be virtual objects and virtual locations (environment) of the real objects and real locations (environment).
The digital twin of an object can also mirror a pose of the real twin of the object, the direction where the real twin is facing in the real world. In an example, the direction can include pitch (gyroscopes and/or accelerometers), yaw (compass), and roll (gyroscopes and/or accelerometers). In another example coordinate scheme for the pose, the direction can include azimuth directions (compass) and elevation directions. The pose can be in real world coordinates of the real world. The digital twin can also mirror a state or other characteristics of the real twin. In some examples, the digital twin is continuously updated to reflect the real twin. In some examples, the real twin is continuously updated to reflect changes made to the digital twin. The digital twin can include the last known attribute of the real twin. Reference to “known” can include the last known attribute of the real object being stored in memory or a database, such as an object database or a map database.
An example of the localization method can be applied to the scenario of convenient and automated mapping of objects in the real world. A camera device can be used to capture one or more images of an object, and generate an object label of the object, a bounding box of the object in each image, and feature points in each image. The localization method can be used to calculate a cuboid in 3D space of the object and a centroid of the cuboid. The localization method can be used to generate, using the feature points and the cuboid: anchor points in 3D space which are contained in (overlap with) the cuboid in 3D space. The camera device can send the object label and location of the centroid without sending an entire 3D model (or point cloud map). When the object is moved, the localization method can track the movement of the centroid in order to determine the new location of the object. A new virtual map can be generated that takes into account the new location of the object. In some example embodiment, the point cloud is initially captured by the camera device for localization of the object, and afterwards the centroid can be used for localization. An advantage of the localization method is that 3D models (and point clouds) are not required to be transmitted by the camera device to generate 3D maps when there is a change in the location of the real object.
In example embodiments, the virtual map is used in an extended reality (XR) platform (e.g., virtual reality (VR), augmented reality (AR) or mixed reality (MR) platform) in order to know where each object is located in the virtual map. The XR platform can include an immersive 3D map in which the user views or navigates the 3D map from a point of view (POV) within the 3D map.
In some examples, the pose of the object is determined by the localization method, for example the direction in the real world where the object is facing. The generated 3D map can also represent the pose of the object, and any changes in the pose of the object. In example embodiments, the virtual map in the XR platform includes the pose of the object in the virtual 3D map. An advantage of the localization method is that 3D models (and point clouds) are not required to be transmitted by the camera device to generate 3D maps when there is a change in the pose of the real object.
An example of the localization method can be applied to the scenario of event venue setup, in which real objects, stage, seating, tables, audiovisual equipment, etc., are required to be setup at particular locations. When a particular real object is moved, the event organizers may desire to track the real object in real-time and the digital twin version of the real object, using the camera device to capture images of the object. As well, event organizers may wish to have a convenient XR (e.g., AR or MR) platform in order to know where each object should be located for event setup, by pointing the camera device to the venue and the camera device displays the placement of the objects in AR or MR. As well, the localization method can be used to identify where particular people should be located at the scheduled times during the event. In another example, the virtual 3D map (including the objects) prevails over the real world. When a real object location differs from the virtual object location, an instruction is sent to the user, a machine, or the object for the real object to return to the desired location in order to match the virtual object location. The localization method can also be used to track the people or performers of the event, who may be required to be at particular locations (which may be different, for example, at particular times).
An example of the localization method can be applied in the field of robotics, for example to the scenario of industrial automation operations or manufacturing. For example, objects such as the components parts of the product being manufactured, and assembly devices (such as robots which can include robotic arms and robotic machines) can be required to be at particular locations at particular times in order to perform the operations or manufacturing. When a particular object or robot is moved, manufacturers may desire to track the real object in real-time and the digital twin version of the real object, using the camera device to capture images of the object. As well, manufacturers may wish to have a convenient XR platform in order to know where each object should be located for setup of the industrial automation operations or manufacturing.
An example of the localization method can be applied to the scenario of gaming and sports. For example, objects such as the game elements and individual participants (players or athletes) can be tracked for purposes of 3D mapping and analytics. When a particular object is moved, the physical object can be tracked in real-time and the digital twin version of the physical object can be visualized in the 3D map, by using the camera device to capture images of the object. Game and sport outcomes can be automatically calculated based on the particular location of the object (e.g., a ball going into a goal, a player at a finish line). As well, the players, viewers and coaches may wish to have a convenient XR platform in order to know the location of each object. Similar applications can apply to other sports, gaming and gambling applications.
An example of the localization method can be applied to the scenario of road mapping, such as for autonomous vehicle driving or GPS mapping purposes. For example, real-time 3D maps of the real world are required for accurate and safe navigation for autonomous driving or GPS navigation. For example, objects such as the traffic lights, obstacles, lane markings, and other vehicles may be required to be tracked in order to perform accurate mapping and navigation. When a particular object is moved, the real-time mapping for the vehicle operation may require the tracking of the real object and the digital twin version of the real object, in real-time. A camera device can be used to capture images of the object for the real-time mapping purposes. The camera device can be onboard the autonomous vehicle in some examples, such as integrated or mounted at the dashboard. The camera device can be off the autonomous vehicle in some examples, for example as a street level camera, aerial satellite cameras, or drone cameras, etc.
In another example embodiment, the camera device generates the location of the camera device, also known as camera tracking. For example, the camera device can be used to capture an image of a stationary real object with a known location and pose in real world coordinates. A pose of the stationary real object is determined from the one or more images. Using the image and the pose of the stationary real object, and a known cuboid of the stationary real object, a location of the camera device is generated using the localization method. In any of the above-described example scenarios, the location of the camera device can be tracked for interactive and immersive 3D applications, for example to show the point of view (POV) from the camera device or to show the virtual camera device in the virtual world as an avatar.
Another example embodiment is a localization method, comprising: receiving at least one image which includes an object; generating for each image, using a positioning module: a camera location in real world coordinates of real 3-Dimensional (3D) space, a camera orientation, and a camera distance to the object; generating, using an image 2D object detection module and each image: i) an object label of the object detected in that image, ii) a bounding box of the object in that image, and iii) feature points in that image; generating, using a cuboid generator, the bounding box for each image, the camera location for each image, the camera orientation for each image, and the camera distance to the object for each image: a cuboid in the real world coordinates of the real 3D space which bounds the object in the real world coordinates of the real 3D space; generating, using an anchor point generator, the feature points of the at least one image, and the cuboid: anchor points in the real world coordinates of the real 3D space of the object which are contained in the cuboid; and generating, using the object label, the anchor points, and at least one of the cuboid in the real world coordinates of the real 3D space, a centroid of the cuboid, or the bounding box of the object with at least one of the images: a 3D map which includes the object located in the real world coordinates in a virtual 3D space.
Another example embodiment is a localization method, comprising: receiving at least one image which includes an object; generating for each image, using a positioning module: a camera location in real world coordinates of real 3-Dimensional (3D) space, a camera orientation, and a camera distance to the object; generating, using an image 2D object detection module and each image: i) an object label of the object detected in that image, ii) a bounding box of the object in that image, and iii) feature points in that image; generating, using a cuboid generator, the bounding box for each image, the camera location for each image, the camera orientation for each image, and the camera distance to the object for each image: a cuboid in the real world coordinates of the real 3D space which bounds the object in the real world coordinates of the real 3D space; generating, using an anchor point generator, the feature points of the at least one image, and the cuboid: anchor points in the real world coordinates of the real 3D space of the object which are contained in the cuboid; and outputting the object label, the anchor points, and at least one of the cuboid in the real world coordinates of the real 3D space, a centroid of the cuboid, or the bounding box of the object with at least one of the images, for generating a 3D map which includes the object located in the real world coordinates in a virtual 3D space.
Another example embodiment is a localization method, comprising: receiving at least one image which include an object; generating for each image, using a positioning module: a camera location in real world coordinates of real 3-Dimensional (3D) space, a camera orientation, and a camera distance to the object; generating, using a scanner or photogrammetry: a point cloud map of the object; generating, using an image 2D object detection module and each image: i) an object label of the object detected in that image, ii) a bounding box of the object in that image, and iii) feature points in that image; generating, using a cuboid generator, the bounding box for each image, the camera location for each image, the camera orientation for each image, the camera distance to the object for each image: a cuboid in the real world coordinates of the real 3D space which bounds the object in the real 3D space; generating, using an anchor point generator, the feature points of the at least one image, and the cuboid: anchor points in the real world coordinates of the real 3D space of the object which are contained in the cuboid; storing in an object database: the object label, the cuboid, the anchor points, and at least one of the point cloud map or a 3D model of the object generated from the point cloud map; and storing in a map database: the object label, the cuboid, and the anchor points.
Another example embodiment is a localization method, comprising: receiving an image which includes a stationary real object; generating, using an image 2D object detection module and the image: i) an object label of the stationary real object detected in that image, and ii) a bounding box of the stationary real object in that image; generating, using a pose estimator module, the image, the object label and the bounding box: a point of view pose of the stationary real object; retrieving, using the object label: a known cuboid in real world coordinates of real 3-Dimensional (3D) space of the stationary real object and a known pose in the real world coordinates of the real 3D space of the stationary real object; and generating, using a positioning module, the image, the object label, the bounding box, the point of view pose, the known cuboid in the real world coordinates of the real 3D space, and the known pose in the real world coordinates of the real 3D space: a camera location in the real world coordinates of the real 3D space, and a camera orientation in the real world coordinates of the real 3D space.
As shown in
The mapping device 108 can be used to generate maps, such as virtual 3D maps that represent real maps of the real world. For example, real world coordinates of the real world (e.g. planet Earth) in real 3D space are represented (matched) by the virtual world coordinates (i.e. the same real world coordinates) in virtual 3D space. The mapping device 108 can be used to perform aspects of the localization method, including localization of objects in the real world. The mapping device 108 can be used to perform localization of the camera device 104 and tracking of the movements of the camera device 104. The mapping device generates virtual 3D maps of the real world that can include the objects and the camera device 104 in virtual coordinates that match the real coordinates (e.g., the virtual coordinates in the virtual world are equivalent to the real coordinates of the real world). As shown in
In examples, the camera device 104, the display device 106, and the mapping device 108 can communicate over communication links and communication sessions. The communication links can be wireless or wired. In an example, each of the communication links can include a WebSocket protocol to provide continuous two-way communication.
In an example, the mapping device 108 is part of the camera device 104. In an example, the mapping device 108 is part of the display device 106. In an example, the mapping device 108 is a stand alone device which is separate from the camera device 104 and the display device 106.
In some examples, the virtual 3D map is crowdsourced in that multiple camera devices 104 can send information to the mapping device 108 regarding the objects for generating updates of the location of the objects in the virtual 3D map. In an example, there is real time collaboration between the camera devices 104 to perform the localization method. In an example, the real time collaboration can be performed in a peer to peer manner, without a processing server, such as in the case when the mapping device 108 is performed by one or more of the camera devices 104. In an example, the real time collaboration can be performed with a client/server relationship, for example the mapping device 108 is a server that communicates with the camera devices 104.
In an example, the camera device 104 is a single stationary camera or multiple stationary cameras located at different locations in a same real 3D space of the real world. For example, the localization method can perform depth estimation using the stationary camera or multiple stationary cameras. The localization method can synchronize the stationary cameras to estimate depth information by comparing frames of a same scene. In an example scenario, the stationary cameras are mounted on a ceiling of a room. In an example scenario, the stationary cameras are mounted at two or more corners of a room. In some examples, such stationary cameras have features such as pan, zoom, and/or tilt in order to provide additional images with different depth information for performing examples of the localization method.
In
As illustrated in
The feature points are features detected in the image by the camera device 104, represented by the white circles. Feature points, also known as feature edge points, Kanade-Lucas-Tomasi (KLT) corners or Harris corners, are identified visual features of particular edges detected from the image 102. In an example, Google ARCore is used to generate the feature points. As shown in
Referring to
In example embodiments, using the cuboid 206, the camera device 104 can generate or calculate the centroid 208 of the cuboid 206. The centroid 208 is the center of the cuboid 206 in 3D space. The centroid 208 can be represented by real word coordinates and the same virtual world coordinates. In example embodiments, using the centroid 208 and the object label, the localization method can generate a 3D map which includes the virtual object 202 in the exact same (within suitable thresholds) virtual location as the real object 202 in the real world. For example, the display device 106 can be used to view and interact with the virtual object 202 in the 3D map in real-time. The centroid 208 can be used to compare with the last known centroid 208 of the virtual object 202, and the 3D map of the virtual object 202 can be generated or updated accordingly. A distance threshold for the movement of the centroid 208 can be used in some examples to determine whether the real object 202 had actually moved, in which the distance threshold can vary depending on the application, the size of the real object 202, or the particular real world environment.
In some examples, the cuboid 206 can be defined for consistency purposes such that the front of the object 202 is a front face of the cuboid 206 perpendicular to horizontal ground, and a bottom of the object 202 is a bottom face of the cuboid 206 parallel to horizontal ground. In other examples, arbitrary faces of the object 202 are used to generate the cuboid 206. For example, different cuboids 206 at various different angles for the same object 202 can have a centroid 208 that does not substantially vary for practical purposes of determining whether the known centroid 208 of the real object 202 had moved (or not). For example, any variances in defining the cuboid 206 and the associated centroid 208 would still have the centroid 208 within the distance threshold in the situation when the object 202 has not moved.
In some examples, the camera device 104 is used to generate anchor points (also known as anchors) in 3D space of the feature points. An anchor point is in real world coordinates (of real 3D space of the real world) for a particular feature point of an image 102. In some examples, one anchor point represents overlapping feature points that were generated from different images 102, because those feature points were actually the same feature taken from different orientations of the camera device 104. An anchor point can be represented as a sphere (e.g. a white sphere) or a dot in 3D space. In an example, the anchor points can be generated by the camera device 104 using ARAnchorManager from AR Foundation. In some examples, each anchor point of the object is individually trackable. In examples, movements of the object, or a part of the object, can be tracked using the anchor points.
In an example, tracking an object includes generating a path in 3D space of where the objection has moved. In an example, tracking an object also includes generating a direction of movement of the object. In an example, tracking an object also includes generating a speed of movement of the object (e.g., current speed and/or average speed). In an example, tracking can be used by the mapping module 114 or the localization method to predict a future location of the object.
In some examples, the camera device 104 is used to determine a pose of the object 202. The pose is the direction or orientation where the front of the real object 202 is facing in the real world. In an example, the pose can include pitch, yaw (compass), and roll. In other examples, the pose can include azimuth directions (compass) and elevation directions.
For example, the camera device 104 can generate front identifying information from one or more of the images 102, which is described in greater detail herein, e.g. as in
Example embodiments of determining the real pose are generally described herein in relation to identifying the front (front face) of the object 202. It would be understood that other faces other than the front face can be used to determine the pose, for example the rear face (e.g. when following behind an object such as a mobile robot), top face (e.g. when the camera device 104 is overhead), or bottom face (e.g., tracking aerial vehicles from the ground during takeoff, in flight, and landing), etc.
In some examples, the camera device 104 is used to determine a point of view (POV) pose and POV distance from the POV of the camera device 104 to object 202. For example, the POV pose is the pose from the POV from the camera device 104 when the camera device 104 captured the image 102, and can include pitch, yaw (compass), and roll. In examples, the POV pose is in relation to coordinates of the camera, e.g. directly forward from the camera is zero degrees. For example, the POV distance is the distance between the object and the camera device 104 when the camera device 104 captured the image 102. For example, the distance is between a perpendicular plane (bounding box) at the closest point of the object 202 to the camera device 104.
In some examples, the anchor points within the bounding box are used to generate the POV pose of the object 202 from the POV of the camera device 104. The camera device 104 can generate, using the location and orientation of the camera device, the pose of the object 202 in the real world which is then used by the mapping device 108 for generating the virtual pose of the virtual object in the virtual map.
In some examples, the camera device 104 captures the images 102 using video capture. A video can include a plurality of video frames, which are the images 102. For example, a user or a machine can activate a video record function of the camera device 104 and move the camera device 104 to the first location, the second location, and the third location (and/or other locations). The video can then be used by extracting the images 102 (video frames), which are then used by the localization method to perform localization of the object 202. The video can be recorded and then processed by the localization method at a later time, or can be processed in real-time. In some examples, audio from the video can be used to assist the localization method in generating the object label, for example animal noises, human voices, vehicles, etc.
The coordinate system can include the camera 3D coordinate system (shown in dashed lines in
The coordinate alignment is as follows: X world=X camera; Y world=Z camera; Z world=Y camera.
The transToWorld is a 4×4 matrix represented as:
In the 4×4 matrix, variable l_camToFloor represents camera height, and _arCameraManager is the ARCamera component managed by ARFoundation's ARCameraManager.
In other examples, e.g. when using other software other than ARFoundation, the transToWorld is a 4×4 matrix represented as:
In an example, the camera height is calculated (generated) by pointing the camera device 104 (e.g., shooting a ray from the center of the screen or the camera 522) towards a detected plane on the floor (horizontal ground). For example, the camera height is generated by ARPlaneManager from ARFoundation or other software.
Referring again to
The mapping device 108 sends the virtual 3D map to the display device(s) 106, which can display the virtual 3D map in an interactive fashion. In examples, the virtual 3D map is an immersive 3D virtual map, which means that the display 160 displays the 3D map from a point of view as if the user is within the 3D map. A user who views the 3D map on the display device 106 can navigate to the same virtual coordinates as the real coordinates of the real object 202, in order to view the virtual object 202. The virtual object 202 has the same virtual pose as the real pose of the real object 202 (e.g., pitch, yaw (relative to virtual compass direction such as virtual North), roll).
In an example embodiment of the localization method, the mapping module 114 upon receiving the object label, the anchor points, and the remaining information, can determine from the map database 112 if there is any change in the last known centroid (or cuboid) location or the last known pose of the object. If there is no change in the centroid (or cuboid) location or the pose, then the virtual 3D map does not need to be updated for the object 202. If there is a change in the centroid (or cuboid) location or the pose, then the virtual 3D map can be updated to accommodate such differences in the virtual 3D map, for display by the display devices 106.
Therefore, referring again to
In some examples, the mapping device 108 compares the centroid or cuboid location and the pose with the last known centroid or cuboid location and pose of the object 202 as contained in the map database 112. In some examples, the mapping device 108 sends the new centroid or cuboid location and/or the pose to a 3D renderer on the mapping device 108 or the display device 106, which updates the location and/or the pose of the virtual object 202 in the 3D map.
It would be appreciated that, for the localization method at this stage, the camera device 104 does not need to send an entire point cloud map of the real object to the mapping device 108 or the display device 106.
In some examples, a 3D model (e.g. full point cloud) of the 3D object 202 can be initially generated by using the camera device 104, which is described in greater detail in relation to
In some examples, the mapping module 114 and the 3D model generator 116 are combined in the same module.
The memory 302 can be a read-only memory (ROM), a static storage device, a dynamic storage device, or a random access memory (RAM). The memory 302 may store programs such as a mapping module 114 and a 3D model generator 116. The memory 302 can be a non-transitory memory. The modules, e.g. mapping module 114 and 3D model generator 116, can include software stored in the memory 302, hardware, or a combination of software and hardware. The memory 302 can store one or more databases, such as the object database 110 and the map database 112. The object database 110 can store, for each object: 1) an object label, 2) a 3D model, 3) front identifying information, 4) anchor points, 5) cuboid of the object, 6) centroid of the object. In examples, these virtual data stored in the object database 110 can also be called “known” data. The 3D model is a virtual object that represents a respective real object. In examples, the 3D model includes a point cloud, which is a virtual object representation of a real object. The object database 110 can store a cuboid of the object and/or a centroid of the cuboid. In some examples, the cuboid is in 3D space in coordinates of the real world (e.g., using opposite diagonal coordinates). In some examples, the cuboid is not in coordinates of the real world but rather is the dimensions of the cuboid, with one face of the cuboid being indicated as the front face using the front identifying information. Similarly, in an example, the centroid of the cuboid can be in 3D space in coordinates of the real world. The centroid can also be defined in relation to the cuboid (e.g. relative to one corner coordinate of the cuboid), not in real world coordinates. The anchor points also contain pertinent information regarding the object location, size, and pose, and can be used by the mapping module 114 to generate any changes in the object location and/or pose. For example, upon receipt of anchor points of an object from the camera device 104, the mapping module 114 can compare the anchor points to known anchor points (for example using a partial amount or all of the anchor points of the object). In an example, for example when the cuboid or centroid is not received by the mapping module 114, the mapping module 114 can generate the cuboid in 3D space of the object using the received anchor points, for example by generating the cuboid in 3D space that tightly bounds the anchor points. The mapping module 114 can also generate (e.g. calculate) the centroid from the cuboid.
In an example, the mapping module 114 can use the object label to retrieve a known cuboid of the object and, using the anchor points and/or the cuboid received from the camera device 104, can determine whether the known cuboid has moved. The mapping module 114 can use the object label to retrieve a known centroid of the object and, using the anchor points and/or the centroid received from the camera device 104 (or calculating the centroid from a received cuboid), can determine whether the known centroid has moved. The mapping module 114 can use the object label to retrieve a known pose of the object and, using the anchor points and/or the pose received from the camera device 104, can determine whether the known pose has changed.
In some examples, the object database 110 stores the known anchor points in 3D space of each object.
In some examples, the object database 110 stores front identifying information of each object. The front identifying information can be used to determine the pose of the object. In an example, the front identifying information can be a particular POV of the 3D model of the object. Another example of the front identifying information is an image of the front of the object. Another example of the front identifying information is a 3D model or point cloud map of only the front of the object. Another example of the front identifying information is descriptive text of the front of the object, e.g. “face” or “nose” for a human, or “hood” for a car. In an example, the front identifying information can be the identification of particular anchor points of the front of the object. In other examples, other faces other than the front face are stored in the object database 110 and can be used to generate the orientation, for example back face, top face, bottom face, left face, or right face, etc. Example embodiments are described in respect to the front face where indicated, for clarity and convenience.
The object label is the name of the object. In some examples, every real object in the real world that is being captured by the localization method has a unique object label (and associated 3D model) in the object database 110. In some examples, an object label and 3D model can be used for more than one real world object, such as a row of identical chairs for purposes of event planning in an event venue. In the latter example, the swapping of two identical chairs may not need to be reflected in the virtual map of the virtual world, and such real objects (e.g. chairs) are given the same object label of “chair” and the same 3D model. However, in other examples the identical chairs can have unique object labels and have respective unique 3D models which are stored in the object database 110, for individual tracking by the localization method. In an example, such unique object labels can be stored in metadata in the object database 110 and the map database 112. In an example, the metadata includes a timestamp as to when the image of the object was captured by the camera device 102. The metadata can also be used for communication messages between any of the camera device 102, the mapping device 108, and the display device 106.
The object database 110 and the map database 112 can also store the location and orientation (POV) of the camera device 104. For example, the 3D map can include a 3D model of the virtual camera device 104, with the same location and orientation as the real camera device 104. The 3D model can be a marker, an avatar, or a point cloud that resembles the camera device 104 or resembles the user.
The map database 112 can also store historical information of the location and pose information, for example, the location and pose of each object in the real world and virtual world at different times, also known as tracking. The historical information can include the location and orientation of the camera device 104. The historical information can include the particular POV directions displayed by the display device(s) 106 of the virtual 3D map, at different times. The map database 112 can store the known cuboid, centroid, and anchor points of the object (in 3D space of real world coordinates).
The map database 112 and the object database 110 can be the same database in some examples, not shown. The pose can also be stored in the object database 110 in some examples.
The processor 304 can be a general central processing unit (CPU), a microprocessor, an application-specific integrated circuit (ASIC), a graphics processing unit (GPU), a Tensor Processing Unit (TPU), or one or more integrated circuits. The processor 304 may be an integrated circuit chip with a signal processing capability. In an implementation process, steps of the localization method as described herein can be performed by an integrated logical circuit in a form of hardware or by an instruction in a form of software in the processor 304. In addition, the processor 304 can be a general purpose processor, a digital signal processor (DSP), an ASIC, a field programmable gate array (FPGA) or another programmable logic device, a discrete gate or a transistor logic device, or a discrete hardware assembly. The processor 304 can implement or execute the methods, steps, and logical block diagrams that are described in example embodiments. The general purpose processor can be a microprocessor, or the processor may be any conventional processor or the like. The steps of the localization method described with reference to the example embodiments may be directly performed by a hardware decoding processor, or may be performed by using a combination of hardware in the decoding processor and a software module. The software module may be located in a mature storage medium in the art, such as a random access memory, a flash memory, a read-only memory, a programmable read-only memory, an electrically erasable programmable memory, or a register. The storage medium is located in the memory 302. The processor 304 reads information from the memory 302, and completes, by using hardware in the processor 304, some or all of the steps of the localization method.
The mapping module 114 and the 3D model generator 116 can each include a machine learning model. In some examples, the machine learning model includes one or more NNs. In some examples, the NNs include CNNs. In some examples, the mapping module 114 and the 3D model generator 116 use rules based models. In some examples, the mapping module 114 and the 3D model generator 116 respond to manual feedback from a user.
The communications interface 306 implements communication between the mapping device 108 and another device or communications network by using wireless or wired communication. In some examples, training data may be obtained by using the communications interface 306.
In an example, the processor 304, with the communications interface 306, executes a live data application program interface (API) to communicate with feeds and third party services e.g. updated maps of the real world, updated libraries of 3D models of objects.
It should be noted that, although the memory 302, the processor 304, and the communications interface 306 are shown in the mapping device 108 in
In some examples, the mapping device 108 includes a server or a cloud server. In some examples, the mapping device 108 includes third party machine learning processing services and cloud storage such as Amazon Web Services (AWS)™, Microsoft Azure™, and Google Cloud™. For example, the object database 110 and/or the map database 112 can be on a separate storage server or the cloud storage. In other examples, not shown, the mapping module 114 or the 3D model generator are executed by the camera device 104, or the display device 106, or other devices.
The example display device 106 in
The 3D display module 410 can receive, from the mapping device 108, the 3D map for display on the display 412. The particular format of the 3D map can be a 3D model, a point cloud map, or data for the 3D display module 410 to reconstruct the 3D map (which can include retrieving 3D object models or data from other sources or the object database 110). In some examples, the data received from the mapping device 108 can include a change in a virtual object, such as the location or the pose, and the 3D display module 410 can move the virtual object in the virtual world based on the data. Therefore, in some examples, entire point cloud maps do not need to be transmitted to the display device 106 for such changes.
In some examples, the 3D display module 410 is executed by a particular platform such as a 3D video platform such as a mobile platform, streaming platform, web platform, gaming platform, application plug-ins, etc. The display device 106 can include input/output (I/O) interface devices 408 for interacting with the user. For example, the user can manually interact with and manipulate the 3D map.
In an example embodiment, the display 412 is a computer monitor. For example, a rectilinear subset view (frame) of a 3D map which includes the one or more virtual objects is output to the computer monitor and the user can change the view being displayed on the display 412 using an input device (one of the I/O interface devices 408). In some examples, the display device 106 receives the 3D map from the mapping device 108 and displays the particular rectilinear subset view (frame) on the display 412. In other examples, the display device 106 only receives the particular rectilinear subset view (frame) from the mapping device 108 that is the particular POV of the display 412.
In accordance with an example embodiment, the display 412 is a 360-degree display. For example, as shown in
In accordance with an example embodiment, the user can use the input device to virtually navigate to the virtual location of the virtual object. The virtual location has virtual world coordinates that correspond to real world coordinates in the world (e.g. planet Earth). In some examples, the user can walk while wearing the display device 106 to change virtual locations. In some examples, the user can walk on a 360-degree VR treadmill (not shown here) to change virtual locations. The particular POV of the 3D map is displayed on the display 412 depending on the virtual location and virtual orientation of the display device 106 in relation to the virtual 3D map, e.g. in relation to the virtual location and virtual pose of the virtual object.
In accordance with an example embodiment, the display device 106 can be used for augmented reality (AR). For example, Google ARCore can be used. In an example of AR, the user can physically move to the same real location as the real object. The display 412 shows or acknowledges the virtual object when the display device 106 faces the real object in the real world. Additional augmenting information can be displayed by the display 412 on the 360-degree display, which overlays or modifies the object on the 360-degree display or the rectilinear subset view of the 3D map. In some examples, for example for AR, the camera device 104 is the same device as the display device 106 (and sometimes also the same device as the mapping device 108).
Another example of the 360-degree display is a room with projections on a curved wall or multiple walls, also known as “cave projection”. The user is immersed in the room and can see different POV angles by turning their head or controlling the desired view to the front of the room. In various examples, the POV of the projection can be selected (navigated) through the input device or can be the same POV as the location of the camera device 104. The cave projection can also display coordinates of the particular location of the POV being displayed, for example using real world coordinates of latitude and longitude (and optionally height).
In examples, the camera device 104 includes sensors 520 which are used to detect information from the real world. In an example, the sensors 520 can be used to determine a location and an orientation (e.g., pitch, roll, yaw) of the camera device 104. The sensors 520 can include: global positioning system (GPS), local positioning system (LPS), range director or scanner such as LiDAR to determine the camera distance to objects or points of the objects, barometric pressure sensor to determine a height (Z axis) of the camera device 104, compass to determine orientation of the camera device 104 in relation to North, and/or accelerometers to determine orientation of the camera device 104. The GPS and/or the LPS can be used to generate the location of the camera device 104. The range detector can be used to determine a distance between the camera device 104 and the object being captured by the camera 522.
The range director such as LiDAR can be used by the camera device 104 to determine the camera distance to objects or points of the objects. For example, the distance is between a perpendicular plane (bounding box) at the closest point of the object to the camera device 104.
In some examples, the range detector is not required by the camera device 104 and the distance between the camera device 104 and the object can instead be generated using photogrammetry, such as: knowing the size or cuboid size of the object using the object database 110, and/or calculating a vanishing point of the object. In some examples, Google ARCore can be used. In some examples, a combination of photogrammetry and at least one of the sensors 520 can be used by the positioning module 518 to determine the distance.
In some examples, Google ARCore can be used to determine the distance from the camera device 104 to the particular feature points in the image (e.g., for purposes of generating the anchor points in 3D space of the feature points). In some examples, the range detector can be used to determine the distance from the camera device 104 to the particular feature points in the image.
The positioning module 518 can be used to generate a location and orientation of the camera device 104. In some examples, the positioning module 518 uses data from the sensors 520. In some examples, the positioning module 518 uses data from the GPS and/or the LPS. In some examples, the user is localized and presumed to be the same location and optionally the same orientation as the camera device 104, i.e., the user is holding the camera device 104 and optionally the user's pose is the same as the camera device 104.
In another example, the location and orientation of the camera device 104 can be generated without using the GPS, the LPS, and/or the compass (or any sensors 520 other than the camera 522). For example, the location and pose of a stationary real object is known from the map database 112. The camera device 104 can capture one or more images of the stationary real object. The camera device 104 can generate a POV distance and a POV pose between the camera device 104 and the stationary real object. The camera device 104 can generate, using the known location and known pose of the stationary real object, and the POV distance and POV pose: the real location and real orientation of the camera device 104.
In example embodiments, the memory 502 can store modules for execution by the processor 504, including: image 2D object detection module 510, cuboid generator 512, centroid generator 514, pose estimation module 516, positioning module 518, and anchor point generator 524. The modules can include software stored in the memory 502, hardware, or a combination of software and hardware. In some examples, the modules of the camera device 104 include machine learning models, which can include NNs such as CNNs. For example, the image 2D object detection module 510 can include an image 2D object detector model which includes a CNN. The pose estimation module 516 can include a pose estimation model which includes a CNN.
The cuboid generator 512 is used to generate a cuboid of the object in the images. The centroid generator 514 is configured to generate the centroid of the cuboid.
The anchor point generator 524 is used to generate anchor points in 3D space of the feature points, using the location and orientation of the camera device 104. For example the anchor points are generated in camera 3D coordinates, and are then transformed to real world 3D coordinates. The anchor points that overlap with the bounding box in real 3D space are used for the localization method, and the remaining anchor points can be removed.
In some examples, one or more of the modules are executed by other devices, such as the mapping device 108 or a cloud server.
In some examples, the camera 522 and the images captured by the camera 522 are part of a third party mapping service, such as Google Maps™. The images from the third party mapping service are received or parsed using an API or other software applications (and/or can be extracted from 3D views of the third party mapping service). In such examples, the third party mapping service typically also provides the 3D coordinates (at least longitude/latitude) and the particular orientation of the camera 522. The localization method for generating the location and pose of the object in real world coordinates can use photogrammetry from the images of the third party mapping service, along with the location of the object and the orientation of the camera 522, without requiring any of the sensors 520. The third party mapping service can be continuously parsed for the localization method to generate up-to-date 3D mapping information, which includes the location and pose of the object.
The applications and modules of the camera device 104, the mapping device 108, and the display device 106 can be used to perform aspects of the localization method in accordance with example embodiments.
In an example, the positioning module 518 may also include ARCore. ARCore includes a mobile augmented reality library that can be used for camera orientation estimation, which is readily available on most Android™ devices or smartphones. ARCore is a library by Google™, which uses the data from the inertial measurement unit (IMU) sensors (e.g. accelerometer, magnetometer, and gyroscope), along with image feature points for tracking the camera orientation of the camera device 104 utilizing a Simultaneous Localization and Mapping (SLAM) algorithm. ARCore can perform camera orientation estimation in real-time. In that regard, to track the motion of the camera device 104, an android application (i.e. the positioning module 518) using ARCore can be developed in Unity3D environment, the Unreal environment, or other interactive 3D environments, for capturing RGB images along with the real world location of the camera device 104. The positioning module 518 can generate or determine the location and the camera orientation of the camera device 104 in a 3D coordinate system in the real world for each image 102. The positioning module 518 can generate or determine the distance from the camera device 104 to an object in the 3D coordinate system in the real world for each image 102.
The LiDAR scanner can also be used as a range detector to detect a distance between the camera device 104 and the object. The sensors 520 can include a GPS or LPS to localize the camera device 104 in 3D coordinates of the real world, therefore generating the camera location. The sensors 520 can include a compass and/or accelerometers to generate an orientation of the camera device 104. The positioning module 518 generates, using the sensor information: i) camera location, ii) camera orientation, and iii) camera distance to object.
As an alternative to using some or all of the sensor information, in an example, the positioning module 518 can also localize the camera device 104 using photogrammetry from an image of a stationary real object. Such a localization method is described in relation to
In the localization method of
In an example, the object label is uniquely assigned to each object in the real world. For example, when the object is initially detected from the camera device 104, the object label can be initially assigned a unique object label. If there are a plurality of similar objects, each of those objects can be assigned a unique object label.
In other examples, the object label can be specific to a particular model and be used for multiple objects, for example a row of the same model of chairs can have each chair assigned the same object label, and the swapping of two real chairs does not require any change to the virtual 3D map for the same two virtual chairs. In some examples, the object label may already exist in an object library (e.g. object database 110) and that object label is initially assigned to the object.
In the localization method of
The front identifying information can be used to determine the pose of the object. An example of the front identifying information is an image of the front of the object or a bounding box of the front of the object. Another example of the front identifying information is descriptive text of the front of the object, e.g. “face” or “nose” for a human, “hood” for a car. In other examples, other faces other than the front face are generated for purposes of generating the pose, for example back face, top face, bottom face, left face, or right face, etc. Another example of the front identifying information is the anchor points of the front of the object.
The pose estimation module 516 can include a CNN, for example as described in relation to
In the localization method of
In some examples, the cuboid can be defined for consistency purposes such that the front of the object is a front face of the cuboid perpendicular to horizontal ground, and a bottom of the object is a bottom face of the cuboid parallel to horizontal ground. Other definitions and landmarks can be used to define the cuboid. In other examples, arbitrary faces of the object are used to define the cuboid.
Examples of the cuboid generator 512 are described herein in relation to
In the localization method of
In other examples, the centroid generator 514 includes a centroid generator model that includes a CNN for estimating the centroid using the images 102. For example, the CNN of the centroid generator 514 does not require the cuboid or the cuboid generator 512 and instead generates the centroid of the object using the images 102, the object label, the bounding boxes of the object in the images, the camera location, the camera orientation, and the camera distance to object.
The anchor point generator 524 generates anchor points which are the feature points in the 3D space. The anchor point generator 524 generates, using the feature points, the camera location, the camera orientation, and the distance to the object: the anchor points in 3D space. The anchor point generator 524 also generates, using the cuboid in 3D space: only the anchor points that are within the cuboid (e.g., those that overlap or are contained within the cuboid in 3D space).
The concatenation module 530 is used to concatenate and output any generated data and any pass through data. For example, the concatenation module 530 is used to output data to the mapping device 108 in order for the mapping device 108 to initially generate a 3D model of the object and map the object (location and pose). In the localization method of
In the localization method of
The 3D model generator 116 generates, using the point cloud map of the object: a 3D model of the object. The 3D model generator 116 stores the 3D model of the object to the object database 110. The 3D model can be in various formats, such as vector based models, primitive models, polygonal models, computer-aided design (CAD) models, solid mods, wireframe models, etc. In some examples, the 3D model generator 116 stored the point cloud map to the object database 110.
The 3D model generator 116 can store to the object database the front identifying information received from the camera device 104. The 3D model generator 116 can generate, using the front identifying information and the point cloud map of the object, additional front identifying information and store the additional front identifying information to the object database 110. For example, the additional front identifying information can be a particular POV of the 3D model and/or the point cloud map that represents the front of the object. For example, the additional front identifying information can be an image (rendered or actual) of the front of the object. For example, the additional front identifying information can be the anchor points of the front of the object.
In examples, the mapping module 114 stores the remaining information of the object to the object database 110, such as the object label, the anchor points, the cuboid, and the centroid. The cuboid can be a rectangular prism that represents the size and dimension of the real object, and can include the cuboid independent of real world coordinates in 3D space in some examples. The cuboid does not necessarily need to be stored in the object database 110 with the 3D coordinates in 3D space, although such embodiment is possible. The centroid can be the center of the cuboid in 3D space. The centroid can be the center of the cuboid, and does not necessarily need to be stored in the object database 110 with the 3D coordinates in 3D space, although such embodiment is possible. As well, the centroid can be stored in the object database 110 or can be calculated on an as-needed basis. The anchor points can be stored relative to coordinates of 3D space or relative to the 3D model (without coordinates of 3D space).
In some examples, the cuboid is adjusted or registered with the image of the 3D model and the point cloud map of the object, for example the front face of the cuboid can be registered with the front face of the object. The bottom face of the cuboid can be registered with the bottom face of the object. The adjustment or registration can be enriched using object libraries or manual adjustment.
Therefore, the object database 110 contains the 3D model of the object (along with other characteristics) which can be subsequently retrieved from the object database 110 for generating virtual 3D maps of the object.
In the example when the object is initially detected from the camera device 104, referring still to
The camera 522 generates one or more images 102 by capturing light (optionally non-visible radiation) from the real world. The sensors 520 generate sensor information from the real world.
The LiDAR can also be used as a range detector to detect a distance between the camera device 104 and the object. The sensors 520 can include a GPS or LPS to localize the camera device 104 in 3D coordinates of the real world, therefore generating the camera location. The sensors 520 can include a magnetometer (e.g. compass), gyroscopic, and/or accelerometers to generate an orientation of the camera device 104. The positioning module 518 generates, using the sensor information: i) camera location, ii) camera orientation, and iii) camera distance to object.
As an alternative to using some or all of the sensor information, in an example, the positioning module 518 can also localize the camera device 104 using photogrammetry from an image of a stationary real object. Such a localization method is described in relation to
In the localization method of
In some examples, the image 2D object detection module 510 also uses the camera location from the positioning module 518 and the last known location of objects from the map database 112. Any objects having a centroid location that were last known to be “close” or within a distance threshold to the camera location can be generated by the image 2D object detection module 510 as having a higher object score. For example, when inside an event venue, a chair that was last known to be located within the event venue is more likely to be the same chair (and given a higher object score) than another chair that was last known to be located across the real world.
In the localization method of
The front identifying information can be used by the pose estimation module 516 to determine the pose of the object. The pose estimation module 516 can include a CNN, for example as described in relation to
In some examples, the pose estimation module 516 can estimate the pose of the object using only one image.
In the localization method of
In some examples, the cuboid can be defined for consistency purposes such that the front of the object is a front face of the cuboid perpendicular to horizontal ground, and a bottom of the object is a bottom face of the cuboid parallel to horizontal ground. Other definitions and landmarks can be used to define the cuboid.
In other examples, arbitrary faces of the object are used to define the cuboid. In other words, the cuboid generated by the cuboid generator 512 in
In some examples, the cuboid generator 512 can generate the cuboid of the object using only one image.
An example of the cuboid generator 512 is described herein in relation to
In the localization method of
In other examples, the CNN of the centroid generator 514 does not require the cuboid or the cuboid generator 512 and instead generates the centroid of the object using the images, the object label, the bounding boxes of the object in the images, the camera location, the camera orientation, and the camera distance to object.
The anchor point generator 524 generates anchor points which are the feature points in the 3D space. The anchor point generator 524 generates, using the feature points, the camera location, the camera orientation, and the distance to the object: the anchor points in 3D space. The anchor point generator 524 also generates, using the cuboid in 3D space, only the anchor points that are within the cuboid (e.g., those that overlap or are contained within the cuboid in 3D space).
The concatenation module 530 is used to concatenate and output any generated data and any pass through data. For example, the concatenation module 530 is used to output data to the mapping device 108 in order for the mapping device 108 to determine whether there has been a change in the object (location and pose) for 3D mapping purposes. In the localization method of
In some examples, not shown, the image 2D object detection module 510 also generates, from each image 102, a mask of the image 102. A mask is a binary image with pixel values in place of the object in the image. In some examples, the mask can be used as additional data by the pose estimation module 516. As well, the mask can be used as additional data by the cuboid generator 512.
In the localization method of
The mapping module 114 uses the object label to query the map database 112 of the last known location (centroid location or cuboid location), the known anchor points of the object, and the known pose of the object. In response, the map database 112 returns the last known location, the known anchor points, and the known pose to the mapping module 114. The map database 112 also returns the virtual 3D map of the applicable real world coordinates at the last known location, which is typically without the object but includes the background, environment, terrain, and sometimes stationary real objects.
When the mapping module 114 determines that there is a change in the location (centroid or cuboid), the anchor points, or the pose (or the change exceeds a respective distance or angular threshold), the mapping module 114 also updates the map database 112 with the new changed location (centroid or cuboid), the new anchor points (changed anchor points), or new pose (changed pose). The mapping module 114 can use the object label to query the object database 110. In response, the object database 110 returns the 3D model of the object (e.g., that was generated and stored in the object database 110 as in
The cuboid in 3D space and the images from the camera device 104 can also be used by the mapping module 114 as additional information to determine whether there is a change in the location or the pose of the object. For example, one of the faces of the cuboid can be identified (e.g. the front identifying information) as the front of the object, and can be used to generate the pose of the object.
When there is no change in the location (centroid or cuboid) or pose (or within a respective distance or angular threshold), the object in the 3D map does not change. For example, either the 3D map that has not changed is output to the display device 106 and does not need updating/refreshing, or a message is output to the display device 106 that the location and pose of the object has not changed and therefore the 3D map does not need updating/refreshing. A similar example can be implemented when there is no change in the cuboid in 3D space.
It can be appreciated that the mapping device 108 generating of the 3D map including the object can be performed in near real-time to the capturing of the image(s) 102 by the camera device 104 as in
In an example, the mapping module 114 (which may be in the mapping device 108 and/or the camera device 104) performs tracking of the object. In an example, the tracking is performed the mapping module 114 by tracking the following features from the object: i) eight corners of the cuboid, ii) the centroid of the cuboid, and iii) the feature points in the 2D bounding box of the object. If these features are found by the localization method to be in the same location (or within a threshold) as a previous location of these features in a previous time, then the location of the object is the same. If these features are found by the localization method to be in a different location (or beyond the threshold) as a previous location of these features in a previous time, then the location of the object has changed and the location of the object in the map database 112 is updated accordingly. In an example, the mapping module 114 does not need to receive a 3D point cloud map for tracking of the object.
It is recognized that estimating the cuboid for every detected object in each image may lead to problems such as jitter and computation ambiguities. In order to reduce occurrences of this issue, the mapping module 114 can be configured to reduce the need to run perform the image 2D object detection on every image, and the mapping module 114 does not need to generate a cuboid for every image. Rather, the features of the object are tracked such as: i) eight corners of the cuboid, ii) the centroid of the cuboid, and iii) the feature points in the 2D bounding box.
Such tracking allows the localization method to be more efficient, by initially retrieving the presently known 2D coordinates of the detected object's bounding box. The localization method is used to update the location of the object every frame, or every few set frames, or after a definite time interval, or only when a new object is detected in a future image frame. Tracking the feature points in the 2D bounding box can provide more stability to the tracking results to reduce the susceptibility of the results being to unfavorable conditions like viewpoint change, occlusion, noise, etc.
The pose is also updated if the pose has changed, as described herein in relation to examples of the localization method.
In another example, the digital 3D map (including the objects) prevails over the real world. When a real object location differs from the virtual object location, the mapping device 108 sends an instruction to the user, a machine, or the object itself for the object to return to the desired real location in order to match the virtual object location. An example can be applied to the scenario of event venue planning. A chair may have a desired location for the venue setup. The virtual 3D map stores the desired virtual location. The chair may have been inadvertently moved. The localization method is used by the mapping device 108 to determine that the real location differs from the virtual location. In response, the mapping device 108 sends an instruction to the user, a machine, or the chair itself (if motorized) to return to the desired virtual location. The localization method of
The real and virtual pose angle can use a common reference point of the world, such as real North and virtual North for yaw 602, e.g. 0 degrees being North. The roll 606 and the pitch 604 can use the real ground and virtual ground as a common reference point, e.g. 0 degrees being parallel to ground. In an example, positive angular values follow the left hand rule with the thumb pointing along the respective axis. In some examples, at least one but not all of the particular pose angles are detected. For example, in some instances only yaw 602 (compass direction) is detected. In other examples, other coordinate schemes can be used for the pose, such as compass direction angle (yaw 602) and elevation angle (pitch 604).
Referring again to
The pose estimation module 516 includes an instance pose estimator 806, which determines the pose of the object in 3D space 816. The instance pose estimator 806 includes a yaw CNN 808, a pitch CNN 810, and a roll CNN. The instance pose estimator 806 receives: i) the image(s) 102, ii) the front identifying information, iii) the object label, and iv) the bounding box in each image. The yaw CNN 808 generates the yaw in the POV pose 814 (perspective of the camera device 104). The pitch CNN 810 generates the pitch 822 in the POV pose 814 (perspective of the camera device 104). The roll CNN 812 generates the yaw in the POV pose 814 (perspective of the camera device 104). The 3D space converter 830 uses the POV pose 814, the camera location, the camera orientation, and the camera device to object to generate: the pose in 3D space 816. The 3D space converter 830 uses math or a NN such as a CNN. More images 102 can be used by the pose estimation module 516 to provide better accuracy for the pose in 3D space 816. In some examples, only one image 102 is used to generate the pose in 3D space 816.
An example of generating one or more of the VPs is described by Yang et al., “Cubeslam: Monocular 3-d object slam” IEEE Transactions on Robotics 35.4 (2019): 925-938, incorporated herein by reference. In an example, the goal of the cuboid is such that the 3D cuboid's projected corners should fit the 2D bounding box tightly. A general 3D cuboid can be represented by 9Degree of Freedom (DoF) parameters: 3 DoF position=t=[tx, ty, tz]; 3 DoF rotation R; and 3 DoF dimension=d=[dx, dy, dz].
In order to fully constrain all 9 parameters more information is needed along with the 2D bounding box for the detected object. In some examples, provided or predicted object dimensions and orientations used. In the present example, rather than relying on the predicted dimensions, VPs points are used to change and reduce the regression parameters in order to work for general objects. The VP is a parallel lines' intersection after projection onto perspective images. A 3D cuboid has three orthogonal axes and can form three VPs after projections depending on object rotation matrix R with respect to the camera frame and calibration matrix K:
VPi=KRcol(i), i□{1, 2, 3} where Rcol(i) is the ith column of R.
VP is determined by the object rotation matrix R.
Another example embodiment of generating the cuboid for the localization method uses the VP, and selects (filters) those cuboids having an angle difference with a screen normal in 2D space within a given angle threshold. For example, for one detected 2D bounding box, many candidate cuboids (cuboid proposals) are generated. These candidate cuboids are then ranked using a cost function given as:
where image is denoted as I, candidate cuboid is denoted as C, and the object is denoted O. The three kinds of costs considered are:
In an example, line segments are detected using Fast Line Detector (FLD) function (from Lee, J. et al., Outdoor place recognition in urban environments using straight lines; 2014 IEEE International Conference on Robotics and Automation (ICRA), pages 5550-5557; IEEE, 2014, herein incorporated by reference herein in its entirety) instead of using Line Segment Detector (LSD) function as used by Yang et al. (supra). For example, FLD is faster as compared to LSD with no apparent performance degradation (see Luo et al., Accurate line reconstruction for point and line-based stereo visual odometry; IEEE Access 2019, 7, 185108-185120, incorporated by reference herein in its entirety).
In an example, candidate cuboids are ranked by using the cost function in equation (1). In another example, the cuboid generator 512 can use the detected plane normal in 3D space (3D_normal) and a screen normal projected on and relative to screen space, e.g., a screen normal in 2D space (screen_normal, also known as camera_normal) to minimize the number of candidate cuboids before applying the cost function of (1), as the cost function of (1) can be computationally intensive due to being for 3D space. In an example, the detected plane is horizontal ground and the 3D normal is orthogonal to the horizontal ground.
For example, the dist and angle costs of the cost function are applied in the 2D image space (e.g., from the screen point of view). Therefore before applying the two costs, the cuboid generator 512 narrows the number of candidate cuboids using screen_normal. The mapping device 108 evaluates the angle made by the screen normal with the x axis in 2D image space θscreen using the following equation:
where point (x2d,y2d) represents the projected screen normal. The cuboid generator 512 also uses the same equation (equation (2)) to compute the angle of the vanishing point center (the center of the vanishing points) projected in 2D image space (θvp_center) in relation to a screen center. Next, the cuboid generator 512 computes the angle difference between the two angles and for a particular candidate cuboid to be selected for further processing, the value of the angle difference does not exceed (or is within) a given angle threshold. The candidate cuboids that exceed the angle threshold are not selected. In an example, the angle threshold is on or about 20° (e.g., selected after experimenting with different values). In an example, the angle difference is calculated with respect to the x-axis.
In an example, the cuboid generator 512 further minimizes the number of candidate cuboids using the 3D_normal. 3D_normal is computed from the plane (horizontal ground) detected using ARCore. Direction angles computed for the 3D normal are α, β and γ which represents the angles formed by the normal with positive x, y and z axis respectively and are given as:
where P3d(x3d, y3d, z3d) represents the 3D_normal and mag_P3d represents the magnitude of the normal vector. The cuboid generator 512 uses the same equation (equation (3)) to compute the direction angles made by a candidate cuboid with positive x, y and z axis. A respective centroid of each candidate cuboid (that were selected or filtered in the 2D space) is generated. A respective centroid angle from the camera position to each centroid is generated. Next, an angle difference in 3D space between respective direction angles of the 3D normal (the three Cartesian coordinates) and the centroid angle is computed. Those candidate cuboids which have the angle difference(s) which are within the angle threshold are selected for further calculation of the cost function of equation (1). Those candidate cuboids which exceed the angle threshold are not selected. In examples, the angle threshold value used for this case is also 20° (e.g., after experimenting with different values). If for a particular candidate cuboid the angle difference value remains within the angle threshold, that candidate cuboid is selected and is ranked according to the cost function as defined in equation (1). In an example, all three Cartesian coordinates need to be within the angle threshold value in order to be selected for the cost function.
In another example, equation (1) is optimized using other optimization methods and/or a NN, in order to select (generate) the optimal cuboid for the object in the image.
An example of this type of cuboid generation for an example of the localization method 1700 is illustrated in detail in
In an example, at step 1702, the mapping device 108 receives an image which includes an object. At step 1704, the mapping device 108 generates a screen normal (also known as camera normal) of the image in 2D space. At step 1706, the mapping device 108 generates, using an image 2D object detection module and the image: i) an object label of the object detected in the image, ii) line segments in the 2D space of respective edges of the object detected in the image, and iii) a bounding box in the 2D space of the object. At step 1708, the mapping device 108 generates, using a cuboid generator, the image, and the bounding box for the image: i) a plurality of candidate cuboids in the 2D space which bound the object, ii) a respective vanishing point for each of the plurality of candidate cuboids, iii) a respective vanishing point angle of the vanishing point. At step 1710, the mapping device 108 selects the candidate cuboids having an angle difference (from a screen center to the respective vanishing point center in the 2D space) which are within an angle threshold (e.g., 20 degrees). At step 1712, the mapping device 108 further selects (second selecting), from the selected candidate cuboids, those selected candidate cuboids having a second angle difference in real 3D space (between the respective centroid angle and the 3D normal) which are within a second angle threshold (which is also 20 degrees in an example). At step 1714, the mapping device 108 generates, for each of the selected candidate cuboids from step 1712, using the cuboid generator, for the image: i) the selected candidate cuboid in the 2D space into the real world coordinates of the real 3D space, and ii) a respective cost function value between the selected candidate cuboid in the real world coordinates of the real 3D space and the object in the real 3D space. At step 1716, the mapping device 108 associates with the object: i) the object label, and ii) the selected candidate cuboid in the real world coordinates of the real 3D space having the respective cost function value that is optimal. The mapping device 108 can be configured to store the object label and the selected candidate cuboid into the object database 110 and the map database 112. The object label and the selected candidate cuboid can be subsequently used for the localization methods described herein and for tracking of the object in the real world coordinates of real 3D space.
In some examples, in step 1712 the second angle difference is calculated for each of the Cartesian 3D coordinates, and all of the second angle difference for a candidate cuboid is required to be within the second angle threshold in order to be selected for the next step 1714.
Step 1712 is optional in some examples.
In an example, the localization method 1700 further includes the mapping device 108 or the camera device 104 tracking the selected candidate cuboid having the respective cost function value that is optimal.
In an example, the line segments are generated using a fast line detector function.
In an example, the cost function value is based on distance error, angle alignment error, and/or shape error.
As before, the images 102(1), 102(2), 102(3) can be taken from a variety of orientations and distances between the POV of the camera device 104 to the object 202. In
In some examples, only one image 102 is used to generate the cuboid in 3D space. For example, not shown, using the object label, the pose from the pose estimation module, a known cuboid dimensions retrieved from the object database 110 using the object label, the camera location, the camera orientation, and the camera distance to the object, the cuboid generator 512 generates the cuboid in 3D space.
In an example, as shown in
In some examples, the GPS or LPS is not required for the localization method of
The pose estimation module 516 generates, using the image 102, the object label, and the bounding box: i) front identifying information. The front identifying information has been described in detail in relation to
The positioning module 518 uses the object label to retrieve, from the map database 112, the stored cuboid 1006 in 3D space of the stationary real object 1000 and the stored pose in 3D space of the stationary real object 1000. In an example, one face of the stored cuboid 1006 in 3D space is indicated as being the front (front face) of the stationary real object 1000. Note that the stored cuboid 1006 in 3D space inherently has real world rectangular dimensions of the stationary real object 1000. Note that the pose in 3D space is where the front (front identifying information) of the stationary real object 1000 is facing.
The positioning module 518 generates, using the image, the object label, the bounding box in the image, the POV pose 814, the front identifying information, the anchor points in camera 3D coordinates, the stored anchor points in 3D space, the stored cuboid 1006 in 3D space of the stationary real object 1000 and/or the stored pose in 3D space of the stationary real object 1000: i) the camera location in 3D space of real world coordinates, and ii) the camera orientation in 3D space of real world coordinates. In some examples, the translation method of
In some examples, the positioning module 518 uses the object label to retrieve the last known 3D coordinates of the centroid of the stationary real object 1000 from the map database 112. The positioning module 518 also generates or estimates a centroid of the stationary real object 1000 from the image 102. The positioning module 518 further uses the centroid and the known centroid (in addition to the information shown in
In some examples, the positioning module 518 uses the object label to retrieve the anchor points in 3D coordinates of the stationary real object 1000 from the map database 112. The positioning module 518 compares the anchor points in camera 3D coordinates and the known anchor points in order to generate: i) the camera location in 3D space, and ii) the camera orientation in 3D space. In some examples, the anchor points are individually trackable. In some examples, the particular anchor points of the front of the object are tracked and used for generating the camera location in 3D space and the camera orientation in 3D space.
In examples, the positioning module 518 can use rules (math) or can include a positioning model that includes a NN such as a CNN.
In some examples, the sensors 520 are not required in the localization method of
Referring again to
In other examples, LiDAR can be used to determine the distance between the camera device 104 and the object 202. The location and pose of the object 202 can now be determined using the localization method described in relation to
As shown in
The convolutional layer/pooling layer 1120 shown in
The following describes internal operating principles of a convolutional layer by using the layer 1122(1) as an example of a convolutional layer 1122(1). The convolutional layer 1122(1) may include a plurality of convolutional operators. The convolutional operator is also referred to as a kernel. A role of the convolutional operator in image processing is equivalent to a filter that extracts specific information from an input image matrix. In essence, the convolutional operator may be a weight matrix. The weight matrix is usually predefined. In the process of performing a convolution operation on an image, the weight matrix is usually processed one pixel after another (or two pixels after two pixels), depending on the value of a stride in a horizontal direction on the input image, to extract a specific feature from the image. The size of the weight matrix needs to be related to the size of the image. It should be noted that a depth dimension of the weight matrix is the same as a depth dimension of the input image. In the convolution operation process, the weight matrix extends to the entire depth of the input image. Therefore, after convolution is performed on a single weight matrix, convolutional output with a single depth dimension is output. However, the single weight matrix is not used in most cases, but a plurality of weight matrices with same dimensions (row×column) are used, in other words, a plurality of same-model matrices. Outputs of all the weight matrices are stacked to form the depth dimension of the convolutional image. It can be understood that the dimension herein is determined by the foregoing “plurality”. Different weight matrices may be used to extract different features from the image. For example, one weight matrix is used to extract image edge information, another weight matrix is used to extract a specific color of the image, still another weight matrix is used to blur unneeded noises from the image, and so on. The plurality of weight matrices have the same size (row×column). Feature graphs obtained after extraction performed by the plurality of weight matrices with the same dimension also have the same size, and the plurality of extracted feature graphs with the same size are combined to form an output of the convolution operation.
Weight values in the weight matrices need to be obtained through a large amount of training in actual application. The weight matrices formed by the weight values obtained through training may be used to extract information from the input image, so that the CNN 1100 performs accurate prediction.
When the CNN 1100 has a plurality of convolutional layers, an initial convolutional layer (such as 1122(1)) usually extracts a relatively large quantity of common features. The common feature may also be referred to as a low-level feature. As the depth of the CNN 1100 increases, a feature extracted by a deeper convolutional layer (such as 1122(6) or 1122(n)) becomes more complex, for example, a feature with high-level semantics or the like. A feature with higher-level semantics is more applicable to a to-be-resolved problem.
An example of the pooling layer is also described. Because a quantity of training parameters usually needs to be reduced, a pooling layer usually needs to periodically follow a convolutional layer. To be specific, at the layers 1122(1), . . . 1122(n), one pooling layer may follow one convolutional layer, or one or more pooling layers may follow a plurality of convolutional layers. In an image processing process, the purpose of the pooling layer is to reduce the space size of the image. The pooling layer may include an average pooling operator and/or a maximum pooling operator, to perform sampling on the input image to obtain an image of a relatively small size. The average pooling operator may compute a pixel value in the image within a specific range, to generate an average value as an average pooling result. The maximum pooling operator may obtain, as a maximum pooling result, a pixel with a largest value within the specific range. In addition, just like the size of the weight matrix in the convolutional layer needs to be related to the size of the image, an operator at the pooling layer also needs to be related to the size of the image. The size of the image output after processing by the pooling layer may be smaller than the size of the image input to the pooling layer. Each pixel in the image output by the pooling layer indicates an average value or a maximum value of a subarea corresponding to the image input to the pooling layer.
The fully connected network layer 1130 is now described. After the image is processed by the convolutional layer/pooling layer 1120, the CNN 110000 is still incapable of outputting desired output information. As described above, the convolutional layer/pooling layer 1120 only extracts a feature, and reduces a parameter brought by the input image. However, to generate final output information (desired category information or other related information), the CNN 1100 needs to generate an output of a quantity of one or a group of desired categories by using the fully connected network layer 1130. Therefore, the fully connected network layer 1130 may include a plurality of hidden layers (such as 1132(1), 1132(2), . . . , 1132(n) in
The output layer 1140 follows the plurality of hidden layers 1132(1), 1132(2), . . . , 1132(n) in the network layer 1130. In other words, the output layer 1140 is a final layer in the entire CNN 1100. The output layer 1140 has a loss function similar to category cross-entropy and is specifically used to calculate a prediction error. Once forward propagation (propagation in a direction from 1110 to 1140 in
It should be noted that the CNN 1100 shown in
In some examples, referring again to
Referring again to
The remainder of the localization method in
As shown in
In some examples, for example when the mapping module 114 and/or the mapping device 104 is in the camera device 104, step 1312 is not required and the mapping module 114 of the camera device 104 performs step 1314 as follows: generating, using the mapping module 114, the object label, the anchor points, and at least one of the cuboid in the real world coordinates of the real 3D space, a centroid of the cuboid, or the bounding box of the object with at least one of the images: a 3D map which includes the object located in the real world coordinates of the real 3D space. For example, the mapping module 114 generates the 3D map by using the 3D model of the object already stored in the object database 110, and inserts the 3D model of the object at the virtual location which is the same as the location of the object.
An example of the localization method can be applied to the scenario of event venue setup, in which real objects, stage, seating, audiovisual equipment, etc., are required to be set up at particular locations. In an example, a digital twin of a chair has a physical counterpart real chair, in which the real chair has the location moved out of a row in the real world. In an example, a predefined rule can set the location of physical chairs in a venue according to the placement of their digital twins in a virtual 3D map. So when a chair is removed from a row of chairs in the venue, the localization method can, automatically or through manual intervention, output instructions to move the chair back to the location of its digital twin as specified in the virtual 3D map. A similar example scenario can be applied to the scenario when the pose of the real chair has been manually changed, the localization method can output instructions to move the chair back to the pose of its digital twin as specified in the virtual 3D map. Another example embodiment is that any change in the real world is prioritized as being desired, and the location of the digital twin is automatically updated to bring it into conformity with the real world object location. A similar example scenario can mirror any changes in the pose of the real chair.
In the example embodiments, it should be understood that the described localization system 100, the localization method, camera device 104, the display device 106, and the mapping device 108 may be implemented in other manners. For example, the described camera device 104 is merely an example embodiment. For example, the unit division is merely a logical function division and may be another type of division in other examples. For example, a plurality of units or components may be combined or integrated into another system, or some features may be ignored or not performed. In addition, the displayed or discussed mutual couplings or direct couplings or communication connections may be implemented by using some interfaces. The indirect couplings or communication connections between the units may be implemented in electronic, mechanical, or other forms. In some examples, the mapping device 108 may be provided on the camera device 104 or the display devices 106.
The units described as separate parts may or may not be physically separate, and parts displayed as units may or may not be physical units, may be located in one position, or may be distributed on a plurality of network units. Some or all of the units may be selected according to actual requirements to achieve the objectives of the solutions of the embodiments.
In addition, functional units in the example embodiments may be integrated into one processing unit, or each of the units may exist alone physically, or two or more units are integrated into one unit.
When the functions are implemented in the form of a software functional unit and sold or used as an independent product, the functions may be stored in a computer-readable storage medium. Based on such an understanding, the technical solutions of example embodiments may be implemented in the form of a software product. The software product is stored in a storage medium, and includes several instructions for instructing a computer device (which may be a personal computer, a server, or a network device) to perform all or some of the steps of the methods described in the example embodiments. The foregoing storage medium includes any medium that can store program code, such as a Universal Serial Bus (USB) flash drive, a removable hard disk, a read-only memory (ROM), a random access memory (RAM), a magnetic disk, or an optical disc. In an example, the software product can be an inference model generated from a machine learning training process.
In the described methods or block diagrams, the boxes may represent events, steps, functions, processes, modules, messages, and/or state-based operations, etc. While some of the example embodiments have been described as occurring in a particular order, some of the steps or processes may be performed in a different order provided that the result of the changed order of any given step will not prevent or impair the occurrence of subsequent steps. Furthermore, some of the messages or steps described may be removed or combined in other embodiments, and some of the messages or steps described herein may be separated into a number of sub-messages or sub-steps in other embodiments. Even further, some or all of the steps may be repeated, as necessary. Elements described as methods or steps similarly apply to systems or subcomponents, and vice-versa. Reference to such words as “sending” or “receiving” could be interchanged depending on the perspective of the particular device.
The described embodiments are considered to be illustrative and not restrictive. Example embodiments described as methods would similarly apply to systems or devices, and vice-versa.
The various example embodiments are merely examples and are in no way meant to limit the scope of the example embodiments. Variations of the innovations described herein will be apparent to persons of ordinary skill in the art, such variations being within the intended scope of the example embodiments. In particular, features from one or more of the example embodiments may be selected to create alternative embodiments comprised of a sub-combination of features which may not be explicitly described. In addition, features from one or more of the described example embodiments may be selected and combined to create alternative example embodiments composed of a combination of features which may not be explicitly described. Features suitable for such combinations and sub-combinations would be readily apparent to persons skilled in the art. The subject matter described herein intends to cover all suitable changes in technology.
This application claims the benefit of priority to U.S. patent application Ser. No. 17/494,585 filed Oct. 5, 2021, the entire contents of which are herein incorporated by reference. This application is also a continuation-in-part of U.S. patent application Ser. No. 17/494,585 filed Oct. 5, 2021, the entire contents of which are herein incorporated by reference.
Filing Document | Filing Date | Country | Kind |
---|---|---|---|
PCT/CA2022/051063 | 7/6/2022 | WO |
Number | Date | Country | |
---|---|---|---|
17494585 | Oct 2021 | US |