The various example embodiments described herein generally relate to image-based localization.
Autonomous navigation of mobile agents (e.g., drones, robots, etc.) generally relies on localizing the agents within their operating environments to a high degree of accuracy (e.g., centimeter-level accuracy) for safe operation. This is particularly true in industrial environments with many potential hazards and objects for the agents to avoid or navigate around. Therefore, there are significant technical challenges with providing accurate and efficient localization in such environments.
Therefore, there is a need for structure-aided visual localization (also referred to herein as SAVLoc) to provide high accuracy localization while also minimizing computational complexity.
According to one example embodiment, an apparatus comprises at least one processor, and at least one memory storing instructions that, when executed by the at least one processor, cause the apparatus at least to process an image captured by a camera device to determine one or more pixel coordinates of one or more structural components depicted in the image. The apparatus is also caused to determine an approximate pose of the camera device at a time the image was captured. The apparatus is further caused to determine a correspondence of the one or more structural components to one or more known components based on the approximate pose. The apparatus is further caused to query a database for one or more location coordinates of the one or more known components based on the correspondence. The apparatus is further caused to compute a pose estimation of the camera device based on one or more location coordinates and the one or more pixel coordinates.
According to another example embodiment, a method comprises processing an image captured by a camera device to determine one or more pixel coordinates of one or more structural components depicted in the image. The method also comprises determining an approximate pose of the camera device at a time the image was captured. The method further comprises determining a correspondence of the one or more structural components to one or more known components based on the approximate pose. The method further comprises querying a database for one or more location coordinates of the one or more known components based on the correspondence. The method further comprises computing a pose estimation of the camera device based on one or more location coordinates and the one or more pixel coordinates.
According to another example embodiment, a non-transitory computer-readable storage medium comprising program instructions that, when executed by an apparatus, cause the apparatus to process an image captured by a camera device to determine one or more pixel coordinates of one or more structural components depicted in the image. The apparatus is also caused to determine an approximate pose of the camera device at a time the image was captured. The apparatus is further caused to determine a correspondence of the one or more structural components to one or more known components based on the approximate pose. The apparatus is further caused to query a database for one or more location coordinates of the one or more known components based on the correspondence. The apparatus is further caused to compute a pose estimation of the camera device based on one or more location coordinates and the one or more pixel coordinates.
According to another example embodiment, a computer program comprising instructions which, when executed by an apparatus, cause the apparatus to process an image captured by a camera device to determine one or more pixel coordinates of one or more structural components depicted in the image. The apparatus is also caused to determine an approximate pose of the camera device at a time the image was captured. The apparatus is further caused to determine a correspondence of the one or more structural components to one or more known components based on the approximate pose. The apparatus is further caused to query a database for one or more location coordinates of the one or more known components based on the correspondence. The apparatus is further caused to compute a pose estimation of the camera device based on one or more location coordinates and the one or more pixel coordinates.
According to another example embodiment, an apparatus comprises means for processing an image captured by a camera device to determine one or more pixel coordinates of one or more structural components depicted in the image. The apparatus also comprises means for determining an approximate pose of the camera device at a time the image was captured. The apparatus further comprises means for determining a correspondence of the one or more structural components to one or more known components based on the approximate pose. The apparatus further comprises means for querying a database for one or more location coordinates of the one or more known components based on the correspondence. The apparatus further comprises means for computing a pose estimation of the camera device based on one or more location coordinates and the one or more pixel coordinates.
According to one example embodiment, an apparatus comprises image processing circuitry configured to perform processing an image captured by a camera device to determine one or more pixel coordinates of one or more structural components depicted in the image. The image processing circuitry is also configured to perform determining an approximate pose of the camera device at a time the image was captured. The apparatus further comprises localization circuitry configured to perform determining a correspondence of the one or more structural components to one or more known components based on the approximate pose. The localization circuitry is further configured to perform querying a database for one or more location coordinates of the one or more known components based on the correspondence. The localization circuitry is further configured to perform computing a pose estimation of the camera device based on one or more location coordinates and the one or more pixel coordinates.
According to one example embodiment, a system comprises one or more devices including one or more of a cloud server device, an edge device, an internet of things (IoT) device, a user equipment device, or a combination thereof. The one or more devices are configured to process an image captured by a camera device to determine one or more pixel coordinates of one or more structural components depicted in the image. The one or more devices are also configured to determine an approximate pose of the camera device at a time the image was captured. The one or more devices are further configured to determine a correspondence of the one or more structural components to one or more known components based on the approximate pose. The one or more devices are further configured to query a database for one or more location coordinates of the one or more known components based on the correspondence. The one or more devices are further configured to compute a pose estimation of the camera device based on one or more location coordinates and the one or more pixel coordinates.
According to a further embodiment, a device (e.g., a camera device, a mobile agent, or component thereof) comprises at least one processor; and at least one memory including a computer program code for one or more programs, the at least one memory and the computer program code configured to, with the at least one processor, cause the device to process an image captured by a camera device to determine one or more pixel coordinates of one or more structural components depicted in the image. The device is also caused to determine an approximate pose of the camera device at a time the image was captured. The device is further caused to determine a correspondence of the one or more structural components to one or more known components based on the approximate pose. The device is further caused to query a database for one or more location coordinates of the one or more known components based on the correspondence. The device is further caused to compute a pose estimation of the camera device based on one or more location coordinates and the one or more pixel coordinates.
In addition, for various example embodiments of the invention, the following is applicable: a method comprising facilitating a processing of and/or processing (1) data and/or (2) information and/or (3) at least one signal, the (1) data and/or (2) information and/or (3) at least one signal based, at least in part, on (or derived at least in part from) any one or any combination of methods (or processes) disclosed in this application as relevant to any embodiment of the invention.
For various example embodiments of the invention, the following is also applicable: a method comprising facilitating access to at least one interface configured to allow access to at least one service, the at least one service configured to perform any one or any combination of network or service provider methods (or processes) disclosed in this application.
For various example embodiments of the invention, the following is also applicable: a method comprising facilitating creating and/or facilitating modifying (1) at least one device user interface element and/or (2) at least one device user interface functionality, the (1) at least one device user interface element and/or (2) at least one device user interface functionality based, at least in part, on data and/or information resulting from one or any combination of methods or processes disclosed in this application as relevant to any embodiment of the invention, and/or at least one signal resulting from one or any combination of methods (or processes) disclosed in this application as relevant to any embodiment of the invention.
For various example embodiments of the invention, the following is also applicable: a method comprising creating and/or modifying (1) at least one device user interface element and/or (2) at least one device user interface functionality, the (1) at least one device user interface element and/or (2) at least one device user interface functionality based at least in part on data and/or information resulting from one or any combination of methods (or processes) disclosed in this application as relevant to any embodiment of the invention, and/or at least one signal resulting from one or any combination of methods (or processes) disclosed in this application as relevant to any embodiment of the invention.
In various example embodiments, the methods (or processes) can be accomplished on the service provider side or on the mobile device side or in any shared way between service provider and mobile device with actions being performed on both sides.
For various example embodiments, the following is applicable: An apparatus comprising means for performing a method of the claims.
According to some aspects, there is provided the subject matter of the independent claims. Some further aspects are defined in the dependent claims.
Still other aspects, features, and advantages of the invention are readily apparent from the following detailed description, simply by illustrating a number of particular embodiments and implementations, including the best mode contemplated for carrying out the invention. The invention is also capable of other and different embodiments, and its several details can be modified in various obvious respects, all without departing from the spirit and scope of the invention. Accordingly, the drawings and description are to be regarded as illustrative in nature, and not as restrictive.
The example embodiments of the invention are illustrated by way of examples, and not by way of limitation, in the figures of the accompanying drawings:
Examples of a method, apparatus, and computer program for providing structure-aided visual localization (SAVLoc), according to one example embodiment, are disclosed in the following. In the following description, for the purposes of explanation, numerous specific details and examples are set forth to provide a thorough understanding of the embodiments of the invention. It is apparent, however, to one skilled in the art that the embodiments of the invention may be practiced without these specific details or with an equivalent arrangement. In other instances, structures and devices are shown in block diagram form to avoid unnecessarily obscuring the embodiments of the invention.
Reference in this specification to “one embodiment” or “an embodiment” means that a particular feature, structure, or characteristic described in connection with the embodiment is included in at least one embodiment of the present disclosure. The appearance of the phrase “in one embodiment” in various places in the specification are not necessarily all referring to the same embodiment, nor are separate or alternative embodiments mutually exclusive of other embodiments. In addition, the embodiments described herein are provided by example, and as such, “one embodiment” can also be used synonymously as “one example embodiment.” Further, the terms “a” and “an” herein do not denote a limitation of quantity, but rather denote the presence of at least one of the referenced items. Moreover, various features are described which may be exhibited by some embodiments and not by others. Similarly, various requirements are described which may be requirements for some embodiments but not for other embodiments.
As used herein, “at least one of the following: <a list of two or more elements>,” “at least one of <a list of two or more elements>,” “<a list of two or more elements> or a combination thereof,” and similar wording, where the list of two or more elements are joined by “and” or “or”, mean at least any one of the elements, or at least any two or more of the elements, or at least all the elements.
Localization, in the context of mobile agents such as drones and robots (as well as in general technology and computing), refers to the process of determining the position or location of a device within its environment. More specifically, localization enables autonomous systems to navigate, interact with their surroundings, and execute tasks accurately.
Traditional localization techniques include Fiducial-based Localization (FidLoc), Hierarchical Localization (HLOC), and Ultrawideband Localization (UWB), but each technique presents technical challenges for implementation.
FidLoc is a visual based solution in which 2D fiducial tags are distributed through an environment. The size and location of the tags is precisely known (with cm-level accuracy) so that given an image of a tag captured by a drone camera, the six degrees of freedom (6DoF) pose can be computed. Centimeter-level accuracy can be achieved but only when a tag is seen by the camera. A denser deployment of tags results in more robust tracking, but this results in higher overhead costs. In some situations, the cost of tag deployment can be a significant fraction (˜30%) of the total deployment cost.
HLOC is a vision-based solution which relies on extracting visual features from a query image captured by a mobile camera and matching the features with those in a pre-generated map database. The map database is generated from a video captured in the environment of interest by a mobile camera. 6DoF camera pose with cm-level localization accuracy can be achieved. This technique is well suited for visually rich environments. However, the map database would need to be updated whenever there are significant visual changes to the environment.
UWB is a radio-based solution that achieves sub-meter localization accuracy when there is a line-of-sight (LOS) path between wired infrastructure node(s) and the device. 3D localization can be obtained, but accurate orientation estimation is not available. In cluttered environments, a high density of infrastructure nodes is required to provide LOS paths with multiple nodes. Also, the location of the nodes needs to be determined with cm-level accuracy.
To enable drone navigation (e.g., in indoor industrial environments), the technical problem is to determine a drone's pose accurately (e.g., cm-level location and degree-level orientation), in real time (e.g., approximately 5 Hz). In some embodiments, there are additional technical problems associated with visual localization in visually repetitive environments such as warehouses consisting of identical shelf structures or outdoor ports with identical shipping containers.
As discussed above, the problems with the existing solutions are:
FidLoc: The overhead cost of installing and calibrating fiducial tags is high. In some cases, the FidLoc overhead cost can be up to 30% of the total system cost.
HLOC: Because HLOC is based on unique visual features, it is not suited for repetitive warehouse environments. If features are extracted from images of goods placed on the shelves, then the map would need to be updated whenever the goods are moved. In general, the cost of creating a map in a large environment is high.
UWB: The overhead cost of installing wired infrastructure nodes is high, and it does not provide accurate orientation estimates.
To address these technical challenges, the various example embodiments described herein introduce a capability to provide structure-aided visual localization referred to herein as “SAVLoc”. In one embodiment, SAVLoc is a visual-based localization technique in which an image, taken by a camera (e.g., equipped on an agent or mobile agent), is used to estimate the camera's pose (e.g., 3D location and/or 3D orientation with respect to the world coordinate frame). The various example embodiments, for instance, are used to identify known structural components in an image and to use their world-frame coordinates to determine the camera pose. The camera is fixed with a known spatial relationship to an agent (e.g., a mobile agent such as a drone, forklift, robot), so that the pose of the agent can be derived from the camera's pose. By way of example, possible structures used for SAVLoc include but are not limited to shelves in a warehouse, containers in a port, door frames, window frames, lights (e.g., overhead lights in an office/warehouse), beams, and/or any other structural components (e.g., edges, intersections, openings, etc.) in an environment. The various example embodiments are based on the geometric information about the structure being known, such as their dimensions and coordinates of edges and corners.
In one embodiment, the camera's pose can be determined with respect to any number of degrees of freedom such as six degrees of freedom (6DoF) or three degrees of freedom (3DoF). 6DoF refers to the ability of the camera and/or mobile agent to move freely in three-dimensional space. These six degrees of freedom represent all possible independent movements that an object can make and include: (1) translation along an x-axis, (2) translation along a y-axis, (3) translation along z-axis, (4) rotation around the x-axis (e.g., roll), (5) rotation around the y-axis (e.g., pitch), and (6) rotation around the z-axis (e.g., yaw). For example, a drone with 6DoF capability can translate (move) in any direction and rotate (change orientation) along all three axes, enabling it to navigate complex environments. 3DoF refers to the ability of a mobile agent to move freely in three-dimensional space along specific axes while having limited rotational freedom (e.g., a mobile agent such as a robot or terrestrial vehicle that is limited to traveling on the ground or surface). In other words, in a 3DoF device, there is no rotational freedom; the mobile agent can only move along the three axes.
In one embodiment, SAVLoc can be used in autonomous drone navigation as part of an autonomous industrial monitoring system. Example use cases include but are not limited to scanning barcodes on boxes in a warehouse and providing video footage of plants growing in an indoor vertical farm. It is noted that although the various example embodiments described herein discuss SAVLoc in the context of an example use case of a drone camera flying through a warehouse with repetitive shelf structures, it is contemplated that the various embodiments are applicable to localization of any type of mobile agent in any environment (e.g., indoor or outdoor).
As shown in
In one example embodiment, the SAVLoc process 100 consists of three steps:
(1) Segmentation. Pixels for the horizontal and vertical shelf components (e.g., more generally, structural components) are identified using, e.g., a pre-trained neural network 103 or equivalent object recognition mechanism to detect structures and/or their boundaries which are then are highlighted (e.g., as shown in segmented image 105 with reference points of detected components in the image frame) in block 107. In this example, segmented image 105 highlights the horizontal and vertical structural components of the shelves as darker shaded lines. For example, the pixel coordinates of where the detected components intersect the edge of the image are determined (e.g., x=310 and y=480 starting in pixel coordinates with the origin at the top left corner of the image 105).
In one embodiment, the pre-trained neural network is trained to identify pixels of the image 101 that correspond to one or more objects of interest. Then, a post-processing step can be applied to determine boundaries of the object(s) of interest in the image 101 from the segmented image 105, and a point (e.g., a centroid or other designated point) within the determined boundaries is identified to represent the detected object. In other embodiments, the pre-trained neural network 103 can be trained to identify objects of interest using bounding boxes. In this case, the representative point of the object can be determined based a centroid (or any other designated point) of the bounding box. In yet another embodiment, the pre-trained neural network 103 can be trained to directly output the boundaries, representative point, centroid, etc. of the detected objects of interest. It is noted that the above examples of image segmentation outputs from the pre-trained neural network 103 are provided by way of illustration and not as limitations. If no shelf structures or other structures detectable by the pre-trained neural network 103 are seen, the process 100 waits for the next image.
(2) Correspondence. Because of the repetitive nature of some structures (e.g., shelves, shipping containers, etc.) in the environment, there could be ambiguity about which structure (e.g., shelf) is seen. Using an approximate estimate 109 of the current pose (e.g., based on a prior pose and odometry estimates as further discussed with respect to
(3) Pose estimation. From the 3D coordinates and corresponding 2D pixel coordinates, the pose 117 of the camera in the world frame can be computed geometrically in block 119.
The various embodiments of the SAVLoc process described herein have several technical advantages including but not limited to having low cost in terms of overhead because in built environments (e.g., warehouses, ports, etc.), the structural information is easily acquired from blueprints and datasheets (e.g., collectively referred to as blueprint data). It can also have low run-time costs by leveraging efficient, state-of-the-art neural networks for image segmentation. The remaining calculations for correspondence and pose estimation are very computationally light weight, thereby using less computational resources relative to traditional approaches.
It is contemplated that the approximate pose 109 can be obtained using any means. For example, the approximate pose 109 can be obtained using any other pose estimation modality known in the art. In one embodiment, the approximate pose 109 can be determined using odometry. Odometry, for instance, is a method used, e.g., in robotics and autonomous systems to estimate position and velocity by analyzing data from its motion sensors or equivalent. It relies on measuring the changes in position over time to track movement. The accuracy of odometry is based on how accurately the changes in position can be tracked over time. For example, as shown in
In one embodiment, as discussed above, the SAVLoc process can be applied in a 3DoF scenario as opposed, for instance, to a 6DoF scenario. In this scenario, the accuracy of the VIO is not needed, and visual odometry (VO) can be performed to track relative movement. In this way, VO can eliminate the IMU hardware needed to support VIO, thereby advantageously reducing or eliminating the computational complexity and hardware cost that is used for VIO.
The accuracy of the estimates depends on the use case, where the most aggressive performance targets are cm-level error in 2D location and degree-level error in yaw. The rate of pose estimation also depends on the use case. For example, robots moving at pedestrian speeds could require estimation rates of about 5 Hz. Accordingly, the SAVLoc process can be simplified to using VO to estimate approximate pose input to the SAVLoc algorithm 100. In one embodiment, SAVLoc can be implemented more simply using ceiling mounted structures (e.g., lights, fire alarms, sprinklers, and/or the like) that are typically mounted at a known height to make mapping and/or related localization calculations less complex. In other words, VO based on such ceiling mounted structures can be used in the SAVLoc algorithm 100 when 3DoF pose estimation is sufficient. As shown in
With respect to overhead, the SAVLoc algorithm 100 based on VO as illustrated in
With respect to tracking, compared to the SAVLoc algorithm 100 of
In one example embodiment, VIO 121 is the process of estimating an agent (e.g., drone) pose using both camera and inertial measurement unit (IMU) sensor inputs. VIO 121 is typically used to determine the relative pose change between time instances, and it suffers from drift due to biases in the IMU measurements. VIO 121 can be paired with SAVLoc to provide 6DoF pose estimation. In contrast, VO 127 forgoes the use of IMU inputs, and tracks the camera's movement and pose in a given environment based on the changes observed in consecutive or a sequence of images (e.g., based on structures detected by pre-trained neural network 103 over a sequence of images 101). More specifically, VO 127 estimates motion by matching features between images 101 and analyzing the spatial displacements of the features between images 101. The relative translation/movement and rotation between consecutive frames are used to update the camera's pose (e.g., VO pose 129) instead of measuring the relative movement and rotation using an IMU as done in VIO 121.
In one example embodiment, the processing consists of image segmentation 107, visual odometry 127, and relocalization (e.g., compute precise pose 119) to output a 3DoF pose estimation 117 in a world frame. By way of example, the entire processing for a single image takes about 20 ms using a current server-grade GPU. In contrast, neural network processing for the traditional HLoc approach consists of visual feature extraction and matching and requires about 800 ms per image on the same GPU.
Overall, the SAVLoc algorithm 100 based on optional VO 127 enables accurate localization in very large indoor environments with very low cost. It was shown to achieve sub-meter accuracy in a 20000 m{circumflex over ( )}2 warehouse with no added localization infrastructure. The overhead for digitizing the blueprint, creating a labeled training set, and training the segmentation network took only about 10 hours.
For this deployment, images captured by a drone camera are communicated in real-time to an on-premises edge server running the SAVLoc algorithm. The algorithm determines the 6DoF camera pose in real-time, and this pose is used by the AIMS autonomy software stack to guide the drone through the environment.
In one embodiment, the various embodiments described herein can be implemented in a client-server architecture or a device only architecture. For example,
In summary, in a server implementation, a query image 217 and a VIO pose estimate 219, relative to the previous query image pose, could be communicated from the device 201 (e.g., drone) to the server 213. The SAVLoc pose 223 would be computed at the server 213 and communicated back to the device 201 (e.g., drone) to be integrated with the flight controller process (e.g., navigation controller 209). In general, a server 213 could provide localization services for devices 201 (e.g., drones) in multiple environments. Hence during a service initialization phase, the device 201 (e.g., drone) could notify the server 213 which environment it is flying in. The server 213 then loads the appropriate structural information (e.g., database 221) to be used for every query from this device 201 (e.g., drone).
At step 301, the image processing circuitry or other component of the server 213 and/or device 201/241 processes an image captured by a camera device (e.g., equipped on a mobile agent) to determine one or more pixel coordinates of one or more structural components depicted in the image. In one example embodiment, a mobile agent or mobile device associated with the agent includes a camera, IMU (if VIO is used to determine an approximate current pose for 6DoF pose estimation, or no IMU needs to be included if VO is used to determine an approximate current pose for 3DoF pose estimation), wireless modem, and compute capability to support VIO/VO. For example, an Android or equivalent device with built-in VIO from ARCore could be used. For the autonomous industrial monitoring use case, a VOXLI module with built-in VIO or equivalent can be used for deployment.
In one embodiment, SAVLoc process for 2D rectangular structures (or any other designated structural component occurring within a selected environment). The process 300 illustrates various example embodiments for determining the 6DoF camera pose given an image of a repetitive 2D rectangular structure. However, it is contemplated that a similar process (e.g., with VO in place of VIO) can be performed for 3DoF pose estimation. In one example embodiment, the image has been undistorted to a pinhole image model using known camera intrinsics. Although the various embodiments described are described using an example use case of a warehouse application where the camera is pointed with a fixed angle (e.g., roughly horizontally) with respect to the ground plane, it is contemplated any other equivalent application (e.g., indoor or outdoor) are also applicable to the various embodiments.
In one example embodiment, the process 300 uses computer vision/image segmentation to detect structural components and their respective pixel coordinates in an image. For example, neural network architectures such as but not limited to Mask R-CNN and YOLO can be trained to segment structural components. As used herein, the term “structural component” refers to any physical object that occurs within an environment such as objects forming the infrastructure at a given site (e.g., shelves, lights, structural beams, door frames, window frames, containers, etc. that regularly occur within environment). In one example embodiment, the structural components are repetitive indicating that the components occur more than once or more than a specified number of times within the environment. Examples of repetitive structural components include but are not limited to shelves in warehouse, stacked shipping containers in a port facility, hospital beds in a hospital ward, doorways in a hotel, etc. The neural network could be trained on camera-captured images with hand-labeled object classes of structural components (e.g., horizontal and vertical shelf components). Alternatively, the network could be trained on synthetic rendered images whose domains have been transferred to the query image domain which could be specific, for example, to the environment lighting or image characteristics.
In one example embodiment, the segmentation neural network (e.g., Mask R-CNN, YOLO network or equivalent) can be trained for image segmentation using a set of labeled training images. If the number of objects to detect is small, then the labeling and training has low complexity. In one example embodiment, a custom network can be trained for each environment.
In one example embodiment, the network can be trained for enhanced segmentation. For example, for a given structural component, the network can be trained to segment different aspects. For example, if a downward facing camera is used for the warehouse drone use case, the network could be trained to segment the top faces of the horizontal structure separately from the front face. In doing so, additional geometric information could be used in pose estimation. In this way, the various embodiments of the pose estimation algorithm described herein can be generalized from a single face of a structure to account for multiple faces.
In one example embodiment, the various embodiments of the SAVLoc process described herein could be applied to structures other than the rectangular, horizontal, vertical, etc. structures discussed in the examples. In the warehouse example, if the camera is farther away from the shelf or if a wider angle lens is used, the openings of the shelves could be used as shown in example 1000 of
At step 303, the image processing circuitry or other component of the server 213 and/or device 201/241 determines an approximate pose of the camera device at a time the image was captured. In one example embodiment, the approximate pose is based on a prior pose coupled with a relative pose change up to the time the image was captured. By of example, the prior pose could come from SAVLoc itself (e.g., when the structure was last seen) or from another localization modality (e.g., any modality other than SAVLoc). The relative pose change could be obtained from odometry from, for example, visual-inertial sensors or wheel measurements. Given the approximate current camera pose, the structure labels can be obtained by projecting all candidate objects into the camera's image plane and determining which candidate is “closest” to the observed object. For example, as shown in example 400 of
In one embodiment, the approximate pose of the camera device can be obtained by fusing the SAVLoc pose estimates with absolute pose estimates. The SAVLoc algorithm assumes that the initial camera pose in the world frame is known and that the approximate pose required for correspondence can be estimate by combining prior SAVLoc and VIO/VO poses. In practice, the camera could get “lost” if the correspondence algorithm incorrectly identifies a structural component. In this case, it may be necessary to get the algorithm back on track by providing an absolute pose estimate. Absolute pose estimates could be obtained from known unique visual references with known locations. Examples include fiducial markers (sparsely placed) and human-readable barcodes or signs. In one example embodiment, the segmentation network can be trained to identify these references to obtain absolute pose estimates to correct any component identification error.
At step 305, the localization circuitry or other component of the server 213 and/or device 201/241 determines a correspondence of the one or more structural components to one or more known components based on the approximate pose. In one example embodiment, the correspondence label for an identified structural object can be determined from an approximate estimate of the current camera pose. As discussed above, this estimate could be obtained in general from a prior pose coupled with a relative pose change up to the current time. In other words, the approximate pose estimate can be used to determine what known components (e.g., determined from blueprint data) are computed to be visible in the capture image. Those known components that are computed to be visible in the image given the approximate pose are selected as candidate objects or structures. In this way, a correspondence between an observed structure (e.g., detected in the processed query image) and a candidate object projected back into the image plane can be determined. In general, the correspondence step is quite robust to errors in the prior or approximate pose. For example, if the spacing between known components is two meters, the drift of the VIO/VO could be up to a meter without resulting in a mislabeled component.
At step 307, the localization circuitry or other component of the server 213 and/or device 201/241 queries a database for one or more location coordinates of the one or more known components based on the correspondence. In other words, once the indices or other identification of the detected structural component is determined, the dimensions and world coordinates can be looked up and used for pose estimation. For example, the candidate object that is projected back into the image plane in step 305 and closest to the observed structure in the image can be selected as the corresponding known object/structure.
In one example embodiment, the database that is being queried stores the known structures and their respective location coordinates (e.g., in a world frame of reference as opposed to an image frame of reference to allow for common frame of reference across all images). An example of a world frame of reference includes but is not limited to an Earth-Centered, Earth-Fixed (ECEF) frame of reference which is a three-dimensional coordinate system that is fixed with respect to the Earth's center and rotates along with the Earth's rotation. In this frame, the origin is located at the center of the Earth, and the axes are fixed relative to the Earth's surface. Setting up structure information. It is contemplated that any world frame can be used according to the various embodiments described herein.
In one example embodiment, an application programming interface (API) and/or other type of interface, application, service, etc. can be used to input the structure coordinates and dimensions by ingesting, for instance, computer-aided design (CAD) blueprints, construction documents, or other equivalent digital records (also collectively referred to as blueprint data). The ingestion of the blueprint data comprises identifying known structural components and their respective locations to create corresponding data records in a database of known of components. Accordingly, each known component will have at a minimum a data recording identifying the component and locations of the component and/or one or more features of the component (e.g., edges, corners, centroid, etc.).
At step 309, the localization circuitry or other component of the server 213 and/or device 201/241 computes a pose estimation of the camera device based on one or more location coordinates and the one or more pixel coordinates. In one example embodiment, the pose estimation algorithm depends on what structures are observed in the image.
For example, option A for pose estimation can be used for when intersecting vertical and horizontal structural components are observed. Option A is illustrated with respect to
For example, let (Pvx, Pvy) and (Phx, Phy) be the respective coordinates for the vertical and horizontal components. Given the focal length f of the camera, it is noted that the vector [Phx, Phy, f] projecting from the camera origin is parallel (in the world frame) to the horizontal structure. Similarly, [Pvx, Pvy, f] is parallel to the vertical structure. Letting N (v) denote the normalized vector of v, the 3D orientation matrix R of the rectangle frame with respect to the camera frame can be derived from the three orthonormal vectors N ([Phx, Phy, f]), N ([Pvx, Pvy, f]), and their cross product. The translation from the camera to the rectangle frame is parallel to the vector N ([Rcx, Rcy, f]) (at block 603). The distance d between the origins can be computed by minimizing the reprojection error of the corners (at block 605). Based on letting d be this distance, the translation of the shelf in the camera frame is T=dN ([Rcx,Rcy,f]). From R and T, the relative 6DoF pose of the shelf in the camera frame can be expressed with a 4-by-4 spatial transformation matrix as T [RectToCam] (at block 607). The pose of the rectangle in the world frame T [RectToWorld] can be derived from the correspondence labels and resulting world coordinates of the structural components (at block 609). Then, the pose of the camera in the world frame is given by T_[CamToWorld]=inv(T_[RectToCam])*T_[RectToWorld] (at block 611), where inv (M) denotes the inverse of the square matrix M (at block 613).
In one example embodiment, the pose estimation can be performed using the horizontal structural component only. Here only a horizontal component is observed as shown in
Details of an algorithm for pose estimation based on horizontal structural component only are illustrated in
In one example embodiment, the pose estimation based on vertical structural components only is similar to the horizontal case but with ambiguity in the rotation and translation about the vertical axis addressed with the appropriate estimates from the approximate current pose.
In one example embodiment, the SAVLoc process can also use alternative visual sensing modalities. For example, depth data from stereo cameras or time-of-flight cameras could be used to augment the pose estimation algorithm.
At optional step 311, the localization circuitry or other component of the server 213 and/or device 201/241 localizes the mobile agent based on the pose estimation (e.g., when the camera device is equipped on a mobile agent). As previously discussed, the camera device is fixed with a known spatial relationship to the mobile agent. Therefore, the known spatial relationship (e.g., known mounting location of the camera device to the mobile agent) can be used to derive the pose of the agent from the pose estimation of the camera device based on the spatial relationship. The derived pose of the mobile agent represents the localization of the mobile agent.
In one example embodiment, the SAVLoc pose estimates from multiple cameras can be fused into a single estimate. Mobile agents often have multiple cameras to enable different applications. For example, the VOXL navigation module used for drones has a front-facing VGA stereo camera pair and a downward-facing VGA camera with a fisheye lens. Images from different cameras can provide independent SAVLoc pose estimates which can be combined based on the error covariance statistics to provide a single final pose estimate.
By way of example, the inference time for YOLO segmentation was 120 ms with a CPU and 20 ms with an RTX 2080Ti GPU. The remaining SAVLoc computation requires only a few ms with a CPU. Hence a single GPU could be multiplexed to serve 50 devices at 1 Hz each. In practice, an update rate of 1 Hz has been demonstrated to be sufficiently high to correct VIO drift and navigate a drone in an indoor warehouse.
The above example embodiments are generally discussed with respect to an example of SAVLoc using VIO or equivalent to provide for pose estimation at 6DoF. However, as previously discussed, in cases where 3DoF is sufficient (e.g., ground-based mobile agents with no roll or pitch rotation), simpler VO can be used to avoid eliminate the use of IMUs for VIO measurements.
As with 6DoF SAVLoc, input images are processed using image segmentation. Images are time-stamped with the time of capture and are processed one-by-one by a neural network pre-trained for the task of image segmentation. Namely, the NN identifies pixels associated with desired ceiling structures such as lights and supports. Given the segmentation masks, additional processing would determine the image coordinates of the segmented structures. For example, circular lights would each have an XY image coordinate, and linear lights and structures would have XY coordinates for each endpoint. The image coordinates, along with the confidence (from 0.0 to 1.0) of each segmented structure, make up the output of the image segmentation process.
Image segmentation can be performed using neural networks. For example, neural network architectures such as Mask R-CNN and YOLOv8 can be trained to segment structural components. The network could be trained on camera-captured images with hand-labeled object classes (round lights, light tubes, ceiling structures). Alternatively, the network could be trained on synthetic rendered images whose domains have been transferred to the query image domain which could be specific, for example, to the environment lighting or image characteristics. If the number of object classes is small, then the labeling and training has low complexity.
In one example embodiment, the SAVLoc based on VO uses a ceiling map. Given a world coordinate frame (e.g., with the origin in the corner of the warehouse, and the XY plane as the ground plane and Z pointing vertically up), the 3D coordinates in the world frame of each light structure are assumed known. The X, Y coordinates can be obtained from a blueprint, and the Z coordinate can be measured with a laser range finder. As described in the tracking algorithm below, the other ceiling structures do not need to have their XY coordinates known, only the height (Z coordinate) needs to be known approximately. (If the full XYZ coordinates are known, this would improve the algorithm performance.)
In one example embodiment, the SAVLoc process uses visual odometry. The goal of visual odometry is to determine the relative 3DoF pose change between successive image frames, based on matching segmented features across the frames. This technique considers frame n+1. As shown in
Relocalization can then be performed. From those features detected in frame n+1 which have known 3D locations (e.g., lights), the correspondences between the 3D world coordinates and 2D image coordinates can be used to estimate the 3DoF pose of the camera in the world frame using a special case of the perspective-n-point algorithm.
In one example embodiment, the relocalization algorithm can be further refined if the lights can be individually distinguished. For example, the general relocalization algorithm described above assumes that the lights are identical. However, if they can be distinguished, then the relocalization algorithm could be made more robust by loosening the required accuracy of the prior velocity. Ways of distinguishing lights include but are not limited to:
In one example embodiment, the SAVLoc pose estimates based on VO can be fused with absolute pose estimates. In practice, the camera could get “lost” if the VO tracking fails due, for example, to missing images. In this case, it may be necessary to get the algorithm back on track by providing an absolute pose estimate. Absolute pose estimates could be obtained from known unique visual references with known locations. Examples include fiducial markers (sparsely placed) and human-readable barcodes or signs. The segmentation network would need to be trained to identify these references.
In one example embodiment, VIO can be used as an alternative to VO. For example, a more expensive visual-inertial odometry system could be implemented on the device to more accurately track the agent compared to visual odometry.
In one example embodiment, SAVLoc based on VO can be used for anomaly detection. In many use cases (e.g., bed tracking in hospitals), the mobile agent should stay within prescribed areas or paths. If the agent detects that it has deviated from these areas, for example, if the detected light pattern does not correspond to the expected pattern, then an alert can be sent to the operator. An unexpected light pattern could also indicate other anomalous conditions, such as a camera installed incorrectly. In this case too, an alert could be sent.
SAVLoc based on VO advantageously provides for privacy preservation. For example, the system is privacy-preserving if the camera is facing the ceiling and no one and/or any privacy-sensitive information is in its field of view.
In one example embodiment, SAVLoc based on VO can also be used for map generation. It has been assumed that a lighting map of the environment is available, for example from a blueprint of the electrical system. In practice, one could generate the lighting map using VO and using some prior knowledge of the ceiling structure. Alternatively, a more general visual localization system such as HLOC could be used to build a point cloud map of the ceiling. Then, the light features could be extracted from this map.
In one example embodiment, the upward facing camera can be tilted at any other designated angle. In some scenarios, it is assumed that the camera is point straight up at the ceiling (so there is no roll or pitch rotation, only yaw). In general, if the camera is fixed to the agent, the roll and pitch angles can be measured using a one-time extrinsic calibration process. Then the VO and relocalization algorithms can be generalized to account for non-zero roll and pitch angles. An alternative way to measure the tilt is with respect to the gravity vector which can be estimated using a low-cost 3D accelerometer.
The various embodiments described herein advantageously provide for real-time processing. As previously discussed, the system can be implemented both as a server-based and agent-based versions. For the server-based system, in one example implementation, images were captured by a camera connected to a Raspberry Pi 4 device and communicated over WiFi to a Lambda GPU laptop for processing. The total processing time per image was about 20 ms, suggesting that SAVLoc pose estimates based on VO could be obtained at about 50 Hz. For the agent-based system, all processing was performed by a Jetson Nano. The processing time per image was about 170 ms, suggesting that pose estimates could be obtained at about 5 Hz. This rate is sufficient for tracking ground-based agents moving at about a few meters per second (walking rate). For comparison the processing time for HLOC is about 800 ms per image. The time is much higher because of the complexity required for 6DoF pose estimation from a single image. It is noted that discussion of specific hardware and performance in relation to the various embodiments are provided by way of illustration and not as limitations.
Returning to
In one example, the devices 201 and 241 and/or server 213 include one or more device sensors (e.g., a front facing camera, a rear facing camera, digital image sensors, LiDAR (light detection and ranging) sensor, global positioning system (GPS) sensors, sound sensors, radars, infrared (IR) light sensors, microphones, height or elevation sensors, accelerometers, tilt sensors, moisture/humidity sensors, pressure sensors, temperature sensor, barometer, NFC sensors, wireless network sensors, etc.) and clients (e.g., mapping applications, navigation applications, image processing applications, augmented reality applications, image/video application, modeling application, communication applications, etc.). In one example, GPS sensors can enable the devices 201 and 241 to obtain geographic coordinates from one or more satellites for determining current or live location and time. Further, a user location within an area may be determined by a triangulation system such as A-GPS (Assisted-GPS), Cell of Origin, or other location extrapolation technologies when cellular or network signals are available. Further, the devices 201 and 241 can include one or more flash devices, e.g., a black light infrared flash.
In one example embodiment, the server 213 and/or devices 201 and/or 241 of the system can perform functions related to providing spatial-temporal authentication as discussed with respect to the various embodiments described herein. In one instance, the authentication platform 117 can be implemented in a standalone server computer or a component of another device with connectivity to the communications network 211. For example, the component can be part of an edge computing network where remote computing devices are installed within proximity of a geographic area of interest, one or more assets/objects/individuals to be monitored, or a combination thereof.
In one instance, the server 213 and/or devices 201 and/or 241 can include one or more neural networks or other machine learning algorithms/systems to process image date, such as images/frames of an input (e.g., a video stream or multiple static/still images, or serial or satellite imagery) (e.g., using an image segmentation algorithm) to extract structural object features. In one instance, the neural network is a convolutional neural network (CNN) which consists of multiple layers of collections of one or more neurons (which are configured to process a portion of an input data).
In one example, the server 213 and/or devices 201 and/or 241 have communication connectivity to one or more services platforms (e.g., services platform 225) and/or one or more software applications that provides one or more services 227 that can use the output (e.g., authentication output 115) of the system. By way of example, the communication connectivity can be internal connection within the apparatuses and/or happen over the communications network 211. By way of example, the one or more services 227 may also include mapping services, navigation services, notification services, social networking services, content (e.g., audio, video, images, etc.) provisioning services, application services, storage services, augmented reality (AR) services, location-based services, information-based services (e.g., weather, news, etc.), payment services, market place services, data analytics services, etc. or any combination thereof.
In one example, one or more devices 201 and 241 may be configured with one or more various sensors for acquiring and/or generating sensor data for real-time use. For example, the sensors can capture one or more images of a geographic area and/or any other sensor data (e.g., LiDAR point clouds, infrared scans, radar scans, etc.) that can be used for real-time object tracking or analytics for spatial-temporal authentication according to the embodiments described herein.
In one example, the components of the system may communicate over one or more communications networks 211 that includes one or more networks such as a data network, a wireless network, a telephony network, or any combination thereof. It is contemplated that the communication network 211 may be any local area network (LAN), metropolitan area network (MAN), wide area network (WAN), a public data network (e.g., the Internet), short range wireless communication network, or any other suitable packet-switched network, such as a commercially owned, proprietary packet-switched network, e.g., a proprietary cable or fiber-optic network, and the like, or any combination thereof. In addition, the communication network 211 may be, for example, a cellular telecom network and may employ various technologies including enhanced data rates for global evolution (EDGE), general packet radio service (GPRS), global system for mobile communications (GSM), Internet protocol multimedia subsystem (IMS), universal mobile telecommunications system (UMTS), worldwide interoperability for microwave access (WiMAX), Long Term Evolution (LTE) networks, 5G/3GPP (fifth-generation technology standard for broadband cellular networks/3rd Generation Partnership Project) or any further generation, code division multiple access (CDMA), wideband code division multiple access (WCDMA), wireless fidelity (Wi-Fi), wireless LAN (WLAN), Bluetooth®, UWB (Ultra-wideband), Internet Protocol (IP) data casting, satellite, mobile ad-hoc network (MANET), and the like, or any combination thereof.
In one example, the system or any of its components may be a platform with multiple interconnected components (e.g., a distributed framework). The system and/or any of its components may include multiple servers, intelligent networking devices, computing devices, components, and corresponding software for spatial-temporal authentication. In addition, it is noted that the system or any of its components may be a separate entity, a part of the one or more services, a part of a services platform, or included within other devices, or divided between any other components.
By way of example, the components of the system can communicate with each other and other components external to the system using well known, new or still developing protocols. In this context, a protocol includes a set of rules defining how the network nodes, e.g. the components of the system, within the communications network interact with each other based on information sent over the communication links. The protocols are effective at different layers of operation within each node, from generating and receiving physical signals of various types, to selecting a link for transferring those signals, to the format of information indicated by those signals, to identifying which software application executing on a computer system sends or receives the information. The conceptually different layers of protocols for exchanging information over a network are described in the Open Systems Interconnection (OSI) Reference Model.
Communications between the network nodes are typically effected by exchanging discrete packets of data. Each packet typically comprises (1) header information associated with a particular protocol, and (2) payload information that follows the header information and contains information that may be processed independently of that particular protocol. In some protocols, the packet includes (3) trailer information following the payload and indicating the end of the payload information. The header includes information such as the source of the packet, its destination, the length of the payload, and other properties used by the protocol. Often, the data in the payload for the particular protocol includes a header and payload for a different protocol associated with a different, higher layer of the OSI Reference Model. The header for a particular protocol typically indicates a type for the next protocol contained in its payload. The higher layer protocol is said to be encapsulated in the lower layer protocol. The headers included in a packet traversing multiple heterogeneous networks, such as the Internet, typically include a physical (layer 1) header, a data-link (layer 2) header, an internetwork (layer 3) header and a transport (layer 4) header, and various application (layer 5, layer 6 and layer 7) headers as defined by the OSI Reference Model.
The processes described herein for providing structure/light-aided visual localization may be advantageously implemented via software, hardware (e.g., general processor, Digital Signal Processing (DSP) chip, an Application Specific Integrated Circuit (ASIC), Field Programmable Gate Arrays (FPGAs), etc.), firmware or a combination thereof. Such exemplary hardware for performing the described functions is detailed below.
Additionally, as used herein, the term ‘circuitry’ may refer to (a) hardware-only circuit implementations (for example, implementations in analog circuitry and/or digital circuitry); (b) combinations of circuits and computer program product(s) comprising software and/or firmware instructions stored on one or more computer readable memories that work together to cause an apparatus to perform one or more functions described herein; and (c) circuits, such as, for example, a microprocessor(s) or a portion of a microprocessor(s), that require software or firmware for operation even if the software or firmware is not physically present. This definition of ‘circuitry’ applies to all uses of this term herein, including in any claims. As a further example, as used herein, the term ‘circuitry’ also includes an implementation comprising one or more processors and/or portion(s) thereof and accompanying software and/or firmware. As another example, the term ‘circuitry’ as used herein also includes, for example, a baseband integrated circuit or applications processor integrated circuit for a mobile phone or a similar integrated circuit in a server, a cellular device, other network device, and/or other computing device.
A bus 1410 includes one or more parallel conductors of information so that information is transferred quickly among devices coupled to the bus 1410. One or more processors 1402 for processing information are coupled with the bus 1410.
A processor 1402 performs a set of operations on information as specified by computer program code related to providing structure/light-aided visual localization. The computer program code is a set of instructions or statements providing instructions for the operation of the processor and/or the computer system to perform specified functions. The code, for example, may be written in a computer programming language that is compiled into a native instruction set of the processor. The code may also be written directly using the native instruction set (e.g., machine language). The set of operations include bringing information in from the bus 1410 and placing information on the bus 1410. The set of operations also typically include comparing two or more units of information, shifting positions of units of information, and combining two or more units of information, such as by addition or multiplication or logical operations like OR, exclusive OR (XOR), and AND. Each operation of the set of operations that can be performed by the processor is represented to the processor by information called instructions, such as an operation code of one or more digits. A sequence of operations to be executed by the processor 1402, such as a sequence of operation codes, constitute processor instructions, also called computer system instructions or, simply, computer instructions. Processors may be implemented as mechanical, electrical, magnetic, optical, chemical or quantum components, among others, alone or in combination.
Computer system 1400 also includes a memory 1404 coupled to bus 1410. The memory 1404, such as a random access memory (RAM) or other dynamic storage device, stores information including processor instructions for providing structure/light-aided visual localization. Dynamic memory allows information stored therein to be changed by the computer system 1400. RAM allows a unit of information stored at a location called a memory address to be stored and retrieved independently of information at neighboring addresses. The memory 1404 is also used by the processor 1402 to store temporary values during execution of processor instructions. The computer system 1400 also includes a read only memory (ROM) 1406 or other static storage device coupled to the bus 1410 for storing static information, including instructions, that is not changed by the computer system 1400. Some memory is composed of volatile storage that loses the information stored thereon when power is lost. Also coupled to bus 1410 is a non-volatile (persistent) storage device 1408, such as a magnetic disk, optical disk or flash card, for storing information, including instructions, that persists even when the computer system 1400 is turned off or otherwise loses power.
Information, including instructions for providing structure/light-aided visual localization, is provided to the bus 1410 for use by the processor from an external input device 1412, such as a keyboard containing alphanumeric keys operated by a human user, or a sensor. A sensor detects conditions in its vicinity and transforms those detections into physical expression compatible with the measurable phenomenon used to represent information in computer system 1400. Other external devices coupled to bus 1410, used primarily for interacting with humans, include a display device 1414, such as a cathode ray tube (CRT) or a liquid crystal display (LCD), or plasma screen or printer for presenting text or images, and a pointing device 1416, such as a mouse or a trackball or cursor direction keys, or motion sensor, for controlling a position of a small cursor image presented on the display 1414 and issuing commands associated with graphical elements presented on the display 1414. In some embodiments, for example, in embodiments in which the computer system 1400 performs all functions automatically without human input, one or more of external input device 1412, display device 1414 and pointing device 1416 is omitted.
In the illustrated embodiment, special purpose hardware, such as an application specific integrated circuit (ASIC) 1420, is coupled to bus 1410. The special purpose hardware is configured to perform operations not performed by processor 1402 quickly enough for special purposes. Examples of application specific ICs include graphics accelerator cards for generating images for display 1414, cryptographic boards for encrypting and decrypting messages sent over a network, speech recognition, and interfaces to special external devices, such as robotic arms and medical scanning equipment that repeatedly perform some complex sequence of operations that are more efficiently implemented in hardware.
Computer system 1400 also includes one or more instances of a communications interface 1470 coupled to bus 1410. Communication interface 1470 provides a one-way or two-way communication coupling to a variety of external devices that operate with their own processors, such as printers, scanners and external disks. In general, the coupling is with a network link 1478 that is connected to a local network 1480 to which a variety of external devices with their own processors are connected. For example, communication interface 1470 may be a parallel port or a serial port or a universal serial bus (USB) port on a personal computer. In some embodiments, communications interface 1470 is an integrated services digital network (ISDN) card or a digital subscriber line (DSL) card or a telephone modem that provides an information communication connection to a corresponding type of telephone line. In some embodiments, a communication interface 1470 is a cable modem that converts signals on bus 1410 into signals for a communication connection over a coaxial cable or into optical signals for a communication connection over a fiber optic cable. As another example, communications interface 1470 may be a local area network (LAN) card to provide a data communication connection to a compatible LAN, such as Ethernet. Wireless links may also be implemented. For wireless links, the communications interface 1470 sends or receives or both sends and receives electrical, acoustic or electromagnetic signals, including infrared and optical signals, that carry information streams, such as digital data. For example, in wireless handheld devices, such as mobile telephones like cell phones, the communications interface 1470 includes a radio band electromagnetic transmitter and receiver called a radio transceiver. In certain embodiments, the communications interface 1470 enables connection to the communication network 211 for providing structure/light-aided visual localization.
The term computer-readable medium is used herein to refer to any medium that participates in providing information to processor 1402, including instructions for execution. Such a medium may take many forms, including, but not limited to, non-volatile media, volatile media and transmission media. Non-volatile media include, for example, optical or magnetic disks, such as storage device 1408. Volatile media include, for example, dynamic memory 1404.
Transmission media include, for example, coaxial cables, copper wire, fiber optic cables, and carrier waves that travel through space without wires or cables, such as acoustic waves and electromagnetic waves, including radio, optical and infrared waves. Signals include man-made transient variations in amplitude, frequency, phase, polarization or other physical properties transmitted through the transmission media. Common forms of computer-readable media include, for example, a floppy disk, a flexible disk, hard disk, magnetic tape, any other magnetic medium, a CD-ROM, CDRW, DVD, any other optical medium, punch cards, paper tape, optical mark sheets, any other physical medium with patterns of holes or other optically recognizable indicia, a RAM, a PROM, an EPROM, a FLASH-EPROM, any other memory chip or cartridge, a carrier wave, or any other medium from which a computer can read.
Network link 1478 typically provides information communication using transmission media through one or more networks to other devices that use or process the information. For example, network link 1478 may provide a connection through local network 1480 to a host computer 1482 or to equipment 1484 operated by an Internet Service Provider (ISP). ISP equipment 1484 in turn provides data communication services through the public, world-wide packet-switching communication network of networks now commonly referred to as the Internet 1490.
A computer called a server host 1492 connected to the Internet hosts a process that provides a service in response to information received over the Internet. For example, server host 1492 hosts a process that provides information representing video data for presentation at display 1414. It is contemplated that the components of system can be deployed in various configurations within other computer systems, e.g., host 1482 and server 1492.
In one embodiment, the chip set 1500 includes a communication mechanism such as a bus 1501 for passing information among the components of the chip set 1500. A processor 1503 has connectivity to the bus 1501 to execute instructions and process information stored in, for example, a memory 1505. The processor 1503 may include one or more processing cores with each core configured to perform independently. A multi-core processor enables multiprocessing within a single physical package. Examples of a multi-core processor include two, four, eight, or greater numbers of processing cores. Alternatively or in addition, the processor 1503 may include one or more microprocessors configured in tandem via the bus 1501 to enable independent execution of instructions, pipelining, and multithreading. The processor 1503 may also be accompanied with one or more specialized components to perform certain processing functions and tasks such as one or more digital signal processors (DSP) 1507, or one or more application-specific integrated circuits (ASIC) 1509. A DSP 1507 typically is configured to process real-world signals (e.g., sound) in real time independently of the processor 1503. Similarly, an ASIC 1509 can be configured to performed specialized functions not easily performed by a general purposed processor. Other specialized components to aid in performing the inventive functions described herein include one or more field programmable gate arrays (FPGA) (not shown), one or more controllers (not shown), or one or more other special-purpose computer chips.
The processor 1503 and accompanying components have connectivity to the memory 1505 via the bus 1501. The memory 1505 includes both dynamic memory (e.g., RAM, magnetic disk, writable optical disk, etc.) and static memory (e.g., ROM, CD-ROM, etc.) for storing executable instructions that when executed perform the inventive steps described herein to provide structure/light-aided visual localization. The memory 1505 also stores the data associated with or generated by the execution of the inventive steps.
While the invention has been described in connection with a number of embodiments and implementations, the invention is not so limited but covers various obvious modifications and equivalent arrangements, which fall within the purview of the appended claims. Although features of the invention are expressed in certain combinations among the claims, it is contemplated that these features can be arranged in any combination and order.