The subject matter of this application relates generally to methods and apparatuses, including computer program products, for sparse simultaneous localization and matching (SLAM) with unified tracking in computer vision applications.
Generally, traditional methods for sparse simultaneous localization and mapping (SLAM) focus on tracking the pose of a scene from the perspective of a camera or sensor that is capturing images of a scene, as well as reconstructing the scene sparsely with low accuracy. Such methods are described in G. Klein et al., “Parallel tracking and mapping for small AR workspaces,” ISMAR '07 Proceedings of the 2007 6th IEEE and ACM International Symposium on Mixed and Augmented Reality, pp. 1-10 (2007) and R. Mur-Atal, ORB-SLAM: a versatile and accurate monocular SLAM system,” IEEE Transactions on Robotics (2015). Traditional methods for dense simultaneous localization and mapping (SLAM) focus on tracking the pose of sensors, as well as reconstructing the object or scene densely with high accuracy. Such methods are described in R. Newcombe et al., “KinectFusion: Real-time dense surface mapping and tracking” Mixed and Augmented Reality (ISMAR), 2011 10th IEEE International Symposium and T. Whelan et al., “Real-time Large Scale Dense RGB-D SLAM with Volumetric Fusion” International Journal of Robotics Research Special Issue on Robot Vision (2014).
Typically, such traditional dense SLAM methods are useful when analyzing an object with many shape features and few color features but do not perform as well when analyzing an object with few shape features and many color features. Also, dense SLAM methods typically require a significant amount of processing power to analyze images captured by a camera or sensor and track the pose of objects within.
Therefore, what is needed is an approach that incorporates sparse SLAM to focus on enhancing the object reconstruction capability on certain complex objects, such as symmetrical objects, and improving the speed and reliability of 3D scene reconstruction using 3D sensors and computing devices executing vision processing software.
The sparse SLAM technique described herein provide certain advantages over other, preexisting techniques:
The sparse SLAM technique can apply a machine learning procedure to train key frames in a mapping database, in order to make global tracking and loop closure more efficient and reliable. Also, the sparse SLAM technique can train features in key frames, and then more descriptive features can be acquired by projecting high-dimension untrained features to low-dimension with trained feature model.
Via its aggressive feature detection and key frame insertion processing, the 3D-sensor-based sparse SLAM technique described herein can be used as 3D reconstruction software to model objects that have few shape features but have many color features, such as a printed symmetrical object.
Because depth maps from 3D sensors are generally already accurate, the sparse SLAM technique can directly reconstruct a 3D mesh using the depth maps from the camera and poses generated by the sparse SLAM technique. In some embodiments, post-processing—e.g., bundle adjustment, structure from motion, TSDF modeling, or Poisson reconstruction—is used to enhance the final result.
Also, when synchronized with a dense SLAM technique, the sparse SLAM technique described herein provides high-speed tracking capabilities (e.g., more than 100 frames per second), against an accurate reconstructed 3D mesh obtained from dense SLAM, to leverage on complex computer vision applications like augmented reality (AR).
For example, when sparse SLAM is synchronized with dense SLAM:
1) The object or scene poses obtained from a tracking module executing on a processor of a computing device that is coupled to the sensor capturing the images of the object can be used for iterative closest point (ICP) registration in dense SLAM to improve reliability.
2) The poses of key frames from a mapping module executing on the processor of the computing device are synchronized with the poses for Truncated Signed Distance Function (TSDF) in dense SLAM in order to align the mapping database of sparse SLAM with the final mesh of dense SLAM, thereby enabling high-speed object or scene tracking (of sparse SLAM) using the accurate 3D mesh (of dense SLAM).
3) The loop closure process in sparse SLAM helps dense SLAM to correct loops with few shape features but many color features.
It should be appreciated that the techniques herein can be configured such that sparse SLAM is temporarily disabled and dense SLAM by itself is used to analyze and process objects with many shape features but few color features.
The invention, in one aspect, features a system for tracking a pose of one or more objects represented in a scene. The system comprises a sensor that captures a plurality of scans of one or more objects in a scene, each scan comprising a color and depth frame. The system comprises a database that stores one or more key frames of the one or more objects in the scene, each key frame comprising a plurality of map points associated with the one or more objects. The system comprises a computing device that a) receives a first one of the plurality of scans from the sensor; b) determines two-dimensional (2D) feature points of the one or more objects using the color and depth frame of the received scan; c) retrieves a key frame from the database; d) matches one or more of the 2D feature points with one or more of the map points in the key frame; e) generates a current pose of the one or more objects in the color and depth frame using the matched 2D feature points; f) inserts the color and depth frame into the database as a new key frame, including the matched 2D feature points as map points for the new key frame; and g) repeats steps a)-f) on each of the remaining scans, using the inserted new key frame for matching in step d), where the computing device tracks the pose of the one or more objects in the scene across the plurality of scans.
The invention, in another aspect, features a computerized method of tracking a pose of one or more objects represented in a scene. A sensor a) captures a plurality of scans of one or more objects in a scene, each scan comprising a color and depth frame. A computing device b) receives a first one of the plurality of scans from the sensor. The computing device c) determines two-dimensional (2D) feature points of the one or more objects using the color and depth frame of the received scan. The computing device d) retrieves a key frame from a database that stores one or more key frames of the one or more objects in the scene, each key frame comprising a plurality of map points associated with the one or more objects. The computing device e) matches one or more of the 2D feature points with one or more of the map points in the key frame. The computing device f) generates a current pose of the one or more objects in the color and depth frame using the matched 2D feature points. The computing device g) inserts the color and depth frame into the database as a new key frame, including the matched 2D feature points as map points for the new key frame. The computing device h) repeats steps b)-g) on each of the remaining scans, using the inserted new key frame for matching in step e), where the server computing device tracks the pose of the one or more objects in the scene across the plurality of scans.
Any of the above aspects can include one or more of the following features. In some embodiments, the computing device generates a 3D model of the one or more objects in the scene using the tracked pose information. In some embodiments, the step of inserting the color and depth frame into the database as a new key frame comprises converting the color and depth frame into a new key frame and converting the 2D feature points of the color and depth frame into map points of the new key frame; fusing one or more map points of the new key frame that have valid depth information with similar map points of one or more neighbor key frames; estimating a 3D position of one or more map points of the new key frame that do not have valid depth information; refining the pose of the new key frame and the one or more neighbor key frames fused with the new key frame; and storing the new key frame and associated map points into the database.
In some embodiments, converting the color and depth frame into a new key frame and converting the 2D feature points of the color and depth frame into map points of the new key frame comprises converting a 3D position of the one or more map points of the new key frame from a local coordinate system to a global coordinate system using the pose of the new key frame. In some embodiments, the computing device correlates the new key frame with the one or more neighbor key frames based upon a number of map points shared between the new key frame and the one or more neighbor key frames. In some embodiments, the step of fusing one or more map points of the new key frame that have valid depth information with similar map points of one or more neighbor key frames comprises: projecting each map point from the one or more neighbor key frames to the new key frame; identifying a map point with similar 2D features that is closest to a position of the projected map point; and fusing the projected map point from the one or more neighbor key frames to the identified map point in the new key frame.
In some embodiments, the step of estimating a 3D position of one or more map points of the new key frame that do not have valid depth information comprises: matching a map point of the new key frame that do not have valid depth information with a map point in each of two neighbor key frames; and determining a 3D position of the map point of the new key frame using linear triangulation with the 3D position of the map points in the two neighbor key frames. In some embodiments, the step of refining the pose of the new key frame and the one or more neighbor key frames fused with the new key frame is performed using local bundle adjustment. In some embodiments, the computing device deletes redundant key frames and associated map points from the database.
In some embodiments, the computing device determines a similarity between the new key frame and one or more key frames stored in the database, estimates a 3D rigid transformation between the new key frame and the one or more key frames stored in the database, selects a key frame from the one or more key frames stored in the database based upon the 3D rigid transformation, and merges the new key frame with the selected key frame to minimize drifting error. In some embodiments, the step of determining a similarity between the new key frame and one or more key frames stored in the database comprises determining a number of matched features between the new key frame and one or more key frames stored in the database. In some embodiments, the step of estimating a 3D rigid transformation between the new key frame and the one or more key frames stored in the database comprises: selecting one or more pairs of matching features between the new key frame and the one or more key frames stored in the database; determining a rotation and translation of each of the one or more pairs; and selecting a pair of the one or more pairs with a maximum inlier ratio using the rotation and translation. In some embodiments, the step of merging the new key frame with the selected key frame to minimize drifting error comprises: merging one or more feature points in the new key frame with one or more feature points in the selected key frame; and connecting the new key frame to the selected key frame using the merged feature points.
Other aspects and advantages of the invention will become apparent from the following detailed description, taken in conjunction with the accompanying drawings, illustrating the principles of the invention by way of example only.
The system 200 includes a sensor 203 coupled to a computing device 204. The computing device 204 includes an image processing module 206. In some embodiments, the computing device can also be coupled to a data storage module 208, e.g., used for storing certain 3D models, color images, and other data as described herein.
The sensor 203 is positioned to capture images (e.g., color images) of a scene 201 which includes one or more physical objects (e.g., objects 202a-202b). Exemplary sensors that can be used in the system 200 include, but are not limited to, 3D scanners, digital cameras, and other types of devices that are capable of capturing depth information of the pixels along with the images of a real-world object and/or scene to collect data on its position, location, and appearance. In some embodiments, the sensor 203 is embedded into the computing device 204, such as a camera in a smartphone, for example.
The computing device 204 receives images (also called scans) of the scene 201 from the sensor 203 and processes the images to generate 3D models of objects (e.g., objects 202a-202b) represented in the scene 201. The computing device 204 can take on many forms, including both mobile and non-mobile forms. Exemplary computing devices include, but are not limited to, a laptop computer, a desktop computer, a tablet computer, a smart phone, augmented reality (AR)/virtual reality (VR) devices (e.g., glasses, headset apparatuses, and so forth), an internet appliance, or the like. It should be appreciated that other computing devices (e.g., an embedded system) can be used without departing from the scope of the invention. The mobile computing device 202 includes network-interface components to connect to a communications network. In some embodiments, the network-interface components include components to connect to a wireless network, such as a Wi-Fi or cellular network, in order to access a wider network, such as the Internet.
The computing device 204 includes an image processing module 206 configured to receive images captured by the sensor 203 and analyze the images in a variety of ways, including detecting the position and location of objects represented in the images and generating 3D models of objects in the images.
The image processing module 206 is a hardware and/or software module that resides on the computing device 204 to perform functions associated with analyzing images capture by the scanner, including the generation of 3D models based upon objects in the images. In some embodiments, the functionality of the image processing module 106 is distributed among a plurality of computing devices. In some embodiments, the image processing module 206 operates in conjunction with other modules that are either also located on the computing device 204 or on other computing devices coupled to the computing device 204. An exemplary image processing module is the Starry Night plug-in for the Unity 3D engine or other similar libraries, available from VanGogh Imaging, Inc. of McLean, Va. It should be appreciated that any number of computing devices, arranged in a variety of architectures, resources, and configurations (e.g., cluster computing, virtual computing, cloud computing) can be used without departing from the scope of the invention.
The data storage module 208 (e.g., a database) is coupled to the computing device 204, and operates to store data used by the image processing module 206 during its image analysis functions. The data storage module 208 can be integrated with the server computing device 204 or be located on a separate computing device.
As described herein, the sparse SLAM technique comprises three processing modules that are executed by the image processing module 206:
1) Tracking—the tracking module comprises matching of the input from the sensor (i.e., color and depth frames) to the key frames and map points contained in the mapping database to get the sensor pose in real time. The key frames are a subset of the overall input sensor frames that are transformed to a global coordinate system. The map points are two-dimensional (2D) feature points, also containing three-dimensional (3D) information, in the key frames.
2) Mapping—the mapping module builds the mapping database which as described above includes the key frames and map points, based upon the input received from the sensor and the sensor pose as processed by the tracking module.
3) Loop Closing—the loop closing module corrects drifting errors contained in the data of the mapping database that is accumulated during tracking of the object.
After the module 206 detects and calculates the 2D feature points, the module 206 gets viewing directions, or normal of the 2D feature points. If 2D feature points have corresponding valid depth values in depth frame, the module 206 also gets the 3D positions in the sensor coordinate system.
Turning back to
The module 206 matches 2D features from the sensor frame to map points in certain key frames. The module 206 selects key frames from the mapping database using the following exemplary methods: 1) key frames that are around the sensor position in global coordinate systems; and 2) key frames in which there are the most number of matching pairs of map points in the key frame and 2D feature points in the previous sensor frame. It should be appreciated that other techniques to select key frames from the mapping database can be used.
The module 206 matches map points to 2D feature points by, e.g., using 3D+2D searching. For example, the module 206 transforms color feature points in the current frame using the 3D pose of the prior sensor frame to estimate the global positions of the color feature points. Then, the module 206 searches for each map point in the 3D space surrounding the transformed color feature points, and looks for the most similar transformed feature point from the sensor frame.
Turning back to
Next, the image processing module 206 decides (308) whether to insert the current sensor frame as a new key frame in the mapping database 208. For example, once the current sensor frame does not have enough feature points that match with the map points in the key frames, the module 206 inserts the current sensor frame in the mapping database 208 as a new key frame in order to guarantee tracking reliability of subsequent sensor frames.
Once the key frame insertion decision has been made, the image processing module 206 generates the pose of the sensor 203 and the key frame insertion decision as output. The module 206 then updates the mapping database with the new key frame and corresponding map points in the frame—if the decision was made to insert the current sensor frame as a new key frame. Otherwise, the module 206 skips the mapping database update and executes the tracking module processing of
The image processing module 206 then fuses (904) similar map points between the newly-inserted key frame and its neighbor key frames. The fusion is achieved by similar 3D+2D searching with tighter thresholds, such as searching window size and feature matching threshold. The module 206 projects every map point in neighboring key frames from the global coordinate system to the newly-inserted key frame and vice versa. Then, the projected map point searches for the map point with similar 2D features that is closest to its projected position in the newly-inserted key frame. Fusing similar map points naturally increases the connectivity between the newly-inserted key frame and its neighbor key frames. It benefits both tracking reliability and mapping, because more map points and key frames are involved in tracking and local bundle adjustment in mapping.
In order to handle scenes without enough depth information, the image processing module 206 also estimates (906) 3D positions for feature points that do not have valid depth information. Estimation is achieved by matching feature points without valid depth values across two key frames subject to an epipolar constraint and feature distance constraints. The module 206 can then calculate the 3D position by linear triangulation to minimize the 2D re-projection error, described by Richard Hartley and Andrew Zisserman, “Multiple View Geometry in Computer Vision”, Cambridge University Press, 2003 (which is incorporated herein by reference). To achieve a good accuracy level, 3D positions are estimated only for two features points with enough parallax. The estimated 3D position accuracy of each map point is improved as more key frames are matched to the map point and more key frames are involved in the next step—local key frame and map point refinement.
The image processing module 206 then refines (908) the poses of the newly-inserted key frame and correlated key frames, and 3D positions of the related map points. The refinement is achieved by local bundle adjustment, which optimizes the poses of the key frames and 3D position of the map points by, e.g., minimizing the re-projection error of map points relative to key frames.
Turning back to
In conjunction with the mapping module processing for inserting a new key frame into the mapping database 208, the image processing module 206 also performs loop closing processing to minimize drifting error in the key frames.
Turning back to
Next, to close the loop (1406), the module 206 merges the latest inserted key frame with the matched key frame by merging the matched feature points and mapping points, and connects the key frames on one side of the loop to key frames on another side of the loop. The drifting error accumulated during the loop can be corrected through global bundle adjustment. Similar to local bundle adjustment, which optimizes poses and map points of the key frames by minimizing re-projection error, global bundle adjustment uses the same concepts, but instead the entire key frames and map points in the loop are involved in the process.
It should be appreciated that the methods, systems, and techniques described herein are applicable to a wide variety of useful commercial and/or technical applications. Such applications can include:
The above-described techniques can be implemented in digital and/or analog electronic circuitry, or in computer hardware, firmware, software, or in combinations of them. The implementation can be as a computer program product, i.e., a computer program tangibly embodied in a machine-readable storage device, for execution by, or to control the operation of, a data processing apparatus, e.g., a programmable processor, a computer, and/or multiple computers. A computer program can be written in any form of computer or programming language, including source code, compiled code, interpreted code and/or machine code, and the computer program can be deployed in any form, including as a stand-alone program or as a subroutine, element, or other unit suitable for use in a computing environment. A computer program can be deployed to be executed on one computer or on multiple computers at one or more sites.
Method steps can be performed by one or more processors executing a computer program to perform functions by operating on input data and/or generating output data. Method steps can also be performed by, and an apparatus can be implemented as, special purpose logic circuitry, e.g., a FPGA (field programmable gate array), a FPAA (field-programmable analog array), a CPLD (complex programmable logic device), a PSoC (Programmable System-on-Chip), ASIP (application-specific instruction-set processor), or an ASIC (application-specific integrated circuit), or the like. Subroutines can refer to portions of the stored computer program and/or the processor, and/or the special circuitry that implement one or more functions.
Processors suitable for the execution of a computer program include, by way of example, both general and special purpose microprocessors, and any one or more processors of any kind of digital or analog computer. Generally, a processor receives instructions and data from a read-only memory or a random access memory or both. The essential elements of a computer are a processor for executing instructions and one or more memory devices for storing instructions and/or data. Memory devices, such as a cache, can be used to temporarily store data. Memory devices can also be used for long-term data storage. Generally, a computer also includes, or is operatively coupled to receive data from or transfer data to, or both, one or more mass storage devices for storing data, e.g., magnetic, magneto-optical disks, or optical disks. A computer can also be operatively coupled to a communications network in order to receive instructions and/or data from the network and/or to transfer instructions and/or data to the network. Computer-readable storage mediums suitable for embodying computer program instructions and data include all forms of volatile and non-volatile memory, including by way of example semiconductor memory devices, e.g., DRAM, SRAM, EPROM, EEPROM, and flash memory devices; magnetic disks, e.g., internal hard disks or removable disks; magneto-optical disks; and optical disks, e.g., CD, DVD, HD-DVD, and Blu-ray disks. The processor and the memory can be supplemented by and/or incorporated in special purpose logic circuitry.
To provide for interaction with a user, the above described techniques can be implemented on a computer in communication with a display device, e.g., a CRT (cathode ray tube), plasma, or LCD (liquid crystal display) monitor, for displaying information to the user and a keyboard and a pointing device, e.g., a mouse, a trackball, a touchpad, or a motion sensor, by which the user can provide input to the computer (e.g., interact with a user interface element). Other kinds of devices can be used to provide for interaction with a user as well; for example, feedback provided to the user can be any form of sensory feedback, e.g., visual feedback, auditory feedback, or tactile feedback; and input from the user can be received in any form, including acoustic, speech, and/or tactile input.
The above described techniques can be implemented in a distributed computing system that includes a back-end component. The back-end component can, for example, be a data server, a middleware component, and/or an application server. The above described techniques can be implemented in a distributed computing system that includes a front-end component. The front-end component can, for example, be a client computer having a graphical user interface, a Web browser through which a user can interact with an example implementation, and/or other graphical user interfaces for a transmitting device. The above described techniques can be implemented in a distributed computing system that includes any combination of such back-end, middleware, or front-end components.
The components of the computing system can be interconnected by transmission medium, which can include any form or medium of digital or analog data communication (e.g., a communication network). Transmission medium can include one or more packet-based networks and/or one or more circuit-based networks in any configuration. Packet-based networks can include, for example, the Internet, a carrier internet protocol (IP) network (e.g., local area network (LAN), wide area network (WAN), campus area network (CAN), metropolitan area network (MAN), home area network (HAN)), a private IP network, an IP private branch exchange (IPBX), a wireless network (e.g., radio access network (RAN), Bluetooth, Wi-Fi, WiMAX, general packet radio service (GPRS) network, HiperLAN), and/or other packet-based networks. Circuit-based networks can include, for example, the public switched telephone network (PSTN), a legacy private branch exchange (PBX), a wireless network (e.g., RAN, code-division multiple access (CDMA) network, time division multiple access (TDMA) network, global system for mobile communications (GSM) network), and/or other circuit-based networks.
Information transfer over transmission medium can be based on one or more communication protocols. Communication protocols can include, for example, Ethernet protocol, Internet Protocol (IP), Voice over IP (VOIP), a Peer-to-Peer (P2P) protocol, Hypertext Transfer Protocol (HTTP), Session Initiation Protocol (SIP), H.323, Media Gateway Control Protocol (MGCP), Signaling System #7 (SS7), a Global System for Mobile Communications (GSM) protocol, a Push-to-Talk (PTT) protocol, a PTT over Cellular (POC) protocol, Universal Mobile Telecommunications System (UMTS), 3GPP Long Term Evolution (LTE) and/or other communication protocols.
Devices of the computing system can include, for example, a computer, a computer with a browser device, a telephone, an IP phone, a mobile device (e.g., cellular phone, personal digital assistant (PDA) device, smart phone, tablet, laptop computer, electronic mail device), and/or other communication devices. The browser device includes, for example, a computer (e.g., desktop computer and/or laptop computer) with a World Wide Web browser (e.g., Chrome™ from Google, Inc., Microsoft® Internet Explorer® available from Microsoft Corporation, and/or Mozilla® Firefox available from Mozilla Corporation). Mobile computing device include, for example, a Blackberry® from Research in Motion, an iPhone® from Apple Corporation, and/or an Android™-based device. IP phones include, for example, a Cisco® Unified IP Phone 7985G and/or a Cisco® Unified Wireless Phone 7920 available from Cisco Systems, Inc.
Comprise, include, and/or plural forms of each are open ended and include the listed parts and can include additional parts that are not listed. And/or is open ended and includes one or more of the listed parts and combinations of the listed parts.
One skilled in the art will realize the technology may be embodied in other specific forms without departing from the spirit or essential characteristics thereof. The foregoing embodiments are therefore to be considered in all respects illustrative rather than limiting of the technology described herein.
This application claims priority to U.S. Provisional Patent Application No. 62/357,916, filed on Jul. 1, 2016, the entirety of which is incorporated herein by reference.
Number | Date | Country | |
---|---|---|---|
62357916 | Jul 2016 | US |