The invention relates generally to situational awareness and visualization systems. More specifically, the invention relates to a system and method for providing enhanced situation awareness and immersive visualization of environments.
In order to be prepared to operate in remote, unknown environments effectively, it is highly beneficial for a user to be provided with a visual and sensory environment that can virtually immerse them in it at a remote location. The immersion should enable them to get a near physical feel for the layout, structure and threat-level of buildings and other structures. Furthermore, the virtual immersion should also bring to them the level of crowds, typical patterns of activity and typical sounds in different parts of the environment as they virtually drive through an extended urban environment. Such an environment provides a rich context in which to do Route Visualization for many different applications. Some key application areas for such a technology includes online active navigation tool for driving directions with intuitive feedback on geo-indexed routes on a map or on video, online situation awareness of large areas using multiple sensors for security purposes, offline authoring of direction and route planning, offline training of military and other personnel on a unknown environment and its cultural significance.
Furthermore, there are no current state-of-the-art tools that exist for creation of a geo-specific navigable video map for data that has been continuously captured. Tools developed in the 90's such as QuickTimeVR work with highly constrained means of capturing image snapshots at key, pre-defined and calibrated locations in a 2D environment. The QTVR browser then just steps through a series of 360 deg snapshots as the user moves along on a 2D map.
At present, the state-of-the-art situational awareness and visualization systems are based primarily on creating synthetic environments that mimic a real environment. This is typically achieved by creating geo-typical or geo-specific models of the environment. The user can then navigate through the environment using interactive 3D navigation interfaces. There are some major limitations with the 3D approach. First, it is generally very hard to create high-fidelity geo-specific models of urban sites that capture all the details that a user is likely to encounter at ground level. Second, 3D models are typically static and do not allow a user to get a sense of the dynamic action such as movements of people, vehicles and other events in the real environment. Third, it is extremely hard to update static 3D models given that urban sites undergo continuous changes both in the fixed infrastructure as well as in the dynamic entities. Fourth, it is extremely hard to capture the physical and cultural ambience of an environment even with a geo-specific model since the ambience changes over different times of day and over longer periods of time.
Thus, there is a need to provide a novel platform for enhanced situational awareness of a real-time remote natural environment, preferably without the need for creating a 3D model of the environment.
The present invention provides a system and method for providing an immersive visualization of an environment. The method comprises receiving in real-time a continuous plurality of captured video streams of the environment via a video camera mounted on a moving platform, synchronizing a captured audio with said video streams/frames and associating GPS data with said captured video streams to provide metadata of the environment; wherein the metadata comprises a map with vehicle location and orientation of each video stream. The method further comprises automatically processing the video streams with said associated GPS data to create an-annotated hypervideo map, wherein the map provides a seamlessly navigable and indexable high-fidelity visualization of the environment.
Referring to
Referring to
Capture Device
As shown in
The above-mentioned camera 202 is also general enough to handle not just 360 deg video but also numerous other camera configurations that can gather video centric information from moving platforms. For instance a single camera platform or a multiple stereo camera pairs can be integrated into the route visualizer as well. In addition to having video and audio other sensors can also be integrated into the system. For instance a lidar scanner (1D or 2D) can be integrated into the system to provide additional 3D mapping capabilities and better mensuration capabilities.
Optionally, an inertial measuring unit, i.e. IMU (not shown) providing inertial measurements of the location data of the moving platform 304 can also be mounted and integrated into the capture device 102. As known in the art the IMU provides altitude, location, and motion of the moving platform. Alternatively, sensor such as a 2D lidar scanner can also be preferably integrated into the capture device 102. The 2D lidar scanner can be utilized to obtain lidar data of the images. This can be used in conjunction with the video or independently to obtain consistent poses of the camera sensor 202 across time.
Although the moving platform 204 as shown in
The captured video and audio data along with the associated geo-spatial/GPS data retrieved from the capture device 102 is stored in the database 104 which is further processed using the vision aided navigation processing tool 106 as described hereinbelow.
Vision Aided Navigation Processing Tool
A. Video Initial Navigation System (INS): Using standard video algorithms and software, one can automatically detect features in video frames and track the features over time to compute precisely the camera and platform motion. This information that is derived at the frame rate of videos can be combined with GPS information using known algorithms and/or software to precisely determine the location and orientation of the moving platform. This is especially useful in urban setting where urban may have no or at best spotty GPS coverage. Also, provided is a method to perform frame-accurate localization based on short-term and long-term landmark based matching. This will compensate for translational drift errors that can accumulate as will be described in greater detail below.
Preferably, the inertial measurements can be combined with the video and the associated GPS data to provide a precision localization of the moving platform. This capability will enable the system to register video frames with precise world coordinates. In addition, transfer of annotations between video frames and a database may preferably be enabled. Thus, the problem of precise localization of the capture videos with respect to the world coordinate system will be solved by preferably integrating GPS measurements with inertial localization based on 3D-motion from known video algorithms and software. Thus, this method provides a robust environment to operate the system when only some of the sensor information is used. For example you may not want to compute poses (images) based on video during the online process. Still the visual interaction and feedback can be provided based on just the GPS and initial measurement information. Similarly a user may enter areas of low GPS coverage in which instance the video INS can compensate for the missing location information.
As discussed above, a lidar scanner can alternatively be integrated as part of the capture device 102. The lidar frames can be registered to each other or to a accumulated reference point cloud to obtain relative poses between the frames. This can be further improved using landmark based registration of features that a temporally further apart. Bundle adjustment of multiple frames can also improve the pose estimates. The system can extract robust features from the video that act as a collection of landmarks to remember. These can be used to correlate when ever the same location is revisited either during the same trip or over multiple trips. This can be used to improve the pose information previously computed. These corrections can be further propagated across multiple frames of the video through a robust bundle adjustment step. The relative poses obtained can be combined with GPS and IMU data to obtain an absolute location of the sensor rig. In a preferred embodiment, both lidar and video can provide an improved estimation of the poses using both sensors simultaneously.
B. 3D Motion Computation and 3D Video Stabilization and Smoothing: 3D motion of the camera can be computed using the known technique disclosed by David Nister and James R. Bergen (hereinafter “Nister et al”), Real-time video-based pose and 3D estimation for ground vehicle applications, ARL CTAC Symposium, May 2003 and Bergen, J. R., Anandan, P., Hanna, K. J., Hingorani, R (hereinafter “Bergen et al”), Hierarchical Model-Based Motion Estimation, ECCV92(237-252) 3D pose estimates are computed for every frame in the video for this application. These estimates are essential for providing a high fidelity immersive experience. Images of the environment are detected and tracked over multiple frames to establish point correspondences over time. Subsequently, a 3D camera attitude and position estimation module employs algebraic 3D motion constraints between multiple frames to rapidly hypothesize and test numerous pose hypotheses. In order to achieve robust performance in real-time, the feature tracking and hypotheses generating and test steps are algorithmically and computationally highly optimized. A novel preemptive RANSAC (Random Sample Consensus) technique is implemented that can rapidly hypothesize pose estimates that compete in a preemptive scoring scheme that is designed to quickly find a motion hypothesis that enjoys a large support among all the feature correspondences, providing the required robustness against outliers in the data (e.g. independently moving objects in the scene). A real-time refinement step based on an optimal objective function is used to determine the optimal pose estimates from a small number of promising hypotheses. This technique is disclosed by combination of the above mentioned articles by Nister et al and by Bergen et al. with M. Fischler and R. Bolles, Random Sample Consensus: a Paradigm for Model Fiting with Application to Image Analysis and Automated Cartography, Commun. Assoc. Comp. Mach., 24:3810395, 1981.
Additionally, vehicle born video obtained from the camera rig can be unstable from the jitter, jerks and sudden jumps in the captured due to the physical motion of the vehicle. The computed 3D pose estimates will be used to smooth the 3D trajectory to remove high-frequency jitter, thus providing a video stabilization and smoothing technology to alleviate these effects. Based on the 3D pose the location (trajectory) of where the platform could be smoothed. Additionally a dominant plane seen in the video (such as the ground plane) can be used as a reference to stabilize the sequence. Based on the stabilization parameters derived a new video sequence can be synthesized that is very smooth. The video synthesis can use either a 3D or 2D image processing methods to derive new frames. The computed 3D poses will be provide a geo-spatial reference to where the moving platform was and the travel direction. These 3D poses will further stored in the hyper-video database 104.
Alternatively, a multi camera device may be employed to provide improved robustness in exploiting features across the scene, improved landmark matching of the features and improved precision over a wide field of view. This provides for very strong constraints in estimating the 3D motion of the sensor. In both the known standard monocular and stereo visual odometry algorithm, the best pose for that camera at the end of the preemptive RANSAC routine is passed to a pose refinement. This is generalized in the multi-camera system and the refinement is distributed across cameras in the following way as described herewith. For each camera, the best cumulative scoring hypothesis is refined not only on the camera from which it originated but also on all the cameras after it is transferred accordingly. Then, the cumulative scores of these refined hypotheses in each camera are computed and the best cumulative scoring refined hypothesis is determined. This pose is stored in the camera it originated (it is transferred if the best pose comes from a different camera than the original). This process is repeated for all the cameras in the system. At the end, each camera will have a refined pose obtained in this way. As a result, we take advantage of the fact that a given camera pose may be polished better in another camera and therefore have a better global score. As the very final step, the pose of the camera, which has the best cumulative score, is selected and applied to the whole system.
In a monocular multi-camera system, there may still be a scale ambiguity in the final pose of the camera rig. By recording GPS information with the video scale can be inferred for the system. Alternately we can introduce an addition camera to form a stereo pair to recover scale.
C: Landmark Matching: Even with the multi-camera system as described above, the aggregation of frame-by-frame estimates can eventually accumulate significant error. With dead reckoning alone, two sightings of the same location, may be mapped to different locations in a map. However, by recognizing landmarks corresponding to the common location and identifying that location as the same, an independent constraint on the global location of the landmark is obtained. This global constraint based optimization combined with locally estimated and constrained locations leads to a globally consistent location map as the same locale is visited repeatedly.
Thus, the approach will be able to locate a landmark purely by matching its associated multi-modal information with the landmark database constructed in a way to facilitate efficient search. This approach is full described by Y. Shan, B. Matei, H. S. Sawhney, R. Kumar, D. Huber, M Hebert, “Linear Model Hashing and Batch RANSAC for Rapid and Accurate Object Recognition ”, IEEE International Conference on Computer Vision and Pattern Recognition,2004. Landmarks are employed both for short range motion correction and long range localization. Short-range motion correction uses landmarks to establish feature correspondences over a longer time span and distance than what is done by the frame-to-frame motion estimation. With an increased baseline over a larger time gap, motion estimates are more accurate. Long-range landmark matching establishes correspondences between newly visible features at a given time instant and their previously stored appearance and 3D representations. This enables high accuracy absolute localization and avoids drift in frame-to-frame location estimates.
Moreover, vehicle position information provided by video INS and GPS may preferably be fused in an EKF (Extended Kalman Filter) framework together with measurements obtained through landmark matching to further improve the pose estimates. GPS acts as a mechanism of resetting drift errors accumulated in the pose estimation. In the absence of GPS (due to temporary drops) landmark-matching measurements will help reduce the accumulation of drift and correct the pose estimates.
Hyper-Video Map and Route Visualization Processing Tool
Since the goal is to enable the user to virtually “drive/walk” on city streets while taking arbitrary routes along roads, the stored video map cannot simply put the linearly captured video on a DVD. Thus, the hyper-video map and route visualization tool processes the video and the associated GPS and 3D motion estimates with a street map of the environment to generate a hyper-video map. Generally, the 3D pose computed as described above, provides a metadata of a route map comprising geo-spatial reference to where the moving platform was and the travel direction. This will provide the user the capability to mouse over the route map to spatially hyper-index into any part of the video instantly. Regions around each of these points will also be hyper indexed to provide rapid navigational links between different parts of the video. For example, when the user navigates to an intersection using the hyper-indexed visualization engine, he can pick a direction he wants to turn to. The corresponding hyper-link will index into the correct part of the video that contains that subset of the route selected. The detailed description of the processing and route visualization is described herein below.
A. Spatially Indexable Hyper Video Map: The hyper-video and route visualization tool retrieves from the database, N video sequences, synchronized with time stamps, metadata comprising of map with vehicle location (UTM) and orientation for each video frame in the input sequences. The metadata is scanned to identify the places where the vehicle path intersects itself and generates a graph corresponding to the trajectory followed by the vehicle. Each node in the graph corresponds to a road segment and edges link nodes if the corresponding road segments intersect. For each node, a corresponding clip from the input video sequences are extracted, and a pointer is stored to the video clip in the node. Preferably, a map or overhead photo of the area may optionally be retrieved from the database, so the road structure covered by the vehicle can be overlaid on it for display and verification. This results in a spatially indexable video map that can be used in several ways.
B. Route Visualization GUI: The GUI interface would be provided to each user to experience environment though multiple trips/missions merged into a single hyper-video database 104. The hyper-indexed visualization engine acts as a functional layer to the GUI front-end to rapidly extract information from the database that would then be rendered on the map. The user would be able to view the route as it evolves on a map and simultaneously view the video and audio as the user navigates through the route using the hyper-indexed visualization engine. Geo-coded information available in the database would be overlaid on the map and the video to provide intuitive training experience. Such information may include geo-coded textual information, vector graphics, 3D models or video/audio clips. The hyper-video and route visualization tool 108 integrates with the hyper-video database 104 that will bring in standardized geo-coded symbolic information into the browser. The user will be able to immerse into the environment by preferably wearing head mounted goggles and stereo headphones.
Audio Processing Tool
The sound captured by the audio 204 preferably comprising a spherical microphone array may be corrupted by the noise of the vehicle 202 upon which it is mounted. The noise of the vehicle is removed using adaptive noise cancellation (ANC), whereby a reference measurement of the noise alone is subtracted from each of the microphone signals. The noise reference is obtained either from a separate microphone nearer the vehicle, or from a beam pointed downwards towards the vehicle. In either case, frequency-domain least means squares (FDLMS) is the preferred ANC algorithm, with good performance and low computational complexity.
The goal of audio-based rendering is to capture a 3D audio scene in a way that allows later virtual rendering of the binaural sounds a user would hear for any arbitrary look direction. To accomplish this, a spherical microphone array is preferably utilized for sound capture, and solid cone beam forming convolved with head related transfer functions (HRTF) to render the binaural stereo.
Given a monaural sound source in free space, HRTF is the stereo transfer function from the source to an individual's two inner ears, taking into account diffraction and reflection of sound as it interacts with both the environment and the user's head and ears. Knowing the HRTF allows processing any monaural sound source into binaural stereo that simulates what a user would hear if a source were at a given direction and distance.
In a preferred embodiment, a 2.5 cm diameter spherical array with six microphones is used as an audio 204. During capture, the raw signals are recorded. During rendering, the 3D space is divided into eight fixed solid cones using frequency invariant beam forming based on the spherical harmonic basis functions. The microphone signals are then projected into each of the fixed cones. The output of each beam former is then convolved with an HRTF defined by the look direction and cone center, and the results summed over all cones.
In another embodiment of the present invention, an algorithm and software may preferably be provided by the audio processing tool 110 to develop an interface for inserting audio information into scene for visualization. For example inserting small audio snippet of someone talking about a threat in a language not familiar to the user, into a data collect done in a remote environment, may test the users ability comprehend some key phrases in the context of the situation for military training. As another example, audio commentary of a tourist destination will enhance a travelers experience and understanding of areas he is viewing at that time.
Furthermore, in another embodiment of the present invention, key feature points on each video frame will be tracked and 3D location of these points will be computed. The known standard algorithms and/or software as described by Hartley, Zisserman, “Multiple View Geometry in Computer Vision”: Cambridge University Press, 2000, provide means of making 3D measurements within the video map as the user navigates through the environment. This will require processing of video and GPS data to derive 3D motion and 3D coordinates to measure distances between locations in the environment. The user can manually identify points of interest on the video and obtain the 3D location of the point and the distance from the vehicle to that point. This requires the point to be identified in at least two spatially separated video frames. The separation of the frames will dictate the accuracy of the geo-location for a given point. This will provide a rapid tool for identifying a point across two frames. When the user selects a point of interest the system will draw the corresponding epipolar line on any other selected frame to enable the user to rapidly identify the corresponding point in that frame.
In order to estimate 3D structure along the road (store fronts, lamp posts, parked cars, etc.) or the 3D location of distant landmarks, it is necessary to track distinctive features across multiple frames in the input video sequence and triangulate the corresponding 3D location of the point in the scene.
Alternatively, user-selected features or points can be tracked automatically. In this case the user clicks on the point of interest in one image, and the system tracks the feature in consecutive frames. An adaptive template matching technique could be used to track the point. The adaptive version will help match across changes in viewpoint. For stable range estimates, it is important that the selected camera baseline (i.e. distance from A to B) be sufficiently large (the baseline should be at least 1/50 of the range).
Optionally, if stereo data or lidar data, is available the measurements made by using the 3D location information provided by the sensor can be directly be used with its pose to estimate location. Multiple frames can yet be used to improve the estimated results. Lidar provides accurate distance measure to points in the environment. These combined with the posses lets you build an accumulated point cloud in a single 3D coordinate system. These 3D measurements can be extracted by going back to these accumulated point-cloud.
a preferred embodiment of the present invention, object recognition cores can preferably be integrated into the route visualization system to provide annotation of the hyper-video map with automatic detection and classification for common objects seen in the spatially indexed hyper-video map. A few key classes such as people, vehicles and buildings are identified and inserted into the system so the user can view these entities during visualization. This capability can further be extended to a wider array of classes and subclasses. The user will have the flexibility of viewing video annotated with these object labels. An example of automated people detection and localization is shown in
One preferred approach to object detection and classification employs a comprehensive collection of shape, motion and appearance constraints. Algorithms developed to robustly detect independent object motions after computing the 3D motion of the camera are disclosed in Tao, H; Sawhney H. S.; Kumar, R; “Object Tracking with Bayesian Estimation of Dynamic Layer Representations”, IEEE Transactions on Pattern Analysis and Machine Intelligence, (24), No. 1, January 2002, pp. 75-89; Guo, Y., Hsu, Steve, Shan. Y, Sawhney H. S.; Kumar, R; Vehicle Fingerprinting for Reacquisition and Tracking in Videos, IEEE Proceedings of CVPR 2005 (II: 761-768) and Zhao, T., Nevatia, R., “Tracking Multiple Humans in Complex Situations,” IEEE Transactions on Pattern Analysis and Machine Intelligence, (26) No. 9, September 2004. In the embodiment of the present invention, the independent object motions will either violate the epipolar constraint that relates the motion of image features over two frames under rigid motion, or the independently moving objects will violate the structure-is-constant constraint over three or more frames. The first step is to recover the camera motion using the visual odometry as described above. Next, the image motion due to the camera rotation (which is independent of the 3D structure in front of the camera) is eliminated and the residual optical flow is computed. After recovering epipolar geometry using 3D camera motion estimation, and estimating parallax flow that is related to 3D shape, violations of the two constraints are detected and labeled as independent motions.
In addition to motion and shape constraints as discussed above, static image constraints such as 2D shape and appearance can preferably be employed as disclosed in Feng Han and Song-Chun Zhu, Bottom-up/Top-Down Image Parsing by Attribute Graph Grammar, ICCV 2005. Vol 2, 17-20 Oct. 2005 Page(s):1778-1785. This approach to object classification differs from previous approaches that use manual clustering of training data into multiple view and pose. In this approach, a Nested-Adaboost is proposed to automatically cluster the training samples into different view/poses, and thus train a multiple-view multiple-pose classifier without any manual labor. An example output for people and vehicle classification and localization is shown in
The moving platform will move through the environment, capturing image or video data, and additionally recording GPS or inertial sensor data. The system should then be able to suggest names or labels for objects automatically, indicating that some object or individual has been seen before, or suggesting annotations. This functionality requires building of models from all the annotations produced in the journey of the environment. Some models will link image structures with spatial annotations (e.g., GPS or INS); such models allow the identification of fixed landmarks. Other models will link image structures with transcribed speech annotations; such models make it possible to recognize these structures in new images. See
In order to link image structures to annotations, it is critical to determine which image structure should be linked to which annotation. In particular, if one has a working model of each object, then one can determine which image structure is linked to which annotation; similarly, if one knows which image structure is linked to which annotation, one can build an improved working model of each object. This process is relatively simply formalized using the EM algorithm described above.
In a further embodiment of the present invention, an algorithm and software is provided for storyboarding and annotation of video information collected from the environment. The storyboard provides a quick summarization of the events/trip laid out on the map that quickly and visually describes the whole trip in a single picture. The storyboard will be registered with respect to a map of the environment. Furthermore, any annotations stored in a database for buildings and other landmarks will be inherited by the storyboard. The user will also be able to create hot-spots of video and other events for others to view and interact with. For example, a marine patrol will preferably move over a wide area during the course of its mission. It is useful to characterize such a mission through some key locations or times of interest. User interface will present this information as a comprehensive storyboard overlaid on a map. Such a storyboard board provides a convenient summary of the mission and acts as a spatio-temportal menu into the mission. Spatio-Temporal is information that correlate items to spatial (location/geo-location) and temporal (time of occurance in a single unified context).
In a preferred embodiment, comparison of routes is a valuable function provided to the user. Two or more routes can be simultaneously displayed on the map for comparison. User will be able to set deviations of a path with respect to a reference route and have it be highlighted on the map. As the user moves cursor over the routes co-located video feeds will be displayed for comparison. Additionally the video can be pre-processed to identify gross changes to the environment and these can be highlighted in the video and the map. This can be a great asset in improved explosive device detection where changes to the terrain or newly parked vehicles can be detected and highlighted for threat assessment.
In a preferred embodiment, structure of the environment can be extracted and processed to build 3D models or facades along the route. In one aspect, with the monocular video, structure of the environment from motion can be computed to get information on the 3D. In another aspect, with the stereo cameras, the computed stereo depth can be used to estimate 3D structure. In an even further aspect with lidar images, 3D structure can be obtained from the accumulated point clouds. This can be incorporated into the route visualization to provide 3D rendering of the route and objects of interest.
In an additionally preferred embodiment of the present invention, the system will provide a novel way of storing, indexing and browsing video and map data and this will require the development of novel playback tools that are characteristically different from the traditional linear / non-linear playback of video data or navigation of 3D models. The playback took is simply able to take a storage device, for example, a DVD and allow the user simplified navigations through the environment. In addition to the play/indexing modes described in the spatially indexable videomap creation section described above, the video could contain embedded hyperlinks added in the map creation stage. The user can click on these links to change the vehicle trajectory (e.g. take a turn in an intersection). A natural extension of the playback tool is to add an orientation sensor on the helmet with the heads-up display through which the user sees the video. By monitoring the head orientation, the corresponding field of view (out of the 360 degrees) can be rendered, giving the user a more natural “look around” capability.
In an even further embodiment of the present invention, the system 100 as defined above can preferably be provided in live real-time use, i.e. live operational environment. In a live system on-line computation of the pose (location and view) information can be used to map out ones route on a map and on the live video available. In the live environment, the user will be able to overlay geo-coded information such as landmarks, road signs and audio commentary on the video and also provide navigation support at location where GPS coverage is not available or is spotty. For example if you enter into a tunnel, underpass or areas of high tree coverage the system can still provide accurate location information for navigation. Also, in a live environment, for example, in a military application, the user will be informed of potential threats based on online geo-coded information received and based on object classification/recognition components.
The live system can also be desirably extended to provide a distributed situation awareness system. The live system will provide a shared map and video based storyboarding of multiple live of the sensor systems are moving in the same environment, even though they be distributed over an extended area. Each moving platform embedded with a sensor rig such as the camera will act as an agent in the distributed environment. Route/location information from each platform along with relevant video clips will be transmitted to a central location or to each other, preferably via wireless channels. The route visualization GUI will provide a storyboard across all the sensor rigs and allow interactive user to hyper-index into any location of interest and get further drill down information. This will extend to providing a remote interface such that the information can be stored into a server that is accessed through a remote interface by another user. This also enables rapid updating of the information as additional embedded platforms are processes. This sets up a collaborative information-sharing network across multiple user/platforms active at the same time. The information each unit has is shared with others through a centralized server or through a network of local servers embedded with each unit. This allows the each unit to be aware of where the other units are and to benefit from the imagery seen by the other users.
Although various embodiments that incorporate the teachings of the present invention have been shown and described in detail herein, those skilled in the art can readily devise many other varied embodiments that still incorporate these teachings without departing from the spirit and the scope of the invention.
This application claims the benefit of U.S. Provisional Patent Application No. 60/720,553 filed Sep. 29, 2005, the entire disclosure of which is incorporated herein by reference.
Number | Date | Country | |
---|---|---|---|
60720553 | Sep 2005 | US |