The present disclosure relates to improvements for constructing floor plans for interactive video control. More particularly, it relates to methods and systems for providing an end-to-end pipeline providing real-time floor plan construction from multiple sparse camera views.
In some instances, an event is streamed or recorded on multiple cameras covering the event from different views. For example, there might be one camera viewing from the right, one from the left, and one or more from the front. Typically, the video provider controls which camera is used for the video stream/recording.
Most viewers today watch video content on a device that allows user interaction with the video—computers, smartphones, tablet devices, etc. These allow the user control over how they experience the video through user interface controls.
Some user interface controls utilize icons presented on the screen with the video that can be selected (by touch screen, mouse click, or the like) to control the video.
The methods and systems herein present an effective way to construct a 2D floor plan visualization of multiple sparse camera views which can be used, for example, with a user interface such that a user can select in real-time which camera's view to use for a video stream/playback in an interactive manner.
An embodiment of the present invention is a method to construct a 2D floor plan visualization based on the views of a plurality of sparse cameras viewing one or more objects of interest, the method comprising: performing computer object detection on the one or more objects of interest for video from each of the plurality of sparse cameras; performing feature extraction from the computer object detection; performing feature aggregation on features of the feature extraction creating latent vectors; transforming the latent vectors to view features; estimating camera poses for each of the plurality of sparse cameras based on the view features; transforming estimated camera poses into world coordinates; and constructing the 2D floor plan visualization based on the world coordinates.
A further embodiment of the present invention is a system, comprising: an object detection module configured to perform computer object detection from video data from a plurality of sparse cameras; a latent feature extraction module configured to convert data from the object detection module into latent vectors for views of each of the plurality of sparse cameras; a latent vector aggregation module configured to aggregate the latent vectors into aggregated latent vectors; a transformer module configured to convert the aggregated latent vectors into view features; a camera pose estimation module configured to estimate camera poses for each of the plurality of sparse cameras from the view features; and a single-frame floor plan construction module configured to construct the 2D floor plan visualization from the estimated camera poses.
In embodiments herein, systems and methods for an end-to-end pipeline to generate floor plans from multiple videos from sparse camera views is disclosed. The generated floor plan can be used to navigate the view selection and facilitate the transverse of multiple perspectives of a scene. One practical application of this technology is with a user interface, enabling users (e.g., viewers) to access a floor plan with various camera locations. The users can then actively engage with the content by playing synchronized multi-perspective audio/video, selecting via a user interface which perspective to view at any given time. This interactive experience allows users to take advantage of the different views under their control, providing an immersive and captivating experience.
The term “sparse camera views” as used herein refers to two or more cameras where at least two of the cameras presents a different view (direction, elevation, and/or height) of an object of interest.
The term “object of interest” or “objects of interest” as used herein refers to the object or collection of objects, including a person or people, or even a location in a scene, that is the shared subject of the video views. For example, a video of a person giving a speech would have the person giving the speech the “object of interest”, perhaps also including a lectern that the person is standing behind. The use of the term “object” as used herein can also refer collectively to all objects of interest in the scene. The object of interest corresponds to the image processing “region of interest”.
The term “scene” as used herein refers to the general physical space that the object of interest is in when being viewed by the cameras.
The term “floor plan” as used herein refers to a 2D visualization (map) of camera view positions relative to an object of interest.
In embodiments of the floor plan, the floor plan is generated by a system that capitalizes on latent feature aggregation and camera pose estimation. This is an improvement over previous systems that use retrieval of camera pose estimation [1], which usually require accurate feature extraction and precise matching to perform well, or techniques like simultaneous location and mapping (SLAM) [2] or visual odometry [3], which are dependent on visual data and feature-based motion estimation. These previous methods have been shown to have particular limitations when applied to sparse camera views, which the present disclosed method overcomes.
This pipeline provides an end-to-end pipeline for floor plan construction from multiple sparse camera view video using novel algorithms (particularly for modules 2, 4, and 5) involving enhanced camera pose estimation stability by considering both spatial and temporal consistency.
Object detection, a salient subfield of computer vision, is the technology that allows machines to identify and locate objects within an image or a sequence of images. It differs from image classification, which only assigns a label to an entire image, by additionally providing the spatial location and extent of one or many objects within the image. Common applications of object detection include self-driving cars, surveillance, image retrieval systems, and medical imaging.
The major challenge in object detection is the variation in size, shape, color, texture, and orientation of objects, which demands algorithms with robust discriminatory and generalization power. Traditional methods such as Viola-Jones algorithm and Scale-Invariant Feature Transform (SIFT) dealt with these challenges to a degree, but with the advent of deep learning, Convolutional Neural Networks (CNNs) have significantly improved the performance in object detection tasks.
Any deep-learning method of object detection can be utilized for the embodiments described herein. As an example, the YOLO (You Only Look Once) methods are described.
Among the deep-learning based methods, YOLO [6] has gained immense popularity due to its speed and efficiency. Unlike the two-stage detectors such as R-CNN and its variants, YOLO employs a single-stage detector strategy, making it significantly faster and suitable for real-time applications.
YOLO is an end-to-end system that concurrently predicts class probabilities and bounding box coordinates for an image. The core idea is to divide the input image into an S×S grid and for each grid cell, predict multiple bounding boxes and class probabilities. The bounding box prediction includes coordinates (x, y) for the center of the box, along with its width and height. Class probabilities indicate the likelihood of the object belonging to a particular class. A key characteristic of YOLO is that it views the detection problem holistically, thereby dramatically increasing the speed of detection.
Since the inception of YOLO, subsequent versions have been developed, each addressing limitations of its predecessor. YOLOv2 [7], also known as YOLO9000, can detect over 9000 object categories by joint training on classification and detection tasks, leveraging hierarchical clustering for designing anchor boxes, and using a multi-scale training method. YOLOv3 incorporated three different scales for prediction by utilizing feature maps from multiple layers of the network, thereby enhancing the detection of smaller objects. YOLOv4 was proposed to make efficient use of the computational resources while still maintaining top-tier performance.
Currently, YOLOv8 [8], developed by Ultralytics™, represents a state-of-the-art model that builds upon the successes of its preceding versions within the YOLO framework. Another alternative for bounding box retrieval lies in YOLOX [9].
The YOLOv8 takes an RGB image as input. Its output are multiple ROIs consisting of a classID, which identifies the object category, and the location of the bounding box.
The output format of YOLOv8 is as follows:
The location of the bounding box can be retrieved using bounding box's centerX, and centerY and bounding box width and height. The YOLOv8 output is saved in a .txt file, which may contain multiple lines representing detected objects with no comma in between each value. Utilizing the classID and the bounding box location, the system can identify the position of the bounding box for a specific class and obtain the targeted bounding box. This targeted bounding box can then be utilized as input for the latent feature aggregation module.
In some embodiments, the system creates bounding boxes for all objects of interest in the scene. In some embodiments, the system aggregates bounding boxes for multiple objects of interest, in some embodiments aggregating all objects into a single bounding box.
For each viewpoint, there are multiple Region of Interest (ROI) that will be selected (identified from previous object detection module). To further illustrate,
For latent feature aggregation, there are at least 6 different methods: addition (add), averaging (avg), multiplication (mul), max pooling (max), min pooling (min) and Principal Component Analysis (PCA) to perform feature aggregation. Assuming that li is the ith feature vector in latent space retrieved from ith ROI among {dot over (n)} different ROIs. Note there here in latent space, element wise operations are done.
Both addition and averaging methods yield equivalent performance. On the other hand, multiplication, max pooling, and min pooling might not produce satisfactory results. The reason behind this failure lies in the fact that the latent vector is not bounded within the range of [0, 1] or [−1, 1]. Consequently, applying operations such as max pooling, min pooling, and multiplication could lead to significant changes in the latent values, resulting in suboptimal outputs.
The position encoding result will be also appended to the latent vector. In this example, denote lagg(k) (selected from one of above methods) as aggregated latent features with position encoding result in view k module.
In computer vision, a 6D camera pose refers to the combination of the camera's position and orientation in three-dimensional space. It is called “6D” because it uses six degrees of freedom (DOF) to fully describe the camera's pose: three for translational (linear) movement along the X, Y, and Z axes, and three for rotational (angular) movement (often referred to as pitch, yaw, and roll).
Translation: This corresponds to the location of the camera in 3D space, relative to a reference point. In other words, it's the camera's position along the X, Y, and Z axes.
Rotation: This describes the camera's orientation, or the direction in which it is pointing. The rotation can be represented using different conventions, such as Euler angles (pitch, yaw, roll), a rotation matrix, or a quaternion.
6D camera pose estimation is a challenging problem due to various factors such as occlusions, lighting conditions, scene complexity, and the need for real-time performance in many applications.
Any 6D pose estimation method can be used for the embodiments herein. For example, the RelPose methods are described.
The RelPose [10] framework offers a data-driven approach to infer camera viewpoints from multiple images of arbitrary objects, a task crucial for both traditional geometric pipelines like SfM (Structure from Motion) and SLAM (Simultaneous Localization and Mapping), as well as modern neural approaches like NeRF (Neural Radiance Fields). RelPose diverges from correspondence-driven methods that struggle with sparse views, instead using an energy-based formulation to represent distributions over relative camera rotations, allowing for the representation of multiple camera modes due to object symmetries or views. From these relative predictions, RelPose estimates a consistent set of camera rotations from multiple images. This system outperforms existing methods on sparse image sets and can be employed as a vital step toward in-the-wild reconstruction from multi-view datasets. Furthermore, it can infer accurate poses for novel instances of various classes and even those from unseen categories.
RelPose++, an enhancement of the RelPose framework, tackles the challenge of estimating 6D camera poses from a sparse set of 2-8 images. This task is useful in neural reconstruction algorithms, particularly when objects have visual symmetries and texture-less surfaces. Key improvements include using attentional transformer layers for processing multiple images jointly to resolve possible ambiguities and expanding the network to report camera translations by defining a distinct coordinate system. The system significantly outperforms previous methods in 6D pose prediction for both familiar and unfamiliar object categories and enables pose estimation and 3D reconstruction for in-the-wild objects. This RelPose++ framework can be used to estimate the rotation and translation of the object.
The RelPose++ takes a sparse set of input views, ranging from 2 views to 8 views, and generates 6D poses (rotation and translations). In embodiments herein, the feature aggregation module requires novel adjustments to the original RelPose++ architecture: instead of taking images of views, the system skips that step and utilizes the trained RelPose++ transformer to take the aggregated features from the views directly, while still outputting the 6D poses as in the original architecture.
With all N views' aggregated latent vectors, employ a transformer encoder (e.g. Relpose++, including 8 layers of multi-headed self-attention blocks, which is similar to the encoder utilized in the Vision Transformer model), to estimate the 6D camera pose for each camera. This fusion of the feature extractor and transformer can be referred to as a scene encoder εϕ. When provided with N aggregated latent vectors {lagg(N)} from N views, the scene encoder generates multi-view conditioned features {fi} that correspond to each respective view (“view features”). This is equivalent to the equation presented in RelPose++, with the difference (modification) being that instead of inputting N input images (views), the input is lagg(i) ∀i∈{1 . . . . N}.
The process of estimating rotations and translations is analogous to Relpose++, where these parameters are estimated independently. The complete architecture, including the transformer module (“Transformer”) that takes concatenated bounding box data (indexed) as input and produces view features, and the rotation and translation estimation modules (“Rotation Estimation” and “Translation Estimation” respectively) that take the view features and produces the rotation/translation “scores” (see below), is illustrated in
For Rotation Estimation (405), the approach is rooted in an energy-based model. It starts by approximating the log-likelihood of the pairwise relative rotations derived from view features fi, fj. This is done using a Multi-Layer Perceptron (MLP) function gθ(fi,fj,Ri→j) which serves as a negative energy or “score” for pairwise relative rotation Ri→j for views i and j.
Having inferred the distributions of these pairwise rotations, the process of determining global rotations is framed as an optimization problem. The objective is to identify the mode of the distribution. This is achieved through a two-step approach: 1. Greedy initialization, 2. Block coordinate ascent. The goal is to recover a set of global rotations {Ri} (i=view number of N total views) that will maximize the cumulative score of relative rotations Ri→j. This can be mathematically expressed as:
for N views.
For Translation (410), train a translation prediction module that infers the per-camera translation {ti} given multiple-view features fi for ∀i∈{1 . . . . N}. [4].
In this module, the rotation and translation are taken as input parameters from the 6D camera pose estimator (see above). With the estimated rotation and translation, one can utilize an application programming interface (API) such as PyTorch3D API [14] (specifically the “pytorch3d.renderer.cameras” module, which provides the “get_camera_center” implementation for that example API). This obtains the estimated world coordinates based on the given rotation and translation of the estimated points.
Once the world coordinates are obtained for each camera pose estimation, there are two methods for creating the floor plan.
The first, there is a straightforward approach, by projecting the x, y, and z coordinates onto the x-y plane to derive the floor plan. Like Relpose++, the object of interest is considered as the center (0,0,0) of the world coordinate system. Consequently, the camera pose is determined relative to the object of interest.
The other approach is to use get the plane that is closest to the cameras (x,y,z) coordinates and the stage (0,0,0). Principal Component Analysis (PCA) can be used to find the plane that best fits a set of points in a three-dimensional space. Essentially, PCA is used to find the plane that minimizes the squared perpendicular distance to the points. [5].
Next, the points are projected perpendicularly onto the plane obtained through Principal Component Analysis (PCA). The process is illustrated in
Subsequently, one can employ rotation to align this plane parallel to the xy plane to allow for a more intuitive visualization of the projection as a floor plane.
Depending on the characteristics of the videos, both or either of the implementations can be utilized. This methodology shows improved performance compared to other methods.
In some embodiments, the 6D world coordinates can be encoded into metadata that is sent in a bitstream with the video image data. The end user can decode the metadata back into the world coordinate data to allow the user to reconstruct the camera poses (e.g., reconstruct/visualize the camera location and shooting angle (i.e. 6DoF) using their video player).
Despite the latent feature aggregation module providing an estimation of the camera pose, the final camera pose of each camera should ideally reflect the actual stationary position of the camera (assuming it is not moving). Moreover, in real-world scenarios, there are often outliers for camera position estimated in the frames that need to be considered when determining the final camera position. To address this issue, use one or more of three types of averages: (1) regular average (avg), (2) interquartile range average (IQR avg), and (3) weighted average (wavg) for each view camera point coordinate. The latter two methods compensate for and eliminate the impact of outliers, ensuring a more accurate and robust estimation of the final camera positions.
The formulas for the three averages are as follows:
where ct is the tth frame camera floorplan coordinate (of F frames, i.e. temporal averaging), assume that each x, and y coordinate in ct is sorted in ascending or descending order, F is the total number of frames being averaged over and the 0.3F, 0.6F and 0.7F values are integers (round down if needed).
If the video has limited outliers, the average, IQR average, and weighted average will be relatively close to each other. In some embodiments, all three averages are used, which stabilizes the system even if one or more cameras change their view (one or more of zoom, pan, tilt, dolly, etc. during filming).
Depending on the characteristics of the videos, different XY plane projections and PCA projections can be employed with different temporal consistency filter. In some embodiments, this module incorporates both methods.
In some embodiments, the module includes a temporal consistency filter, which reads a frame from a video file every second and applies a temporal median filter to each frame, excluding the first and last ones. In some embodiments, for the selected videos, one can assume that the camera remains stationary without any magnification or rotation, in which case temporal consistency filtering is unnecessary. In some embodiments, the three methods of averaging presented earlier serve as the consistency filter.
In some embodiments, the cameras movement (rotations and translations) over time can be included in the pose estimation, where camera poses will be time varying and the changes in poses will be shown at the end user's side. To have this time dependent information, apply the temporal filtering along the time domain. A sliding window based solution can be used (temporal window). The sliding window method computes the camera pose for each time interval (selectable) of the video stream, rather than for the entire video.
A number of embodiments of the disclosure have been described. Nevertheless, it will be understood that various modifications may be made without departing from the spirit and scope of the present disclosure. Accordingly, other embodiments are within the scope of the following claims.
The examples set forth above are provided to those of ordinary skill in the art as a complete disclosure and description of how to make and use the embodiments of the disclosure, and are not intended to limit the scope of what the inventor/inventors regard as their disclosure.
Modifications of the above-described modes for carrying out the methods and systems herein disclosed that are obvious to persons of skill in the art are intended to be within the scope of the following claims. All patents and publications mentioned in the specification are indicative of the levels of skill of those skilled in the art to which the disclosure pertains. All references cited in this disclosure are incorporated by reference to the same extent as if each reference had been incorporated by reference in its entirety individually.
It is to be understood that the disclosure is not limited to particular methods or systems, which can, of course, vary. It is also to be understood that the terminology used herein is for the purpose of describing particular embodiments only, and is not intended to be limiting. As used in this specification and the appended claims, the singular forms “a,” “an,” and “the” include plural referents unless the content clearly dictates otherwise. The term “plurality” includes two or more referents unless the content clearly dictates otherwise. Unless defined otherwise, all technical and scientific terms used herein have the same meaning as commonly understood by one of ordinary skill in the art to which the disclosure pertains.
This application claims priority to U.S. Provisional Patent Application No. 63/621,372, filed on Jan. 16, 2024, of which is incorporated herein by reference in its entirety.
Number | Date | Country | |
---|---|---|---|
63621372 | Jan 2024 | US |