The increasing availability of drones equipped with cameras has inspired a new style of cinematography based on capturing images of scenes that were previously difficult to access. While professionals have traditionally captured high-quality images by using precise camera trajectories with well controlled extrinsic parameters, a camera on a drone is always in motion even when the drone is hovering. This is due to the aerodynamic nature of drones, which makes continuous movement fluctuations inevitable. If only one drone is involved, it is still possible to estimate camera pose (a 6D combination of position and orientation) by simultaneous localization and mapping (SLAM), a technique which is well known in the field of robotics. However, it is often desirable to employ multiple cameras at different viewing spots simultaneously, allowing for complex editing and full 3D scene reconstruction. Conventional SLAM approaches work well for single-drone, single-camera situations but are not suited for the estimation of all the poses involved in multiple-drone or multiple-camera situations.
Other challenges in multi-drone cinematography include the complexity of integrating the video streams of images captured by the multiple drones, and the need to control the flight paths of all the drones such that a desired formation (or swarm pattern), and any desired changes in that formation over time, can be achieved. In current practice for professional cinematography involving drone, human operators have to operate two separate controllers on each drone, one controlling flight parameters, and one controlling camera pose. There are many negative implications: for the drones in terms of their size, weight and cost; for reliability of the system as a whole; and for the quality of the output scene reconstructions.
There is, therefore, a need for improved systems and methods for integrating images captured by cameras on multiple, moving drones, and for accurately controlling those drones (and possibly the cameras independently of the drones), so that the visual content necessary to reconstruct the scene of interest can be efficiently captured and processed. Ideally, the visual content integration would be done automatically, at an off-drone location, and the controlling, also performed at an off-drone location but not necessarily the same one, would involve automatic feedback control mechanisms, to achieve high precision in drone positioning, adaptive to aerodynamic noise, due to factors such as wind. It may also sometimes be beneficial to minimize the number of human operators required for system operation.
Embodiments generally relate to methods and systems for imaging a scene in 3D, based on images captured by multiple drones.
In one embodiment, a system comprises a plurality of drones, a fly controller and a camera controller, wherein the system is fully operational with as few as one human operator. Each drone moves along a corresponding flight path over the scene, and each drone has a drone camera capturing, at a corresponding first pose and a corresponding first time, a corresponding first image of the scene. The fly controller controls the flight path of each drone, in part by using estimates of the first pose of each drone camera provided by a camera controller, to create and maintain a desired pattern of drones with desired camera poses over the scene. The camera controller receives, from the plurality of drones, a corresponding plurality of captured images of the scene, processes the received images to generate a 3D representation of the scene as a system output, and provides the estimates of the first pose of each drone camera to the fly controller.
In another embodiment, a method of imaging a scene comprises: deploying a plurality of drones, each drone moving along a corresponding flight path over the scene, and each drone having a camera capturing, at a corresponding first pose and a corresponding first time, a corresponding first image of the scene; using a fly controller to control the flight path of each drone, in part by using estimates of the pose of each camera provided by a camera controller, to create and maintain a desired pattern of drones with desired camera poses over the scene; and using the camera controller to receive, from the plurality of drones, a corresponding plurality of captured images of the scene, and to process the received images to generate a 3D representation of the scene as a system output, and to provide the estimates of the pose of each camera to the fly controller. No more than one human operator is needed for full operation of the method.
In another embodiment, an apparatus comprises one or more processors; and logic encoded in one or more non-transitory media for execution by the one or more processors. When executed, the logic is operable to image a scene by: deploying a plurality of drones, each drone moving along a corresponding flight path over the scene, and each drone having a camera capturing, at a corresponding first pose and a corresponding first time, a corresponding first image of the scene; using a fly controller to control the flight path of each drone, in part by using estimates of the pose of each camera provided by a camera controller, to create and maintain a desired pattern of drones with desired camera poses over the scene; and using the camera controller to receive, from the plurality of drones, a corresponding plurality of captured images of the scene, and to process the received images to generate a 3D representation of the scene as a system output, and to provide the estimates of the pose of each camera to the fly controller. No more than one human operator is needed for full operation of the apparatus to image the scene.
A further understanding of the nature and the advantages of particular embodiments disclosed herein may be realized by reference of the remaining portions of the specification and the attached drawings.
Each drone agent 142 is “matched up” with one and only one drone, receiving images from a drone camera 115 within or attached to that drone 105. For simplicity,
Each drone agent then collaborates with at least one other drone agent to compute a coordinate transformation specific to its own drone camera, so that the estimated camera pose can be expressed in a global coordinate system, shared by each of the drones. The computation may be carried out using a novel robust coordinate aligning algorithm, discussed in more detail below, with reference to
Each drone agent also generates a dense1 depth map of the scene 120 as viewed by the corresponding drone camera for each pose from which the corresponding image was captured. depth map is calculated and expressed in the global coordinate system. In some cases, the map is generated by the drone agent processing a pair of images received from the same drone camera at slightly different times and poses, with their fields of view overlapping sufficiently to serve as a stereo pair. Well known techniques may be used by the drone agent to process such pairs to generate corresponding depth maps, as indicated in
Each drone agent sends its own estimate of drone camera pose and the corresponding depth map, both in global coordinates, to global optimizer 144, along with data intrinsically characterizing the corresponding drone. On receiving all these data and an RGB image from each of the drone agents, global optimizer 144 processes these data collectively, generating a 3D point cloud representation that may be extended, corrected, and refined over time as more images and data are received. If a keypoint of an image is already present in the 3D point cloud, and a match is confirmed, the keypoint is said to be “registered”. The main purposes of the processing are to validate 3D point cloud image data across the plurality of images, and to adjust the estimated pose and depth map for each drone camera correspondingly. In this way, a joint optimization may be achieved of the “structure” of the imaged scene reconstruction, and the “motion” or positioning in space and time of the drone cameras.
The global optimization depends in part on the use of any one of various state-of-the-art SLAM or Structure from Motion (SfM) optimizers now available, for example the graph-based optimizer BundleFusion, that generate 3D point cloud reconstructions from a plurality of images captured at different poses.
In the present invention, such an optimizer is embedded in a process-level iterative optimizer, sending updated (improved) camera pose estimates and depth maps to the fly controller after each cycle, which the fly controller can use to make adjustments to flight path and pose as and when necessary. Subsequent images sent by the drones to the drone agents are then processed by the drone agents as described above, involving each drone agent collaborating with at least one other, to yield further improved depth maps and drone camera pose estimates that are in turn sent on to the global optimizer, to be used in the next iterative cycle, and so one. Thus the accuracy of camera pose estimates and depth maps are improved, cycle by cycle, in turn improving the control of the drones' flight paths and the quality of the 3D point cloud reconstruction. When this reconstruction is deemed to meet a predetermined threshold of quality, the iterative cycle may cease, and the reconstruction at that point provided as the ultimate system output. Many applications for that output may readily be envisaged, including, for example, 3D scene reconstruction for cinematography, or view change experience.
Further details of how drone agents 142 shown in system 100 operate in various embodiments will now be discussed.
The problem of how to control the positioning and motion of multiple drone cameras is addressed in the present invention by a combination of SLAM and MultiView Triangulation (MVT).
Mathematical details of the steps involved in the various calculations necessary to determining the transforms between two cameras are presented in
For simplicity, one of the drone agents may be considered the “master” drone agent, representing a “master” drone camera, whose coordinates whose coordinates may be considered to be the global coordinates, to which all the other drone camera images are aligned using the techniques described above.
(1) Control is rooted in the global optimizer's 3D map, which serves as the latest and most accurate visual reference for camera positioning. (2) The fly controller uses the 3D map information to generate commands to each drone that compensate for positioning errors made apparent in the map. (3) Upon the arrival of an image from the drone, the drone agent starts to compute the “measured” position “around” the expected position which can avoid unlikely solutions. (4) For drone swarm formation, the feedback mechanism always adjusts each drone's pose by visual measures, and the formation distortion due to drift is limited.
Embodiments described herein provide various benefits in systems and methods for the capture and integration of visual content using a plurality of camera-equipped drones. In particular, embodiments enable automatic spatial alignment or coordination of drone trajectories and camera poses based purely on the visual content of the images those cameras capture, and the computation of consistent 3D point clouds, depth maps, and camera poses among all drones, as facilitated by the proposed iterative global optimizer. Successful operation does not rely on the presence of depth sensors (although they may be a useful adjunct) as the proposed SLAM-MT mechanisms in the camera controller can generate scale-consistent RGB-D image data simply using the visual content of successively captured images from multiple (even much greater than 2) drones. Such data are invaluable in modern high-quality 3D scene reconstruction.
The novel local-to-global coordinate transform method described above is based on matching multiple pairs of images such that a multi-to-one global match is made, which provides robustness. In contrast with prior art systems, the image processing performed by the drone agents to calculate their corresponding camera poses and depth maps does not depend on the availability of a global 3D map. Each drone agent can generate a dense depth map by itself given a pair of RGB images and their corresponding camera poses, and then transform the depth map and camera poses into global coordinates before delivering the results to the global optimizer. Therefore, the operation of the global optimizer of the present invention is simpler, dealing with the camera poses and depth maps in a unified coordinate system.
It should be noted that two loops of data transfer are involved. The outer loop operates between the fly controller and the camera controller to provide global positioning accuracy while the inner loop (which is made up of multiple sub-loops) operates between drone agents and the global optimizer within the camera controller to provide structure and motion accuracy.
Although the description has been described with respect to particular embodiments thereof, these particular embodiments are merely illustrative, and not restrictive. Applications include professional 3D scene capture, digital content asset generation, a real-time review tool for studio capturing, and drone swarm formation and control. Moreover, since the present invention can handle multiple drones performing complicated 3D motion trajectories, it can also be applied to process cases of lower dimensional trajectories such as scans by a team of robots.
Any suitable programming language can be used to implement the routines of particular embodiments including C, C++, Java, assembly language, etc. Different programming techniques can be employed such as procedural or object oriented. The routines can execute on a single processing device or multiple processors. Although the steps, operations, or computations may be presented in a specific order, this order may be changed in different particular embodiments. In some particular embodiments, multiple steps shown as sequential in this specification can be performed at the same time.
Particular embodiments may be implemented in a computer-readable storage medium for use by or in connection with the instruction execution system, apparatus, system, or device. Particular embodiments can be implemented in the form of control logic in software or hardware or a combination of both. The control logic, when executed by one or more processors, may be operable to perform that which is described in particular embodiments.
Particular embodiments may be implemented by using a programmed general purpose digital computer, by using application specific integrated circuits, programmable logic devices, field programmable gate arrays, optical, chemical, biological, quantum or nanoengineered systems, components and mechanisms may be used. In general, the functions of particular embodiments can be achieved by any means as is known in the art. Distributed, networked systems, components, and/or circuits can be used. Communication, or transfer, of data may be wired, wireless, or by any other means.
It will also be appreciated that one or more of the elements depicted in the drawings/figures can also be implemented in a more separated or integrated manner, or even removed or rendered as inoperable in certain cases, as is useful in accordance with a particular application. It is also within the spirit and scope to implement a program or code that can be stored in a machine-readable medium to permit a computer to perform any of the methods described above.
A “processor” includes any suitable hardware and/or software system, mechanism or component that processes data, signals or other information. A processor can include a system with a general-purpose central processing unit, multiple processing units, dedicated circuitry for achieving functionality, or other systems. Processing need not be limited to a geographic location, or have temporal limitations. For example, a processor can perform its functions in “real time,” “offline,” in a “batch mode,” etc. Portions of processing can be performed at different times and at different locations, by different (or the same) processing systems. Examples of processing systems can include servers, clients, end user devices, routers, switches, networked storage, etc. A computer may be any processor in communication with a memory. The memory may be any suitable processor-readable storage medium, such as random-access memory (RAM), read-only memory (ROM), magnetic or optical disk, or other non-transitory media suitable for storing instructions for execution by the processor.
As used in the description herein and throughout the claims that follow, “a”, “an”, and “the” includes plural references unless the context clearly dictates otherwise. Also, as used in the description herein and throughout the claims that follow, the meaning of “in” includes “in” and “on” unless the context clearly dictates otherwise.
Thus, while particular embodiments have been described herein, latitudes of modification, various changes, and substitutions are intended in the foregoing disclosures, and it will be appreciated that in some instances some features of particular embodiments will be employed without a corresponding use of other features without departing from the scope and spirit as set forth. Therefore, many modifications may be made to adapt a particular situation or material to the essential scope and spirit.