This disclosure relates generally to digital photography. More particularly, but not by way of limitation, this disclosure relates to a technique for calibrating a pair of image capture devices using a face or other standard object as a calibration standard. As used herein, the term “camera” means any device that has at least two image capture units, each capable of capturing an image of a scene at substantially the same time (e.g., concurrently). This definition explicitly includes stand-alone digital cameras and image capture systems embedded in other devices such as mobile telephones, portable computer systems including tablet computer systems, and personal entertainment devices.
Two cameras may be used to generate stereoscopic images, depth maps and disparity maps. Two-camera systems may also be used to improve or assist image registration and fusion operations. To accomplish this however, the cameras need to be calibrated to one another. That is, the pose of the first camera with respect to the second camera needs to be known. Referring to
In one embodiment the disclosed concepts provide a method to calibrate two image capture units or cameras based on a non-standard, and initially unknown, calibration object. The method includes obtaining a first image of an object (e.g., a human face) from a first image capture unit and, concurrently, a second image of the object from a second image capture unit; identifying a first and second sets of landmark points based on the first and second images respectively; determining first and second poses based on the first and second sets of landmark points respectively (e.g., using a POSIT and, possibly, an initial structure of the object such as, for example, an “average” face); determining a first estimated structure of the object based on the first and second poses; determining a first projection error based on the first pose, the second pose and the first estimated structure; calibrating, when the first projection error is less than a first threshold value, the first and second image capture units based on the first estimated structure; and determining, when the first projection error is more than a second threshold value: (•) revised first and second poses based on the first and second sets of landmark points and the first estimated structure, (•) a second estimated structure based on the revised first and second poses and the first estimated structure, and (•) a new first projection error based on the revised first and second poses and the second estimated structure.
In one embodiment the two image capture units may be incorporated in a single electronic device. In some embodiments, the first and second threshold values may be the same while in other embodiments they may be different. In still other embodiments, determining a revised first pose, a revised second pose, a second estimated structure, and a new first projection error may be repeated until the new first projection error is less than the first threshold value. The disclosed methods may be embodied in computer executable programs or instructions. Such computer programs or instructions may be stored in any media that is readable and executable by a computer system.
This disclosure pertains to systems, methods, and computer readable media to improve the operation of digital cameras having two or more image capture units. Techniques are disclosed for calibrating two cameras (image capture units) using a non-standard, and initially unknown, target (calibration) object. The target object may be any of a number of object types such as a specific three dimensional (3D) shape, a specific type of animal (e.g., dogs) or, in one particular embodiment, a human face. In general, any object type that may be expressed with a reasonably low-dimensional (possibly linear) parametrized model may be used as the target object. One result of the disclosed operation is a refined characterization of the target object's structure. This structure, along with the cameras' intrinsics and extrinsics, may be used as input to a non-linear bundle adjustment operation resulting in camera calibration. One of ordinary skill in the art will understand that for a bundle adjuster to converge a good initial estimate of the calibration rig's structure must be known. In the prior art this would be the planar calibration target 130. In accordance with this disclosure this could now be any target object, images of which may be captured in an unconstrained environment, characterized by a parametrized model that is, through an iterative approach, refined to the desired level of accuracy. While the disclosed subject matter is not so limited, this disclosure will focus on the iterative refinement of a model directed to a single human face captured by a pair of cameras.
In the following description, for purposes of explanation, numerous specific details are set forth in order to provide a thorough understanding of the disclosed concepts. As part of this description, some of this disclosure's drawings represent structures and devices in block diagram form in order to avoid obscuring the novel aspects of the disclosed concepts. In the interest of clarity, not all features of an actual implementation are described. Moreover, the language used in this disclosure has been principally selected for readability and instructional purposes, and may not have been selected to delineate or circumscribe the inventive subject matter, resort to the claims being necessary to determine such inventive subject matter. Reference in this disclosure to “one embodiment” or to “an embodiment” means that a particular feature, structure, or characteristic described in connection with the embodiment is included in at least one embodiment of the disclosed subject matter, and multiple references to “one embodiment” or “an embodiment” should not be understood as necessarily all referring to the same embodiment.
It will be appreciated that in the development of any actual implementation (as in any software and/or hardware development project), numerous decisions must be made to achieve the developers' specific goals (e.g., compliance with system- and business-related constraints), and that these goals may vary from one implementation to another. It will also be appreciated that such development efforts might be complex and time-consuming, but would nevertheless be a routine undertaking for those of ordinary skill in the design and use of image capture systems having the benefit of this disclosure.
As noted above, camera calibration usually requires a reference object (e.g., a rigid 3D checkerboard pattern) that must be seen from both cameras and from multiple positions. Because of these requirements, camera calibration is usually only performed at time of manufacture or during special “calibration sessions” when the camera may be placed into—and moved through—a controlled environment (see FIG. 1). The issue of post-sale calibration of multi-camera devices may become significant as the number of such devices entering the consumer market continues to increase. When this happens, the ability to accurately generate and/or use 3D information available from such devices may depend, in part, on the accuracy of their calibration—where such calibration may change over time as the devices are subject to, for example, thermal stresses and ballistic motions.
To simplify the presentation of a specific embodiment, the following assumptions and notations are herein adopted.
Single View Face Reconstruction and Pose Estimation.
To begin, assume a single image capture device with a known focal length and whose principal point coincides with a captured image's center. Knitting together two observations allows one to estimate, from a single image, the 3D landmark coordinates (e.g., the facial structure) as well as the face pose (i.e. the camera extrinsic parameters): [1] If the 3D coordinates of the landmark points are known (and their projections onto the camera image plane), it is possible to estimate the camera pose; and [2] If the camera pose is known, it is possible to estimate the 3D landmark coordinates. These two ideas can be combined in an iterative fashion, the convergence of which, may be achieved when the projection of the estimated face shape onto the camera image plane via the estimated intrinsic and extrinsic parameters is close enough to the position of the detected facial landmarks. Analytically, this difference or residual may be determined as:
J(x1, . . . ,xn,{circumflex over (R)},{circumflex over (T)},{circumflex over (φ)})=Σi=1n∥xi−Π[K({circumflex over (R)}Pi(S{circumflex over (φ)}+μ)+{circumflex over (T)})]∥ EQ. 6
where the “hat” symbol (^) indicates an estimated value.
Referring to
x=Π[K(RX+T)]Π[K(RP(Sφ+μ)+T], EQ. 7
where P is the projection matrix that extracts the proper portion of the shape vector corresponding to the point x. Rearranging EQ. 7 yields:
x≈Π[(KRPS)φ+K(RPμ+T)]=Π[Aφ+b]. EQ. 8
Expanding the projection operation Π yields:
where a1, a2 and a3 are the rows of the matrix AΣ3×3. By repeating the process for each of the facial landmarks and stacking the resulting equations one on top of the other an over-determined linear system may be obtained:
H(R)φ≈z(R,T), EQ. 10
where HΣ2n×p and zΣ2n and the arguments within the parentheses explicitly show the dependence from the camera extrinsics. As long as 2n>p, the value of φ (which encodes the imaged face's structure or shape) can be determined in a “least square sense” (block 225). In one embodiment, φ may be determined (in a least-squared sense) using the Matlab® programming environment as illustrated in Table 1. (MATLAB is a registered trademark of The MathWorks, Inc.)
The residual, given by EQ. 6 may now be determined (block 230) and compared against a specified threshold (block 235). As noted above, this threshold (Tj) may be established a priori. In another embodiment, however, this threshold may be determined dynamically. By way of example, this threshold may be met if the change in the projection error determined in accordance with block does not change over a given number of iterations more than a given amount. If the determined error (aka residual) is less than a specified threshold (the “YES” prong of block 235), the target object's (e.g., face's) pose and structure may be passed to a non-linear bundle adjuster to complete the cameras' calibration (block 240). If the residual is not less than the specified threshold (the “NO” prong of block 235), another iteration of operation 200 may begin at block 220. In one embodiment, a first threshold may be used to determine the projection error meets the YES prong of block 235 and a second threshold used to determine the projection error meets the NO prong of block 235. This permits, for example, a large distinction between the YES and NO actions of block 2325. In another embodiment the same threshold value may be used. In still another embodiment, The YES prong of block 235 may need to be met a specified minimum number of consecutive evaluations. The presentation here, based on a single image, can provide initial values for a face's pose and structure that, in combination with non-linear optimization techniques (e.g., bundle adjuster), may be used to determine a final pose and face structure.
Camera Pair Calibration from Face Images.
Referring to
K1,R1,T1,K2,R2,T2,φ=argminΣi=12J(x1i, . . . ,xni,Ri,Ti,φ). EQ. 11
When EQ. 11 is satisfied, the camera pair may be said to be calibrated.
Bundle adjustment amounts to jointly refining the set of initial camera parameters and landmark positions for finding the set of parameters that most accurately predict the locations of the observed landmarks in the two image planes. (It is assumed that both cameras 310 and 315 can see all of the target object's landmark points (e.g., face 305). In another embodiment, this constraint may be relaxed, especially when more than two cameras are used. In this section, the methods of the prior section are tailored and extended to provide a good initialization to the bundle adjuster that minimizes EQ. 11.
Once the quantities in EQ. 11 have been estimated, the transformation between the camera coordinate systems such that X2=
where
When only a single image of a face is captured by the two or more image capture units (or, more precisely, when a set of corresponding landmarks is detected), estimation of the camera extrinsics remains decoupled and therefore the POSIT operation may be used for each of the two cameras. This will return an initial estimate for R1, T1, and R2, T2. If multiple faces are captured, the intrinsic parameters may be constrained by the poses of the image capture units with respect to one another, and therefore it would make sense to resort to an approach that explicitly takes advantage of this constraint.
In the single face case, because the same face is seen by both of the cameras, the structure parameter PHI is shared across both views. Hence, the coupling may be expressed by modifying EQ. 10 so that:
By using EQ. 13 (instead of EQ. 10) during acts in accordance with block 225 calibration operation 200 may be used to calibrate two cameras based on images captured “in the wild” (e.g., in a post-sale environment) using a human face (or other target object that has a model as discussed above) of any arbitrary individual with any arbitrary expression and without any special calibration rig.
Referring to
Initial shape vectors L1 and L2 may be applied to three-dimensional (3D) pose estimation (POSIT) modules or circuits 435 and 440 respectively. As shown, POSIT operations 435 and 440 may also each receive input that characterizes or models the target object. In the illustrative embodiment shown in
Referring to
Lens assemblies 505A and 505B may each include a single lens or multiple lens, filters, and a physical housing unit (e.g., a barrel). One function of a lens assembly is to focus light from scene 515 onto the corresponding image sensor. Image sensors 510A and 510B may, for example, be CCD (charge-coupled device) or CMOS (complementary metal-oxide semiconductor) image sensors. In one embodiment, image sensors 510A and 510B may be physically distinct sensors. In another embodiment, image sensors 510A and 510B may be different portions of a single sensor element. In still another embodiment, image sensors 510A and 510B may be the same sensor element with the two images described above being captured in rapid succession through lens elements 505A and 505B. IPP 520 may perform a number of different tasks including, but not limited to, black level removal, de-noising, lens shading correction, white balance adjustment, demosaic operations, and the application of local or global tone curves or maps. IPP 520 may comprise a custom designed integrated circuit, a programmable gate-array, a central processing unit, a graphical processing unit, memory, or a combination of these elements (including more than one of any given element). Some functions provided by IPP 520 may be implemented at least in part via software (including firmware). Display element 525 may be used to display text and graphic output as well as receiving user input via user interface 530. For example, display element 525 may be a touch-sensitive display screen. User interface 530 can also take a variety of other forms such as a button, keypad, dial, a click wheel, and keyboard. Processor 535 may be a system-on-chip (SOC) such as those found in mobile devices and include one or more dedicated graphics processing units (GPUs). Processor 535 may be based on reduced instruction-set computer (RISC) or complex instruction-set computer (CISC) architectures or any other suitable architecture and may include one or more processing cores. Graphics hardware 540 may be special purpose computational hardware for processing graphics and/or assisting processor 535 perform computational tasks. In one embodiment, graphics hardware 540 may include one or more programmable GPUs each of which may have one or more cores. Audio circuit 545 may include one or more microphones, one or more speakers and one or more audio codecs. Image processing circuit 550 may aid in the capture of still and video images from image sensors 510A and 510B and include at least one video codec. Image processing circuit 550 may work in concert with IPP 520, processor 535 and/or graphics hardware 540. Images, once captured, may be stored in memory 555 and/or storage 560. Memory 555 may include one or more different types of media used by IPP 520, processor 535, graphics hardware 540, audio circuit 545, and image processing circuitry 550 to perform device functions. For example, memory 555 may include memory cache, read-only memory (ROM), and/or random access memory (RAM). Storage 560 may store media (e.g., audio, image and video files), computer program instructions or software, preference information, device profile information, and any other suitable data. Storage 560 may include one more non-transitory storage mediums including, for example, magnetic disks (fixed, floppy, and removable) and tape, optical media such as CD-ROMs and digital video disks (DVDs), and semiconductor memory devices such as Electrically Programmable Read-Only Memory (EPROM), and Electrically Erasable Programmable Read-Only Memory (EEPROM). Device sensors 565 may include, for example, proximity sensor/ambient light sensor, accelerometer and/or gyroscopes. Communication interface 570 may be used to connect device 500 to one or more networks. Illustrative networks include, but are not limited to, a local network such as a USB network, an organization's local area network, and a wide area network such as the Internet. Communication interface 570 may use any suitable technology (e.g., wired or wireless) and protocol (e.g., Transmission Control Protocol (TCP), Internet Protocol (IP), User Datagram Protocol (UDP), Internet Control Message Protocol (ICMP), Hypertext Transfer Protocol (HTTP), Post Office Protocol (POP), File Transfer Protocol (FTP), and Internet Message Access Protocol (IMAP)). Communication link 575 may be a continuous or discontinuous communication path and may be implemented, for example, as a bus, a switched interconnect, or a combination of these technologies.
It is to be understood that the above description is intended to be illustrative, and not restrictive. The material has been presented to enable any person skilled in the art to make and use the disclosed subject matter as claimed and is provided in the context of particular embodiments, variations of which will be readily apparent to those skilled in the art (e.g., some of the disclosed embodiments may be used in combination with each other). For example,
Number | Name | Date | Kind |
---|---|---|---|
6064749 | Hirota | May 2000 | A |
7508979 | Comaniciu | Mar 2009 | B2 |
7760242 | Anabuki | Jul 2010 | B2 |
8073201 | Satoh | Dec 2011 | B2 |
8126261 | Medioni et al. | Feb 2012 | B2 |
8339459 | Zhang et al. | Dec 2012 | B2 |
8917317 | Beeler | Dec 2014 | B1 |
9165365 | Hara et al. | Oct 2015 | B2 |
9508147 | Endo | Nov 2016 | B2 |
20140168378 | Hall | Jun 2014 | A1 |
Number | Date | Country |
---|---|---|
2309451 | Apr 2011 | EP |