The subject matter disclosed herein relates to generating three-dimensional images. In particular, the subject matter disclosed herein relates to methods, systems, and computer-readable storage media for generating stereoscopic content via depth map creation.
Stereoscopic, or three-dimensional, imagery is based on the principle of human vision. Two separate detectors detect the same object or objects in a scene from slightly different angles and project them onto two planes. The resulting images are transferred to a processor which combines them and gives the perception of the third dimension, i.e. depth, to a scene.
Many techniques of viewing stereoscopic images have been developed and include the use of colored or polarizing filters to separate the two images, temporal selection by successive transmission of images using a shutter arrangement, or physical separation of the images in the viewer and projecting them separately to each eye. In addition, display devices have been developed recently that are well-suited for displaying stereoscopic images. For example, such display devices include digital cameras, personal computers, digital picture frames, high-definition televisions (HDTVs), and the like.
The use of digital image capture devices, such as digital still cameras, digital camcorders (or video cameras), and phones with built-in cameras, for use in capturing digital images has become widespread and popular. Because images captured using these devices are in a digital format, the images can be easily distributed and edited. For example, the digital images can be easily distributed over networks, such as the Internet. In addition, the digital images can be edited by use of suitable software on the image capture device or a personal computer.
Digital images captured using conventional image capture devices are two-dimensional. It is desirable to provide methods and systems for using conventional devices for generating three-dimensional images. In addition, it is desirable to provide methods and systems for aiding users of image capture devices to select appropriate image capture positions for capturing two-dimensional images for use in generating three-dimensional images. Further, it is desirable to provide methods and systems for altering the depth perceived in three-dimensional images.
Methods, systems, and computer program products for generating stereoscopic content via depth map creation are disclosed herein. According to one aspect, a method includes receiving a plurality of images of a scene captured at different focal planes. The method can also include identifying a plurality of portions of the scene in each captured image. Further, the method can include determining an in-focus depth of each portion based on the captured images for generating a depth map for the scene. Further, the method can include generating the other image of the stereoscopic image pair based on the captured image where the intended subject is found to be in focus and the depth map.
According to another aspect, a method for generating a stereoscopic image pair by altering a depth map can include receiving an image of a scene. The method can also include receiving a depth map associated with at least one captured image of the scene. The depth map can define depths for each of a plurality of portions of at least one captured image. Further, the method can include receiving user input for changing, in the depth map, the depth of at least one portion of at least one captured image. The method can also include generating a stereoscopic image pair of the scene based on the received image of the scene and the changed depth map.
According to an aspect, a system for generating a three-dimensional image of a scene is disclosed. The system may include at least one computer processor and memory configured to: receive a plurality of images of a scene captured at different focal planes; identify a plurality of portions of the scene in each captured image; determine an in-focus depth of each portion based on the captured images for generating a depth map for the scene; identify the captured image where the intended subject is found to be in focus as being one of the images of a stereoscopic image pair; and generate the other image of the stereoscopic image pair based on the identified captured image and the depth map.
According to another aspect, the computer processor and memory are configured to: scan a plurality of focal planes ranging from zero to infinity; and capture a plurality of images, each at a different focal plane.
According to another aspect, the system includes an image capture device for capturing the plurality of images.
According to another aspect, the image capture device comprises at least one of a digital still camera, a video camera, a mobile phone, and a smart phone.
According to another aspect, the computer processor and memory are configured filter the portions of the scene for generating a filtered image; apply thresholded edge detection to the filtered image; and determine whether each filtered portion is in focus based on the applied threshold edge detection.
According to another aspect, the computer processor and memory are configured to: identify at least one object in each captured image; and generate a depth map for the at least one object.
According to another aspect, the at least one object is a target subject. The computer processor and memory are configured to determine one of the captured images having the highest contrast based on the target subject.
According to another aspect, the computer processor and memory are configured to generate the other image of the stereoscopic pair based on translation and perspective projection.
According to another aspect, the computer processor and memory are configured to generate a three-dimensional image of the scene using the stereoscopic image pairs.
According to another aspect, the computer processor and memory are configured to implement one or more of registration, rectification, color correction, matching edges of the pair of images, transformation, depth adjustment, motion detection, and removal of moving objects.
According to another aspect, the computer processor and memory are configured to display the three-dimensional image on a suitable three-dimensional image display.
According to another aspect, the computer processor and memory are configured to display the three-dimensional image on one of a digital still camera, a computer, a video camera, a digital picture frame, a set-top box, and a high-definition television.
The foregoing summary, as well as the following detailed description of preferred embodiments, is better understood when read in conjunction with the appended drawings. For the purposes of illustration, there is shown in the drawings exemplary embodiments; however, the invention is not limited to the specific methods and instrumentalities disclosed. In the drawings:
The subject matter of the present invention is described with specificity to meet statutory requirements. However, the description itself is not intended to limit the scope of this patent. Rather, the inventors have contemplated that the claimed subject matter might also be embodied in other ways, to include different steps or elements similar to the ones described in this document, in conjunction with other present or future technologies. Moreover, although the term “step” may be used herein to connote different aspects of methods employed, the term should not be interpreted as implying any particular order among or between various steps herein disclosed unless and except when the order of individual steps is explicitly described.
The present invention includes various embodiments for the creation and/or alteration of a depth map for an image using a digital still camera or other suitable device as described herein. Using the depth map for the image, a stereoscopic image pair and its associated depth map may be rendered. These processes may be implemented by a device such as a digital camera or any other suitable image processing device.
Referring to
The memory 104 and the CPU 106 may be operable together to implement an image generator function 114 for generating three-dimensional images of a scene using a depth map in accordance with embodiments of the present invention. The image generator function 114 may generate a three-dimensional image of a scene using two or more images of the scene captured by the device 100.
The method includes identifying 202 a plurality of portions of the scene in each captured image. For example, objects in each captured image can be identified and segmented to concentrate focus analysis on specific objects in the scene. A focus map, as described in more detail herein, may be generated and used for approximating the depth of image segments. Using the focus map, an in-focus depth of each portion may be determined 204 based on the captured images for generating a depth map for the scene.
The method uses the image where the intended subject is found to be in focus by the camera (as per normal camera focus operation) as the first image of the stereoscopic pair. The other image of the stereoscopic image pair is then generated 206 based on the first image and the depth map.
A method in accordance with embodiments of the present invention for generating a stereoscopic image pair of a scene using a depth map may be applied during image capture and may utilize camera, focus, and optics information for estimating the depth of each pixel in the image scene. The technique utilizes the concept of depth of field (or similarly, the circle of confusion) and relies upon fast capture and evaluation of a plurality of images while adjusting the lens focus from near field to infinity, before refocusing to capture the intended focused image.
The method of
The method of
If object segmentation is performed, each N×M block may be further subdivided into n×m sized sub-blocks corresponding to portions of a given segmented object (step 306). In each sub-block, the images for which the pixels are deemed by the procedure above to be “in-focus” may be analyzed for those pixels to identify in which of the candidate images the local contrast is at its highest level (step 308). This process can continue hierarchically for smaller sub-blocks as needed. The nearest focus distance at which a given pixel is deemed “in focus,” the farthest distance at which it is “in focus,” and the distance at which it is optimally “in focus,” as indicated by the highest local contrast for that pixel, may be recorded in a “focus map.”
Given the focus map for the pixels in an image, an approximate depth for those pixels can be calculated. For a given combination of image (camera) format circle of confusion, c, f-stop (aperture), N, and focal length, F, the hyperfocal distance (the nearest distance at which the depth of field extends to infinity) of the combination can be approximated as follows:
In turn, the near field depth of field (Dn) for an image for a given focus distance, d, can be approximated as follows:
(for moderate to large s), and the far field DOF (Df) as follows:
for d<H. For d>=H, the far end depth of field becomes infinite, and only the near end depth of field value is informative.
Using the values in the focus map, these relationships can be combined to build a depth map for the captured image (step 310). For example, for a given pixel (P) the focus map contains the value for the shortest focus distance at which the pixel is in focus, ds(P), the longest distance, dl(P), and the optimum contrast distance, dc(P). Using these values, one can approximate that the closest possible distance for the pixel is given by the following equation:
And the furthest distance (again, remembering that for a given focus distance, di, if di>=H, the associated value of Df will be infinite) is given by the following equation:
and the optimum distance is between the equation,
and the equation,
Further, it is known that for the focus distances ds(P) and dl(P),
and
Given these values, a depth for each pixel, Dp, can be approximated as follows:
D
p=(Dnl(P)<Dfc(P))
(Dfs(P)>Dnc(P))->[max(Dns(P),Dnl(P),Dnc(P))+min(Dfs(P),Dfl(P),Dfc(P))]/2
Else->max(Dns(P),Dnl(P),Dnc(P))+min(Dfl(P),Dfc(P))]/2
Else
(Dfs(P)>Dnc(P))->[max(Dns(P),Dnc(P))+min(Dfs(P),Dfl(P),Dfc(P))]/2
Else->max(Dns(P),Dnl(P),Dnc(P))+min(Dfl(P),Dfc(P))]/2
if any of the Df(P) values are non-infinite. In the case that all Df(P) values are infinite, Dp is instead approximated as
D
p=[max(Dns(P),Dnc(P))+min(Dnl(P),Dnc(P))]/2.
The method of
A method in accordance with embodiments of the present invention for altering a depth map for generating a stereoscopic image pair may be applicable either pre- or post-capture. Touchscreen technology may be used in this method. Touchscreen technology has become increasingly common, and with it, applications such as touchscreen user directed focus for digital cameras (encompassing both digital still camera and cellphone camera units) has emerged. Using this technology, a touchscreen interface may be used for specifying the depth of objects in a two dimensional image capture. Either pre- or post-capture, the image field may be displayed in the live view LCD window, which also functions as a touchscreen interface. A user may touch and highlight the window area he or she wishes to change the depth on, and subsequently uses a right/left (or similar) brushing gesture to indicate an increased or decreased (respectively) depth of the object(s) at the point of the touchscreen highlight. Alternatively, depth can be specified by a user by use of any suitable input device or component, such as, for example, a keyboard, a mouse, or the like.
Embodiments of the present invention are applicable pre-capture, while composing the picture, or alternatively can be used post-capture to create or enhance the depth of objects in an eventual stereoscopic image, optionally in conjunction with the technology of the first part of the description. In conjunction with the technology above, the technology described can be used for selective artistic enhancements by the user; whereas in a stand-alone sense, the technology described can be the means of creation of a relative depth map for the picture, allowing the user to create a depth effect only for the objects he/she feels are of import.
Once an image view and depth map are available using the techniques above, rendering of the stereoscopic image pair may occur.
For any stereoscopic image, there is an overlapping field of view from the left and right eyes that defines the stereoscopic image. At the point of convergence of the eyes, the disparity of an object between the two views will be zero, i.e. no parallax. This defines the “screen point” when viewing the stereoscopic pair. Objects in front of the screen and behind the screen will have increasing amounts of parallax disparity as the distance from the screen increases (negative parallax for objects in front of the screen, positive parallax for objects behind the screen).
The central point of the overlapping field of view on the screen plane (zero parallax depth) of the two eyes in stereoscopic viewing defines a circle that passes through each eye, with a radius, R, equal to the distance to the convergence point. Moreover, the angle, θ, between the vectors from the central convergence point to each of the two eyes can be measured. Examples for varying convergence points are described herein below.
Medium distance convergence gives a relatively small angular change, while close convergence gives a relatively large angular change.
The convergence point is chosen as center pixel of the image on the screen plane. It should be noted that this may be an imaginary point, as the center pixel of the image may not be at a depth that is on the screen plane, and hence, the depth of that center pixel can be approximated. This value (Dfocus) is approximated to be 10-30% behind the near end depth of field distance for the final captured image, and is approximated by the equation:
where Dfocus is the focus distance of the lens for the final capture of the image, “Screen” is a value between 1.1 and 1.3, representing the placement of the screen plane behind the near end depth of field, and “scale” represents any scaled adjustment of that depth by the user utilizing the touchscreen interface.
The angle, θ, is dependent upon the estimated distance of focus and the modeled stereo baseline of the image pair to be created. Hence, θ may be estimated as follows:
for Dfocus calculated in centimeters. Typically, θ would be modeled as at most 2 degrees.
In addition to the rotational element in the Z plane, there can also be an X axis translational shift between views. Since no toe-in should occur for the image captures, as would be the case for operation of the eyes, there can be horizontal (X axis) displacement at the screen plane for the two images at the time of capture. For example,
for the width of the image sensor, W, and the focal length, F.
Depth Dp has been approximated for each pixel in the image, and is available from the depth map. It should be noted that the calculations that follow for a given pixel depth, Dp, may be imperfect, since each pixel is not centrally located between the two eye views; however, the approximation is sufficient for the goal of producing a stereoscopic effect. Hence, knowing V and the depth, Dp, of a given pixel, the approximate width of the field of view (WoV) may be represented as follows:
Hence, if the stereo baseline is estimated, the translational offset in pixels, S, for displacement on the X axis to the left (assuming without loss of generality, right image generated from left) is given by the following equation:
for PW the image width in pixels. Since W, F, and Pw are camera-specific quantities, the only specified quantity is the modeled convergence angle, θ, as noted typically 1-2 degrees.
For each pixel, p, in the image, knowing (xp, yp) coordinates, pixel depth Dp, pixel X-axis displacement S, and the angle θ, a perspective projective transform can be defined to generate a right eye image from the single “left eye” image. A projective perspective transform is defined as having an aspect of translation (defined by S), rotation in the x/y plane (which will be zero for this case), rotation in the y/z plane (again will be zero for this case), and rotation in the x/z plane, which will be defined by the angle θ. For example, the transform may be defined as follows:
where (Dxp, Dyp, Dzp) are 3D coordinate points resulting from the transform that can be projected onto a two dimensional image plane, which may be defined as follows:
where Ex, Ey, and Ez are the coordinates of the viewer relative to the screen, and can be estimated for a given target display device. Ex and Ey can be assumed to be, but are not limited to, 0. The pixels defined by (xp′, yp′) make up the right image view for the new stereoscopic image pair.
Following the calculation of (xp′, yp′) for each pixel, some pixels may map to the same coordinates. The choice of which is in view is made by using the Dzp values of the two pixels, after the initial transform, but prior to the projection onto two-dimensional image space, with lowest value displayed. An example of the pixel manipulations that occur in the course of the transform is shown in
Similarly, there may be points in the image for which no pixel maps. This can be addressed with pixel fill-in and/or cropping. A simple exemplary pixel fill-in process that may be utilized in the present invention assumes a linear gradient between points on each horizontal row in the image. For points on the same row, n, without defined pixel values between two defined points (xi, yn) and (xj, yn), the fill-in process first determines the distance, which may be defined as follows:
d=j−i−1,
and then proceeds to determine an interpolated gradient between the two pixel positions to fill in the missing values. For simplicity of implementation, the interpolation may be performed on a power of two, meaning that the interpolation will produce one of 1, 2, 4, 8, 16, etc. pixels as needed between the two defined pixels. Pixel regions that are not a power of two are mapped to the closest power of two, and either pixel repetition or truncation of the sequence is applied to fit. As an example, if j=14 and i=6, then d=7, and the following intermediate pixel gradient is calculated as follows:
Since only 7 values are needed, p8 would go unused in this case, such that the following assignments can be made:
This process may repeat for each line in the image following the perspective projective transformation. The resultant image may be combined with the initial image capture to create a stereo image pair that may be rendered for 3D viewing via stereo registration and display. Other, more complex and potentially more accurate pixel fill in processes may be utilized.
Embodiments in accordance with the present invention may be implemented by a digital still camera, a video camera, a mobile phone, a smart phone, and the like. In order to provide additional context for various aspects of the disclosed invention,
Generally, however, program modules include routines, programs, objects, components, data structures, etc. that perform particular tasks or implement particular data types. The operating environment 900 is only one example of a suitable operating environment and is not intended to suggest any limitation as to the scope of use or functionality of the subject matter disclosed herein. Other well known computer systems, environments, and/or configurations that may be suitable for use with the invention include but are not limited to, personal computers, hand-held or laptop devices, multiprocessor systems, microprocessor-based systems, programmable consumer electronics, network PCs, minicomputers, mainframe computers, distributed computing environments that include the above systems or devices, and the like.
With reference to
The system bus 908 can be any of several types of bus structure(s) including the memory bus or memory controller, a peripheral bus or external bus, and/or a local bus using any variety of available bus architectures including, but not limited to, 11-bit bus, Industrial Standard Architecture (ISA), Micro-Channel Architecture (MCA), Extended ISA (EISA), Intelligent Drive Electronics (IDE), VESA Local Bus (VLB), Peripheral Component Interconnect (PCI), Universal Serial Bus (USB), Advanced Graphics Port (AGP), Personal Computer Memory Card International Association bus (PCMCIA), and Small Computer Systems Interface (SCSI).
The system memory 906 includes volatile memory 910 and nonvolatile memory 912. The basic input/output system (BIOS), containing the basic routines to transfer information between elements within the computer 902, such as during start-up, is stored in nonvolatile memory 912. By way of illustration, and not limitation, nonvolatile memory 912 can include read only memory (ROM), programmable ROM (PROM), electrically programmable ROM (EPROM), electrically erasable ROM (EEPROM), or flash memory. Volatile memory 910 includes random access memory (RAM), which acts as external cache memory. By way of illustration and not limitation, RAM is available in many forms such as synchronous RAM (SRAM), dynamic RAM (DRAM), synchronous DRAM (SDRAM), double data rate SDRAM (DDR SDRAM), enhanced SDRAM (ESDRAM), Synchlink DRAM (SLDRAM), and direct Rambus RAM (DRRAM).
Computer 902 also includes removable/nonremovable, volatile/nonvolatile computer storage media.
It is to be appreciated that
A user enters commands or information into the computer 902 through input device(s) 926. Input devices 926 include, but are not limited to, a pointing device such as a mouse, trackball, stylus, touch pad, keyboard, microphone, joystick, game pad, satellite dish, scanner, TV tuner card, digital camera, digital video camera, web camera, and the like. These and other input devices connect to the processing unit 904 through the system bus 908 via interface port(s) 928. Interface port(s) 928 include, for example, a serial port, a parallel port, a game port, and a universal serial bus (USB). Output device(s) 930 use some of the same type of ports as input device(s) 926. Thus, for example, a USB port may be used to provide input to computer 902 and to output information from computer 902 to an output device 930. Output adapter 932 is provided to illustrate that there are some output devices 930 like monitors, speakers, and printers among other output devices 930 that require special adapters. The output adapters 932 include, by way of illustration and not limitation, video and sound cards that provide a means of connection between the output device 930 and the system bus 908. It should be noted that other devices and/or systems of devices provide both input and output capabilities such as remote computer(s) 934.
Computer 902 can operate in a networked environment using logical connections to one or more remote computers, such as remote computer(s) 934. The remote computer(s) 934 can be a personal computer, a server, a router, a network PC, a workstation, a microprocessor based appliance, a peer device or other common network node and the like, and typically includes many or all of the elements described relative to computer 902. For purposes of brevity, only a memory storage device 936 is illustrated with remote computer(s) 934. Remote computer(s) 934 is logically connected to computer 902 through a network interface 938 and then physically connected via communication connection 940. Network interface 938 encompasses communication networks such as local-area networks (LAN) and wide-area networks (WAN). LAN technologies include Fiber Distributed Data Interface (FDDI), Copper Distributed Data Interface (CDDI), Ethernet/IEEE 1102.3, Token Ring/IEEE 1102.5 and the like. WAN technologies include, but are not limited to, point-to-point links, circuit switching networks like Integrated Services Digital Networks (ISDN) and variations thereon, packet switching networks, and Digital Subscriber Lines (DSL).
Communication connection(s) 940 refers to the hardware/software employed to connect the network interface 938 to the bus 908. While communication connection 940 is shown for illustrative clarity inside computer 902, it can also be external to computer 902. The hardware/software necessary for connection to the network interface 938 includes, for exemplary purposes only, internal and external technologies such as, modems including regular telephone grade modems, cable modems and DSL modems, ISDN adapters, and Ethernet cards.
The various techniques described herein may be implemented with hardware or software or, where appropriate, with a combination of both. Thus, the methods and apparatus of the disclosed embodiments, or certain aspects or portions thereof, may take the form of program code (i.e., instructions) embodied in tangible media, such as floppy diskettes, CD-ROMs, hard drives, or any other machine-readable storage medium, wherein, when the program code is loaded into and executed by a machine, such as a computer, the machine becomes an apparatus for practicing the invention. In the case of program code execution on programmable computers, the computer will generally include a processor, a storage medium readable by the processor (including volatile and non-volatile memory and/or storage elements), at least one input device and at least one output device. One or more programs are preferably implemented in a high level procedural or object oriented programming language to communicate with a computer system. However, the program(s) can be implemented in assembly or machine language, if desired. In any case, the language may be a compiled or interpreted language, and combined with hardware implementations.
The described methods and apparatus may also be embodied in the form of program code that is transmitted over some transmission medium, such as over electrical wiring or cabling, through fiber optics, or via any other form of transmission, wherein, when the program code is received and loaded into and executed by a machine, such as an EPROM, a gate array, a programmable logic device (PLD), a client computer, a video recorder or the like, the machine becomes an apparatus for practicing the invention. When implemented on a general-purpose processor, the program code combines with the processor to provide a unique apparatus that operates to perform the processing of the present invention.
While the embodiments have been described in connection with the preferred embodiments of the various figures, it is to be understood that other similar embodiments may be used or modifications and additions may be made to the described embodiment for performing the same function without deviating therefrom. Therefore, the disclosed embodiments should not be limited to any single embodiment, but rather should be construed in breadth and scope in accordance with the appended claims.
This application claims the benefit of U.S. provisional patent application No. 61/230,138, filed Jul. 31, 2009, the disclosure of which is incorporated herein by reference in its entirety. The disclosures of the following U.S. provisional patent applications, commonly owned and simultaneously filed Jul. 31, 2009, are all incorporated by reference in their entirety: U.S. provisional patent application No. 61/230,131; and U.S. provisional patent application No. 61/230,133.
Number | Date | Country | |
---|---|---|---|
61230138 | Jul 2009 | US | |
61230131 | Jul 2009 | US | |
61230133 | Jul 2009 | US |