Three dimensional (3-D) information about a scene can be useful for many purposes, such as gesture detection, 3-D video conferencing, and gaming, among others. 3-D information can be derived from stereo images of the scene. However, current techniques for deriving this information tend to work well in some scenarios but not so well in other scenarios.
The described implementations relate to stereo image matching to determine depth of a scene as captured by images. More specifically, the described implementations can involve a two-stage approach where the first stage can compute depth at highly accurate but sparse feature locations. The second stage can compute a dense depth map using the first stage as initialization. This improves accuracy and robustness of the dense depth map. For example, one implementation can utilize a first technique to determine 3-D locations of a set of points in a scene. This implementation can initialize a second technique with the 3-D locations of the set of points. Further, the second technique can be propagated to determine 3-D locations of other points in the scene.
The above listed examples are intended to provide a quick reference to aid the reader and are not intended to define the scope of the concepts described herein.
The accompanying drawings illustrate implementations of the concepts conveyed in the present document. Features of the illustrated implementations can be more readily understood by reference to the following description taken in conjunction with the accompanying drawings. Like reference numbers in the various drawings are used wherever feasible to indicate like elements. Further, the left-most numeral of each reference number conveys the Figure and associated discussion where the reference number is first introduced.
The description relates to stereo matching to determine depth of a scene as captured by images. Stereo matching of a pair of left and right input images can find correspondences between pixels in the left image and pixels in the right image. Depth maps can be generated based upon the stereo matching. Briefly, the present implementations can utilize a first technique to accurately determine depths of seed points relative to a scene. The seed points can be utilized to initialize a second technique that can determine depths for the remainder of the scene. Stated another way, the seed points can be utilized to guide selection of potential minimum and maximum depths for a bounded region of the scene that includes individual seed points. This initialization can enhance the accuracy of the depth results produced by the second technique.
As can be evidenced from
The IR projector 104 can serve to project features onto the scene that can be detected by the IR cameras 106 and 108. Any type of feature 202 can be utilized that serves this purpose. In some cases, the features are projected at random locations in the scene and/or at a random density. Examples of such features can include dots, geometric shapes, texture, etc. Dots are utilized in the described implementations, but any feature can be utilized that is readily detectable in the resulting IR images 116 and 118. In summary, features can be added to the scene rather than relying on the scene containing features that lend themselves to accurate location. Further, the added features are outside the visible spectrum and thus don't degrade image 120 of the scene captured by visible light camera 110. Other technologies could also satisfy this criteria. For instance UV light or other not-visible frequencies of light could be used.
The IR cameras 106 and 108, and visible light camera 110 may be genlocked, or synchronized. The genlocking of the IR cameras and/or visible light camera can ensure that the cameras are temporally coherent so that the captured stereo images directly correlate to each other. Other implementations can employ different numbers of IR projectors, IR cameras, and/or visible light cameras than the illustrated configuration.
The visible light camera 110 can be utilized to capture a color image for the scene by acquiring three different color signals, i.e., red, green, and blue, among other configurations. The output of the visible light camera 110 can provide a useful supplement to a depth map for many applications and use case scenario, some of which are described below relative to
The images 116 and 118 captured by the IR cameras 106 and 108 include the features 202. The images 116 and 118 can be received by sparse component 112 as indicated at 204. Sparse component 112 can process the images 116 and 118 to identify the depths of the features in the images from the two IR cameras. Thus, one function of the sparse component can be to accurately determine the depth of the features 202. In some cases, the sparse component can employ a sparse location-based matching technique or algorithm to find the features and identify their depth. The sparse component 112 can communicate the corresponding images and/or the feature depths to the dense component 114 as indicated at 206.
In summary, the present concepts can provide accurate stereo matching of a few features of the images. This can be termed ‘sparse’ in that the features tend to occupy a relatively small amount of the locations of the scene. These accurately known feature locations can be leveraged to initialize nearest neighbor field stereo matching of the imaging.
From one perspective, some of the present implementations can precisely identify a relatively small number of locations or regions in a scene. These precisely identified regions can then be utilized to initialize identification of the remainder of the scene.
In an alternative configuration, the time of flight emitter 402 can be replaced with the IR projector 104 (
Devices 502, 504, and 506 can include several elements which are defined below. For example, these devices can include a processor 516 and/or storage 518. The devices can further include one or more IR projectors 104, IR cameras 106, visible light cameras 110, sparse components 112, and/or dense components 114. The function of these elements is described in detail above relative to
Device 502 is configured with a forward facing (e.g., toward the user) IR projector 104, a pair of IR cameras 106, and visible light camera 110. This configuration can lend itself to 3-D video conferencing and gesture recognition (such as to control the device or for gaming purposes). In this case, corresponding IR images containing features projected by the IR projector 104 can be captured by the pair of IR cameras 106. The corresponding images can be processed by the sparse component 112 which can provide initialization information for the dense component. Ultimately, the dense component can generate a robust depth map from the corresponding images.
This depth mapping process can be performed for single pictures (e.g., still frames) and/or for video. In the case of video, the sparse component and the dense component can operate on every video frame or upon select video frames. For instance, the sparse component and the dense component may only operate on I-frames or frames that are temporally spaced, such as one every half-second for example. Thus, device 502 can function as a still shot ‘camera’ device and/or as a video camera type device and some or all of the images can be 3-D mapped.
Device 504 includes a first set 520 of IR projectors 104, IR cameras 106, and visible light cameras 110 similar to device 502. The first set can perform a functionality similar to the described above relative to device 502. Device 504 also includes a second set 522 that includes an IR projector 104 and a pair of IR cameras 106. This second set can be aligned to capture user ‘typing motions’ on surface 524 (e.g., a surface upon which the device is positioned). Thus, the second set can enable a virtual keyboard scenario.
Device 506 can be a free standing device that includes an IR projector 104, a pair of IR cameras 106, and/or visible light cameras 110. The device may be manifest as a set-top box or entertainment console that can capture user gestures. In such a scenario, the device can include a processor and storage. Alternatively, the device may be configured to enable monitor 510 that is not a touch screen to function as a ‘touchless touchscreen’ that detects user gestures toward the monitor without having to actually touch the monitor. In such a configuration, the device 506 may utilize processing and storage capabilities of the computing device 508 to augment or in place of having its own.
In still other configurations, any of devices 502-506 can send image data to Cloud 514 for remote processing by the Cloud's sparse component 112 and/or dense component 114. The Cloud can return the processed information, such as a depth map to the sending device and/or to another device with which the device is communicating, such as in a 3-D virtual conference.
The term “computer” or “computing device” as used herein can mean any type of device that has some amount of processing capability and/or storage capability. Processing capability can be provided by one or more processors (such as processor 516) that can execute data in the form of computer-readable instructions to provide a functionality. Data, such as computer-readable instructions, can be stored on storage, such as storage 518 that can be internal or external to the computer. The storage can include any one or more of volatile or non-volatile memory, hard drives, flash storage devices, and/or optical storage devices (e.g., CDs, DVDs etc.), among others. As used herein, the term “computer-readable media” can include signals. In contrast, the term “computer-readable storage media” excludes signals. Computer-readable storage media includes “computer-readable storage devices.” Examples of computer-readable storage devices include volatile storage media, such as RAM, and non-volatile storage media, such as hard drives, optical discs, and flash memory, among others.
In the illustrated implementation devices 502 and 504 are configured with a general purpose processor 516 and storage 518. In some configurations, a computer can include a system on a chip (SOC) type design. In such a case, functionality provided by the computer can be integrated on a single SOC or multiple coupled SOCs. One or more processors can be configured to coordinate with shared resources, such as memory, storage, etc., and/or one or more dedicated resources, such as hardware blocks configured to perform certain specific functionality. Thus, the term “processor” as used herein can also refer to central processing units (CPU), graphical processing units (CPUs), controllers, microcontrollers, processor cores, or other types of processing devices suitable for implementation both in conventional computing architectures as well as SOC designs.
In some configurations, the sparse component 112 and/or the dense component 114 can be installed as hardware, firmware or software during manufacture of the computer or by an intermediary that prepares the computer for sale to the end user. In other instances, the end user may install the sparse component 112 and/or the dense component 114, such as in the form of a downloadable application.
Examples of computing devices can include traditional computing devices, such as personal computers, desktop computers, notebook computers, cell phones, smart phones, personal digital assistants, pad type computers, cameras, or any of a myriad of ever-evolving or yet to be developed types of computing devices. Further, aspects of system 500 can be manifest on a single computing device or distributed over multiple computing devices.
A second technique can be initialized with the 3-D locations of the set of points at block 604. The second technique can be manifest as a nearest neighbor field (NNF) stereo matching technique. An example of an NNF stereo matching technique is Patch Match™, which is described in more detail below relative to the “Third Method Example”. Briefly, Patch Match can be thought of as an approximate dense nearest neighbor algorithm, i.e., for each patch of one image an (x, y)-vector can be mapped to a similar colored patch of a second image.
The second technique can be propagated to determine 3-D locations of other points of the scene at 606. The other points can be most or all of the remaining points of the scene. For example, in relation to
To summarize, the present implementations can accurately identify three-dimensional (3-D) locations of a few features or regions in the scene using a first technique. The identified three-dimensional locations can be utilized to initialize another technique that can accurately determine 3-D locations of a remainder of the scene.
Features can be detected within the first and second stereo images at block 704. Feature detection algorithms can be utilized to determine which pixels captured features. Some algorithms can even operate at a sub-pixel level and determine which portions of pixels captured features.
A disparity map can be computed of corresponding pixels that captured the features in the first and second stereo images at block 706.
Depths of the features can be calculated at block 708. One example is described below. Briefly, when the cameras are calibrated there can be a one-to-one relationship between disparity and depth.
An intensity-based algorithm can be initialized utilizing the feature depths at block 710. An example of an intensity-based algorithm is described below relative to the “Third Method Example”.
Good matching values can be distinguished at block 712. In one case, matching values can be compared to a threshold value. Those matching values that satisfy the threshold can be termed ‘good’ matching values, while those that do not satisfy the threshold can be termed ‘bad’ matching values.
Unlikely disparities can be removed at block 714. The removal can be thought of as a filtration process where unlikely or ‘bad’ matches are removed from further consideration.
The following implementation can operate on a pair of images (e.g., left and right images) to find correspondences between the images. The image pair may be captured either using two IR cameras and/or two visible-light cameras, among other configurations. Some implementations can operate on the assumption that the image pair has been rectified, such that for a pixel at location (x,y) in the left image, the corresponding pixel in the right image lies on the same row, i.e. at location (x+d,y). The technique can estimate disparity “d” for each pixel.
There is a one-to-one relationship between disparity and depth when the cameras are calibrated. Thus, an estimated disparity for each pixel can allow a depth map to be readily computed. This description only estimates a disparity map for the left image. However, it is equally possible to estimate a disparity map for the right image. The disparity d may be an integer, for a direct correspondence between individual pixels, or it may be a floating-point number for increased accuracy.
For purposes of discussion the left image is referred to as IL. The sparse location-based matching technique can estimate a disparity map D for this image. The right image is referred to as IR. A disparity D(x,y) can mean that the pixel IL(x, y) in the left image corresponds to the point in the right image IR(x+D(x, y),y).
An example intensity-based algorithm is manifest as the PatchMatch Stereo algorithm. An intensity-based algorithm can be thought of as being dense in that it can provide a depth map for most or all of the pixels in a pair of images. The PatchMatch Stereo algorithm can include three main stages: initialization, propagation, and filtering. In broad terms, the initialization stage can assign an initial disparity value to each pixel in the left image. The propagation stage can attempt to discover which of these initial values are “good”, and propagate that information to neighboring pixels that did not receive a good initialization. The filtering stage can remove unlikely disparities and labels those pixels as “unknown”, rather than pollute the output with poor estimates.
The PatchMatch Stereo algorithm can begin by assigning each pixel an initial disparity. The initial disparity can be chosen between some manually specified limits dmin and dmax, which correspond to the (potentially) minimum and maximum depths in the scene.
The present implementation can leverage an approximate initial estimate of the 3-D scene, in the form of a sparse set of 3-D points, to provide a good initialization. These 3-D points can be estimated by, among other techniques, projecting a random dot pattern on the scene. The scene can be captured with a pair of infra-red cameras. Dots can be detected in the images from the pair of infra-red cameras and matched between images. These points can be accurate, reliable, and can be computed very fast. However they are relatively “sparse”, in the sense that they do not appear at many pixel locations in the image. For instance these points tend to occupy less than half of the total pixels in the images and in some implementations, these points tend to occupy less than 20 percent or even less than 10 percent of the total pixels.
The description above can serve to match the IR dots and compute their 3-D positions. Each point (e.g., dot) can be projected into the two images IL and IR, to obtain a reliable estimate of the disparity of any pixel containing a point. A naive approach could involve simply projecting each 3-D point (Xi,Yi,Zi) to its locations (xiL,yiL) and (xiR,yiR) in the two images to compute its disparity di=xiR−xiL and set D(xiL,yiL)=di. However, not every pixel will contain a point, and some pixels may contain more than one point. In these cases, the points could either provide no information or conflicting information about a pixel's disparity.
The present implementation can retain the random initialization approach of the original PatchMatch Stereo algorithm, but which can be guided by a sparse 3-D point cloud. For each pixel (x, y) in the left image, the implementation can look to see if any 3-D points lie in a small square window (e.g., patch) around the pixel, and collect their disparities into a set Si={di
This initialization can begin by projecting all the 3-D points to their locations in the images. For each 3-D point (Xi,Yi,Zi), the corresponding position-disparity triple (xiL,yiL,di) can be obtained. Various methods can be utilized to perform the pixel initializations. Two method examples are described below. The first method can store the list of position-disparity triples in a spatial data structure that allows fast retrieval of points based on their 2-D location. Initializing a pixel (x,y), can involve querying the data structure for all the points in the square window around (x,y), and form the set Si from the query results. The second method, can create two images in which to hold the minimum and maximum disparity for each pixel. These values are denoted as Dmin and Dmax. All pixels can be initialized in Dmin to a large positive number, and all pixels in Dmax to a large negative number. The method can iterate over the list of position-disparity triples. For each item (xiL,yiL,di), the method can scan over each pixel (xj,yj) in the square window around (xiL,yiL), setting Dmin(xj,yj)=min(Dmin(xj,yj), di) and Dmax(xj,yj)=max(Dmax(xj,yj),di). This essentially “splats” each point into image space. Then, to initialize a disparity D(x,y), the method can sample a random value between Dmin(x,y) and Dmax(x,y). If no points were projected nearby, then Dmin(x,y)>Dmax(x,y), and sampling can be performed between dmin and dmax.
After initializing each pixel with a disparity, the PatchMatch Stereo algorithm can perform a series of propagation steps, which aim to spread “good” disparities from pixels to their neighbors, over-writing “bad” disparities in the process. The general design of a propagation stage is that for each pixel, the method can examine some set of (spatially and temporally) neighboring pixels, and consider whether to take one of their disparities or keep the current disparity. The decision of which disparity to keep is made based on a photo-consistency check, and the choice of which neighbors to look at is a design decision. The propagation is performed in such an order that when the method processes a pixel and examines its neighbors, those neighbors have already been processed.
Concretely, when processing a pixel (x,y), the method can begin by evaluating the photo-consistency cost of the pixel's current disparity D(x,y). The photo-consistency cost function C(x,y,d) for a disparity d at pixel (x,y), can return a small value if IL(x,y) has a similar appearance to IR(x+d, y), and a large value if not. The method can then look at some set of neighbors N, and for each pixel (xn yn) in N, compute C(x,y,D (xn,yn)) and set D(x,y)=D (xn yn) if C(x,y,D(xn,yn))<C(x,y,D(x,y)). Note that the method is computing the photo-consistency cost of D(xn,yn) at (x, y), which is different from the photo-consistency cost of D(xn,yn) at (xn,yn). Pseudo-code for the propagation passes performed by some method implementations is given in Listing 2.
A disparity ranking technique can be utilized to compare multiple possible disparities for a pixel and decide which is “better” and/or “best”. As in most intensity-based stereo matching, this can be done using a photo-consistency cost, which compares pixels in the left image to pixels in the right image, and awards a low cost when they are similar and a high cost when they are dissimilar. The (potentially) simplest photo-consistency cost can be to take the absolute difference between the colors of the two images at the points being matched, i.e. |IL(x,y)−IR(x+D(x,y),y)|. However, this is not robust, and may not take advantage of local texture, which may help to disambiguate pixels with similar colors.
Instead, another approach involves comparing small image patches centered on the two points. The width w of the patch can be set manually. This particular implementation utilizes a width of 11 pixels, which can provide a good balance of speed and quality. Other implementations can utilize less than 11 pixels or more than 11 pixels. There are many possible cost functions for comparing image patches. Three examples can include sum of squared differences (SSD), normalized cross-correlation (NCC) and Census. These cost functions can perform a single scan over the window, accumulating comparisons of the individual pixels, and then use these values to compute a single cost. One implementation uses Census, which is defined as
There are two final details to note, regarding the photoconsistency score/patch comparisons relative to at least some implementations. First, not every pixel in the patch may be used. For speed, some implementations can skip every other column of the patch. This can reduce the number of pixel comparisons by half without reducing the quality substantially. In this case, x; iterates over the values {x−r,x−r+2, . . . , x+r−2,x+r}. Second, disparities D(x, y) need not be integer-valued. In this case, an image IR(xj+D(X y),yj) is not simply accessed in memory, but is interpolated from neighboring pixels using bilinear interpolation. This sub-pixel disparity increases the accuracy of the final depth estimate.
When processing a video sequence, the disparity for a pixel at one frame may provide a good estimate for the disparity at that pixel in the next frame. Thus at frame t, the propagation stage can begin with a temporal propagation that can consider the disparity from the previous frame Dt-1(x,y) and can take this disparity if it offers a lower photo-consistency cost. In practice, when a single array is used to hold the disparity map for all frames, the temporal propagation can be swapped with the initialization. In this way, all pixels can start with their disparity from the previous frame. The photo-consistency cost of a random disparity can be computed for each pixel. The photo-consistency cost can be utilized if it has a lower cost than the temporally-propagated disparity. Pseudo-code for this is given in Listing 2, under PropagateTemporal.
Following the temporal propagation, the method can perform several passes of spatial propagation. In some variations of the PatchMatch Stereo algorithm, two spatial propagation passes are performed, using two different neighborhoods with two corresponding pixel orderings. The neighborhoods are shown in
Stated another way, in an instance where the images are frames of video, a parallel propagation scheme can be employed on the video frames. In one case, the parallel propagation scheme can entail propagation from left to right and top to bottom in parallel for even video frames followed by temporal propagation and propagation from right to left and bottom to top in parallel for odd video frames. Of course, other configurations are contemplated.
Some implementations of PatchMatch Stereo can run on the graphics processing unit (GPU) and/or the central processing unit (CPU). Briefly, GPUs tend to perform relatively more parallel processing and CPUs tend to perform relatively more sequential processing. In the GPU implementation of PatchMatch Stereo, different neighborhoods/orderings can be used to take advantage of the parallel processing capabilities of the GPU. In this implementation, four neighborhoods are defined, each consisting of a single pixel, as shown in
After the propagation, each pixel will have considered several possible disparities, and retained the one which gave the better/best photo-consistency between left and right images. In general, the more different disparities a pixel considers, the greater its chances of selecting an accurate disparity. Thus, it can be attractive to consider testing additional disparities, for example when testing a disparity d also testing d±0.25, d±0.5, d±1. These additional comparisons can be time-consuming to compute however.
On the GPU, the most expensive part of computing a photo-consistency cost can be accessing the pixel values in the right image. For every additional disparity d′ that is considered at a pixel (x,y), the method can potentially access all the pixels in the window around IR(x+d′,y). This aspect can make processing time linear in the number of disparities considered. However, if additional comparisons are strategically selected that do not incur any additional pixel accesses, they will be very cheap and may improve the quality. One GPU implementation can cache a section of the left image in groupshared memory as all of the threads move across it in parallel. As a result, it can remain expensive for a thread to access additional windows in the right image, but becomes cheap to access additional windows in the left image. Thus, a thread whose main task is to compute C(x,y,d), can also cheaply compute C(x−1,y,d+1), C(x+1,y,d−1) etc. and then “propose” them back to the appropriate threads via an additional block of groupshared memory.
The final stage in the PatchMatch Stereo algorithm can be filtering to remove spurious regions that do not represent real scene content. This is based on a simple region labeling algorithm, followed by a threshold to remove regions below a certain size. A disparity threshold td can be defined. Any two neighboring pixels belong to the same region if their disparities differ by less than td, i.e. pixels (x1,y1) and (x2,y2) belong to the same region if |D(x1,y1)−D(x2,y2)|<td. In some implementations, td=2. This definition can enable the extraction of all regions and discard all regions smaller than 200 pixels, setting their disparity to “unknown”.
In summary, the described concepts can employ a two-stage stereo technique where the first stage computes depth at highly accurate but sparse feature locations. The second stage computes a dense depth map using the first stage as initialization. This can improve accuracy and robustness of the dense depth map.
The methods described above can be performed by the systems and/or devices described above relative to
Although techniques, methods, devices, systems, etc., pertaining to stereo imaging are described in language specific to structural features and/or methodological acts, it is to be understood that the subject matter defined in the appended claims is not necessarily limited to the specific features or acts described. Rather, the specific features and acts are disclosed as exemplary forms of implementing the claimed methods, devices, systems, etc.