1. Field of the Invention
The present invention relates to artificial or computer vision systems, e.g. vehicular vision systems. In particular, this invention relates to a method and apparatus for detecting objects in a manner that facilitates collision avoidance.
2. Description of the Related Art
Collision avoidance systems utilize a sensor system for detecting objects in front of an automobile or other form of vehicle or platform. In general, a platform can be any of a wide range of bases, including a boat, a plane, an elevator, or even a stationary dock or floor. The sensor system may include radar, an infrared sensor, or another detector. In any event the sensor system generates a rudimentary image of the scene in front of the vehicle. By processing that imagery, objects can be detected. Collision avoidance systems generally use multiple resolution disparity images in conjunction with one depth image. A multiple resolution disparity image may have points that correspond to different resolution levels. Thus, the depth image generated may not correspond smoothly with each multiple resolution disparity image.
Therefore, there is a need in the art for a method and apparatus that provides depth images at multiple resolutions.
The present invention describes a method and apparatus for detecting a target in an image. In one embodiment, a plurality of depth images is provided. A plurality of target templates is compared to at least one of the plurality of depth images. A scores image is generated based on the plurality of target templates and the at least one depth image.
So that the manner in which the above recited features of the present invention are attained and can be understood in detail, a more particular description of the invention, briefly summarized above, may be had by reference to the embodiments thereof which are illustrated in the appended drawings.
It is to be noted, however, that the appended drawings illustrate only typical embodiments of this invention and are therefore not to be considered limiting of its scope, for the invention may admit to other equally effective embodiments.
The present invention discloses in one embodiment method and apparatus for classifying an object in a region of interest based on one or more features of the object. Detection and classification of pedestrians, vehicles, and other objects are important, e.g., for automotive safety devices, since these devices may deploy in a particular fashion only if a target of the particular type (i.e., pedestrian or car) is about to be impacted. In particular, measures employed to mitigate the injury to a pedestrian may be very different from those employed to mitigate damage and injury from a vehicle-to-vehicle collision.
The field of view in a practical object detection system 102 may be ±12 meters horizontally in front of the vehicle 100 (e.g., approximately 3 traffic lanes), with a ±3 meter vertical area, and have a view depth of approximately 12-40 meters. (Other fields of view and ranges are possible, depending on camera optics and the particular application.) Therefore, it should be understood that the present invention can be used in a pedestrian detection system or as part of a collision avoidance system.
Still referring to
The processed images from the image preprocessor 206 are coupled to the CPU 210. The CPU 210 may comprise any one of a number of presently available high speed microcontrollers or microprocessors. CPU 210 is supported by support circuits 208 that are generally well known in the art. These circuits include cache, power supplies, clock circuits, input-output circuitry, and the like. Memory 212 is also coupled to CPU 210. Memory 212 stores certain software routines that are retrieved from a storage medium, e.g., an optical disk, and the like, and that are executed by CPU 210 to facilitate operation of the present invention. Memory 212 also stores certain databases 214 of information that are used by the present invention, and image processing software 216 that is used to process the imagery from the sensor array 106. Although the present invention is described in the context of a series of method steps, the method may be performed in hardware, software, or some combination of hardware and software (e.g., an ASIC). Additionally, the methods as disclosed can be stored on a computer readable medium.
For both hardware and practical reasons, creating disparity images having different resolutions is beneficial when detecting objects. Calibration provides for a reference point and direction from which all distances and angles are determined. Each of the disparity images contains the point-wise motion from the left image to the right image and each corresponds to a different image resolution. The greater the computed disparity of an imaged object, the closer the object is to the sensor array.
The depth map generator 302 processes the multi-resolution disparity images into a two-dimensional depth image for each of the multi-resolution disparity images. In one embodiment, each depth image is provided using calibration parameters from preprocessor 300. Each depth image (also referred to as a depth map) contains image points or pixels in a two dimensional array, where each point represents a specific distance from the sensor array to a point within scene 104. A depth image at a selected resolution is then processed by the target processor 304 wherein templates (models) of typical objects encountered by the vision system are compared to the information within the depth image. As described below, the template database 306 comprises templates of objects (e.g., automobiles, pedestrians) located at various locations and poses with respect to the sensor array.
An exhaustive search of the template database may be performed to identify the set of templates that most closely explain the present depth image. Secondary sensor 204 may provide additional information regarding the position of the object relative to vehicle 100, velocity of the object, size or angular width of the object, etc., such that the target template search process can be limited to templates of objects at about the known position relative to vehicle 100. Thus, the three-dimensional search space may be limited using secondary sensor 204. Target cueing provided by secondary sensor 204 speeds up processing by limiting the search space to the region to the immediate area of the cued location (e.g., the area indicated by secondary sensor 204) and also improves robustness by eliminating false targets that might otherwise have been considered. If the secondary sensor is a radar sensor, the sensor can, for example, provide an estimate of both object position and distance.
Target processor 304 produces a target list that is then used to identify target size and classification estimates that enable target tracking and the identification of each target's position, classification and velocity within the scene. That information may then be used to avoid collisions with each target or perform pre-crash alterations to the vehicle to mitigate or eliminate damage (e.g., lower or raise the vehicle, deploy air bags, and the like).
In step 415, a plurality of target templates is compared to at least one of the plurality of depth images. The plurality of target templates, e.g., “block” templates, may be three-dimensional renderings of vehicle templates, human templates, or templates of other objects. The block templates are rendered at each hypothesized target location within a two-dimensional multiple-lane grid. Previous systems limited detection of target vehicles to a one-dimensional (i.e., a single lane) region adjacent to and behind a host vehicle. The two-dimensional multiple-lane grid of the present invention is tessellated at ¼ meter by ¼ meter resolution in front of a host, e.g., vehicle 100. In other words, at every point in a ¼ meter grid, a three-dimensional pre-rendered template, e.g., vehicle template, human template, or other object template is provided at that location. Then each of the pre-rendered templates is compared to the actual depth image at a particular resolution level. The hypothesized target locations may be determined from the multi-resolution disparity images alone or in conjunction with target cueing information from secondary sensor 204. Multiple resolution depth images are desirable due to camera and lens distortions that occur due to perspective projection for points that are closer to the camera. The distortions that occur when objects are closer to the camera are easier to deal with when using a coarse resolution. In addition, targets which are further away from the camera appear smaller in the camera's images, and thus appear smaller in the multiple resolution depth images, than for targets that are closer to the camera. Finer resolution depth images are therefore generally better able to detect these targets that are further away from the camera.
In one embodiment, a level-2 depth image, e.g., a depth image at a coarse resolution, is used for distances less than or equal to 18 meters and a level-1 depth image is used for distances greater than 18 meters, when searching for vehicles. In one embodiment, the cut-off for level-2 and level-1 depth images may be 12 meters instead of 18 meters, when searching for people. In another embodiment, a level-0 depth image may be used to search for people at distances greater than 30 meters.
In an illustrative example, vehicle detection may be necessary at a distance of 10 meters from host 100. Pre-rendered templates of hypothesized vehicles are provided within a two-dimensional multi-lane grid tessellated at ¼ meter by ¼ meter resolution in front of host 100. The pre-rendered templates are compared to a level-2 depth image since the distance from vehicle 100 is less than 18 meters.
In step 420 a “scores” image based on the plurality of target templates and the at least one depth image is generated. Creating the “scores” image involves searching a template database to match target templates to the depth map. The template database comprises a plurality of pre-rendered templates for targets such as vehicles, and pedestrians, etc.; e.g., depth models of these objects as they would typically be computed by the stereo depth map generator 302. The depth image is a two-dimensional digital image, where each pixel expresses the depth of a visible point in the scene 104 with respect to a known reference coordinate system. As such, the mapping between pixels and corresponding scene points is known. In one embodiment, the template database is populated with multiple vehicle and pedestrian depth models.
A depth model based search is then employed, wherein the search is defined by a set of possible location pose pairs for each model class (e.g., vehicle or pedestrian). For each such pair, the hypothesized 3-D model is rendered and compared with the observed scene 104 range image via a similarity metric. This process creates a “scores” image with dimensionality equal to that of the search space, where each axis represents a model state parameter such as but not limited to lateral or longitudinal distance, and each pixel value expresses a relative measure of the likelihood that a target exists in the scene within the specific parameters. Generally, at this point an exhaustive search is performed wherein a template database is accessed and the templates stored therein are matched to the depth map.
Matching itself can be performed by determining a difference between each of the pixels in the depth image and each similarly positioned pixels in the target template. If the difference at each pixel is less than a predefined amount, the pixel is deemed a match. Individual pixel matching is then used to compute a template match score assigned to corresponding pixels within a scores image where the value (score) is indicative of the probability that the pixel is indicative of the presence of the operative model (e.g., vehicle, pedestrian, or other target).
The match scores may be derived in a number of ways. In one embodiment, the depth differences at each pixel between the template and the depth image are summed across the entire image and normalized by the total number of pixels in the target template. Without loss of generality, these summed depth differences may be inverted or negated to provide a measure of similarity. Spatial and/or temporal filtering of the match score values can be performed to produce new match scores.
In another embodiment, the comparison (difference) at each pixel can be used to determine a yes or no “vote” for that pixel (e.g., vote yes if the depth difference is less than one meter, otherwise vote no). The yes votes can be summed and normalized by the total number of pixels in the template to form a match score for the image.
In another embodiment, the top and bottom halves of the target template are compared to similarly positioned pixels in the depth map. If the difference at each pixel is less than a predefined amount, such as ¼ meter in the case of a pedestrian template and 1 meter in the case of a vehicle template, the pixel is deemed a first match. The number of pixels deemed a first match is then summed and then divided by the total number of pixels in the first half of the target template to produce a first match score. Then, the difference of each of the pixels in the second half of the depth image and each similarly positioned pixel in the second half of the target template are determined. If the difference at each pixel is less than a predefined amount, the pixel is deemed a second match. The total number of pixels deemed a second match is then divided by the total number of pixels in the second half of the template to produce a second match score. The first match score and the second match score are then multiplied to determine a final match score.
The scores image is then used to provide target aggregation from match scores. In one embodiment, a mean-shift algorithm is used to detect and localize specific targets from the scores image.
Once specific targets, e.g., vehicles, humans, and/or other objects, are detected and localized, a target list is generated. In one embodiment, radar validation of detected targets may optionally be performed. The detection of a vision target using radar increases confidence in the original target detection. Using radar guards against “false positives”, i.e., false identification of a target.
Target size and classification may be estimated for each detected target. Depth, depth variance, edge, and texture information may be used to determine target height and width, and classify targets into categories (e.g., sedan, sport utility vehicle (SUV), truck, pedestrian, pole, wall, motorcycle).
Characteristics (e.g., location, classification, height, width) of targets may be tracked using Kalman filters. Some targets may be rejected if these targets don't track well. Position, classification, and velocity of tracked targets may be output to other modules, such as another personal computer (PC) or sensor, using appropriate communication formats.
While the foregoing is directed to embodiments of the present invention, other and further embodiments of the invention may be devised without departing from the basic scope thereof, and the scope thereof is determined by the claims that follow.
This application claims the benefit of U.S. provisional patent application No. 60/549,186 filed, Mar. 2, 2004, which is herein incorporated by reference.
Number | Date | Country | |
---|---|---|---|
60549186 | Mar 2004 | US |