1. Field of the Disclosure
The present disclosure relates generally to identifying humans in images and, more particularly, tracking humans in video images.
2. Description of the Related Art
Crowd management is becoming an urgent global concern. A good understanding of the number of people in a public space and their movement through the public space can provide a baseline for automatic security and protection, as well as facilitating monitoring and design of public spaces for safety, efficiency, and comfort. Video-based imagery systems may be used in combination with the data generated by surveillance systems to detect, count, or track people. However, reliably detecting and tracking people in a crowd scene remains a difficult problem. For example, occlusions of people (by other people or objects) make it difficult to detect the occluded person and to track the person as they pass in and out of the occlusion. Detection of individuals may also be complicated by factors such as the variable appearance of people due to different body poses or different sizes of individuals, variations in the background due to lighting changes or camera angles, or different accessories such as bags or umbrellas carried by people.
Conventional human detection techniques such as the Histogram of Oriented Gradients (HOG) are designed to detect people in static images based on a distribution of intensity gradients or edge directions in a static image. For example, a static image can be divided into cells and the cells are subdivided into pixels. Each cell is characterized by a histogram of intensity gradients at each of the pixels in the cell, which may be referred to as a HOG descriptor for the cell. The HOG descriptors may be referred to as “patch descriptors” because they represent a property of the image at each pixel within a cell corresponding to a “patch” of the image. The HOG descriptors for the cells associated with a static image may then be compared to libraries of models to detect humans in the static image. One significant drawback to patch descriptor techniques such as HOG is that they often fail to detect occluded people (i.e., people who are partially or fully obscured by objects or other people) or people wearing colors that do not contrast sufficiently with the background.
The HOG technique may be combined with a Histogram Of Flow (HOF) technique to track people using optical flow (i.e., the pattern of apparent motion of objects, services, or edges caused by relative motion between the camera and the scene) in a sequence of video images. The HOF technique characterizes each cell in each video image by a histogram of gradients in the optical flow measured at each of the pixels in the cell. Thus, the HOF is also a patch descriptor technique. Relative to the HOG technique alone, combining the HOG technique and the HOF technique may improve the counting accuracy for a sequence of video images. However, detecting and tracking moving people using patch descriptors requires generating patch descriptors for all of the cells in each video image and consequently requires a high level of computational complexity that does not allow people to be detected or tracked in real-time. Furthermore, HOG, HOF, and other conventional techniques only yield reliable measurements when minimal occlusions occur, e.g., at relatively low densities of people.
The present disclosure may be better understood, and its numerous features and advantages made apparent to those skilled in the art by referencing the accompanying drawings. The use of the same reference symbols in different drawings indicates similar or identical items.
Humans can be detected or tracked in sequences of video images by identifying keypoints in portions of the video images that correspond to humans identified by applying a patch descriptor technique, such as the HOG technique, to the video images. A pixel is determined to be a keypoint if at least a threshold percentage of pixels within a radius of the pixel are brighter or darker than the pixel. The portions of the video images may be represented as bounding boxes that encompass a portion of the video image. Sets of keypoints identified within the bounding boxes in pairs of video images are compared to each other and associated with the same human if a matching criterion is satisfied. The characteristics of the keypoints may be represented by descriptors such as a binary descriptor or a vector of integers. For example, two keypoints may match if a statistical measure of the differences between binary descriptors for the two keypoints, such as a Hamming distance, is less than a threshold value. For another example, two keypoints may match if a statistical measure of the difference between vectors of integers that represent the two keypoints is less than a threshold value. Bounding boxes in the pairs of video images are determined to represent the same human if a percentage of matching keypoints in the two bounding boxes exceeds a threshold percentage. In some embodiments, keypoints in different bounding boxes may be filtered based on a motion vector determined based on the locations of the bounding boxes in the video images. Keypoints associated with motion vectors that exceed a threshold magnitude may not be compared. A motion history, including directions and speeds, can then be calculated for each human identified in the video images. The motion history may be used to predict future locations of the humans identified in the video images.
A patch descriptor technique may be applied to the foreground images 301-303 to identify portions of the foreground images 301-303 that include the humans 105-108. Some embodiments may apply a patch descriptor technique such as a histogram-of-gradients (HOG) technique to define bounding boxes 305, 306, 307, 308 (collectively referred to as “the bounding boxes 305-308”) that define the portions of the foreground image 301 that include the corresponding humans 105-108. For example, the bounding boxes 305-308 may be defined by dividing the foreground image 301 into small connected regions, called cells, and compiling a histogram of gradient directions or edge orientations for the pixels within each cell. The combination of these histograms represents a HOG descriptor, which can be compared to public or proprietary libraries of models of HOG descriptors for humans to identify the bounding boxes 305-308.
Patch descriptor techniques such as the HOG technique may effectively identify the bounding boxes 305-308 for the humans 105-108 in the static foreground image 301. However, patch descriptor techniques may fail to detect humans when occlusion occurs or when the color of people's clothes is similar to the background. For example, the patch descriptor technique may identify the bounding boxes 315, 316, 317 for the fully visible humans 105-107 but may fail to identify a bounding box for the occluded human 108 in the foreground image 302. The human 108 is no longer occluded in the foreground image 303 and so the patch descriptor technique identifies the bounding boxes 325, 326, 327, 328 for the humans 105-108 in the foreground image 303. Although the patch descriptor techniques may identify the bounding boxes 305-308, 315-317, and 325-328 in the foreground images 301-303, the patch descriptor techniques only operate on the static foreground images 301-303 separately and do not associate the bounding boxes with humans across the foreground images 301-303. For example, the patch descriptor techniques do not recognize that the same human 105 is in the bounding boxes 305, 315, 325.
Some embodiments of the keypoints 405, 410 may be represented as binary descriptors that describe an intensity pattern in a predetermined area surrounding the keypoints 405, 410. For example, the keypoint 405 may be described using a binary descriptor that includes a string of 512 bits that indicate the relative intensity values for 512 pairs of points in a sampling pattern that samples locations within the predetermined area around the keypoint 405. A bit in the binary descriptor is set to “1” if the intensity value at the first point in the pair is larger than the second point and is set to “0” if the intensity value at the first point is smaller than the second point. In other embodiments, the keypoints 405, 410 may be represented as a vector of integers that describe an intensity pattern in a predetermined area surrounding the keypoints 405, 410.
The appearance of the human 105 may not change significantly between the images 301, 302 that include the bounding boxes 305, 315. Consequently, the human 105 may be identified and tracked from its location in the image 301 to its location in the image 302 by comparing the keypoints 405 in the bounding box 305 to the keypoints 410 in the bounding box 315. In some embodiments, the binary descriptors of the keypoints 405, 410 can be compared by determining a measure of the difference between the binary descriptors. For example, a Hamming distance between the binary descriptors may be computed by summing the exclusive-OR values of corresponding pairs of bits in the binary descriptors. A smaller Hamming distance indicates a smaller difference between the binary descriptors and a higher likelihood of a match between the corresponding keypoints 405, 410. The keypoints 405, 410 may therefore be matched or associated with each other if the value of the Hamming distance is less than a threshold. For example, a pair of matching keypoints 405, 410 is indicated by the arrow 415. In some embodiments, a vector of integers representative of the keypoints 405, 410 may be compared to determine whether the keypoints 405, 410 match each other. In some embodiments, a measure of color similarity between the keypoints 405, 410 may be used to determine whether the keypoints 405, 410 match. For example, keypoints 405, 410 may not match if the keypoint 405 is predominantly red and the keypoint 410 is predominantly blue. Binary descriptors, vectors of integers, colors, or other characteristics of the keypoints 405, 410 may also be used in combination with each other to determine whether the keypoints 405, 410 match.
The human 105 may be identified in the bounding boxes 305, 315 if a percentage of the matching keypoints 405, 410 exceeds a threshold. For example, twelve keypoints 405 are identified in the bounding box 305 and these are determined to match the nine keypoints 410 identified in the bounding box 315. Thus, 75% of the keypoints 405 are determined to match keypoints 410 in the bounding box 315, which may exceed a threshold such as a 50% match rate for the keypoints. Conversely, all of the nine keypoints 410 identified in the bounding box 315 matched keypoints 405 identified in the bounding box 305, which is a 100% match rate. Match rates may be defined in either “direction,” e.g. from the bounding box 305 to the bounding box 315 or from the bounding box 315 to the bounding box 305. In some embodiments, a motion history may be generated for the human 105 in response to determining that the human 105 is identified in the bounding boxes 305, 315. The motion history may include the identified locations of the human 105, a direction of motion of the human 105, a speed of the human 105, and the like. The motion history may be determined using averages over a predetermined number of previous video images or other combinations of information generated from one or more previous video images.
Furthermore, although
The location of the human 108 (or the bounding box 308) in the video image 301 may be used to define a candidate region to search for an image of the occluded human 108 in the video image 302. For example, the candidate region may be defined by extending the bounding box 308 by a ratio such as 1.2 times the length and height of the bounding box 308. For another example, the candidate region may be defined as a circular region about the location of the human 108 in the video image 301. The circular region may have a radius that corresponds to a speed of the human 108 indicated in the corresponding motion history or to a maximum speed of the human 108. For yet another example, the candidate region may be defined as a region (such as a circle or rectangle) that is displaced from the location of the human 108 in the video image 302 by a distance that is determined based on a speed and direction of the human 108 indicated in the corresponding motion history. If the human 108 is present in the candidate region, as illustrated in
The keypoints 505, 510 may be compared on the basis of a Hamming distance between their binary descriptors. The keypoints 505, 510 may be matched or associated with each other if the value of the Hamming distance is less than a threshold, as discussed herein. For example, a pair of matching keypoints 505, 510 is indicated by the arrow 515. In some embodiments, vectors of integers representative of the keypoints 505, 510 or a measure of color similarity between the keypoints 505, 510 may be used to determine whether the keypoints 505, 510 match, as discussed herein.
The human 108 may be identified in the candidate region if a percentage of the matching keypoints 505, 510 exceeds a threshold. For example, twelve keypoints 505 are identified in the bounding box 308 and seven of these twelve are determined to match the seven keypoints 510 identified in the candidate region. Thus, just over half of the keypoints 505 are determined to match keypoints 510 in the candidate region, which may exceed a threshold such as a 50% match rate for the keypoints. Conversely, all of the seven keypoints 510 identified in the candidate region matched keypoints 505 identified in the bounding box 308, which is a 100% match rate. In some embodiments, a motion history may be generated for the human 108 in response to determining that the human 108 is identified in the bounding box 308 and the candidate region. The motion history may include the identified locations of the human 108, a direction of motion of the human 108, a speed of the human 108, and the like. The motion history may be determined using averages over a predetermined number of previous video images or other combinations of information generated from one or more previous video images. In some embodiments, a new bounding box may be defined for the occluded human 108.
Matching humans are identified (at block 725) in pairs of foreground images by comparing binary descriptors or vectors of integers representative of the keypoints in bounding boxes in the pairs of foreground images. In some embodiments, matching humans may also be identified (at block 725) in pairs of foreground images by comparing binary descriptors or vectors of integers representative of keypoints in bounding boxes to binary descriptors or vectors of integers representative of keypoints in candidate regions that were not identified by the patch descriptor technique, as discussed herein. At block 730, motion history for the identified humans may be generated. For example, locations of the same human in different video images may be used to calculate a distance traversed by the human in the time interval between the video images, which may be used to determine a speed or velocity of the human. The motion history for the identified humans may then be stored, e.g., in a database or other data structure.
If more keypoints are available in the second bounding box (as determined at decision block 825), the binary descriptor or vector of integers representative of the additional keypoint may be accessed (at block 810) and compared to the binary descriptor or vector of integers representative of the keypoint in the first bounding box. If no more keypoints are available in the second bounding box (as determined at decision block 825), the method 800 may end by determining (at block 830) that there are no matching keypoints between the first bounding box and the second bounding box. Consequently, the method 800 determines that the images of humans associated with the first bounding box and the second bounding box are of different people.
At decision block 920, the number of matching keypoints is compared to a threshold. The threshold may indicate an absolute number of matching keypoints or a percentage of the total number of keypoints in the bounding box or candidate region that match. If the matching number of keypoints is less than the threshold, the method 900 determines (at block 925) that the human associated with the bounding box in the first image is not present in the candidate region of the second image. If the matching number of keypoints is greater than the threshold, the method 900 determines that the human associated with the bounding box in the first image is present in the candidate region of the second image. At block 930, a new bounding box encompassing the candidate region is defined and associated with the image of the human identified by the keypoints in the candidate region. The new bounding box may be used to identify or track the associated human in other video images in a sequence of video images.
The video processing device 1005 includes one or more processors 1025 that can identify or track images of humans in the video images captured by the camera 1015. Some embodiments of the processors 1025 may identify or track images of humans in the video images by executing instructions stored in the memory 1020. For example, the video processing device 1005 may include a plurality of processors 1025 that operate concurrently or in parallel to identify or track images of humans in the video images according to instructions for implementing the method 700 shown in
In some embodiments, certain aspects of the techniques described above may implemented by one or more processors of a processing system executing software. The software comprises one or more sets of executable instructions stored or otherwise tangibly embodied on a non-transitory computer readable storage medium. The software can include the instructions and certain data that, when executed by the one or more processors, manipulate the one or more processors to perform one or more aspects of the techniques described above. The non-transitory computer readable storage medium can include, for example, a magnetic or optical disk storage device, solid state storage devices such as Flash memory, a cache, random access memory (RAM) or other non-volatile memory device or devices, and the like. The executable instructions stored on the non-transitory computer readable storage medium may be in source code, assembly language code, object code, or other instruction format that is interpreted or otherwise executable by one or more processors.
A non-transitory computer readable storage medium may include any storage medium, or combination of storage media, accessible by a computer system during use to provide instructions and/or data to the computer system. Such storage media can include, but is not limited to, optical media (e.g., compact disc (CD), digital versatile disc (DVD), Blu-Ray disc), magnetic media (e.g., floppy disc , magnetic tape, or magnetic hard drive), volatile memory (e.g., random access memory (RAM) or cache), non-volatile memory (e.g., read-only memory (ROM) or Flash memory), or microelectromechanical systems (MEMS)-based storage media. The computer readable storage medium may be embedded in the computing system (e.g., system RAM or ROM), fixedly attached to the computing system (e.g., a magnetic hard drive), removably attached to the computing system (e.g., an optical disc or Universal Serial Bus (USB)-based Flash memory), or coupled to the computer system via a wired or wireless network (e.g., network accessible storage (NAS)).
Note that not all of the activities or elements described above in the general description are required, that a portion of a specific activity or device may not be required, and that one or more further activities may be performed, or elements included, in addition to those described. Still further, the order in which activities are listed are not necessarily the order in which they are performed. Also, the concepts have been described with reference to specific embodiments. However, one of ordinary skill in the art appreciates that various modifications and changes can be made without departing from the scope of the present disclosure as set forth in the claims below. Accordingly, the specification and figures are to be regarded in an illustrative rather than a restrictive sense, and all such modifications are intended to be included within the scope of the present disclosure.
Benefits, other advantages, and solutions to problems have been described above with regard to specific embodiments. However, the benefits, advantages, solutions to problems, and any feature(s) that may cause any benefit, advantage, or solution to occur or become more pronounced are not to be construed as a critical, required, or essential feature of any or all the claims. Moreover, the particular embodiments disclosed above are illustrative only, as the disclosed subject matter may be modified and practiced in different but equivalent manners apparent to those skilled in the art having the benefit of the teachings herein. No limitations are intended to the details of construction or design herein shown, other than as described in the claims below. It is therefore evident that the particular embodiments disclosed above may be altered or modified and all such variations are considered within the scope of the disclosed subject matter. Accordingly, the protection sought herein is as set forth in the claims below.