This disclosure relates in general to systems and methods for processing shadows of moving objects represented in compressed video images.
Multimedia technologies, including those for video- and image-related applications, are widely used in various fields, such as security surveillance, medical diagnosis, education, entertainment, and business presentations. For example, the use of high resolution videos are becoming more and more popular in the security surveillance applications so that important security information is captured in real time with improved resolutions, such as a million pixels or more per image. In security surveillance systems, videos are usually recorded by video cameras, and the recorded raw video data are compressed before the video files are transmitted to or stored in a storage device or a security monitoring center. The video files can then be analyzed by processing devices.
Moving objects are of significant interest in surveillance applications. For example, surveillance videos taken at the entrance of a private building may be analyzed to identify whether an unauthorized person attempts to enter the building. For example, the surveillance system may identify the moving trajectory of a moving object. If the trajectory indicates that a person has reached a certain position, an alarm may be triggered or a security guard may be notified. Therefore, detecting the moving objects and identifying their moving trajectories may provide useful information for assuring the security of the monitored site.
However, many lighting conditions cause video cameras to record the shadows of moving objects in video images. To identify accurate moving trajectories, the shadows associated with moving objects need to be removed from the recorded video images. Otherwise, false alarm may be triggered, or miscalculation may result. Traditional image processing methods require that the compressed video data transmitted from the video camera be uncompressed before shadow detection and removal. Uncompressing high resolution video data, however, is usually time-consuming and may sometimes require expensive computation resources.
Therefore, it may be desirable to have systems and/or methods that process compressed video images and/or detect a shadow associated with a moving object in the compressed video images.
Consistent with embodiments of the present invention, there is provided a computer-implemented method for processing compressed video images. The method detects a candidate object region from the compressed video images. The candidate object region includes a moving object and a shadow associated with the moving object. For each data block in the candidate object region, the method calculates an amount of encoding data used to encode temporal changes in the respective data block. The method then identifies the shadow in the candidate object region composed of data blocks each having the amount of encoding data below a threshold value.
Consistent with embodiments of present invention, there is also provided another computer-implemented method for processing compressed video images. The method detects an object image region representing a moving object from the compressed video images. The compressed video images include a shadow associated with the moving object. The method then determines a hypothetical moving object based on the detected object image region. The method further creates an environmental model in which the compressed video images are obtained, and determines a hypothetical shadow for the hypothetical moving object based on the environmental model.
It is to be understood that both the foregoing general description and the following detailed description are exemplary and explanatory and are intended to provide further explanation of the invention as claimed.
The accompanying drawings, which are incorporated in and constitute a part of this specification, illustrate disclosed embodiments described below.
In the drawings,
Reference will now be made in detail to the exemplary embodiments of the disclosure, examples of which are illustrated in the accompanying drawings. Wherever possible, the same reference numbers will be used throughout the drawings to refer to the same or like parts.
Consistent with some embodiments, surveillance system 100 may include a video processing and monitoring system 101, a plurality of surveillance cameras 102, and a communication interface 103. For example, surveillance cameras 102 may be distributed throughout the monitored site, and video processing and monitoring system 101 may be located on the site or remote from the site. Video processing and monitoring system 101 and surveillance cameras 102 may communicate via communication interface 103. Communication interface 103 may be a wired or wireless communication network. In some embodiments, communication interface 103 may have a bandwidth sufficient to transmit video images from surveillance cameras 102 to video processing and monitoring system 101 in real time.
Surveillance cameras 102 may be video cameras, such as analog closed-circuit television (CCTV) cameras or internet protocol (IP) cameras, configured to capture video images of one or more surveillance regions. For example, a video camera may be installed above the entrance of a bank branch or next to an ATM machine. In some embodiments, surveillance cameras 102 may be connected to a recording device, such as a central network video recorder (not shown), configured to record the video images. In some other embodiments, surveillance cameras 102 may have built-in recording functionalities, and can thus record directly to digital storage media, such as flash drives, hard disk drives or network attached storage.
The video data acquired by surveillance cameras 102 may be compressed before it is transmitted to video processing and monitoring system 101. Consistent with the present disclosure, video compression refers to reducing the quantity of data used to represent digital video images. Therefore, given a pre-determined band-width on communication interface 103, compressed video data can be transmitted faster than the original/uncompressed video data. Accordingly, the video images can be displayed on video processing and monitoring system 101 in real-time.
Video compression may be implemented as a combination of spatial image compression and temporal motion compensation. Various video compression methods may be used to compress the video data, such as discrete cosine transform (DCT), discrete wavelet transform (DWT), fractural compression, matching pursuit, etc. In particular, several video compression standards have been developed based on DCT, including H.120, H.261, MPEG-1, H.262/MPEG-2, H.263, MPEG-4, and H.264/MPEG-4 AVC. H.264 is currently one of the most commonly used formats for the recording, compression, and distribution of high definition video. Thus, the present disclosure discusses embodiments of the invention associated with video data compressed under the H.264 standard. However, it is contemplated that the invention can be applied to video data compressed with any other compression standards or methods.
As shown in
Memory module 120 can include, among other things, a random access memory (“RAM”) and a read-only memory (“ROM”). The computer program instructions can be accessed and read from the ROM, or any other suitable memory location, and loaded into the RAM for execution by processor 110. For example, memory module 120 may store one or more software applications. Software applications stored in memory module 120 may comprise operating system 121 for common computer systems as well as for software-controlled devices. Further, memory module may store an entire software application or only a part of a software application that is executable by processor 110. In some embodiments, memory module may store video processing software 122 that may be executed by processor 110. For example, video processing software 122 may be executed to remove shadows from the compressed video images.
It is also contemplated that video l processing software 122 or portions of it may be stored on a removable computer readable medium, such as a hard drive, computer disk, CD-ROM, DVD ROM, CD+RW or DVD±RW, USB flash drive, memory stick, or any other suitable medium, and may run on any suitable component of video processing and monitoring system 101. For example, portions of applications to perform video processing may reside on a removable computer readable medium and be read and acted upon by processor 110 using routines that have been copied to memory 120.
In some embodiments, memory module 120 may also store master data, user data, application data and/or program code. For example, memory module 120 may store a database 123 having therein various compressed video data transmitted from surveillance cameras 102.
In some embodiments, input device 130 and display device 140 may be coupled to processor 110 through appropriate interfacing circuitry. In some embodiments, input device 130 may be a hardware keyboard, a keypad, or a touch screen, through which an authorized user, such as a security guard, may input information to video processing and monitoring system 101. Display device 140 may include one or more display screens that display video images or any related information to the user.
Communication device 150 may provide communication connections such that video processing and monitoring system 101 may exchange data with external devices, such as video cameras 102. Consistent with some embodiments, communication device 150 may include a network interface (not shown) configured to receive compressed video data from communication interface 103.
One or more components of surveillance system 100 may be used to implement a process related to video processing. For example,
In some embodiments, the video stream may include video data coded in the form of macroblocks. Macroblocks are usually composed of two or more blocks of pixels. The size of a block may depend on the codec and is usually a multiple of 4. For example, in modern codecs such as H.263 and H.264, the overarching macroblock size may be fixed at 16×16 pixels, but can be broken down into smaller blocks or partitions which are either 4, 8, 12 or 16 pixels by 4, 8, 12 or 16 pixels.
Color and luminance information may be encoded in the macroblocks. For example, a macroblock may contain 4 Y (luminance) block, 1 Cb (blue color difference) block, 1 Cr (red color difference) block. In an example of an 8×8 macroblock, the luminance may be encoded at an 8×8 pixel size and the difference-red and difference-blue information each at a size of 2×2. In some embodiments, the macroblock may further include header information describing the encoding. For example, it may include an ADDR unit indicating the address of block in the video image, a TYPE unit identifying type of the macroblock (e.g., intra-frame, inter frame, bi-directional inter frame), a QUANT unit indicating the quantization value to vary quantization, a VECTOR unit storing a motion vector, a CBP unit storing a bit mask indicating how well the blocks in the macroblock match.
The video images may usually show several objects, including static objects and moving objects. Due to the existence of lighting sources, the video images may also show shadows of these objects. In particular, the shapes, sizes, and orientations of the shadows associated with moving objects may vary throughout time. For example,
In step 202 of process 200, candidate object regions corresponding to one or more moving objects and their respective shadows may be detected in the compressed video images. In some embodiments, candidate object regions may be detected based on the compressed video data without decompressing it into the raw data domain. Image 302 of
In some embodiments, various image segmentation methods may be used to detect the candidate object regions. For example, processor 110 may aggregate temporally adjacent video images, and calculate the motion vector for each “block” in the aggregated images. Because motion vector is indicative of the temporal changes within a block, a block with larger motion vector may be identified as part of the candidate object region. In addition, or in alternative, processor 110 may also calculate a difference between two temporally adjacent video images based on encoded image features such as luminance, color, and displacement vector, etc. Based on the calculated difference, processor 110 may further identify if a block belongs to the candidate object region or the background. Processor 110 may further “connect” the identified blocks into a continuous region. For example, processor 110 may determine the candidate image region as a continuous region that covers the identified blocks. In some embodiments, processor 110 may label the blocks in the candidate image region.
In step 203 of process 200, the shadow may be detected in the candidate object region. In some embodiments, the detection may be made based on H.264 macroblocks. For example,
For example, for each macroblock in the candidate object regions, processor 110 may calculate the DC encoding bits (step 403) and AC encoding bits (step 404) used to encode the corresponding video data.
In some embodiments, in steps 403 and 404, processor 110 may calculate the amount of encoding data (e.g., amount of information carried by the DC and AC encoding bits) used to encode temporal change information of a macroblock. Accordingly, in step 405, processor 110 may identify an estimated shadow region, from the candidate object region, that is composed of those macroblocks that have smaller amounts of encoding data. For example, processor 110 may compare the amount of encoding data of each macroblock with a predetermined threshold, and if the threshold is exceeded, the macroblock is labeled as part of moving object 501. Otherwise, the macroblock is labeled as part of shadow 502.
In some other embodiments, in steps 403 and 404, processor 110 may calculate the values of the encoding data for each macroblock. For example, processor 110 may calculate the DC and AC encoding bits. Since the AC encoding bits of moving object 501 tend to have higher values than the AC encoding bits of shadow 502, in step 405, processor 110 may identify an image region composed of those macroblocks that have larger-valued AC encoding bits, as the estimated shadow location.
Based on the estimation of shadow location in step 405, processor 110 may determine a boundary between moving object 501 and shadow 502 within the candidate image region (step 406). For example, the candidate object region may be divided into two parts by the boundary: a shadow image region and an object image region.
Processor 110 may further refine the boundary based on motion entropies of the two image regions. Each macroblock in the compressed video data may be associated with a motion vector that is a two-dimensional vector used for inter prediction that provides an offset from the coordinates in a video image to the coordinates in a reference image. Motion vectors associated with macroblocks in a moving object may share a similar or same movement direction, while motion vectors associated with macroblocks in a moving show may show various movement directions. Therefore, the motion entropy of the motion vectors associated with macroblocks of the shadow may usually be higher than those associated with the moving object. Accordingly, the boundary between moving object 501 and shadow 502 may be accurately set when the difference between the motion entropy for the shadow image region and the motion entropy for the object image region is maximized.
In some embodiments, the boundary may be refined using an iterative method. For example, in step 407, processor 110 may calculate a motion entropy for each of the shadow image region and the object image region separated by the boundary determined in step 406. Processor 110 may further determine the difference between the motion entropy for the shadow image region and the motion entropy for the object image region. Processor 110 may then go back to step 406 to slightly adjust the boundary, and execute step 407 again to determine another difference in motion entropies. Steps 406 and 407 may be repeated until the difference in motion entropies is maximized.
Based on the encoding bits calculated in steps 403 and 404, the motion entropies calculated in step 407, as well as the refined boundary determined in step 406, processor 110 may identify the location of the shadow 502 using various image segmentation and data fusion methods known in the art, such as Markov Random Field (MRF) classification method (step 408). Process 400 may then terminate after step 408.
Returning to
In step 601, a hypothetical moving object may be determined based on the object image region detected in step 203. For example, image 303 of
In step 602, an environmental model may be created. In some embodiments, processor 110 may receive input of location information of lighting sources in the real monitored environment. Processor 110 may then create the environmental model that includes the lighting sources and the hypothetical moving objects. In step 603, processor 110 may simulate light projections onto the hypothetical moving objects from the locations of the lighting sources. Accordingly, in step 604, processor 110 may estimate the shadow locations of the hypothetical moving objects, such as hypothetical shadows 710 and 720, as shown in
Returning to
In step 804, processor 110 may calculate bounding boxes for the shadow locations. In some embodiments, a bounding box may be a rectangular box that covers the outset of an aggregated shadow location. For example, image 306 of
Returning to
It will be apparent to those skilled in the art that various modifications and variations can be made in the disclosed embodiments without departing from the scope or spirit of those disclosed embodiments. Other embodiments of the invention will be apparent to those skilled in the art from consideration of the specification. It is intended that the specification and examples be considered as exemplary only, with a true scope and spirit of the disclosed embodiments being indicated by the following claims.