The present invention generally relates to a method and apparatus for processing image data, more particularly but not exclusively for a surveillance application.
Video surveillance cameras are normally used to monitor premises for security purposes. A typical video surveillance system usually involves taking video signals of site activity from one or more video cameras, transmitting the video signals to a remote central monitoring point, and displaying the video signals on video screens for monitoring by security personnel. In some cases where evidentiary support is desired for investigation or where “real-time” human monitoring is impractical, some or all of the video signals will be recorded.
It is common to record the output of each camera on a time-elapse video cassette recorder (VCR). In some applications, a video or infrared motion detector is used so that the VCR does not record anything except when there is motion in the observed area. This reduces the consumption of tape and makes it easier to find footage of interest. However, it does not eliminate the need for the VCR, which is a relatively complex and expensive component that is subject to mechanical failure, frequent tape cassette change, and periodic maintenance, such as cleaning of the video heads.
Another proposed approach is to use an all-digital video imaging system, which converts each video image to a compressed digital form immediately upon capture. The digital data is then saved in a conventional database. Solutions of this approach can be divided into three categories. The first category makes use of digital video recorders with or without network interface. This category is relatively expensive. It requires a substantial amount of storage space. The second category is framegrabber based hardware solutions. In this category, a framegrabber PC is used with traditional video cameras attached to it. The disadvantages of this category include: lack of flexibility, heavy cabling work, and high cost. Compared to the first two categories, the third category—a network camera based solution, possesses favourable features. In a network camera based surveillance solution, the cabling is simpler, faster and less expensive. The installation is not necessarily permanent since the cameras can easily be moved around a building. The distance from the camera to the monitoring/displaying/storage station can be very long (in principle worldwide). Moreover, network camera based solutions can achieve performance comparable with the first two categories. A network camera developed by Axis is able to transmit high-quality streaming video at 30(NTSC) or 25(PAL) images per second with enough bandwidth.
In digital video surveillance systems, as video data is relatively large in data amount terms, it is necessary to reduce the data amount by coding/compressing the digital video data. If video data is compressed, more video information can be transmitted through a network at high speed. Among various compression standards, JPEG and Motion JPEG (MJPEG) are the most widely used. The reason is that, although H.261, H.263, and MPEG compression methods can generate a smaller data stream, some image details 25 will inevitably be dropped which might be crucial in identifying an intruder. Using JPEG or Motion JPEG, the image quality is always guaranteed. U.S. Pat. No. 5,379,122, and the book JPEG: Still Image Compression Standard, New York, N.Y.: Van Nostrand Reinhold, 1993 by W. B. Pennebaker and J. L. Mitchell, gives a general overview of data-compression techniques which are consistent with JPEG device-independent compression standards. MJPEG is a less formal standard used by several manufacturers of digital video equipment. In MJPEG, the moving picture is digitized into a sequence of still image frames, and each image frame in an image sequence is compressed using the JPEG standard. Therefore, a description of JPEG suffices to describe the operation of MJPEG. In JPEG compression, each image frame of an original image sequence which is desired to be transmitted from one hardware device to another, or which is to be retained in an electronic memory, is first divided into a two-dimensional array of typically square blocks of pixels, and then encoded by an JPEG encoder (apparatus or a computer program) into compressed data. To display JPEG compressed data, a JPEG decoder (normally a computer program) is used to decompress the compressed data and reconstruct an approximation of the original image sequence therefrom.
Although JPEG/MJPEG compression preserves the image quality, it makes the compressed data size relatively bigger. It will take about 3 seconds to transmit a 704×576 size color image with reasonable compression level through a ISDN 2B link. Such a transmission speed is not acceptable in surveillance applications. By observing the camera setting environment in surveillance applications, one can easily find that the camera position is always fixed. That is, the images captured by surveillance camera will always consist of two distinct regions: background region and foreground region. The background region consists of the static objects in the scene while the foreground region consists of objects that move and change as time progresses. Ideally, background regions should be compressed and sent to the receiver only once. By concentrating bit allocation on pixels in the foreground region, more efficient video encoding can be achieved.
Means for segmenting a video signal into different layers and merging two or more video signals to provide a single composite video signal is known in the art. An example of such video separation and merging is presentation of weather-forecasts on television, where a weather-forecaster in the foreground is first segmented from the original background and then superimposed on a weather-map background. Such prior-art means normally use a color-key merging technology in which the required foreground scene is recorded using a colored background (usually blue or green). If a blue pixel is detected in the foreground scene (assuming blue is the color key), then a video switch will direct the video signal from the foreground scene to the background scene at that point. If a blue pixel is not detected in the foreground scene, then the video switch will direct the video from the background scene to the foreground scene at that point. Examples of such video separation and merging technique include U.S. Pat. Nos. 4,409,611, 5,923,791, and an article by Nakamura et al. in SMPTE Journal, Vol. 90, Feb. 1981, p. 107. The key feature of this type of methods is the pre-set background color. This is feasible in media production applications but is absolutely impossible in a surveillance application.
To perform foreground/background segmentation in a general environment, some image/video encoders have been proposed. U.S. Pat. No. 5,915,044 describes a method of encoding uncompressed video images using foreground/background segmentation. The method consists of two steps: a pixel level analysis and a block level analysis. During the pixel level, interframe differences corresponding to each original image are thresholded to generate an initial pixel-level mask. A first morphological filter is applied to the initial pixel-level mask to generate a filtered pixel-level mask. During the block level, the filtered pixel-level mask is thresholded to generate an initial block-level mask. A second morphological filter is preferably applied to the initial block-level mask to generate a filtered block-level mask. Each element of the filtered block-level mask indicates whether the corresponding block of the original image is part of the foreground or background.
Patent EP0833519 introduced an enhancement to the standard JPEG image data compression technique which includes a step of recording the length of each string of bits corresponding to each block of pixels in the original image at the time of compression. The list of lengths of each string of bits in the compressed image data is retained as an “encoding cost map” or ECM. The ECM, which is considerably smaller than the compressed image data, is transmitted or retained in memory separate from the compressed image data along with some other accompanying information and is used as a “key” for editing or segmentation of the compressed image data. The ECM, in combination with a map of DC components of the compressed image, is also used for substituting background portions of the image with blocks of pure white data, in order to compress certain types of images even further. This patent is meant for digital printing. It uses the bit length and DC coefficient of each block of pixels to analyse and segment the image into regions with different characteristics, for example, text, halftone, and contone regions. The ‘background’ in this patent denotes regions with less detail, that is totally different from the background definition in surveillance applications: portions of the scene that do no significantly change from frame to frame. The method of this patent cannot be used in foreground/background separation for surveillance applications.
Besides patents, some research work, especially MPEG-4 related, has also been published in this area. The paper “Check Image Compression using a layered coding method”, J. Huang and etc., Journal of Electronic Imaging, Vol. 7, No. 3, pp. 426442, July 1998, introduced a method to segment and encode a check image into different layers.
All of these known approaches have been generally adequate for their intended purposes, but they are not satisfactory in surveillance network camera applications.
Patents describing various network cameras or network camera related surveillance systems are proposed in the prior art. U.S. Pat. No. 5,926,209 discloses a video camera apparatus with compression system responsive to video camera adjustment. Patent JP7015646 provides a network camera which can freely select the angle of view and the shooting direction of a subject. Patent EP0986259 describes a network surveillance video camera system containing monitor camera units, a data storing unit, a control server, and a monitor display coupled by a network. Japanese patent application provisional publication No. 9-16685 discloses a remote monitor system using a data link ISDN. Japanese patent application provisional publication No. 7-288806 discloses that a traffic amount is measured and the resolution is determined in accordance with the traffic amount. U.S. Pat. No. 5,745,167 discloses a video monitor system including a transmitting medium, video cameras, monitors, a VTR, and a control portion. Although some of the network cameras use image analysis techniques to perform motion detection, none of them is capable of background/foreground separation, encoding, and transmission.
It is an object of the invention to provide an image processing method and apparatus suitable for a surveillance application which alleviates at least one disadvantage of the prior art noted above and/or provides the public with a useful choice.
According to the invention in a first aspect, there is provided a method of processing image data comprising the steps of taking a compressed version of an image and determining from the compressed version if a change in the image compared to previously obtained image data has occurred and identifying the changed portion of the compressed image.
An image processor arranged to perform the method of the first aspect is also provided.
According to the invention in a second aspect, there is provided a method of processing compressed data derived from an original image, the data being organized as a set of blocks, each block comprising a string of bits corresponding to an area of the original image, Direct Cosine Transformation (DCT) coefficients for each block being derived by decoding each string of bits, the differences between the DCT coefficients of the current frame and the DCT coefficients of a previous frame or a background frame being thresholded for each frame to produce an initial mask indicating changed blocks, applying segmentation and morphological techniques to the initial mask to filter out noise and find regions of movement, if no moving region is found, regarding the current frame as a background frame, otherwise identifying the blocks in the moving regions as foreground blocks and extracting the foreground blocks to form a foreground frame
According to the invention in a third aspect, there is provided network camera apparatus comprising an image requisition unit arranged to capture an image and converts the image into digital format; an image compression unit arranged to decrease the data size; an image processing unit arranged to analyze the compressed data of each image, detect motion from the compressed data, and identify background and foreground regions for each image; a data storage unit arranged to store the image data processed by the image processing unit; a traffic detection unit arranged to detect network traffic and set the frame rates of the image data to be transmitted; and a communication unit arranged to communicate with the network to transmit the image data.
According to the invention in a fourth aspect, there is provided a method of transmitting image data where the data has been split into foreground data and background data wherein the foreground and background data are transmitted at different bit rates.
According to the invention in a fifth aspect there is provided a method of forming a changed image from previous image data and current image data identifying a change in a portion of the previous image comprising replacing a corresponding portion of the previous image data with the current image data to form the changed image.
In the described embodiment a video encoding scheme for a network surveillance camera is provided that addresses the bit rate and foreground/background segmentation problems of the prior art. All the important image details can be kept during encoding and transmission processes and the compressed data size can be kept low. The proposed video encoding scheme identifies all the stationary objects in the scene (such as door, wall, window, table, chair, computer, and etc.) as background regions and all the moving objects (people, animal, and etc.) as foreground regions. After separating the image frames into foreground regions and background regions, the video encoding scheme sends background data in low frequency and foreground data in high frequency. If the number of images captured by a network camera in each second is 25, the total number of frames captured will be 30×60×25=45000 for 30 minutes. If each image has a size of 50 kbyte (after JPEG compression), the total size will be 2.25 Gbyte. In an indoor room environment, however, the room may be empty at most of the time. Assuming that out of 30 minutes, the time people are moving in the room is 10 minutes and the area occupied by the moving people is one eighth of the whole image area. By using the proposed foreground/background separation and transmission scheme, the total data can be further compressed to a much smaller size of 93.8 Mbyte. Thus, the network camera of the described embodiment of the present invention is able to produce a much smaller image stream of the same quality when compared with a traditional network camera. In the example given above, the size of image data generated by a network camera of the described embodiment of the present invention is only one twenty fourth of that of a traditional network camera. By separating foreground-moving objects from background, the described embodiment has another advantage over the traditional network camera: high-level information such as size, color, classification, or moving directions of foreground objects can be easily extracted from the foreground objects and used in video indexing or intelligent camera applications.
Embodiments of the invention will now be described, by way of example, with reference to the accompanying drawings, in which:
In the JPEG compression so far, there are 64 DCT coefficients each of which has a real value. Given the fact that high frequency DCT coefficients occur less and actually make less visual impact on the image, it makes sense to only use 1 or 2 bits to represent high frequency DCT coefficients and 8 bits to represent low frequency DCT coefficients with precision. This results in compression with almost no perceptible difference to humans. This step of reducing the number of bits representing DCT coefficients is called quantization. For each JPEG compressed image, there is a quantization table that determines how many bits represent each DCT coefficient. Each DCT coefficient is divided by a quantization coefficient (a constant in the quantization table), and rounded to the nearest integer. The quantization step can be used to vary the amount of compression. If only a couple of bits are used to represent each coefficient, then there will be high compression at the cost of a fuzzy image. Similarly, all the bits could be used (but compressed) for an exact replica of the original image. The reduced, and weighted DCT coefficients are next coded using the Huffman coding method.
As shown in
Compared with the approaches shown in
For displaying the image sequence, it is necessary to find out the types of each image frame. The header of each image frame data is arranged to contain data enabling a decision to be made whether the image frame is a background frame or a foreground frame at 240, for example by adding one bit of data to the image frame header having the value 1 for a background frame and 0 for a foreground frame. If an image frame is a background frame, it will be used at 260 to replace the background image data stored in a background buffer 250 of the receiver. Using a standard JPEG decoder, the background image frame can be decoded and displayed directly at 270,280. If an image frame is a foreground frame, foreground/background composition 255 is needed to display the image correctly. The foreground/background composition will take the background image data from the background buffer 250 of the receiver, use the foreground block data in the foreground frame to replace the corresponding blocks of the background image, and form a complete foreground JPEG image for display at 290,280. As the foreground/background composition only involves replacing background blocks with foreground blocks, the computational complexity is minimized at the receiver side.
The embodiments described above are intended to be illustrative, and not limiting of the invention, the scope of which is to be determined from the appended claims. In particular, the image processing method disclosed is not solely applicable to surveillance applications and may be used in other applications where only some image data is expected to change from one time to the next. Furthermore, the described method although using JPEG compressed images is not limited to this and other compressed image formats may be employed, depending upon the application, provided semantics of the uncompressed image can be derived from the compressed data to allow a decision on whether a portion of the data has changed or not to be made. The camera shown need not be a network camera.
This application is a continuation of pending U.S. patent application Ser. No. 10/483,992, filed Jan. 23, 2004, which is a National Stage Application of PCT/SG01/00158, filed Jul. 25, 2001, the disclosures of which are expressly incorporated herein by reference in their entireties.
Number | Date | Country | |
---|---|---|---|
Parent | 10483992 | US | |
Child | 11039883 | Jan 2005 | US |