The invention relates to a method of processing digital coded video data available in the form of a video stream consisting of consecutive frames divided into slices, said frames including at least I-frames, coded without any reference to other frames, P-frames, temporally disposed between said I-frames and predicted from at least a previous I- or P-frame, and B-frames, temporally disposed between an I-frame and a P-frame, or between two P-frames, and bidirectionally predicted from at least these two frames between which they are disposed.
Content analysis techniques are based on algorithms such as multimedia processing (image and audio processing), pattern recognition and artificial intelligence that aim at automatically create annotations of video material. These annotations vary from low-level signal related properties, such as color and texture, to higher-level information, such as presence and location of faces. The results of the content analysis thus performed are used for many content-based applications such as commercial detection, scene-based chaptering, video previews and video summaries.
Both the established standards (e.g. MPEG-2, H.263) and the emerging standards (e.g. H.264/AVC, shortly described for instance in: “Emerging H.264 standard: Overview” and in TMS320C64xDigital Media Platform Implementation—white paper, at: http:///www.ubvideo.com/public) inherently use the concept of block-based motion-compensated coding. Accordingly, video is represented as a hierarchy of syntax elements describing picture attributes (e.g. size and rate) and spatio-temporal interrelationships and decoding procedure for the building 2D data blocks that will ultimately compose an approximation of the original signal. The first step in obtaining such a representation is the conversion of the RGB data matrix of a picture into a YUV matrix (the RGB color space representation is most used for image acquisition and rendering), so that the luminance (Y) and the two chrominance components (U, V) can be coded separately. Usually, the U and V frames are first down-sampled by a factor of 2 in the horizontal and vertical directions, to obtain the so-called 4:2:0 format and thereby half the amount of data to be coded (this is justified by the relatively lower susceptibility of the human eye to color changes compared to changes in the luminance). Each of the frames is further divided into a plurality of non-overlapping blocks, sizing 16×16 pixels for the luminance and 8×8 pixels for the downsized chrominance. The combination of a 16×16 luminance block and the two corresponding 8×8 chrominance blocks is designated as a macroblock (or MB), the basic encoding unit. These conventions are common to all standards, and the differences between the various encoding standards (MPEG-2, H.263 and H.264/AVC) mainly concern the options, techniques and procedures for partitioning a MB into smaller blocks, for coding the sub-blocks, and for organizing the bitstream.
Without going into details of all coding techniques, it can be pointed out that all standards use two basic types of coding: intra and inter (motion-compensated). In the intra mode, pixels of an image block are coded by themselves, without any reference to other pixels, or possibly based (only in H.264) on prediction from previously coded and reconstructed pixels in the same picture. The inter mode inherently uses temporal prediction, whereby an image block in a certain picture is predicted by its “best match” in a previously coded and reconstructed reference picture. There, the pixel-wise difference (or prediction error) between the actual block and its estimate and the relative displacement of the estimate (or motion vector) with respect to the coordinates of the actual block are coded separately.
Depending on the coding type, three basic types of pictures (or frames) are defined: I-pictures, allowing only intra coding, P-pictures, allowing also inter coding based on forward prediction, and B-pictures, further allowing inter coding based on backward or bi-directional prediction.
Hence, the coded video sequence is defined with a hierarchy of layers (
H.264/AVC is the newest joint video coding standard of ITU-T and ISO/TEC MPEG, which has been recently officially approved as ITU-T Recommendation H.264/AVC and ISO/FEC International Standard 14496-10 (MPEG-4 Part 10) Advanced Video Coding (AVC). The main goals of the H.264/AVC standardization have been to significantly improve compression efficiency (by halving the number of bits needed to achieve a given video fidelity) and network adaptation. Presently, H.264/AVC is broadly recognized for achieving these goals, and it is currently being considered, by forums such as DVB, DVD Forum, 3GPP, for adoption in several application domains (next generation wireless communication, videophony, HDTV storage and broadcast, VOD, etc.). On the Internet, there is a growing number of sites offering information about H.264/AVC, among which an official database of ITU-T/MPEG JVT [Joint Video Team] (Oficial H.264 documents and software of the JVT at: ftp://ftp.imtc-files.org/jvt-experts/) provides free access to documents reflecting the development and status of H.264/AVC, including the draft updates.
The aforementioned flexibility of H.264 to adapt to a variety of networks and to provide robustness to data errors/losses adaptation and robustness is enabled by several design aspects among which the following ones are most relevant for the invention which is described some paragraphs later:
(a) NAL units (NAL=Netword Abstraction Layer): a NAL unit (NALU) is the basic logical data unit in H.264/AVC, effectively composed of an integer number of bytes including video and non-video data. The first byte of each NAL unit is a header byte that indicates the type of data in the NAL unit, and the remaining bytes contain the payload data of the type indicated by the header. The NAL unit structure definition specifies a generic format for use in both packet-oriented (e.g. RTP) and bitstream-oriented (e.g. H.320 and MPEG-2|H.222) transport systems, and a series of NALUs generated by an encoder are referred to as a NALU stream.
(b) Parameter sets: a parameter set will contain information that is expected to rarely change and will apply to a larger number of NAL units. Hence, the parameter set can be separated from other data, for more flexible and robust handling (in the previous standards, the header information is repeated more frequently in the stream, and the loss of few key bits of such information could have a severe negative impact on the decoding process). There are two types of parameter sets: the sequence parameter sets, that apply to series of consecutive coded pictures called a sequence, and the picture parameter sets, that apply to the decoding of one or more pictures within a sequence.
(c) Flexible macroblock ordering (FMO): FMO refers to a new ability to partition a picture into regions called slice groups, with each slice becoming an independently-decodable subset of a slice group. Each slice group is a set of macroblocks defined by a macroblock to slice group map, which is specified by the content of the picture parameter set (see above) and some information from slice headers. Using FMO, a picture can be split into many macroblock scanning patterns, such as e.g. those shown in
Recent advances in computing, communications and digital data storage have led to a tremendous growth of large digital archives in both the professional and the consumer environment. Because these archives are characterized by a steadily increasing capacity and content variety, finding efficient ways to quickly retrieve stored information of interest is of crucial importance. Searching manually through terabytes of unorganized stored data is however tedious and time-consuming, and there is consequently a growing need to transfer information search and retrieval tasks to automated systems.
Search and retrieval in large archives of unstructured video content is usually performed after the content has been indexed using content analysis techniques, based on algorithms such as indicated above. Detecting the presence and location of particular objects (e.g. faces, superimposed text) and tracking them among video frames is an important task for automatic annotation and indexing of content. Without any a priori knowledge of the possible location of objects, object detection algorithms need to scan the entire frames, with therefore a considerable consumption of computational resources.
It is an object of the invention to propose a method allowing to detect with a better computational efficiency the use of regions of interest (ROI) coding in H.264/AVC video, by looking at the stream syntax.
To this end, the invention relates to a processing method such as defined in the introductory paragraph of the description and which comprises the steps of:
Content analysis algorithms (e.g. face detection, object detection, etc.) including this technical solution can focus in the regions of interest rather than scan blindly the whole picture. Alternatively, content analysis algorithms could be applied in different regions in parallel, which would increase the computational efficiency.
The present invention will now be described, by way of example, with reference to the accompanying drawings in which:
Considering the described ability of FMO to flexibly slice a picture, it is expected that the FMO will be largely exploited for ROI type of coding. This type of coding refers to unequal coding of video or picture segments, depending on the content (for example, in videoconferencing applications: picture regions capturing the face of a speaker can be coded with better quality compared to the background). The FMO could be applied here, in such a way that a separate slice in each picture would be assigned to the region encompassing the face, and a smaller quantization step can further be chosen in such a slice, to enhance the picture quality.
Based on this consideration, it is proposed to analyze the FMO usage in the stream, as a means to indicate that ROI coding may have been applied in a certain part of the stream. To enhance ROI indication, and eventually enable detection of ROI boundaries, the FMO information is combined with the information extracted from slice headers and possible other data in the stream characterizing a slice. This additional information may relate to physical attributes of a slice, such as the size and the relative position in the picture, or coding decisions, such as the default quantization scale for the macroblocks contained in the slice (e.g. “GQUANT” in
An application that can largely benefit from the proposed detection of ROI coding is content analysis. For example, a typical goal of content analysis in many applications is face recognition, which is usually preceded by separately performed face detection. The method described here may in particular be exploited in the latter, in such a way that the face detection algorithm would be targeted on few most important slices, rather than being applied blindly across the whole picture. Alternatively, the algorithms could be applied in different slices in parallel, which would increase the computational efficiency. ROI coding may be also used in other applications than in videoconferencing. For example, in movie scenes, parts of the content are often in focus and other parts are out of focus, which often corresponds to the separation of the foreground and background in a scene. Hence, it is conceivable that these parts may be separated and unequally coded during the authoring process. Detecting such ROI coding by means of the present method can be helpful in enabling more selective use of the content analysis algorithms.
A processing device for the implementation of the method according to the invention is shown in
The output signals of said unit 425 are a statistical information related to FMO. Said information is received by a ROI detection and identification circuit 43 which combines this FMO information with an information extracted from the entropy decoding circuit 421 and related to some structural attributes of the slices of the pictures (such as their size and their relative positions in the pictures, the default quantization scale for macroblocks within a certain slice, the macroblock to slice group map characterizing FMO, etc, said attributes being called slice coding parameters). It can be noted that the FMO information is conveyed by a parameter set which, depending on the application and transport protocol, may be either multiplexed in the H.264/AVC stream or transported separately through a reliable channel RCH, as illustrated in dotted lines in
As said above, the principle of the invention is to analyze through a series of consecutive pictures the statistics of syntax elements related to FMO and the slice layer information (and possibly other data in the stream characterizing a slice), said analysis being for instance based on comparisons with predetermined thresholds. For example, the presence of FMO will be inspected, and the amount by which the number, the relative position and the size of slices may change along a number of consecutive pictures will be analyzed, said analysis in view of the detection and identification of the use of ROIs in the coded stream being done in the ROI detection and identification circuit 43. In the case of the H.264 standard, the central idea of the invention is to detect potential ROIs by detecting the use of FMO along a series of consecutive H.264-coded pictures, and to employ statistical analysis of the amount by which the number, relative position and size of such flexible slices may change from picture to picture. All the relevant information can be extracted by parsing the relevant syntax elements from the H.264 bitstream. An example is illustrated in
Looking into more detail of how the relevant information is evaluated to arrive at the final decision, different strategies are feasible. In
An example of a possible embodiment of a dedicated ROI analyzer will be now described.
The decision logic in anyone of the analyzers 61(1) to 61(N) of
In the example illustrated with reference to
Once a consistency of the statistics has been established, it is a good indication of ROI coding in that part of the content: the slices are coincided with ROIs and this information is passed to enhance a content analysis performed in a content analysis circuit 44. The circuit 44 therefore receives the output of the circuit 43 (control signals sent by means of the connection (1)), the decoded video stream DVS delivered by the-motion compensation circuit 424 of the decoder 42, and the decoded audio stream DAS delivered by the audio decoder 52, and, on the basis of said information, identifies the genre of a certain content (such as news, music clips, sport, etc. . . . ). The output of the content analysis circuit 44 is constituted of metadata, i.e. of description data of the different levels of information contained in the decoded stream, which are stored in a file 45, e.g. in the form of the commonly used CPI (Characteristic Point Information) table. These metadata are then, now, available for applications such as video summarization and automatic chaptering (it can be recalled, however, that the invention is especially useful in the case of videoconferencing, where it is a common approach to detect and track the face of a speaker such that picture regions corresponding to the face can be coded with better quality, or more robustly, compared to regions corresponding to the background).
In an improved embodiment, the output of the content analysis circuit 44 can be transmitted back (by means of the connection (2)) to the ROI detection and identification circuit 43, which can provide an additional clue about e.g. the likeliness of ROI coding in that content.
Number | Date | Country | Kind |
---|---|---|---|
04300758.2 | Nov 2004 | EP | regional |
Filing Document | Filing Date | Country | Kind | 371c Date |
---|---|---|---|---|
PCT/IB2005/053534 | 10/28/2005 | WO | 00 | 4/30/2007 |