The present invention relates to visual communication systems, and in particular, the invention relates to a method and device for providing temporal up-conversion in video telephony systems for enhanced quality of visual images.
In general terms, video quality is a key characteristic for global acceptance of video telephony applications. It is extremely critical and important that video telephony systems bring the situation at the other side as accurately as possible across to end users in order to enhance the user's situational awareness and thereby the perceived quality of the video call.
Although video conferencing systems have gained considerable attention since being first introduced many years ago, they have not become extremely popular and a wide breakthrough of these systems has not yet taken place. This was in general due to the insufficient availability of communication bandwidth leading to unacceptably low, poor quality of video and audio transmissions such as low resolution, blocky images and long delays.
However, recent technological innovations capable of providing sufficient communication bandwidth is becoming widely more available to an increasing number of end users. Further, the availability of powerful computing systems such as PC's, mobile devices, and the like, with integrated displays, cameras, microphones, speakers are rapidly growing. For these foregoing reasons, one may expect a breakthrough and higher quality expectations in the use and application of consumer video conferencing systems as the audiovisual quality of video conferencing solutions becomes one of the most important distinguishing factors in this demanding market.
Generally speaking, many conventional algorithms and techniques for improving video conferencing images have been proposed and implemented. For example, various efficient video encoding techniques have been applied to improve video encoding efficiency. In particular, such proposals (see, e.g., S. Daly, et al., “Face-Based Visually-Optimized Image Sequence Coding, 0-8186-8821-1/98, pages 443-447, IEEE) aim at improving video encoding efficiency based on the selection of a region of interest (ROI) and a region of no interest (RONI). Specifically, the proposed encoding is performed in such a way that most bits are assigned to the ROI and fewer bits are assigned to the RONI. Consequently, the overall bit-rate remains constant, but after the decoding, the quality of the ROI image is higher than the quality of the image in the RONI. Other proposals such as U.S. 2004/0070666 A1 to Bober et al. primarily suggest smart zooming techniques before video encoding is applied so that a person in a camera's field of view is zoomed in by digital means so that irrelevant background image portions are not transmitted. In other words, this method transmits an image by coding only the selected regions of interest of each captured image.
However, the conventional techniques described above are not often satisfactory due to a number of factors. No further processing or analysis is performed on the captured images to counter the adverse effects of image quality in the transmission of video communication systems. Further, improved coding schemes, although they might give acceptable results, cannot be applied independently across the board to all coding schemes, and such techniques require that particular video encoding and decoding techniques be implemented in the first place. Also, none of these techniques appropriately address the problems of low situation awareness and the poor perceived quality of a video teleconferencing call.
Accordingly, it is an object of the present invention to provide a new and improved method and device that efficiently deals with image quality enhancement, which addresses the above mentioned problems, can be cost efficient and simple to implement.
To this end, the invention relates to a method of processing vidoe images that comprises the steps of detecting at least one person in an image of a video application, estimating the motion associated with the detected person in the image, segmenting the image into at least one region of interest and at least one region of no interest, where the region of interest includes the detected person in the image, and applying a temporal frame processing to a video signal including the image by using a higher frame rate in the region of interest than that applied in the region of no interest.
One or more of the following features may also be included.
In one aspect of the invention, the temporal frame processing includes a temporal frame-up conversion processing applied to the region of interest. In another aspect, the temporal frame processing includes a temporal frame down-conversion processing applied to the region of no interest.
In yet another aspect, the method also includes combining an output information from the temporal frame up-conversion processing step with an output information from the temporal frame down-conversion processing step to generate an enhanced output image. Further, the visual image quality enhancement steps can be performed either at a transmitting end or a receiving end of the video signal associated with the image.
Moreover, the step of detecting the person identified in the image of the video application may include detecting lip activity in the image, as well as detecting audio speech activity in the image. Also, the step of applying a temporal frame up-conversion processing to the region of interest may only be carried out when lip activity and/or audio speech activity has or have been detected.
In other aspects, the method also includes segmenting the image into at least a first region of interest and a second region of interest, selecting the first region of interest to apply the temporal frame up-conversion processing by increasing the frame rate, and leaving a frame rate of the second region of interest untouched.
The invention also relates to a device configured to process video images, where the device includes a detecting module configured to detect at least one person in an image of a video application; a motion estimation module configured to estimate a motion associated with the detected person in the image; a segmenting module configured to segment the image into at least one region of interest and at least one region of no interest, where the region of interest includes the detected person in the image ; and at least one processing module configured to apply a temporal frame processing to a video signal including the image by using a higher frame rate in the region of interest than that applied in the region of no interest.
Other features of the method and device are further recited in the dependent claims.
Embodiments may have one or more of the following advantages.
The invention advantageously enhances the visual perception of video conferencing systems for relevant image portions and increases the level of the situational awareness by making the visual images associated with the participants or persons who are speaking clearer relative to the remaining part of the image.
Further, the invention can be applied at the transmit end, which results in higher video compression efficiency because relatively more bits are assigned to the enhanced region of interest (ROI) and relatively less bits are assigned to the region of no interest (RONI), resulting in an improved transmission process of important and relevant video data such as facial expressions and the like, for the same bit-rate.
Additionally, the method and device of the present invention allows independent application from any coding scheme which can be used in video telephony implementations. The invention does not require video encoding nor decoding. Also, the method can be applied at the camera side in video telephony for an improved camera signal or it can be applied at the display side for an improved display signal. Therefore, the invention can be applied both at the transmit and receive ends.
As yet another advantage, the identification process for the detection of a face can be made more robust and fail proof by combining various face detection techniques or modalities such as a lip activity detector and/or an audio localization algorithms. Also, as another advantage, computations can be safeguarded and saved because the motion compensated interpolation is applied only in the ROI.
Therefore, with the implementation of the present invention, video quality is greatly enhanced, making for better acceptance of video-telephony applications by increasing the persons' situational awareness and thereby the perceived quality of the video call. Specifically, the present invention is able to transmit higher quality facial expressions for enhanced intelligibility of the images and for conveying different types of facial emotions and expressions. By increasing this type of situational awareness in current-day group video conferencing applications is tantamount to increased usage and reliability, especially when participants or persons on a conference call, for example, are not familiar with the other participants.
These and other aspects of the invention will become apparent from and elucidated with reference to the embodiments described in the following description, drawings and from the claims.
This invention deals with the perceptual enhancement of people in an image in a video telephony system as well as the enhancement of the situational awareness of a video teleconferencing session, for example.
Referring to
In order to implement the invention, an image segmentation technique needs to be applied for the selection of a ROI containing the participant of the conference call. Therefore, a face tracking module 14 can be used to find in an image an information 20 with regards to face location and size. Various face detection algorithms are well known in the art. For example, to find the face of a person in an image, a skin color detection algorithm or a combination of skin color detection with elliptical object boundary searching can be used. Alternatively, additional methods to identify a face search for critical features in the image may be used. Therefore, many available robust methods to find and apply efficient object classifiers may be integrated in the present invention.
Subsequent to identifying the face of a participant in the image, a motion estimation module 16 is used to calculate motion vector fields 18. Thereafter, using information 20 with regards to face location and size, a ROI/RONI segmentation module 22 is performed around the participant, for example, using a simple head and shoulder model. Alternatively, a ROI may be tracked using motion detection (not motion estimation) on a block-by-block basis. In other words, an object is formed by grouping blocks in which motion have been detected with the ROI being the object with the most moving blocks. Additionally, methods using motion detection saves computational complexity for image processing technologies.
Next, a ROI/RONI processing takes place. For a ROI segment 24, the pixels are visually emphasized within the ROI segment 24 by a temporal frame rate up-conversion module 26, for visual enhancement. This is combined, for a RONI segment 28, with a temporal frame down-conversion module 30 of the remaining image portions which is to be de-emphasized. Then, the ROI and RONI processed outputs are combined in a recombining module 32 to form the “output” signal 12 (Vout). Using the ROI/RONI processing, the ROI segment 24 is visually improved and brought to a more important foreground against the less relevant RONI segment 28.
Referring now to
If a face has been detected in the step 44, then a ROI/RONI segmentation step 50 is performed, which results in a generating step 52 for a ROI segment and a generating step 54 for the RONI. The ROI segment then undergoes a motion-compensated frame up-convert step 56 using the motion vectors generated by the step 48. Similarly, the RONI segment undergoes a frame down-convert step 58. Subsequently, the processed ROI and RONI segments are combined in a combining step 60 to produce an output signal in a step 62. Additionally, in the face detection step 44, if no face has been detected, then, in a step 64 (test “conversion down?”), if the image is to be subject to a down-conversion processing, then a down-conversion step 66 is performed. On the other hand, if the image is to be left untouched, then it simply follows on to the step 62 (direct connection), without step 66, to generate an unprocessed output signal.
Referring now to
Referring now to
In other words, when audio is available because a person is talking, a speech activity detector can be used. For example, a speech activity detector based on detection of non-stationary events in the audio signal combined with a pitch detector may be used. At the transmit end, that is, in the audio-in step 81, the “audio in” signal is the microphone input. At the receive end, the “audio in” signal is the received and the decoded audio. Therefore, for increased certainty of audio activity detection, a combined audio/video speech activity detection is performed by a logical AND on the individual detector outputs.
Similarly,
Referring to
Further,
Referring now to
In
In the event that two ROIs are detected, then a ROI selection module 23 performs the selection of the ROIs that must be processed for image quality enhancement based on the results of the audio algorithm module 13 which outputs the locations (x, y coordinates) of the sound source or sound sources (the connection 21 gives the (x,y) locations of the sound sources) including the speech activity flag 17, including the results of the lip activity detection module 15, namely, the lip activity flag 19. In other words, with multi-microphone conferencing systems, multiple audio inputs are available on the receive end. Then, applying lip activity algorithms in conjunction with audio algorithms, the direction and location (x, y coordinates) from which speech or audio is coming from can also be determined. This information can be relevant to target the intended ROI, who is currently the speaking participant in the image.
This way, when two or more ROIs are detected by the face tracking module 14, the ROI selection module 23 selects the ROI associated with the person who is speaking, so that this person who is speaking can be given the most visual emphasis, with the remaining persons or participants of the teleconferencing session receiving slight emphasis against the RONI background.
Thereafter, separate ROI and RONI segments undergo image processing steps by the ROI up-convert module 26 in the frame rate up-conversion for the ROI and by the RONI down-convert module 30 in the frame rate down-conversion for the RONI, using the information output by the motion estimation module 16. Moreover, the ROI segment can include the total number of persons detected by the face tracking module 14. Assuming that the persons further away from the speaker are not participating in the video teleconferencing call, the ROI can include only the detected faces or persons that are close enough by inspection of the detected face size and whose face size is larger than a certain percentage of the image size. Alternatively, the ROI segment can include only the person who is speaking or the person who has last spoken when no one else has spoken since.
Referring now to
Referring to
Next, the lip detection step 71 is carried subsequent to the face detection step 44 and the ROI/RONI segmentation step 50. As shown in
Subsequently, the ROI/RONI selection step 102 generates a selected ROI segment (104) that undergoes the frame up-convert step 56. The ROI/RONI selection 102 also generates other ROI segments (106), on which in the step 64, if the decision to subject the image to a down-conversion analysis is positive, then a down-conversion step 66 is performed. On the other hand, if the image is to be left untouched, then it simply follows on to the step 60 to combine with the temporally up-converted ROI image generated by the step 56 and the RONI image generated by the steps 54 and 66 to eventually arrive at the unprocessed “video-out” signal in the step 62.
Referring now to
Referring to
The image 110 can be subdivided into blocks of 8×8 luminance values. For motion estimation, a 3D recursive search method may be used, for example. The result is a two-dimensional motion vector for each of the 8×8 blocks. This motion vector may be denoted by {right arrow over (D)}({right arrow over (X)}, n) with the two-dimensional vector {right arrow over (X)} containing the spatial x- and y-coordinates of the 8×8 block, and n the time index. The motion vector field is valued at a certain time instance between two original input frames. In order to make the motion vector field valid at another time instance between two original input frames, one may perform motion vector retiming.
Referring now to
As mentioned previously, the robustness of the face tracking mechanism can be ameliorated when a face tracking mechanism is combined with information from a video lip activity detector, which is usable both at the transmit and receive ends, and/or combined with an audio source tracker, which requires multiple microphone channels, and implemented at the transmit end. Using a combination of these techniques, non-faces which are mistakenly found by the face tracking ah ?mechanism can be appropriately rejected.
Referring to
The ROI/RONI frame rate conversion utilizes a motion estimation process based on the motion vectors of the original image.
Referring now to
As for the RONI region 140, for the interpolated picture 134, the pixels belonging to the RONI region 140 are simply copied from the previous original input picture 132A, and the pixels in the ROI are interpolated with motion compensation.
This is further demonstrated with reference to
In
Additionally, when the background of an image or picture is stationary, the transition boundaries between the ROI and RONI regions are not visible in the resulting output image because the background pixels within the ROI region are interpolated with the zero motion vector. However, when the background moves which is oftentimes the case with digital cameras (e.g., unstable hand movements), the boundaries between the ROI and the RONI regions become visible because the background pixels are calculated with motion compensation within the ROI region while the background pixels are copied from a previous input frame in the RONI region.
Referring now to
In particular,
As shown in
While there has been illustrated and described what are presently considered to be the preferred embodiments of the present invention, it will be understood by those of ordinary skill in the art that various other modifications may be made, and equivalents may be substituted, without departing from the true scope of the present invention.
In particular, although the foregoing description related mostly to video teleconferencing, the image quality enhancement method described can be applied to any type of video application, such as in those implemented on mobile telephony devices and platforms, home office platforms such as PC, and the like.
Additionally, many advanced video processing modifications may be made to adapt a particular situation to the teachings of the present invention without departing from the central inventive concept described herein. Furthermore, an embodiment of the present invention may not include all of the features described above. Therefore, it is intended that the present invention not be limited to the particular embodiments disclosed, but that the invention include all embodiments falling within the scope of the appended claims and their equivalents.
Number | Date | Country | Kind |
---|---|---|---|
05300594.8 | Jul 2005 | EP | regional |
Filing Document | Filing Date | Country | Kind | 371c Date |
---|---|---|---|---|
PCT/IB06/52296 | 7/7/2006 | WO | 00 | 1/8/2008 |