The present invention relates to a method and devices for easily picking up specific scenes or picking up in real time scenes in which specific motions exist, from a plenty number of video data, by defining the specific quantities characterizing the motions in the video frame to be displayed, in such video systems as those constituting storage devices for recording television broadcasting programs and video images, and in such systems as for monitoring video scenes.
The method and devices of the present invention can be applied to detect irregular scenes in the remote monitoring systems for monitoring the video images of traffics and/or security in malls, i.e., monitors for illegal parking, illegal drive and violence in traffics, and criminal offense; to detect designated scene on the video monitors of the video editors for broadcasting program service, digital libraries, and production lines; to retrieve desired information in the directory services utilizing multimedia technology, electronic commerce systems, and television shopping; and to detect desired scenes in the television program recorders and set-top boxes.
Multimedia telecasting has brought forth a new era in which a huge volume of video data are television-broadcast, and a variety of video contents are distributed to every home via the Internet which has become popular.
In the home appliance industry, inexpensive video recorders which can store a large volume of video contents have become practical due to advancement of optical technology e.g., DVD's and magnetic recording technology. Although a plenty amount of video contents( motion images) can easily be stored in the HDD recorders and home servers, database systems of new type are expected to be put into practical use so that everyone can restore the designated specific scenes every time and everywhere.
A patent document and non-patent documents 1 and 2 as the prior art disclose that each video frame on a video stream (a series of motion images) is dissected (or divided) into a plurality of blocks, and specific scenes are restored in accordance with the motion vector magnitudes found in each block. In accordance with the disclosed technologies in the prior art, whether the detected scenes are likely to the designated ones or not can be decided by statistically analyzing the information of the motion on the video stream, acquiring as the characteristic parameters the changes and their specific parameters in the motion quantities on the video stream, and comparing the specific parameters between the reference images and the target images to be retrieved.
The principle of operation of the specific scene restoration means as disclosed in both the patent document and the non-patent document 1 is as follows:
When the specific scenes are detected from a series of target scenes to be retrieved, the detection rate( recognized as the precision of retrieving scenes) is defined in the disclosed technical materials as the percentage of the detected specific scenes to the total target scenes in number. The detection rate for detecting the resembling scenes includes the recall rate and precision rate in accordance with the non-patent document 1.
For instance, the recall rate and precision rate for the pitching scenes of baseball games are respectively defined as:
Recall rate=(Number of pitching scenes correctly decided.)/(Actual number of pitching scenes.)
Precision rate=(Number of pitching scenes correctly decied.)/(Number of pitching scenes decided in the retrieval.)
In accordance with the current technology level disclosed in the non-patent document 1, the maximum recall rate for the pitching scenes of a baseball game was 92.86 and the maximum precision rate was 74.59 at that time, of which the detection rates were unsatisfactory. Said technologies are considered suitable for generally restoring the designated scenes, but not for use in video databases where high detection rates are needed. High erroneous detection rates of said specific scene restoration means and devices might be due to the reasons which will be described hereafter.
In accordance with the technologies disclosed heretofore,
On the other hand, the non-patent document 2 provides the character recognition means utilizing multi-dimensional information, but not provide the specific scene restoration means having a sufficient detection rate enough to easily detect and pick up the specific scene from a plenty number of video data, or to detect in real time such scene as those whereon specific motions are existing.
In the non-patent document 2, the threshold to discriminate another data set to which other data of incidence belong, each containing a certain value of Mahalanobis distance, can be seen. However, none of these documents define the method of setting the threshold uniquely. The threshold is empirically set in accordance with the frequency distribution of incidence of data in a data set being compared with the reference scene.
The objectives of the present invention are to provide the specific scene restoration systems having sufficient detection rates enough to detect the specific scenes satisfactorily in order to easily pick up the designated specific scenes from a plenty amount of video data, or in order to detect in realtime the scenes wherein the specific motions are existing.
The above objectives may be attained by a method of restoring specific scenes in which specific motion quantities will be defined by employing the motion vector distributions over the dissected block areas, i.e., the method and devices for restoring from the population of video contents the specific video contents which contain the designated specific scene (hereafter called the “reference scene”) that the customer wishes to watch; and comprises the followings steps of:
The Mahalanobis distance is defined as the squared distance measured from the center of gravity ( average ), divided by standard deviation, wherein the distance is given in terms of the probability.
The multi-dimensional Mahalanobis distance is a measure of distances among the correlated samples of frames distributed over the multidimensional space which are correlated each other by the correlation coefficients of a correlation coefficient matrix, and it can be used for precisely making a decision of whether a number of distributed samples of frames belong to a single group whose attribute resembles the reference scene. So, we can make a decision on whether a plurality of distributed samples belong to a specific group of samples or not, in units of said distance.
The high-precision, high-speed scene detection means can be realized wherein the specific scene can precisely be restored on demand from the video program contents of large volume at high speed.
Since the video monitoring system has a capability to detect the scene changes , it can detect irregular scenes with ease without any special video channel switching means, thereby making the monitoring of video contents easier.
Control prepares the specific parameters (reference parameters) derived from the scene to be restored (called the reference scene ), on the basis of the flow (S1 through S6) in the left hand side of the flowchart of
Next, a Mahalanobis distance D2 is calculated for the scene which might contain the target scene on the video frames taken out of the population of the video contents, in order to decide on whether the scene taken out of said video contents resembles the reference scene or not, in accordance with the flow (X1 through X5) in the right hand side of the flowchart. During the calculation steps X1 through X5, specific parameters (a) through (e) are employed in terms of said reference scene.
Following the preprocessing steps mentioned above, control moves to the “compare” step (X6) shown at the bottom of the flowchart, and control makes a decision of whether D2 is equal to or smaller than Dt2 or not. On condition that D2≦Dt2 is valid for the decision, control recognizes during the decision step that the series of contiguous frames, on which the decision has been made, belong to the frames which resemble those of the reference scene, and that this target scene is decided to be restored.
For obtaining the respective parameters mentioned above, control inputs contiguous S frames to the system for the reference scene, dissects the respective frames into N (=k×k) blocks. Control performs the processing for one target frame taken out of the video contents, on which the decision is to be made, at a time for making the decision. Each frame is dissected into N blocks in the same manner as for the reference scene. N is an integer in the range of 100>N>4, and desirably 36>N>9. These limited numbers are chosen to properly reduce the processing time of calculating the motion quantities for the respective target frames.
The motion quantity of each block is given by expression (1) on the basis of the motion vectors in each block as:
where m is the motion quantity, and Vi is the motion vector. The upper bund n to subscript i is the number of units for calculating motion vectors in each block. For instance, if a frame is dissected into 9=3×3 blocks, and if each block consists of 10×15 unit cells, each consisting of 16×16 pixels for calculating motion vectors, n is given as 150 assuming that a frame consists of 720×480 pixels.
The Mahalanobis distance D2 will be calculated in accordance with the following manner.
We obtain correlation coefficient matrix R for the motion quantities between the respective blocks on a frame, in terms of correlation coefficients given by the expression (2):
where rnm and rmn are the elements of correlation coefficient matrix R for the respective motion quantities. Mns and Mms are the normalized motion quantities, respectively. S is the number of frames.
For instance, in case of a 3×3 matrix:
We obtain Mahalanobis distance D2 of the motion quantities of the respective blocks on each frame, in accordance with S5 of
D2=(VR−1Vt)/N (3)
where N is the number of blocks.
On the other hand, the threshold to discriminate another data set to which other data of incidence, each containing a certain value of Mahalanobis distance, belong can be seen in non-patent document 2. However, none of these documents define the method of setting the threshold uniquely. The threshold is empirically set in accordance with the frequency distribution of incidence of data in a data set being compared with the reference scene.
In accordance with the method of the present invention, the threshold to discriminate whether the data set under consideration is that of reference scenes or that of non-reference scenes is set, taking into consideration the detection rates (the recall rate and precision rate) of the scenes to be picked up so that said pair of data sets are placed in the nearest positions on the Mahalanobis distance. Since the method of setting the threshold provides an objective decision criteria specified on the basis of the normalized statistical frequency distribution of incidence of data, the threshold is valid for all video contents, and in principle independent of the decision criteria for video contents.
We calculate the Mahalanobis distance Ds2 for each of the frames containing the reference scene in order to make a decision on the likelihood between the target scene, on which the decision is to be made, and the reference scene; and calculate threshold Dt2 for use in making the decision on said likelihood in terms of the average and standard deviations of Ds2, which have been calculated for the contiguous S frames.
The frequency distribution of incidence of the Mahalanobis distance D2 exhibits the highest frequency if D2 is in its average, with decreasing frequencies around the average of D2 (average-2).
The frequency distribution of incidence of Mahalanobis distance D2 for each frame of the non-pitching scene, on which a decision is to be made, is defined by the distribution of the Mahalanobis distance measured from the reference scene, and the values of D2 on the frequency distribution for the non-pitching scene occupy the range in which these values are generally larger than those of the reference scene. Deviations in the frequency distributions of incidence of the Mahalanobis distance D2 are determined by the characteristics of the frames of the non-pitching scenes, on each of which a decision is to be made.
The recall rate and precision rate for the pitching scenes of a baseball game are respectively defined as:
Recall rate=(Number of pitching scenes correctly detected on the decision)/(Number of actual pitching scenes).
Precision rate=(Number of pitching scenes correctly detected on the decision)/(Number of scenes detected as the pitching scenes on the decision in the retrieval).
We assume that the standard deviations, each of which is defined as ‘u’, are of a pair of frequency distributions of D2, and Ds2 for the pitching scenes and non-pitching scenes are the same in value with different averages. These averages are denoted as Ds2 (average-1 for the pitching scenes) and D2 (average-2 for the non-pitching scenes). Then, we assume that Ds2(average-1)<D2(average-2)).
We assume that threshold Dt2 which is defined by Ds2 (average-1)+Ds2(standard deviation) for the pitching scene is the same in value as the threshold Dt2 which is defined by D2(average-2)−D2(standard deviation) for the non-pitching scene.
In
Under these conditions, the recall rate is given by the hatched area A on the frequency distributions. The precision rata is given by A/(A+C) where C is the meshed area. A is given as 0.841 since u=1 and A/(A+C) is given as 0.841/1.00=0.841. When a pair of frequency distribution have the same value for u=1, the recall rate and precision are the same and it is 0.841. We can understand that the point of u=1 is the optimum point when the decision on the pitching scenes and non-pitching scenes can be made with recall and precision rates, each of greater than 80%.
Threshold Dt2 is defined by the sum of the average of Ds2 and u-times (0<u<3) the standard deviation of Ds2, and so if ‘u’ is changed the any other value than unity taking account of the tradeoff between the recall and precision rates, these rates can be set at optimum values in accordance with the characteristics of the frames in which non-pitching scenes can appear.
If u=2.0, the recall rate is 0.9 and precision rate is 90/(90+50)=0.64. This implies that the recall rate goes high while the precision rate goes low.
A method for restoring the specific scene of images will be described hereafter as a second embodiment of the present invention, which will be referred to in Claim 2 of the present invention.
Control obtains the Mahalanobis distance D2 for the contiguous target frames, on which the decision is to be made, which have been input from the population of video contents; compares D2 with the threshold Dt2 obtained by the average and standard deviation of Ds2 for the reference scene; and makes a decision on whether the target frames taken out of the population of video contents belong to the frames of the reference scene on condition that D2≦Dt2 for a predetermined number or more of said contiguous target frames.
Means for detecting the scene changes will be cited as a variation of the second embodiment of the present invention, which will be referred as Claim 3 in the present invention.
Control obtains the Mahalanobis distance D2 for the contiguous target frames, on which the decision is to be made, which has been input to the system from the population of video contents; compares D2 with the threshold Dt2 obtained by the average and standard deviation of Ds2 for the reference scene; and makes a decision on whether said target scene taken out of the population of video contents indicates a scene change on condition that D2≦Dt2 is valid for a predetermined number or more of said contiguous target frames, and thereafter the expression D2≦Dt2 becomes invalid.
A device for restoring the specific scene of images will be described as a third embodiment of the present invention, which will be referred to in Claim 4 of the present invention.
The device to restore from the population of video contents the specific video contents which contain the designated specific scene that the customer wishes to watch: In order to make a decision on the likelihood of the target scene to the reference scene, said device consists of a video signal preprocessing unit 12 which performs the preprocessing of the video frames (the target frame on which the decision is to be made) of the target scene which have been taken out of the population of the video contents which have been stored in video device 11, and dissects each of said video frames into N=k×k blocks, where N is an integer characterized by 100>N>4, and desirably 36>N>9; a motion vector calculation unit 13 which calculates the motion vectors in each block; a motion quantity calculation unit 14 which calculates the motion quantities m on the basis of the sum of the motion vector magnitudes in each block; a distance calculation unit 15 which calculates the distances of the distributed motion quantities from the reference parameter; a Mahalanobis distance D2 calculation unit 16 which calculates the Mahalanobis distance D2 for the target frame, on which the decision is to be made; a comparison unit 17; and a specific parameter holding unit 20 which calculates and holds the specific parameters (reference parameters) defined by the average mp and standard deviation msd of the motion quantities for the reference scene, an inverse matrix R−1 of correlation coefficient matrix R for the motion quantities in each block, and the threshold Dt2 defined by Ds2 (average)+Ds2 ( standard deviation) (threshold Dt2 defined by the average of Ds2 plus standard deviation of Ds2); and characterized by the comparison unit 17 which compares the Mahalanobis distance D2 with the threshold Dt2, and makes a decision on that the target frame belongs to the scene resembling the reference scene on condition that expression D2≦Dt2 is valid.
The video signal preprocessing unit 12 inputs video signals from such a video device as a television set or a DVD recorder, dissects a frame of the video signals into 9=3×3 blocks, and obtains the motion vector magnitudes in each block. The means to obtain the motion vector magnitudes are, in the present embodiment, the same as those which have been employed in the MPEG2 image compression device. We calculate the distance of motion measured by the moving object, which will be defined as the motion vector in units of blocks (each called a “macro block”: abbreviated as “MB” in the specification), each consisting of 16×16 pixels as a cell. The motion vector magnitude is defined by the minimum scalar value obtained by the calculation of expression (4) on the coordinates (a, b) within an MB. In case that a frame consisting of 720×480 pixels is dissected into 9=3×3 blocks, there are 150 MBs in each block.
where X indicates the value (eg., brightness) of the pixel. Subscripts i and a respectively indicate the specified values of positions on the ordinate within an MB, and j and b respectively on the abscissa within an MB. Character k indicates the frame number. Expression (4) calculates for all a- and b-values the differences between the values of positions of pixels on the ordinate i and abscissa j within the MB having the frame number k, and those of pixels on the ordinate i±a and abscissa j±b within the MB having the frame number k−1; then calculates the sum of these absolute values on the respective ordinate and abscissa, resulting in the motion vector quantities (motion vector magnitudes).
We calculate the sum of the motion vector magnitudes, of which each magnitude has been obtained for the respective MB, in each block employing expression (1); then we define the sum of the motion vector magnitudes in each block as the motion quantity.
We dissect a frame into 9=3×3 blocks as shown in
Next, we obtain for said normalized data, element r of the correlation coefficient matrix R of motion quantities among the respective blocks within a frame.
We then calculate a normalized matrix V, a transposed matrix Vt of V, a correlated coefficient matrix R of motion quantities among the respective blocks within a frame, thereby obtaining an inverse matrix R−1 of R, and the Mahalanobis distance Ds2 of the motion quantities among the blocks in each frame.
The threshold defined by the average of the Mahalanobis distance Ds2 for the reference scene plus its standard deviation, which are denoted as Ds2 (average)+Ds2 (standard deviation), is given as 0.95+0.29=1.24.
A fifth embodiment of restoring the specific scene s in accordance with the present invention will be described referring to a total number of 800 frames, on which the decision is to be made, consisting of 20 pitching scenes and other 20 non-pitching scenes (a total of 40 scenes) of a baseball game.
We dissected a frame into 9=3×3 blocks, and calculated Mahalanobis distance D2 for each frame in accordance with the motion quantity in each block.
The specific parameters for the reference scene are prepared in accordance with
The recall and precision rates for the respective frames being retrieved are as follows:
Decision 1 (in case of D2≦Dt2) made in accordance with Mahalanobis distance D2 has appeared contiguously for the pitching scenes, but not for the non-pitching scenes.
When the number of frames contiguously decided as decision 1 (implying a pitching scene) is defined to be 7 or more in accordance with the decision criteria, we obtain a recall rate for the scenes of 20/20=100% and a precision rate for the scenes of 20/22=90%. The means to improve the decision rate are cited in Claim 2 in the present invention.
In this case, control needs not detect the scene change which has been set forth as a preliminary condition for the means to restore the specific scenes in the specific scene restoration device cited in both patent document 1 and non-patent document 1.
How to detect the scene changes in the specific scenes referring to Claim 3 of the present invention will be described in case of pitching scenes.
Number | Date | Country | Kind |
---|---|---|---|
2004-114997 | Apr 2004 | JP | national |