This application claims the priority of Korean Patent Application No. 2004-11320, filed on Feb. 20, 2004, in the Korean Intellectual Property Office, the disclosure of which is incorporated herein by reference.
1. Field of the Invention
The present invention relates to moving image processing, and more particularly, to a method and an apparatus for detecting an anchorperson shot of the moving image.
2. Description of Related Art
In a conventional method of detecting an anchorperson shot in a broadcasting signal used in a field such as news or in a moving image like a movie, the anchorperson shot is detected using a template on the anchorperson shot. In this method, format information about the anchorperson shot is assumed and recognized in advance and the anchorperson shot is extracted using the recognized format information or using the template generated using the color of an anchorperson's face or clothes. However, in this method, since a specified template of the anchorperson is used, a performance of detecting an anchorperson shot may be greatly degraded by a change in the format of the anchorperson shot. Furthermore, in a conventional method of detecting the anchorperson shot using the color of the anchorperson's face or clothes, when the color of the anchorperson's face or clothes is similar to that of a background or illumination is changed, the performance of detecting an anchorperson shot is degraded. In addition, in a conventional method of obtaining anchorperson shot information using a first anchorperson shot, detecting an anchorperson shot is affected by the degree at which the number of anchorpersons or the format of the anchorperson shot is changed. That is, when the first anchorperson shot is wrongly detected, the performance of detecting an anchorperson shot is degraded.
Meanwhile, in another conventional method of detecting an anchorperson shot, the anchorperson shot is detected by clustering characteristics such as a similar color distribution in the anchorperson shot or time when the anchorperson shot is generated. In the method, a report shot having a color distribution similar to that of the anchorperson shot may be wrongly detected as the anchorperson shot and one anchorperson shot that occurs unexpectedly cannot be detected.
An aspect of the present invention provides a method of detecting an anchorperson shot using audio signals separated from a moving image, that is, using anchorperson's speech information.
An aspect of the present invention also provides an apparatus for detecting an anchorperson shot using audio signals separated from a moving image, that is, using anchorperson's speech information.
According to an aspect of the present invention, there is provided a method of detecting an anchorperson shot, including: separating a moving image into audio signals and video signals; deciding boundaries between shots of the moving image using the video signals; and extracting shots having a length larger than a first threshold value and a silent section having a length larger than a second threshold value from the audio signals using the boundaries, and deciding that the extracted shots are anchorperson speech shots.
According to another aspect of the present invention, there is provided an apparatus for detecting an anchorperson shot, the apparatus comprising a signal separating unit separating a moving image into audio signals and video signals; a boundary deciding unit deciding boundaries between shots of the moving image using the video signals; and an anchorperson speech shot extracting unit extracting shots having a length larger than a first threshold value and a silent section having a length larger than a second threshold value from the audio signals using the boundaries and outputting the extracted shots as anchorperson speech shots.
According to an aspect of the present invention, there is provided a method of detecting anchorperson shots, including: generating an anchorperson image model; detecting anchorperson candidate shots using the generated anchorperson image model; and verifying whether the anchorperson candidate shot is an actual anchorperson shot that contains an anchorperson image, using the separate speech model and the anchorperson speech model.
According to an aspect of the present invention, there is provided an apparatus for detecting an anchorperson shot, comprising: an image model generating unit generating an anchorperson image model; an anchorperson candidate shot detecting unit detecting anchorperson candidate shots by comparing the anchorperson image model generated by the image model generating unit with a key frame of each divided shot; and an anchorperson shot verifying unit verifying whether the anchorperson candidate shot is an actual anchorperson shot that contains an anchorperson image, using a separate speech model.
Additional and/or other aspects and advantages of the present invention will be set forth in part in the description which follows and, in part, will be obvious from the description, or may be learned by practice of the invention.
These and/or other aspects and advantages of the present invention will become apparent and more readily appreciated from the following detailed description, taken in conjunction with the accompanying drawings of which:
Reference will now be made in detail to embodiments of the present invention, examples of which are illustrated in the accompanying drawings, wherein like reference numerals refer to the like elements throughout. The embodiments are described below in order to explain the present invention by referring to the figures.
In operation 10, the moving image is separated into audio signals and video signals. Hereinafter, it is assumed that the moving image includes audio signals as well as video signals. In this case, the moving image may be data compressed by the MPEG standard. If the moving image is compressed by MPEG-1, the frequency of the audio signals separated from the moving image may be 48 kHz or 44.1 kHz, for example, which corresponds to the sound quality of a compact disc (CD). In order to perform operation 10, a raw pulse code modulation (PCM) format may be extracted from the moving image and the extracted raw PCM format may be decided as the separated audio signals. After operation 10, in operation 12, boundaries between shots are decided using the video signals. To this end, a portion in which there is relatively a large change in a moving image being sensed, the sensed portion is decided as the boundary between the shots. Changes in at least one of brightness, color quantity, and motion of a moving image may be sensed, and a portion in which there is a rapid change in the sensed results may be decided as the boundary between the shots.
Returning to
If the moving image is compressed by the MPEG-1 standard, the frequency of the separated audio signal is 48 kHz and the separated audio signal is down-sampled at the frequency of 8 kHz, the audio signal shown in
After operation 14, in operation 16, shots having a length larger than a first threshold value TH1 and a silent section having a length larger than a second threshold value TH2 are extracted from the down-sampled audio signals using the boundaries obtained in operation 12 and the extracted shots are decided as anchorperson speech shots. The anchorperson speech shot means a shot containing an anchorperson's speech, but is not limited to this and may be a shot containing a reporter's speech or the sound of an object significant to a user. In general, the length of the anchorperson shot is considerably long, more than 10 seconds, and there are some silent sections in a portion in which the anchorperson shot ends, which is a boundary between the anchorperson shot and the report shot when the anchorperson shot and the report shot exist continuously. In operation 16, the anchorperson speech shot is decided based on its characteristics. That is, the length of the shot should be larger than the first threshold value TH1 and a silent section having a length larger than the second threshold value TH2 should exist in a portion in which the shot ends, which is a boundary between the shots, so that a shot may be an anchorperson speech shot.
The method of detecting the anchorperson shot of
First, in operation 30, the length of each of the shots is obtained using the boundaries obtained in operation 12. The boundary between the shots represents a portion between the end of a shot and the beginning of a new shot, and thus the boundaries may be used in obtaining the length of the shots.
After operation 30, in operation 32, shots having the length larger than the first threshold value TH1 are selected from the shots.
After operation 32, in operation 34, the length of a silent section of each of the selected shots is obtained. The silent section is a section in which there is no significant sound.
First, in operation 50, an energy of each of frames Frame 1, Frame 2, Frame 3, . . . , Frame i, . . . , and Frame N included in each of the shots selected in operation 32 is obtained. Here, the energy of each of the frames included in each of the shots selected in operation 32 may be given by Equation 1.
Here, Ei is an energy of an i-th frame among frames included in a shot, fd is a down frequency at which the audio signals are down-sampled, ff is the length 70 of the i-th frame, and pcm is a pulse code modulation (PCM) value of each sample included in the i-th frame and is an integer. When fd is 8 kHz and tf is 25 ms, fdtf is 200. That is, there are 200 samples in the i-th frame.
After operation 50, in operation 52, a silent threshold value is obtained using energies of frames included in the shots selected in operation 32 of
In operation 80, each of the energies obtained in operation 50 in the frames included in each of the shots selected in operation 32 is rounded and expressed as an integer. After operation 80, in operation 82, the distribution of frames with respect to energies is obtained using the energies expressed as the integers. For example, an energy of each of the frames included in each of the shots selected in operation 32 is shown as the distribution of frames with respect to energies, as shown in
After operation 82, in operation 84, a reference energy is decided as a silent threshold value in the distribution of the frames with respect to energies, and operation 54 is performed. The reference energy is selected so that the number of frames distributed in the energies equal to or less than the reference energy is approximate to the number corresponding to a specified percentage Y% of the total number X of frames included in the shots selected in operation 32, that is, XY/100. For example, when the distribution of frames with respect to energies is shown in
After operation 52, in operation 54, a silent section of each of the shots selected in operation 32 is decided using a silent threshold value. For example, as shown in
After operation 54, in operation 56, the number of silent frames is counted in each of the shots selected in operation 32, the counted results are decided as the length of a silent section, and an operation 36 is performed. The silent frame is a frame included in the silent section and having an energy equal to or less than a silent threshold value. For example, as shown in
The end frame of each of the shots selected in operation 32 may not be counted, because the end frame of each of the selected shots has the number of samples not larger than fdtf.
In addition, when the number of frames that belong to the silent section is counted, that is, when it is determined whether a frame belongs to the silent section, if frames having an energy larger than the silent threshold value exist continuously, a counting operation may be stopped. For example, when it is checked from each of the shots selected in operation 32 whether the frames are silent frames, even though an L-th frame is not the silent frame and when a (L−1)-th frame is the silent frame, the L-th frame is regarded as the silent frame. In addition, when both a (L−M)-th frame and a (L−M−1)-th frame are not the silent frames, the counting operation is stopped.
Referring to
After operation 36, in operation 38, only shots (PQ/100) of a specified percentage Q% having a relatively large length are selected from P (where P is a positive integer) extracted shots and are decided as anchorperson speech shots, and operation 18 is performed. For example, when P is 200 and 0 is 80, 40 shots having a small length among 200 shots extracted in operation 36 are discarded, and only 160 shots having a large length are selected and decided as anchorperson speech shots.
Operation 16A of
Only anchorperson speech shots shown in
Meanwhile, after operation 16, in operation 18, anchorpersons' speech shots that contain anchorpersons' speeches are separated from the anchorperson speech shots. The anchorpersons may be the same gender or the opposite gender anchorpersons. That is, the anchorpersons' speech shots may contain only anchormen speech or anchorwomen speech, or both anchormen and anchorwomen speech.
After operation 16, in operation 130, the silent frame and the consonant frame are removed from each of the anchorperson speech shots.
In operation 150, in order to remove the silent frame from each of anchorperson speech shots, an energy of each of the frames included in each of anchorperson speech shots is obtained.
After operation 150, in operation 152, the silent threshold value is obtained using energies of the frames included in each of the anchorperson speech shots. After operation 152, in operation 154, the silent section of each of the anchorperson speech shots is decided using the silent threshold value. After operation 154, in operation 156, the silent frame included in the decided silent section is removed from each of the anchorperson speech shots.
Operations 150, 152, and 154 of
Alternatively, without the need of separately obtaining the silent frame of the anchorperson speech shots decided in operation 16 in operations 150 through 154 of
First, in operation 170, the ZCR according to each frame included in each of the anchorperson speech shots is obtained. The ZCR may be given by Equation 2.
Here, # is the number of sign changes in decibel values of pulse code modulation (PCM) data, and tf is the length of a frame in which the ZCR is obtained. In this case, the ZCR increases as the frequency of an audio signal increases. In addition, the ZCR is used in classifying a consonant part and a vowel part of anchorperson's speech, because the fundamental frequency of speech mainly exists in the vowel part of speech.
After operation 170, in operation 172, the consonant frame is decided using the ZCR of each of the frames included in each of the anchorperson speech shots.
After operation 170, in operation 190, the average value of ZCRs of frames included in each of anchorperson speech shots is obtained. After operation 190, in operation 192, in each of the anchorperson speech shots, a frame having a ZCR larger than a specified multiple of the average value of the ZCRs is decided as the consonant frame, and operation 174 is performed. The specified multiple may be set to ‘2’.
After operation 172, in operation 174, the decided consonant frame is removed from each of the anchorperson speech shots.
Operation 130A of
Alternatively, after operation 130A of
Alternatively, before operation 130A of
Meanwhile, according to an embodiment of the present invention, after operation 130, in operation 132, mel-frequency cepstral coefficients (MFCCs) according to each coefficient of each of the frames included in each of the anchorperson speech shots from which the silent frame and the consonant frame are removed are obtained, and anchorpersons' speech shots are detected using the MFCCs. The MFCCs have been introduced by Davis S. B. and Mermelstein P. in an article entitled “Comparison of Parametric Representations of Monosyllabic Word Recognition in Continuously Spoken Sentences,” IEEE Trans. Acoustics, Speech and Signal Processing, 28, pp. 357-366, 1980.
In operation 210, with respect to each of anchorperson speech shots from which a silent frame and a consonant frame are removed, average values of MFCCs according to each coefficient of a frame included in each window are obtained while a window having a specified length moves at specified time intervals. The MFCCs are feature values widely used in speech recognition and generally include 13 coefficients in each frame. In the present invention, 12 MFCCs, the zeroth coefficient is excluded, are used for speech recognition.
In this case, each window may include a plurality of frames, and each frame has a MFCC according to each coefficient of a frame. Thus, average values of MFCCs according to each coefficient of each window are obtained by averaging MFCCs according to each coefficient of a plurality of frames of each window.
After operation 210, in operation 212, a difference between the average values of MFCCs is obtained between adjacent windows. After operation 212, in operation 214, with respect to each of the anchorperson speech shots from which the silent frame and the consonant frame are removed, if the difference between the average values of MFCCs between the adjacent windows is larger than a third threshold value TH3, the anchorperson speech shots are decided as anchorpersons' speech shots.
For example, referring to
According to another embodiment of the present invention, after operation 130, in operation 132, a MFCC according to each coefficient and power spectral densities PSDs in a specified frequency bandwidth are obtained in each of the frames included in each of the anchorperson speech shots from which the silent frame and the consonant frame are removed, and the anchorpersons' speech shots are detected using the MFCCs according to each coefficient and the PSDs. The specified frequency bandwidth is a frequency bandwidth in which there is a large difference between average spectrums of men speech and women speech and may be set to 100-150 Hz, for example. The difference between spectrums of men speech and women speech has been introduced by Irii, H., Itoh, K., and Kitawaki, N. in an article entitled “Multi-lingual Speech Database for Speech Quality Measurements and its Statistic Characteristics,” Trans. Committee on Speech Research, Acoust. Soc. Jap, pp. S87-69, 1987 and by Saito, S, Kato, K., and Teranishi, N. in an article entitled “Statistical Properties of Fundamental Frequencies of Japanese Speech Voices,” J. Acoust. Soc. Jap., 14, 2, pp. 111-116, 1958.
In operation 230, an average value of MFCCs according to each coefficient of each of frames included in each window and an average decibel value of PSDs in the specified frequency bandwidth are obtained in each of anchorperson speech shots from which a silent frame and a consonant frame are removed, while a window having a specified length moves at specified time intervals. The average decibel value of PSDs in the specified frequency bandwidth of each window is obtained by calculating a spectrum in a specified frequency bandwidth of each of frames included in each window, averaging the calculated spectrum, and converting the calculated average spectrum into a decibel value.
For example, as shown in
After operation 230, in operation 232, a difference Δ1 between average values of MFCCs between adjacent windows WD1 and WD2 and a difference Δ2 between average decibel values of PSDs between the adjacent windows WD1 and WD2 are obtained.
After operation 232, in operation 234, a weighed sum of the differences Δ1 and Δ2 is obtained in each of the anchorperson speech shots from which the silent frame and the consonant frame are removed. The weighed sum WS1 may be given by Equation 3.
WS1=W1Δ1+(1−W1)Δ2 (3)
Here, WS1 is a weighed sum, and W1 is a first weighed value.
After operation 234, in operation 236, the anchorperson speech shot having the weighed sum WS1 larger than a fourth threshold value TH4 is decided as anchorpersons' speech shot, and operation 20 is performed.
In operation 132A of
Meanwhile, after operation 18, in operation 20, the anchorpersons' speech shots are clustered, anchorperson' speech shots excluding the anchorpersons' speech shots from the anchorperson speech shots are grouped, and the grouped results are decided as similar groups.
In operation 250, an average value of MFCCs according to each coefficient is obtained in each of anchorperson's speech shots.
After operation 250, in operation 252, when a MFCC distance calculated using average values of MFCCs according to each coefficient of two anchorperson's speech shots Sj and Sj+1 is the closest among the anchorperson speech shots and smaller than a fifth threshold value TH5, the two anchorperson's speech shots Sj and Sj+1 are decided as similar candidate shots Sj′ and Sj+1′. Coefficients of the averages values of MFCCs according to each coefficient used in operation 252 may be third to twelfth coefficients, and j represents an index of an anchorperson's shot and is initialized in operation 250. In this case, the MFCC distance WMFCC may be given by Equation 4.
WMFCC={square root}{square root over ((a1−b1)2+(a2−b2)2+ . . . +(ak−bk)2)} (4)
Here, a1, a2, . . . , and ak are average values of MFCCs according to each coefficient of the anchorperson's speech shot Sj, b1, b2, . . . , and bk are average values of MFCCs according to each coefficient of the anchorperson's speech shot Sj+1, and k is a total number of coefficients in the average values of MFCCs according to each coefficient obtained from the anchorperson's speech shot Sj or Sj+1.
After operation 252, in operation 254, a difference between average decibel values of PSDs in a specified frequency bandwidth of the similar candidate shots Sj′ and Sj+1′ is obtained.
After operation 254, in operation 256, when the difference between the average decibel values of PSDs obtained in operation 254 is smaller than a sixth threshold value TH6, the similar candidate shots Sj′ and Sj+1′ are grouped and are decided as similar groups. In this case, when the difference between the average decibel values of PSDs is larger than the sixth threshold value TH6, a flag may be allocated to similar candidate shots in which the average values of MFCCs are similar, because operations 252, 254, and 256 is prevented from being again performed on the similar candidate shots to which the flag is allocated.
After operation 256, in operation 258, it is determined whether all of anchorperson's speech shots are grouped. If it is determined that all of anchorperson's speech shots are not grouped, operation 252 is performed, and operations 252, 254, and 256 are performed on anchorperson's speech shots Sj+1 and Sj+2 in which two different average values of MFCCs are the closest. However, if it is determined that all of anchorperson's speech shots are grouped, operation 20A of
For example, by grouping the anchorperson speech shots of
Meanwhile, after operation 20, in operation 22, a representative value of each of the similar groups is obtained as an anchorperson speech model. The representative value is the average value of MFCCs according to each coefficient of shots that belong to the similar groups and the average decibel value of PSDs in the specified frequency bandwidth of the shots that belong to the similar groups.
After operation 22, in operation 24, a separate speech model is generated using information about initial frames among frames of each of the shots included in each of the similar groups. The initial frames may be frames corresponding to an initial 4 seconds in each shot included in each of the similar groups. For example, information about the initial frames may be averaged, and the averaged results may be decided as the separate speech model.
In operation 270, an anchorperson image model is generated.
After operation 270, in operation 272, anchorperson candidate shots are detected using the generated anchorperson image model. For example, a moving image may be divided into a plurality of shots, and the anchorperson candidate shots may be detected by obtaining a color difference between a key frame of each of the plurality of divided shots and the anchorperson image model and by comparing the color differences. In order to obtain the color difference, each of the plurality of shots included in the moving image is divided into R×R (where R is a positive integer equal to or greater than 1) sub-blocks, and the anchorperson image model is divided into R×R sub-blocks. In this case, a color of a sub-block of an object shot is compared with a color of a sub-block of the anchorperson image model placed in the same position as that of the sub-block, and the compared results are decided as the color difference between the sub-blocks. If the color difference between the key frame of a shot and the anchorperson image model is smaller than a color difference threshold value, the shot is decided as an anchorperson candidate shot.
The color difference is a normalized value based on a Grey world theory and may be decided to be robust with respect to some illumination changes. The Grey world theory was introduced by E. H. Land and J. J. McCann in an article entitled “Lightness and Retinex Theory,” Journal of the Optical Society of America, vol. 61, pp. 1-11, 1971.
After operation 272, in operation 274, it is verified whether the anchorperson candidate shot is an actual anchorperson shot that contains an anchorperson image, using the separate speech model and the anchorperson speech model. For example, it is verified using the separate speech model whether an anchorperson candidate shot having a very small length less than 6 seconds is an actual anchorperson shot. Thus, the separate speech model is not used when the anchorperson candidate shot having a large length is the actual anchorperson shot. In this case, the method of
In operation 292, a representative value of each of anchorperson candidate shots is obtained using the time when the anchorperson candidate shot is generated. The representative value of the anchorperson candidate shot is the average value of MFCCs according to each coefficient of frames that belong to the shot and the average decibel value of PSDS in the specified frequency bandwidth of the frames that belong to the shot. In addition, the time when the anchorperson candidate shot is generated is obtained in operation 272 and is time when the anchorperson candidate shot starts and ends.
After operation 292, in operation 294, a difference DIFF between the representative value of each of the anchorperson speech shots and the anchorperson speech model is obtained. The difference DIFF may be given by Equation 5.
DIFF═W2Δ3+(1−W2)Δ4 (5)
Here, W2 is a second weighed value, Δ3 is a difference between average values of MFCCs according to each coefficient of the anchorperson candidate shot and the anchorperson speech model, and Δ4 is a difference between average decibel values of PSDs of the anchorperson candidate shot and the anchorperson speech model.
After operation 294, in operation 296, a weighed sum WS2 of color difference information ΔCOLOR and the difference DIFF that can be expressed by Equation 5, for example, is obtained in each of the anchorperson candidate shots. The color difference information ΔCOLOR is information about the color difference between the anchorperson candidate shot and the anchorperson image model detected in operation 272, and the weighed sum WS2 obtained in operation 296 may be given by Equation 6.
WS2=W3ΔCOLOR+(1−W3)DIFF (6)
Here, W3 is a third weighed value. In this case, the weighed sum WS2 reflects the color difference information ΔCOLOR which is video information of the moving image and the difference DIFF which is audio information, and thus is referred to as multi-modal information.
After operation 296, in operation 298, when the weighed value WS2 is not larger than a seventh threshold value TH7, the anchorperson candidate shot is decided as the actual anchorperson shot. However, when the weighed value WS2 is larger than the seventh threshold value TH7, it is decided that the anchorperson candidate shot is not the actual anchorperson shot.
According to an embodiment of the present invention, in operation 270 of
According to another embodiment of the present invention, in operation 270, the anchorperson image model may be generated using the anchorperson speech shots or the similar groups obtained in operation 16 or 20 of
If the anchorperson image model is generated using the anchorperson speech shots obtained in operation 16 of
Alternatively, if the anchorperson image model is generated using the similar groups obtained in operation 20 of
Meanwhile, the method of
In this case, according to an embodiment of the present invention, when the anchorperson image model is generated using the anchorperson speech shots obtained in operation 16 of
According to another embodiment of the present invention, when the anchorperson image model is generated using the similar groups obtained in operation 20 of
Hereinafter, an apparatus for detecting an anchorperson shot according to the present invention will be described.
The apparatus of
In order to perform operation 10, the signal separating unit 400 separates a moving image inputted through an input terminal IN1 into audio signals and video signals, outputs the separated audio signals to the down-sampling unit 404, and outputs the separated video signals to the boundary deciding unit 402.
In order to perform operation 12, the boundary deciding unit 402 decides boundaries between shots using the separated video signals inputted by the signal separating unit 400 ad outputs the boundaries between the shots to the anchorperson speech shot extracting unit 406.
In order to perform operation 14, the down-sampling unit 404 down-samples the separated audio signals inputted by the signal separating unit 400 and outputs the down-sampled results to the anchorperson speech shot extracting unit 406.
In order to perform operation 16, the anchorperson speech shot extracting unit 406 extracts shots having a length larger than a first threshold value TH1 and a silent section having a length larger than a second threshold value TH2 from the down-sampled audio signals using the boundaries inputted by the boundary deciding unit 402 as anchorperson speech shots and outputs the extracted anchorperson speech shots to the shot separating unit 408 through an output terminal OUT2.
As described above, when the method of
Meanwhile, in order to perform operation 18, the shot separating unit 408 separates anchorpersons' speech shots from the anchorperson speech shots inputted by the anchorperson speech shot extracting unit 406 and outputs the separated results to the shot grouping unit 410.
In order to perform operation 20, the shot grouping unit 410 groups anchorpersons' speech shots and anchorperson's speech shots from the anchorperson speech shots, decides the grouped results as similar groups, and outputs the decided results to the representative value generating unit 412 through an output terminal OUT3.
In order to perform operation 22, the representative value generating unit 412 obtains a representative value of each of the similar groups inputted by the shot grouping unit 410 and outputs the obtained results to the separate speech model generating unit 414 as an anchorperson speech model.
In order to perform operation 24, the separate speech model generating unit 414 generates a separate speech model using information about initial frames among frames of each of the shots included in each of the similar groups and outputs the generated separate speech model through an output terminal OUT1.
As described above, when the method of
The apparatus of
The image model generating unit 440 generates an anchorperson image model and outputs the generated image model to the anchorperson candidate shot detecting unit 442. In this case, the image model generating unit 440 inputs the anchorperson speech shot outputted from the anchorperson speech shot extracting unit 406 of
In order to perform operation 272, the anchorperson candidate shot detecting unit 442 detects the anchorperson candidate shots by comparing the anchorperson image model generated by the image model generating unit 440 with a key frame of each of divided shots inputted through an input terminal IN3 and outputs the detected anchorperson candidate shots to the anchorperson shot verifying unit 444.
In order to perform operation 274, the anchorperson shot verifying unit 444 verifies whether the anchorperson candidate shot inputted by the anchorperson candidate shot detecting unit 442 is an actual anchorperson shot that contains an anchorperson image, using the separate speech model and the anchorperson speech model inputted by the separate speech model generating unit 414 and the representative value generating unit 412 through an input terminal IN4 and outputs the verified results through an output terminal OUT4.
The above-described first weighed value W1 may be set to 0.5, the third weighed value W3 may be set to 0.5, the first threshold value TH1 may be set to 6, the second threshold value TH2 may be set to 0.85, the fourth threshold value TH4 may be set to 4, and the seventh threshold value TH7 may be set to 0.51. In this case, the results of using the method and apparatus for detecting an anchorperson shot according to the present invention and the results of using a conventional method of detecting an anchorperson shot in a news moving image having a quantity of 720 minutes produced by several broadcasting stations are compared with each other, as shown in Table 1. The conventional method was introduced by Xinbo Gao, Jie Li, and Bing Yang in an article entitled “A Graph-Theoretical Clustering based Anchorperson Shot Detection for News Video Indexing,” ICCIMA, 2003.
As shown in Table 1, the method and apparatus for detecting an anchorperson shot according to the present invention have more advantages than those of the conventional method of detecting an anchorperson shot.
By classifying anchorperson shots detected by the method and apparatus according to the present invention according to the stories of news, a user can see shots like a news storyboard from the Internet. As a result, the user can briefly see a corresponding moving image report by selecting articles of interest. That is, the user can record desired contents of the moving image at a desired time automatically and can select and see from the recorded shots a shot in which the user has the most interest, using the method and apparatus for detecting an anchorperson shot according to the present invention.
At present, in an environment in which the conventional TV viewing culture is changed because video contents overflow via broadcasting, the Internet, or other several media and a personal video recorder (PVR), an electronic program guide (EPG) and large-capacity hard drive emerge, the method and apparatus for detecting an anchorperson shot according to the present invention can provide a simplified storyboard or highlights to a moving image which has a regular pattern like in sports or news and can be viewed for a long time even after recording.
In the method and apparatus for detecting an anchorperson shot according to the above-described embodiments of the present invention, an anchorperson image model can be generated in a moving image such as news having an anchorperson shot without a specified anchorperson image model, and even when the color of anchorperson's clothes or face is similar to a background color, the anchorperson shot can be robustly detected, the anchorperson shot can be detected without a first anchorperson shot, and a possibility that a report shot similar to the anchorperson shot may be wrongly detected as the anchorperson shot is removed, that is, the anchorperson shot can be detected accurately such that a news is divided into stories, the types of anchorperson shots are grouped according to voices or genders, the contents of the moving image can be indexed in a home audio/video storage device or an authoring device for providing contents and thus, only an anchorperson shot that contains desired anchorperson's comment is extracted and searched for or summarized.
Although a few embodiments of the present invention have been shown and described, the present invention is not limited to the described embodiments. Instead, it would be appreciated by those skilled in the art that changes may be made to these embodiments without departing from the principles and spirit of the invention, the scope of which is defined by the claims and their equivalents.
Number | Date | Country | Kind |
---|---|---|---|
10-2004-0011320 | Feb 2004 | KR | national |