(1) Field of the Invention
The present invention relates to a sound identification apparatus which identifies an inputted sound, and outputs the type of the inputted sound and an interval of each type of inputted sound.
(2) Description of the Related Art
Conventionally, sound identification apparatuses have been widely used as means for extracting information regarding the source, emitting device, and so on of a certain sound by extracting acoustic characteristics of the sound. Such apparatuses are used, for example, for detecting the sound of ambulances, sirens, and so on occurring outside of a vehicle and providing a notification of such sounds to within the vehicle, for discovering defective devices by analyzing the sound a product manufactured in a factory emits during operation and detecting abnormalities in the sound, and so on. However, recent years have seen a demand for a technique for identifying the type, category, and so on of sounds from mixed ambient sounds in which various sounds are mixed together or sounds are emitted alternately, without limiting the sound to be identified to a specific sound.
Patent Reference 1 (Japanese Laid-Open Patent Application No. 2004-271736; paragraphs 0025 to 0035) can be given as an example of a technique for identifying the type, category, and so on of an emitted sound. The information detection device described in Patent Reference 1 divided inputted sound data into blocks based on predetermined units of time and classifies each block as sound “S” or music “M”.
However, with Patent Reference 1, in the case of calculating the identification frequency of Ps(t) and the like in each time t, the same predetermined unit of time Len, or in other words, a predetermined unit of time Len which has a fixed value, is used, which gives rise to the following problems.
The first problem is that interval detection becomes inaccurate in the case where sudden sounds occur in rapid succession. When sudden sounds occur in rapid succession, the judgment of the sound type of the blocks becomes inaccurate, and differences between the actual sound type and the sound type judged for each block occur at a high rate. When such differences occur at a high rate, the identification frequency Ps and the like in the predetermined unit of time Len become inaccurate, which in turn causes the detection of the final sound or sound interval to become inaccurate as well.
The second problem is that the recognition rate of the sound to be identified (the target sound) is dependent on the length of the predetermined unit of time Len due to the relationship between the target sound and background sounds. In other words, in the case where the target sound is identified using the predetermined unit of time Len, which is a fixed value, there is a problem in that the recognition rate for the target sound drops due to background sounds. This problem shall be discussed in detail later.
Having been conceived in light of the aforementioned problems, an object of the present invention is to provide a sound identification apparatus which reduces the chance of a drop in the identification rate, even when sudden sounds occur, and furthermore, even when a combination of the target sound and background sounds changes.
The sound identification apparatus according to the present invention is a sound identification apparatus that identifies the sound type of an inputted audio signal, and includes: a sound feature extraction unit which divides the inputted audio signal into a plurality of frames and extracts a sound feature per frame; a frame likelihood calculation unit which calculates a frame likelihood of the sound feature in each frame, for each of a plurality of sound models; a confidence measure judgment unit which judges a confidence measure based on the sound feature or a value derived from the sound feature, the confidence measure being an indicator of whether or not to cumulate the frame likelihoods; a cumulative likelihood output unit time determination unit which determines a cumulative likelihood output unit time so that the cumulative likelihood output unit time is shorter in the case where the confidence measure is higher than a predetermined value and longer in the case where the confidence measure is lower than the predetermined value; a cumulative likelihood calculation unit which calculates a cumulative likelihood in which the frame likelihoods of the frames included in the cumulative likelihood output unit time are cumulated, for each of the plurality of sound models; a sound type candidate judgment unit which determines, for each cumulative likelihood output unit time, a sound type corresponding to the sound model that has a maximum cumulative likelihood; a sound type frequency calculation unit which calculates a frequency at which the sound type determined by the sound type candidate judgment unit appears in a predetermined identification time unit; and a sound type interval determination unit which determines the sound type of the inputted audio signal and the temporal interval of the sound type, based on the frequency of the sound type calculated by the sound type frequency calculation unit.
For example, the confidence measure judgment unit judges the confidence measure based on the frame likelihood of the sound feature in each frame for each sound model, calculated by the frame likelihood calculation unit.
Through such a configuration, the cumulative output unit time is determined based on a predetermined confidence measure, such as, for example, a frame confidence measure that is based on a frame likelihood. For this reason, it is possible, by making the cumulative likelihood output unit time shorter in the case where the confidence measure is high and longer in the case where the confidence measure is low, to make the frame number for judging the sound type variable. Accordingly, it is possible to reduce the influence of short amounts of time of sudden abnormal sounds with low confidence measures. In this manner, the cumulative likelihood output unit time is caused to change based on the confidence measure, and thus it is possible to provide a sound identification apparatus in which the chance of a drop in the identification rate is reduced even when a combination of background sounds and the sound to be identified changes.
Preferably, the frame likelihood for frames having a confidence measure lower than a predetermined threshold is not cumulated.
Through this configuration, frames with a low confidence measure are ignored. For this reason, it is possible to accurately identify the sound type.
Note that the confidence measure judgment unit may judge the confidence measure based on the cumulative likelihood calculated by the cumulative likelihood calculation unit.
In addition, the confidence measure judgment unit may judge the confidence measure based on the cumulative likelihood per sound model calculated by the cumulative likelihood calculation unit.
Furthermore, the confidence measure judgment unit may judge the confidence measure based on the sound feature extracted by the sound feature extraction unit.
It should be noted that the present invention can be realized not only as a sound identification apparatus that includes the abovementioned characteristic units, but may also be realized as a sound identification method which implements the characteristic units included in the sound identification apparatus as steps, a program which causes a computer to execute the characteristic steps included in the sound identification method, and so on. Furthermore, it goes without saying that such a program may be distributed via a storage medium such as a Compact Disc Read Only Memory (CD-ROM) or a communications network such as the Internet.
According to the sound identification apparatus of the present invention, it is possible to make the cumulative likelihood output unit time variable based on the confidence measure of a frame or the like. Therefore, it is possible to provide a sound identification apparatus which reduces the chance of a drop in the identification rate, even when sudden sounds occur, and furthermore, even when a combination of the target sound and background sounds changes.
The disclosure of Japanese Patent Application No. 2005-243325, filed on Aug. 24, 2005, including specification, drawings and claims is incorporated herein by reference in its entirety.
These and other objects, advantages and features of the invention will become apparent from the following description thereof taken in conjunction with the accompanying drawings that illustrate a specific embodiment of the invention. In the Drawings:
Hereafter, embodiments of the present invention shall be described with reference to the drawings.
Before describing the embodiments of the present invention, experimental findings made by the inventor shall be discussed first. Experimental sound identification was performed on mixed sounds with changed combinations of a target sound and background sounds using frequency information of a most-likely model, in the same manner as the procedure described in Patent Reference 1. In the learning of a statistical learning model (hereafter, referred to simply as a “model” where appropriate), a synthetic sound in which the target sound was 15 dB against the background sounds was used. In addition, in the experimental sound identification, a synthetic sound in which the target sound was 5 dB against the background sounds was used.
Here, the results shall be examined in detail. When ambient sounds N1 through N17 are assumed to be the background sounds, and in the case where the sound to be identified is a sound M001, music M4, or the like, it can be seen that Tk=1 produces the best identification results. In other words, it can be seen that the procedure using the cumulative likelihood in which Tk=100 is not effective. On the other hand, in the case where the same ambient sound (with the exception of N13) is used as the background sound, and the sound to be identified is the ambient sound N13, Tk=100 shows the best results. In this manner, a trend in which the optimum Tk value differs depending on the type of the background sound can be seen in cases where the background sound is music or speech as well.
In other words, it can be seen that the cumulative likelihood output unit time Tk values in which the identification rate is the best change due to combinations of background sounds and target sounds. Conversely, when the cumulative likelihood output unit time Tk is a fixed value, as in Patent Reference 1, drops in the identification rate can be seen.
The present invention is based upon these findings.
According to the present invention, a model of a sound to be identified, which has been learned beforehand, is used in sound identification, the sound identification using frequency information based on the cumulative likelihood results of plural frames. Speech and music are given as sounds to be identified; the sounds of train stations, automobiles running, and railroad crossings are given as ambient sounds. The various sounds are assumed to have been modeled in advance based on characteristic amounts.
The sound identification apparatus includes: a frame sound feature extraction unit 101; a frame likelihood calculation unit 102; a cumulative likelihood calculation unit 103; a sound type candidate judgment unit 104; a sound type interval determination unit 105; a sound type frequency calculation unit 106; a frame confidence measure judgment unit 107; and a cumulative likelihood output unit time determination unit 108.
The frame sound feature extraction unit 101 is a processing unit which converts an inputted sound into a sound feature, such as Mel-Frequency Cepstrum Coefficients (MFCC) or the like, per frame of, for example, 10 millisecond lengths. While 10 milliseconds is given here as the frame time length which serves as the unit of calculation of the sound feature, 5 milliseconds to 250 milliseconds may be used as the frame time length depending on the characteristics of the target sound to be identified. When the frame time length is 5 milliseconds, it is possible to capture the frequency characteristics of an extremely short sound, and changes therein; accordingly, 5 milliseconds is best used for capturing and identifying sounds with fast changes, such as, for example, beat sounds, sudden bursts of sound, and so on. On the other hand, when the frame time length is 250 milliseconds, it is possible to capture the frequency characteristics of quasi-steady continuous sounds very well; accordingly, with 250 milliseconds, the frequency characteristics of sounds with slow changes or which do not change much, such as, for example, the sound of a motor, can be captured, and thus 250 milliseconds is best used for identifying such sounds.
The frame likelihood calculation unit 102 is a processing unit which calculates a frame likelihood, which is a likelihood for each frame, between a model and the sound feature extracted by the frame sound feature extraction unit 101.
The cumulative likelihood calculation unit 103 is a processing unit which calculates a cumulative likelihood in which a predetermined number of frame likelihoods have been cumulated.
The sound type candidate judgment unit 104 is a processing unit which judges candidates for different sound types based on cumulative likelihoods. The sound type frequency calculation unit 106 is a processing unit which calculates a frequency in the identification unit time T per sound type candidate. The sound type interval determination unit 105 is a processing unit which determines a sound identification and the interval thereof in the identification unit time T, based on frequency information per sound type candidate.
The frame confidence measure judgment unit 107 outputs a frame confidence measure based on the frame likelihood by verifying the frame likelihood calculated by the frame likelihood calculation unit 102. The cumulative likelihood output unit time determination unit 108 determines and outputs a cumulative likelihood output unit time T, which is a unit time in which the cumulative likelihood is converted to frequency information, based on the frame confidence measure which is in turn based on the frame likelihood outputted by the frame confidence measure judgment unit 107. Accordingly, the cumulative likelihood calculation unit 103 is configured so as to calculate a cumulative likelihood, in which the frame likelihoods have been accumulated, in the case where the confidence measure is judged to be high enough, based on the output from the cumulative likelihood output unit time determination unit 108.
To be more specific, the frame likelihood calculation unit 102 calculates, based on formula (1), a frame likelihood P between an identification target sound characteristic model Mi learned in advance through a Gaussian Mixture Model (denoted as “GMM” hereafter) and an input sound feature X. The GMM is described in, for example, “S. Young, D. Kershaw, J. Odell, D. Ollason, V. Valtchev, and P. Woodland, ‘The HTK Book (for HTK Version 2.2), 7.1 The HMM Parameter.’ (1999-1)”.
X(t): input sound characteristic amount model in a frame t;
Mi: sound characteristic model i for identification target sound i (μim is an average value;
Σim is a covariance matrix; λim is a branch probability for a mixed distribution; m is a superscript expressing the distribution number of the mixed distribution; N is a mixed number;
n is a dimension number of a characteristic amount vector X);
P(X(t)|Mi): the likelihood of the sound characteristic model Mi for the identification target sound i, for the input sound characteristic amount X(t) in the frame t
In addition, the cumulative likelihood calculation unit 103 calculates, as a cumulative value of the likelihood P(X(t)|Mi) for each learned model Mi, a cumulative likelihood Li in a predetermined unit time, as shown in formula (2); a model I that indicates the maximum cumulative likelihood is selected and outputted as the closest identified sound type in this unit interval.
Furthermore, the sound type candidate judgment unit 104 uses, as the sound type candidate, the model in which the cumulative likelihood for each learned model i outputted from the cumulative likelihood calculation unit 103 is maximum, per cumulative likelihood output unit time Tk; this is shown in the second part of formula (3). The sound type frequency calculation unit 106 and the sound type interval determination unit 105 output the sound identification results by outputting the model which has the maximum frequency in the identification unit time T based on the frequency information; this is shown in the first part of formula (3).
Next, the specific processes of each block that makes up the first embodiment of the present invention shall be described using a flowchart.
The frame likelihood calculation unit 102 finds, for an input sound feature X(t) in a frame t, each frame likelihood Pi(t) of the sound characteristic model Mi for the identification target sound (Step S1001). The cumulative likelihood calculation unit 103 calculates the cumulative likelihood of each model by accumulating, over the cumulative likelihood output unit time Tk, the frame likelihood of each model for the input characteristic amount X(t) obtained in Step S1001 (Step S1007), and the sound type candidate judgment unit 104 outputs, as the sound identification candidate for that time, the model in which the likelihood is maximum (Step S1008). The sound type frequency calculation unit 106 calculates the frequency information of the sound identification candidate found in Step S1008 in the interval of the identification unit time T (Step S1009). Finally, the sound type interval determination unit 105 selects, based on the obtained frequency information, the sound identification candidate for which the frequency is maximum, and outputs the candidate as the identification results for the present identification unit time T (Step S1006).
By setting the cumulative likelihood output unit time Tk of step S1007 to the same value as the identification unit time T, this method can also function as a method for a cumulative likelihood in which a single maximum frequency is outputted for each identification unit time. In addition, this method can also function as a method for selecting a most-likely model with the frame likelihood as a standard of reference, if the cumulative likelihood output unit time Tk is thought of as one frame.
The frame confidence measure judgment unit 107 resets, in advance, the frame confidence measure to a maximum value (in the diagram, 1) based on the frame likelihood (Step S1101). In the case where any of the three conditional expressions in steps S1012, S1014, and S1015 are fulfilled, the frame confidence measure judgment unit 107 judges the confidence measure by setting the confidence measure to an abnormal value, or in other words, to a minimum value (in the diagram, 0) (Step S1013).
The frame confidence measure judgment unit 107 judges whether or not the frame likelihood Pi(t) for each model Mi of the input sound feature X(t) calculated in Step S1001 is greater than an abnormal threshold value TH_over_P, or is less than an abnormal threshold value TH_under_P (Step S1012). In the case where the frame likelihood Pi(t) for each model Mi is greater than the abnormal threshold value TH_over_P, or in the case where the frame likelihood Pi(t) for each model Mi is less than the abnormal threshold value TH_under_P, it is thought that there is no reliability whatsoever. It can be thought that such a situation arises in the case where the input sound feature is of a range outside of a certain assumed range, a model in which learning has failed, or the like.
Moreover, the frame confidence measure judgment unit 107 judges whether or not the change is low between the frame likelihood Pi(t) and the previous frame likelihood Pi(t−1) (Step S1014). Sounds in an actual environment are always in fluctuation, and thus if sound input is performed properly, changes in likelihood occurring in response to the changes in sound are permitted. Accordingly, in the case where the likelihood is so low that changes in the likelihood are not permitted even when the frame changes, it can be thought that the input sound itself or the input of the sound feature has been cut off.
Furthermore, the frame confidence measure judgment unit 107 judges whether or not the difference between the frame likelihood value for the model in which the calculated frame likelihood Pi(t) is maximum and the model likelihood value in which the calculated frame likelihood Pi(t) is minimum is lower than a threshold value (Step S1015). It is thought that this indicates that a superior model, which is close to the input sound feature, is present in the case where the difference between the maximum and minimum values of the frame likelihood for the model is greater than the threshold, whereas none of models are superior in the case where the difference is extremely low. Accordingly, this is used as the confidence measure. In the case where the difference between the maximum and minimum values of the frame likelihood is less than the threshold value (Y in Step S1015), the frame confidence measure judgment unit 107 assumes the present frame to be of an abnormal value, and sets the frame confidence measure to 0 (Step S1013). On the other hand, in the case where the comparison result is greater than or equal to the threshold value (N in Step S1015), it is assumed that a superior model is present, and thus the frame confidence measure can be set to 1.
In this manner, it is possible to calculate the frame confidence measure based on the frame likelihood, determine the cumulative likelihood output unit time Tk using the information regarding a frame with a high frame confidence measure, and calculate the frequency information.
In the case where frame confidence measures R(t) close to 1 frequently appear (Y in Step S1024), the cumulative likelihood output unit time determination unit 108 causes the cumulative likelihood output unit time Tk to decrease (Step S1025). Through this, in the case where the frame confidence measure R(t) is low, the number of frames is lengthened and the cumulative likelihood found, whereas when the frame confidence measure R(t) is high, the number of frames is shortened and the cumulative likelihood found; because the frequency information can be obtained based on the results thereof, it is possible to automatically obtain identification results of the same accuracy as compared to conventional methods in a relatively short identification unit time.
With the conventional method, or in other words, in conditions where the cumulative likelihood output unit time Tk is fixed, the frequency information of the model with the maximum likelihood, from among the likelihoods obtained from each single frame, is calculated. The conventional method is a method which does not use the confidence measure, and thus the frequency information of the outputted most-likely model is reflected as-is. The information outputted as the sound identification results is determined via the frequency information per interval. In the example in this diagram, the frequency results indicate 2 frames of sound type M (music) and 4 frames of sound type S (sound) in the identification unit time T; from this, the most frequent model in the identification unit time T is the sound type S (sound), and thus a result in which the identification is mistaken is obtained.
On the other hand, under the conditions in which the frequency information is calculated using the likelihood confidence measure, as according to the present invention, the confidence measure per frame is indicated by a value of either 1 or 0, as indicated by the steps in the diagram; the frequency information is outputted as the unit time, which is for calculating the cumulative likelihood using this confidence measure, changes. For example, a frame likelihood judged to be unreliable is not directly converted into frequency information, and rather is calculated as cumulative likelihood until a frame judged to be reliable is reached. In this example, there is an interval in which the confidence measure is 0, and as a result, the most-frequent frequency information in the identification unit time T, which is of the sound type M (music), is outputted as the frequency information. As the most-frequent model in the identification unit time T is that of the sound type M (music), it can be seen that the correct sound type has been identified. Therefore, as an effect of the present invention, it can be expected that identification results can be improved through absorbing unstable frequency information, by not directly using frame likelihoods judged to be unreliable.
According to such a configuration, when converting the cumulative likelihood information to frequency information, by converting the frequency information based on the likelihood confidence measure, the length of the cumulative likelihood calculation unit time can be appropriately set even in cases where sudden sounds occur frequently and sound types frequently switch (the cumulative likelihood calculation unit time can be set to be short in the case where the confidence measure is higher than a predetermined value, and longer in the case where the confidence measure is lower than the predetermined value). For this reason, it can be thought that a drop in the identification rate of a sound can be suppressed. Furthermore, it is possible to identify a sound based on a more appropriate cumulative likelihood calculation unit time, and thus a drop in the identification rate of a sound can be suppressed, even in the case where background noise and the target sound have changed.
Next, a second configuration of a sound identification apparatus according to the first embodiment of the present invention, which is shown in
The difference between
According to such a configuration, when converting the sound type candidate calculated from the cumulative likelihood information to frequency information, by converting to frequency information based on the likelihood confidence measure, it is possible to reduce the influence of sudden abnormal sounds over a short amount of time; therefore, it is possible to suppress a drop in the identification rate by using a more appropriate cumulative likelihood calculation unit time, even when there is background noise present or the target sound changes.
Here, the frame confidence measure judgment unit 107 sets the confidence measure to take on an intermediate value between 0 and 1, rather than setting the confidence measure at either 0 or 1. Specifically, as in Step S1016, the frame confidence measure judgment unit 107 can add, as a further standard for the confidence measure, a measure for judging how superior the frame likelihood of the model with the maximum value is. Accordingly, the frame confidence measure judgment unit 107 may use a ratio between the maximum and minimum values of the frame likelihood as the confidence measure.
The sound type frequency calculation unit 106 finds the frequency information by accumulating, over the interval of the identification unit time T, the sound type candidates outputted in accordance with the processing shown in
Note that the sound type interval determination unit 105 may select the model that has the maximum frequency information only in an interval in which frequency information with a high confidence measure is concentrated, and may then determine the sound type and interval thereof. In this manner, information in intervals with low frame confidence measures is not used, and the accuracy of identification can be improved.
Here, the effects of the present invention shall be described using specific examples of identification results. With the conventional method, or in other words, in conditions where the cumulative likelihood output unit time Tk is fixed, the frequency information of the model with the maximum likelihood, from the likelihoods obtained from each single frame, is calculated. Therefore, in the same manner as the results shown in
On the other hand, under conditions in which the frequency information is calculated using the likelihood confidence measure, as in the present invention, it is possible to find the frequency information based on three levels of reliability, while having the cumulative likelihood be of variable length, from a frame with a likelihood than can be converted to frequency information from the likelihood of only a single frame. Accordingly, it is possible to obtain identification results without directly using the frequency information of an unstable interval. In addition, in the case of a frame in which the reliability is low and the frequency information is accordingly not being used, such as the last frame in the identification target interval T in the diagram, it is possible to calculatingly ignore the cumulative likelihood. In this manner, it can be expected that identification can be performed with even further accuracy by having the confidence measure in a multiple-stepped form.
It should be noted that in the above example, descriptions are given in which a single identification judgment result is outputted in the identification unit time T; however, plural identification judgment results may be outputted with an interval of high reliability or an interval of low reliability being used as a base point. With such a configuration, the identification results for the identification unit time T are not outputted at a fixed timing; rather, it is possible to appropriately output information of an interval with high reliability at a changeable timing. Therefore, even if, for example, the identification unit time T is set to be longer, results can be quickly obtained in intervals in which the identification results are probable due to the confidence measure. It is possible to quickly obtain results for a highly-reliable interval even in the case where the identification unit time T is set to be shorter as well.
Note that while descriptions have been given in which MFCC is assumed as the sound feature learning model used by the frame sound feature extraction unit 101 and GMM is used as the model, the present invention is not limited to these models; a Discrete Fourier Transform (DFT), Discrete Cosine Transform (DCT), a Modified Discrete Cosine Transform (MDCT) or the like, which express the characteristic amount as a frequency characteristic amount, may be used as well. In addition, a Hidden Markov Model (HMM), which takes into consideration state transition, may be used as a model learning method.
In addition, a model learning method may be used after using a statistical method such as principle component analysis (PCA) to analyze or extract components such as the independence of the sound feature.
In
Through such a configuration, it is possible to find changes in the input sound from changes in the abovementioned candidates, and thus it can be speculated that changes will occur in the makeup of mixed sounds that include the identification target sound and background noise. This can be thought of as useful in the case where the identification target sound continues to occur while the background noise changes and a sound similar to the target sound repeatedly appears and disappears in the background.
Note that a change in the sound type candidates calculated in the above manner, or in other words, the combination of identifiers within a predetermined value from the most-likely cumulative likelihood, may be detected, and the change point or the amount in which the number of candidates has increased or decreased may be used as the frame confidence measure and converted to the frequency information.
Note that a change in the sound type candidates calculated in the above manner, or in other words, the combination of identifiers from the lowest cumulative likelihood, may be detected, and the change point or the amount in which the number of candidates has increased or decreased may be used as the frame confidence measure and converted to the frequency information.
In addition, in the abovementioned
It should be noted that a model within a range from the most-likely cumulative likelihood to the predetermined likelihood is a model in which the probability of the model as the sound type of the interval in which the cumulative likelihood has been calculated is extremely high. Accordingly, assuming that only the model judged in Step S1053 to have a likelihood within the predetermined range is a reliable model, the confidence measure may be created per model and used in conversion to frequency information. In addition, a model within a range from the lowest cumulative likelihood to the predetermined value is a model in which the probability of the model as the sound type of the interval in which the cumulative likelihood has been calculated is extremely low. Accordingly, assuming that only the model judged in Step S1058 to have a likelihood within the predetermined range is an unreliable model, the confidence measure may be created per model and used in conversion to frequency information.
Note that in the abovementioned configuration, descriptions have been given regarding a method for using the frame confidence measure based on the cumulative likelihood and converting the frame confidence measure into the frequency information; however, the frame confidence measure based on the frame likelihood may be compared with the frame confidence measure based on the cumulative likelihood, an interval in which the two match may be selected, and the frame confidence measure based on the cumulative likelihood may be weighted.
With such a configuration, it is possible to maintain a short frame unit response time while using the frame confidence measure based on the cumulative likelihood. Therefore, it is possible to detect an interval in which the frame confidence measure based on the frame likelihood is being transited, even in the case where the frame confidence measure based on the cumulative likelihood continues and the same sound type candidates are outputted. Therefore, it is also possible to detect a degradation in likelihood over a short period of time due to rapidly occurring sounds or the like.
In addition, in the first embodiment or the second embodiment, descriptions have been given regarding a method in which a frame confidence measure calculated based on the likelihood or the cumulative likelihood is used in converting the frequency information; however, the frequency information or identification results may further be outputted using a sound type candidate confidence measure in which a confidence measure is provided per sound model.
In
By using such a configuration, it is possible to provide a confidence measure per model using the sound type candidate confidence measure, and therefore it is possible to output frequency information in which the model has been weighted. In addition, in the case where a predetermined number of pieces of the frequency information is above a predetermined threshold value, or the frequency information is above the predetermined threshold value for a certain period of time, it is possible to output the identification results with less delay in the sound identification interval even when the identification unit time T has passed, by determining the sound type and outputting it with the interval information.
Next, a method for outputting the sound identification results in which mistaken identifications are suppressed, the mistaken identifications arising because there is almost no frequency difference between sound types in the frequency information obtained in the interval of the identification unit time T, or in other words, because a superior sound type is not present.
As mentioned above, in the case where a sound in which music (M) and sound (S) alternately appear is the input sound, and the frame confidence measure is high, sound type candidates are outputted even if the identification unit time T is not reached. However, in the case where background noise or other noise (N) that resembles the music (M) is present, or many models that resemble alternately-appearing sound (S) or music (M) are present, and a single model cannot be isolated, the frame reliability drops, as opposed to the case described above. Furthermore, if each cumulative likelihood interval Tk continues in and interval in the identification unit time T of a length of time that cannot be ignored, the frequency number obtained in the identification unit time T drops. As a result, there are cases in which the difference in the frequency of music (M) and sound (S) in the identification unit time T decreases. In such cases, there is a problem in that as a model in which the frequency information is maximum in the identification unit time T, no superior model is present, and a sound type candidate which differs from the actual sound type is outputted.
Accordingly, in a variation on the present embodiment, the appearance frequency of each sound type in the cumulative likelihood output unit time Tk in within the identification unit time T is used, and the sound identification frequency calculation unit 106 shown in
In
First, the identification unit time is, as a rule, a predetermined value T (100 frames, in this example); however, in the case where the frame reliability at the time when the sound type frequency calculation unit 106 outputs the cumulative likelihood is above the predetermined value for a predetermined number of consecutive frames, the cumulative likelihood is outputted even if the identification unit time does not reach the predetermined value T, and therefore the identification unit time is shorter than the predetermined value in the identification unit intervals T3 and T4 shown in the diagram.
Next, the appearance frequency per model is shown. Here, “M” indicates music, “S” indicates sound, “N” indicates noise, and “X” indicates silence. The appearance frequency in the first identification time interval T0 is 36 for M, 35 for S, 5 for N, and 2 for X. Therefore, in this case, the most frequent model is M. In
As opposed to the example shown in
Next, descriptions shall be given of the case in which the appearance frequency is used. Using the frequency of each model per identification unit time outputted by the sound identification frequency calculation unit 106 shown in
A specific procedure that uses such judgment criteria shall be described. In the case where the frequency confidence measure R(t) is greater than or equal to 0.5, the most frequent model in the identification unit interval is used as-is, and in the case where the frequency confidence measure R(t) is lower than 0.5, the frequency per model in a plurality of identification unit intervals is re-calculated and the most frequent model determined. In
In such a manner, by using the frequency per model in plural identification unit intervals for areas in which the frequency confidence measure is low, accurate sound identification can be outputted even if the frequency confidence measure of the most frequent model in the identification unit interval drops due to the influence of noise and the like.
In
By using such a configuration, information of intervals in which the frame confidence measure is low may be outputted together. Also, by using such a configuration, it is possible to detect the occurrence of sudden sounds by finding how much the confidence measure has changed, even when, for example, the same sounds are continuing.
The frame confidence measure judgment unit 107 judges whether or not the power of the sound feature is below a predetermined signal power (Step S1041). In the case where the power of the sound feature is below the predetermined signal power (Y in Step S1041), the frame confidence measure based on the sound feature is assumed to have no reliability and is thus set to 0 (Y in Step S1041). In all other cases (N in Step S1041), the frame confidence measure judgment unit 107 sets the frame confidence measure to 1 (Step S1011).
By using such a configuration, it is possible to judge the type of the sound using the confidence measure at the sound input stage prior to the judgment of the sound type.
Note that regarding
Although only some exemplary embodiments of this invention have been described in detail above, those skilled in the art will readily appreciate that many modifications are possible in the exemplary embodiments without materially departing from the novel teachings and advantages of this invention. Accordingly, all such modifications are intended to be included within the scope of this invention.
The sound identification apparatus according to the present invention has a function for judging a sound type using frequency information converted from a likelihood based on a confidence measure. Accordingly, it is possible to extract intervals of a sound from a specific category out of audio and video recorded in a real environment by learning scenes of specific categories using characteristic sounds, and possible to continuously extract exciting scenes from among content by extracting cheering sounds and using them as identification targets. In addition, it is possible to other related information using the detected sound type and interval information as tags, and utilize a tag detection device or the like for audio/visual (AV) content.
Furthermore, the present invention is useful as a sound editing apparatus or the like which detects sound intervals from a recorded source in which various unsynchronized sounds occur and plays back only those intervals.
In addition, it is possible to extract intervals in which sound changes even when the same sound type is detected, such as when sudden sounds occur over a short period of time, by outputting intervals in which the confidence measure has changed.
Furthermore, the confidence measure of the frame likelihood and so on may be outputted and used as the sound identification results, rather than just the sound identification results and that interval. For example, in the case where an area where the confidence measure is low is detected when editing a sound, a beep sound or the like may be provided as a notification of search and editing. In such a manner, it is expected that search operations will be more effective in the case where sounds that are difficult to model due to their short length, such as sounds of doors and pistols, are searched for.
Furthermore, intervals in which the outputted confidence measures, cumulative likelihoods, and the frequency information alternatively occur may be diagrammed and presented to the user. Through this, it is possible for the user to easily see intervals in which the confidence measure is low, and it can be expected that editing operations or the like will be more effective.
By equipping the sound identification apparatus according to the present invention in, it is possible to apply the present invention in a recording apparatus or the like which can compress recorded audio by selecting a necessary sound and recording the audio.
Number | Date | Country | Kind |
---|---|---|---|
2005-243325 | Aug 2005 | JP | national |
This is a continuation of PCT application No. PCT/JP2006/315463, filed Aug. 4, 2006, and designating the United States of America.
Number | Date | Country | |
---|---|---|---|
Parent | PCT/JP06/15463 | Aug 2006 | US |
Child | 11783376 | Apr 2007 | US |