1. Field
The present disclosure relates to a sound monitoring method, and more particularly, to a sound detection method of classifying various kinds of mixed sounds in an actual environment, determining whether or not a user is exposed to a dangerous situation, and recognizing a hazard situation.
2. Background
Generally, closed circuit television (CCTV) refers to a system which transfers video information to a particular user for a particular purpose, and is configured so that an arbitrary person other than the particular user cannot connect to the system in a wired or wireless manner and receive a video. CCTVs are mainly used in various surveillance systems for places congested with people, such as large discount stores, banks, apartments, schools, hotels, public offices, subway stations, etc., or places that require constant monitoring, such as unmanned base stations, unmanned substations, police stations, etc., and play a major role in acquiring clues from various crime scenes.
The market scale of CCTV cameras and Internet protocol (IP) cameras which are used as security cameras have drastically grown since 2010, and the Korean market of security cameras also grew to about 420 billion Korean won in 2013. In light of this, it can be seen that a security system for preventing various crimes is attracting attention these days.
However, in spite of the rapid proliferation of security cameras such as CCTVs, blind spots of security cameras still remain, and a crime rate is not being reduced. When one camera is used to monitor several directions, even if a guard continuously changes the position of the camera, it may be impossible to continuously monitor the surveillance area due to carelessness of the guard or a lack of guards, and a surveillance system may not fully achieve its role.
Also, when a plurality of security cameras are installed to minimize blind spots, the number of screens to be monitored increases, and a larger number of security workers are required to monitor the screens. Although blind spots are reduced and a probability that a crime scene will be recorded increases, a probability that the crime will be handled in real time is reduced and the cost of equipment increases. Therefore, this is not an efficient method for crime prevention.
Consequently, to rapidly cope with a dangerous situation such as with crime, it is necessary to rapidly determine whether or not a dangerous situation has actually occurred for a user by detecting and classifying not only video images shown through a surveillance camera but also acoustic events included in the video images.
To classify a sound according to related art, a system is utilized for identifying three types of sounds, such as explosions, gunshots, screams, etc., through two operations of detecting a particular event sound, such as a gunshot or a scream, using a Gaussian mixture model (GMM) classifier and identifying sounds of events using a hidden Markov model (HMM) classifier based on Mel-frequency cepstral coefficient (MFCC) features. However, the aforementioned methods have problems in that the accuracy of sound detection is not ensured at a low signal-to-noise ratio (SNR), and it is difficult for the HMM classifier to distinguish between ambient noise and event sounds.
The present disclosure is directed to providing a sound detection method of detecting sounds coming from the surroundings and identifying a sound of a dangerous situation, such as a crime, to rapidly recognize the occurrence of a crime.
The present disclosure is directed to implementing a system capable of detecting a sound, determining whether or not a particular situation has occurred in real time, and rapidly handling the situation.
According to an aspect of the present disclosure, there is provided a method of detecting a sound for recognizing a hazard situation in an environment with mixed background noise, the method including acquiring a sound signal from a microphone; separating abnormal sounds from the input sound signal based on non-negative matrix factorization (NMF); extracting Mel-frequency cepstral coefficient (MFCC) parameters according to the separated abnormal sounds; calculating hidden Markov model (HMM) likelihoods according to the separated abnormal sounds; and comparing the HMM likelihoods of the separated abnormal sounds with a reference value to determine whether or not an abnormal sound has occurred.
The separating of the abnormal sounds based on NMF may include decomposing the input sound into a linear combination of several vectors using a background noise base and a plurality of abnormal sound bases and determining degrees of similarity with a pre-trained abnormal sound signal. The background noise base and the plurality of abnormal sound bases may be obtained through NMF training in an offline environment using corresponding signals.
The extracting of the MFCC parameters according to the separated abnormal sounds may include converting the separated abnormal sounds into 39-dimensional feature vectors, and the feature vectors may consist of the MFCC parameters including logarithmic energy and delta acceleration factors.
The method may further include, after the extracting of the MFCC parameters according to the separated abnormal sounds, detecting a highest likelihood of each separated abnormal sound using an HMM of the background noise and an HMM of the separated abnormal sound.
A likelihood of the HMM of the background noise may be calculated as a probability that feature values of the abnormal sound will be detected in the HMM of the background noise, and a likelihood of the HMM of the abnormal sound may be calculated as a probability that feature values of the abnormal sound will be detected in the HMM of the abnormal sound.
39-dimensional feature vectors may be obtained by training the HMM of the abnormal sound and the HMM of the background noise, and an expectation-maximization (EM) algorithm may be used in training of an HMM parameter.
The method may further include calculating an HMM likelihood of the abnormal sound and an HMM likelihood of the background noise, and determining whether the abnormal sound exists in a particular frame through an HMM likelihood ratio of the background noise to the abnormal sound.
The method may further include comparing the HMM likelihood ratio of the background noise to the abnormal sound with a preset reference value, and determining that the sound signal includes the abnormal sound when the likelihood ratio is larger than the preset reference value.
The method may further include setting a probability that each frame will include the abnormal sound to 1 when the likelihood ratio is larger than the preset reference value, setting the probability to 0 otherwise, and determining that the abnormal sound is included in the sound signal to recognize a dangerous situation when a sum of set probabilities is larger than 0.
Embodiments will be described in detail with reference to the following drawings in which like reference numerals refer to like elements, and wherein:
Hereinafter, embodiments will be described in detail with reference to the accompanying drawings. The embodiments may, however, be embodied in many different forms and should not be construed as being limited to the embodiments set forth herein; rather, alternate embodiments falling within the spirit and scope can be seen as included in the present disclosure.
The present disclosure proposes a method of simultaneously performing sound source separation and acoustic event detection to improve the accuracy in detecting a surrounding acoustic event at a low signal-to-noise (SNR). According to an embodiment of the present disclosure, event sounds are separated from ambient noise through non-negative matrix factorization (NMF), and a probability-based test is performed for each separated sound using a hidden Markov model (HMM) to determine whether an acoustic event has occurred.
The embodiment may include an operation of acquiring a sound from a microphone (S10), an operation of separating abnormal sounds from the input sound acquired in operation S10 based on NMF (S20), an operation of extracting Mel-frequency cepstral coefficient (MFCC) parameters according to the abnormal sounds separated in operation S20 (S30), an operation of calculating likelihoods based on HMMs according to the abnormal sounds separated in operation S20 (S40), an operation of comparing the likelihoods of the separated abnormal sounds calculated in operation S40 with a reference value (S50), an operation of determining that an abnormal sound has occurred when a likelihood of a separated abnormal sound is equal to or larger than the reference value (S60), and an operation of determining that no abnormal sound has occurred when a likelihood of a separated abnormal sound is smaller than the reference value (S70).
It is assumed that the input sound signal yi(n) is a signal sil in which L abnormal sounds are mixed and a background noise signal is di(n). The input sound signal is a signal in which the background noise signal and the L abnormal sounds are mixed, and may be expressed as yi(n)=di(n)+Σi=1L Sil(n).
Subsequently, the operation of separating abnormal sounds from the input sound signal based on an NMF algorithm (S20) is performed. The NMF algorithm performs a process of generating a predictive frame of a current frame using a predictive algorithm for a previous frame of a previously input sound signal.
The input sound signal converted to have an amplitude of |Yi(k)| may be split into signals having a spectrum size corresponding to the L abnormal sounds using an NMF technique, and the signals may be expressed as |Sil(k)| (l=1, . . . , and L).
The NMF technique is a technique of decomposing and expressing one matrix in the form of a product of two matrices. Generally, there are several techniques of decomposing a matrix, and various factorization techniques have been researched under different constraint conditions. The NMF technique differs from other techniques in that factorization is performed so that all elements of the decomposed two matrices satisfy a non-negative condition. In other words, when one matrix is decomposed and expressed as a product of two matrices, the decomposition is performed according to the NMF technique so that each element of the two matrices has a value of 0 or a positive value larger than 0.
To decompose one matrix into a product of two matrices is to express one vector as a linear combination of several vectors. In terms of signal space, this is to construct a subspace based on the several vectors of the linear combination and project one of the vectors to the subspace. In this projection process, there is an inevitable projection error, which serves as an index for defining a distance between the vector and the subspace. Therefore, when an input signal is expressed as a linear combination of basis vectors, that is, the input signal is projected in one subspace, it is possible to determine degrees of similarity between the input signal and the particular basis vectors from a size of the projection error.
An operation of separating an acoustic event from ambient noise using the above-described NMF technique will be described below.
A spectrum amplitude of frames having M consecutive input sound signals is converted into a K×M dimensional time-frequency matrix, and may be expressed as follows: Yi=[|Yi−M+1(k)|˜|Yi−M(k)|˜|Yi(k)|].
Therefore, assuming that the input sound signal is the sum of a background noise signal Di and a plurality of abnormal sound signals Sil and is expressed as an equation Yi≅Di+Σi=1L Sil(n), Di and Sil and are time-frequency matrices of di(n) and sil(n).
Subsequently, NMF classification may be performed using a background noise base B{circumflex over (D)} and a plurality (L) of abnormal sound bases BŜl (l=1 to L). In this embodiment, the background noise base B{circumflex over (D)} and the abnormal sound bases BŜl may be obtained through offline NMF training with corresponding signals. In other words, a spectrum amplitude of background noise in the i-th frame and a spectrum amplitude of an l-th abnormal sound in the i-th frame may be calculated using the relationship between {circumflex over (D)}i=B{circumflex over (D)}a{circumflex over (D)}
(Here, h is an iteration coefficient, and multiplication and division may be performed between base-specific factors.) Equation 1 is derived from a condition that a Kullback-Leibler divergence is minimized, and the Kullback-Leibler divergence may be expressed as Equation 2 below.
Equation 1 is repeated until a solution of Equation 2 does not become smaller than a predetermined value. A condition for repeating Equation 1 is given by Equation 3 below.
In Equation 3, θ may be set as a very small threshold value of about 0.0001.
Ŝ={BŜl . . . BŜl . . . BŜl], āŜ
Here, r and R are base rankings of the abnormal sound base BŜl and the background noise base B{circumflex over (D)} respectively, dimensions of
After Ŝil=BŜl(aŜ
In operation S30, |Ŝi−ml(k)| is converted into 39-dimensional feature vectors ci−l, which consist of 12 MFCCs including a logarithmic energy and delta acceleration factors thereof. As a result, ci−l which is M consecutive feature vectors may be expressed by an equation Cil=[cli−M+1)T˜cli−M)T˜cli)T]T.
Subsequently, the operation of calculating HMM likelihoods according to the separated abnormal sounds (S40) is performed. In operation S40, the highest likelihood is detected through likelihoods of the l-th abnormal sound and background noise, and may be calculated using the HMM of the l-th abnormal sound and a signal Cil from which an MFCC has been extracted.
In this embodiment, training of HMMs is performed in eight stages, and 16 mixed
Gaussian probability density functions (pdfs) are modeled. To train λs
In the HMM training, 39 decomposed feature vectors are obtained as feature parameters from the training audio list, and an expectation-maximization (EM) algorithm may be additionally used to train HMM parameters.
Subsequently, the operation of comparing the likelihoods of the separated abnormal sounds with a reference value (S50) may be performed.
After training the l-th abnormal sound HMM λS
L
i
S
=P(Cil|λS
As shown in Equation 4, the likelihood of the background noise HMM may be calculated as a probability that feature values of an abnormal sound will be detected in the background noise HMM, and the likelihood of the abnormal sound HMM may be calculated as a probability that feature values of an abnormal sound will be detected in the abnormal sound HMM.
Next, the operation of comparing the likelihoods using a likelihood Lis
Here, when a reference value thrl is a preset threshold value and a ratio of the likelihood LiD of the background noise HMM to the likelihood Lis
The detected likelihood value {Eventi(i)} of 1 indicates that the i-th frame includes the l-th abnormal sound. When it is determined that the i-th frame includes the abnormal sound through the comparison between the likelihood and the reference value as described above, it is possible to detect that the abnormal sound exists in an input signal corresponding to the current frame and a dangerous situation has occurred.
Therefore, according to the embodiment of the present disclosure, when at least one abnormal sound occurs, it is determined whether the at least one abnormal sound has occurred in the i-th frame to determine whether a dangerous situation has occurred. This may correspond to a case of Σi=1lEventl(i)>0. In other words, when the sum of detected likelihood values is larger than 0, it is possible to recognize a dangerous situation by determining that an abnormal sound is included in an input sound signal.
To compare the embodiment with the related art, two or more abnormal sounds including a scream and a gunshot were taken into consideration. Since the two or more abnormal sounds (L=2) were used, it was possible to acquire two abnormal sound bases BŜl and abnormal sound HMMs λS
For the test, the scream and the gunshot were mixed with audio clips recorded on congested public streets. At this time, an average SNR varied from −5 dB to 15 dB at intervals of 5 dB according to a change of the average power of an abnormal sound. A scream region A and a gunshot region B did not overlap, and each SNR consisted of 10 screams and gunshots.
Table 1 shows false alarm ratios and missed-detection ratios for a comparison between the embodiment and the existing method.
Referring to Table 1, it is possible to see that an average F-measure of the method of detecting a sound according to the embodiment is 90.51% and was remarkably increased compared to the existing method using an HMM. Compared to the existing method, F-measure values were remarkably increased in a section showing a low SNR of −a5 dB to 5 dB, and thus the accuracy of abnormal sound detection was improved.
(a) of
(b) of
In other words, the embodiment shows that all abnormal sounds existing in the test sound are detected, but the existing method (CONV-HMM) of detecting a sound shows that all the abnormal sounds are not detected.
According to the embodiment, an abnormal sound is determined in a situation with background noise, and an NMF-based sound separation is performed. Also, a method of detecting an abnormal sound by comparing ratios of the likelihood of a noise HMM to the likelihoods of several abnormal sound HMMs with a reference value is used, so that the accuracy of sound detection may be improved even in an environment with a low SNR. Therefore, it is possible to determine whether or not a dangerous situation has occurred with high reliability.
According to the embodiment of the present disclosure, since a sound monitoring system compares sounds to detect with ambient noise in a one-to-one basis and classifies the sounds, it is possible to stably detect the sounds even in an actual environment with multiple noises.
According to the embodiment of the present disclosure, since voice data is recognized through an HMM based on the NMF technique, it is possible to detect a particular sound targeted by a user in an input signal with high accuracy and reliability.
According to the embodiment of the present disclosure, it is possible to improve the reliability of detecting a particular sound in an actual environment with a plurality of noises, and the embodiment of the present disclosure may be applied to various sound monitoring systems for rapidly detecting a dangerous situation. Consequently, high industrial applicability can be expected.
Any reference in this specification to “one embodiment,” “an embodiment,” “example embodiment,” etc., means that a particular feature, structure, or characteristic described in connection with the embodiment is included in at least one embodiment. The appearances of such phrases in various places in the specification are not necessarily all referring to the same embodiment. Further, when a particular feature, structure, or characteristic is described in connection with any embodiment, it is submitted that it is within the purview of one skilled in the art to apply such a feature, structure, or characteristic in connection with other ones of the embodiments.
Although embodiments have been described with reference to a number of illustrative embodiments thereof, it should be understood that numerous other modifications and embodiments can be devised by those skilled in the art that fall within the spirit and scope of the principles of this disclosure. More particularly, various variations and modifications are possible in the component parts and/or arrangements of the subject combination arrangement within the scope of the disclosure, the drawings and the appended claims. In addition to variations and modifications in the component parts and/or arrangements, alternative uses will also be apparent to those skilled in the art.
The application claims the benefit of U.S. Provisional Application Ser. No. 62/239,989, filed Oct. 12, 2015, which is hereby incorporated by reference in its entirety.
Number | Date | Country | |
---|---|---|---|
62239989 | Oct 2015 | US |