The present invention relates to a sound processing method, sound processing apparatus, and program.
There is a desire to estimate abnormality or detailed state of an apparatus in use through a sound in a factory, home, a common commercial facility, or the like. A technology is known that detects a particular sound from the general environment, in which various sounds are usually mixed together, in order to detect the state through a sound. A noise cancellation technique is also known that identifies and reduces (eliminates) ambient noise included in an input signal. There are also known methods of identifying a particular sound by comparing an input signal from which ambient noise has been eliminated using the noise cancellation technology, with a previously learned signal pattern (for example, see Patent Document 1). There is also known a method of identifying an input signal having large sound pressure variations in the time domain, as a sudden sound (hereafter referred to as an “impulse sound”). There are also known methods of identifying, as an impulse sound, an input signal whose ratio between the sound pressure energy of a low-frequency range and the sound pressure energy of a high-frequency range in the frequency range is equal to or greater than a predetermined threshold (for example, see Patent Document 2).
Among technologies that mainly recognize human speeches are methods described in Patent Documents 3 and 4. Patent Documents 3 and 4 include recognizing a sound by previously storing a sound model and making a comparison between sound feature values extracted from a sound signal and the sound model. The sound feature values are mel-frequency cepstral coefficients (MFCC) and are typically the n-th order cepstral coefficients obtained by eliminating the zero-th-order component, that is, the direct-current component, as described in Patent Document 4.
However, an impulse sound, such as a sound that occurs when a ceiling light or a home appliance is switched on or a sound that occurs when a door is closed, shows approximately flat frequency characteristics in a certain range and therefore features thereof are difficult to grasp. For this reason, even if the technologies described in the above Patent Documents are used, the difficulty in grasping the features disadvantageously makes it difficult to determine from what the impulse sound is occurring under what situation and thus to identify the sound source.
Accordingly, an object of the present invention is to provide a sound processing method, sound processing apparatus, and program that are able to resolve the difficulty in recognizing an impulse sound.
A sound processing method according to an aspect of the present invention includes performing a Fourier transform and then a cepstral analysis of a sound signal and extracting, as feature values of the sound signal, values including frequency components obtained by the Fourier transform of the sound signal and a value based on a result obtained by the cepstral analysis of the sound signal.
A sound processing apparatus according to another aspect of the present invention includes a feature value extractor configured to perform a Fourier transform and then a cepstral analysis of a sound signal and to extract, as feature values of the sound signal, values including frequency components obtained by the Fourier transform of the sound signal and a value based on a result obtained by the cepstral analysis of the sound signal.
A program according to yet another aspect of the present invention is a program for implementing, in an information processing apparatus, a feature value extractor configured to perform a Fourier transform and then a cepstral analysis of a sound signal and to extract, as feature values of the sound signal, values including frequency components obtained by the Fourier transform of the sound signal and a value based on a result obtained by the cepstral analysis of the sound signal.
According to the present invention thus configured, an impulse sound is easily recognized.
A first example embodiment of the present invention will be described with reference to
The present invention consists of sound processing systems as shown in
First, referring to
The sound processing system also includes a signal processor 1 that receives and processes the sound data, which is digital data. The signal processor 1 consists of one or more information processing apparatuses each including an arithmetic logic unit and a storage unit. The signal processor 1 includes a noise cancellation unit 4, a feature value extractor 20, and a learning unit 8. These elements are implemented when the arithmetic logic unit executes a program. The storage unit(s) of the signal processor 1 includes a model storage unit 8. The respective elements will be described in detail below.
The noise cancellation unit 4 analyzes the sound data and eliminates noise (stationary noise: the sound of an air-conditioner indoors, the sound of wind outdoors, etc.) included in the sound data. The noise cancellation unit 4 then transmits the noise-eliminated sound data to the feature value extractor 20.
The feature value extractor 20 includes mathematical functional blocks for extracting features of the numerical sound data. The mathematical functional blocks extract the features of the sound data by converting numerical values of the sound data in accordance with the functions thereof. Specifically, as shown in
The FFT unit 5 includes, in feature values of the sound data, frequency components of the sound data obtained by performing a fast Fourier transform of the sound data. The MFCC unit 6 includes, in feature values of the sound data, the zero-th-order component of a result obtained by performing a mel-frequency cepstral coefficient analysis of the sound data. The differentiator 7 calculates the differential component of the result obtained by the mel-frequency cepstral coefficient analysis of the sound data by the MFCC unit 6 and includes the differential component in feature values of the sound data. Thus, the feature value extractor 20 extracts, as the feature values of the sound data, values including the frequency components obtained by the fast Fourier transform of the sound data, the zero-th-order component of the result obtained by the mel-frequency cepstral coefficient analysis of the sound data, and the differential component obtained by differentiating the result of the mel-frequency cepstral coefficient analysis of the sound data. That is, with respect to the sound data, the feature value extractor 20 extracts sound pressure variations in the time domain using the zero-th-order component of MFCC, extracts time variations not dependent on the volume using the differential component of MFCC, and extracts the frequency components of the impulse by FFT, and uses the sound pressure variations and the like as the feature values of the sound data. For example, the feature value extractor 20 expresses the values extracted from the mathematical functional blocks as a set of numerical sequences in a time-series manner and uses the values as feature values.
The feature values of the sound data used in the present invention need not necessarily include the above values. For example, the feature values of the sound data may be values including frequency components obtained by a Fourier transform of the sound data and a value based on a result obtained by a cepstral analysis of the sound data, or values including the frequency components obtained by the Fourier transform of the sound data and the zero-th-order component of the result obtained by the cepstral analysis of the sound data. A cepstral analysis performed to detect a feature value of the sound data need not necessarily be a mel-frequency cepstral analysis.
The learning unit 8 generates a model by machine-learning the feature values of the sound data extracted by the feature value extractor 20, which are learning data. For example, the learning unit 8 receives input of teacher data (particular information) indicating the sound source (the sound source itself or the state of the sound source) of the sound data along with the feature values of the sound data and generates a model by learning the relationship between the sound data and teacher data. The learning unit 8 then stores the generated model in the model storage unit 9. Note that the learning unit 8 need not necessarily use the above method to learn from the feature values of the sound data and may use any method. For example, the learning unit 8 may learn previously classified sound data such that the sound data can be identified based on the feature values thereof.
Next, referring to
First, the model storage unit 9 is storing the model generated by learning the feature values of the sound data as learning data in the learning phase as described above. The microphone 2 acquires a sound signal to be detected whose sound source has not been identified, such as environmental sound, and the A/D converter 3 converts this analog sound signal into digital sound data.
The signal processor 1 receives the sound data to be detected, eliminates noise at the noise cancellation unit 4, and extracts feature values of the sound data at the feature value extractor 20. At this time, the feature value extractor 20 extracts the feature values of the sound data to be detected at the three mathematical functional blocks, that is, the FFT unit 5, MFCC unit 6, and differentiator 7 in a manner similar to that in which the feature values are extracted in the learning phase. Specifically, the feature value extractor 20 extracts, as the feature values of the sound data, values including frequency components obtained by a fast Fourier transform of the sound data, the zero-th-order component of a result obtained by a mel-frequency cepstral coefficient analysis of the sound data, and the differential component obtained by differentiating the result obtained by the mel-frequency cepstral coefficient analysis of the sound data. Note that the feature values of the sound data extracted in the detection phase need not necessarily include the above values and may include values similar to those extracted in the learning phase.
The determination unit 10 makes a comparison between the feature values extracted from the sound data by the feature value extractor 20 and the model stored in the model storage unit 9 and identifies the sound source of the sound data to be detected. For example, the determination unit 10 inputs the feature values extracted from the sound data to the model and identifies a sound source corresponding to a label representing an output value thereof, as the sound source of the sound data to be detected.
Next, an operation of the sound processing system thus configured will be described. First, referring to the flowchart of
First, the sound processing system collects, from the microphone 2, a sound signal consisting of an impulse sound to be learned, whose sound source has been identified (step S1). Note that the sound signal to be learned need not be one collected by the microphone and may be a recorded sound signal. The sound processing system then converts the collected sound signal into digital sound data, which is signal-processable, numerical data, at the A/D converter 3 (step S2).
The sound processing system then inputs the sound data to the signal processor 1 and eliminates noise (stationary noise: the sound of an air-conditioner indoors, the sound of wind outdoors, etc.) included in the sound data at the noise cancellation unit 4 (step S3). The sound processing system then extracts the feature values of the sound data at the feature value extractor 20, that is, the FFT unit 5, MFCC unit 6, and differentiator 7 (step S4). In the present embodiment, the sound processing system extracts, as the feature values of the sound data, values including frequency components obtained by a fast Fourier transform of the sound data, the zero-th-order component of a result obtained by a mel-frequency cepstral coefficient analysis of the sound data, and the differential component obtained by differentiating the result obtained by the mel-frequency cepstral coefficient analysis of the sound data.
The sound processing system then generates a model by machine-learning the feature values of the sound data as learning data at the learning unit 8 (step S5). For example, the learning unit 8 receives input of teacher data indicating the sound source of the sound data along with the feature values of the sound data and generates a model by learning the relationship between the sound data and teacher data. The sound processing system then stores the model generated from the learning data in the model storage unit 9 (step S6).
Next, referring to the flowchart of
First, the sound processing system newly collects and detects a sound signal, such as environmental sound, from the microphone 2 (step S11). Note that the sound signal need not be one collected by the microphone and may be a recorded sound signal. The sound processing system then converts the collected sound signal into digital sound data, which is signal-processable, numerical data, at the A/D converter 3 (step S12).
The sound processing system then inputs the sound data to the signal processor 1 and eliminates noise (stationary noise: the sound of an air-conditioner indoors, the sound of wind outdoors, etc.) included in the sound data at the noise cancellation unit 4 (step S13). The sound processing system then extracts feature values of the sound data at the feature value extractor 20, that is, the FFT unit 5, MFCC unit 6, and differentiator 7 (step S14). In the present embodiment, the sound processing system extracts, as the feature values of the sound data, values including frequency components obtained by a fast Fourier transform of the sound data, the zero-th-order component of a result obtained by a mel-frequency cepstral coefficient analysis of the sound data, and the differential component obtained by differentiating the result obtained by the mel-frequency cepstral coefficient analysis of the sound data. These steps are approximately the same as those in the learning phase.
The sound processing system then, at the determination unit 10, makes a comparison between the feature values extracted from the sound data and the model stored in the model storage unit 9 (step S15) and identifies the sound source of the sound data to be detected (step S16). For example, the determination unit 10 inputs the feature values extracted from the sound data to the model and identifies a sound source corresponding to a label, which is output values thereof, as the sound source of the sound data to be detected.
As described above, with respect to the sound data, the present invention extracts sound pressure variations in the time domain using the zero-th-order component of MFCC, extracts time variations not dependent on the volume using the differential component of MFCC, and extracts the frequency components of the impulse by FFT, and uses the sound pressure variations and the like as feature values of the sound data. By learning the sound data having these feature values, the present invention is able to identify the type of the impulse sound that is included in environmental sound or the like and whose sound source is unknown.
Next, a second embodiment of the present invention will be described with reference to
First, referring to
When the CPU 101 acquires and executes the programs 104, a feature value extractor 121 shown in
The hardware configuration of the information processing apparatus serving as the sound processing apparatus 100 shown in
The sound processing apparatus 100 performs the sound processing method shown in the flowchart of
As shown in
As described above, the present invention extracts, as the feature values of the sound signal, the values including the frequency components obtained by the Fourier transform of the sound signal and the value based on the result obtained by the cepstral analysis of the sound signal. Thus, the present invention is able to properly extract the features of the impulse sound based on the values. As a result, the impulse sound is easily recognized.
Some or all of the embodiments can be described as in Supplementary Notes below. While the configurations of the sound processing method, sound processing apparatus, and program according to the present invention are outlined below, the present invention is not limited thereto.
A sound processing method comprising:
performing a Fourier transform and then a cepstral analysis of a sound signal; and
extracting, as feature values of the sound signal, values including frequency components obtained by the Fourier transform of the sound signal and a value based on a result obtained by the cepstral analysis of the sound signal.
The sound processing method according to Supplementary Note 1, wherein the extracting comprises extracting, as the feature values of the sound signal, values including the frequency components obtained by the Fourier transform of the sound signal and the zero-th-order component of the result obtained by the cepstral analysis of the sound signal.
The sound processing method according to Supplementary Note 2, wherein the extracting comprises extracting, as the feature values of the sound signal, values including the frequency components obtained by the Fourier transform of the sound signal, the zero-th-order component of the result obtained by the cepstral analysis of the sound signal, and a differential component of the result obtained by the cepstral analysis of the sound signal.
The sound processing method according to any of Supplementary Notes 1 to 3, wherein the cepstral analysis is a mel-frequency cepstral coefficient analysis.
The sound processing method according to any of Supplementary Notes 1 to 4, wherein a model is generated by learning the sound signal based on the feature values extracted from the sound signal and identification information identifying the sound signal.
The sound processing method according to Supplementary Note 5, wherein the feature values are extracted from the newly detected sound signal, and the identification information corresponding to the feature values extracted from the new sound signal is identified using the model.
The sound processing method according to any of Supplementary Notes 1 to 4, wherein the feature values are extracted from the newly detected sound signal, and the sound signal is identified based on the feature values.
A sound processing apparatus comprising a feature value extractor configured to perform a Fourier transform and then a cepstral analysis of a sound signal and to extract, as feature values of the sound signal, values including frequency components obtained by the Fourier transform of the sound signal and a value based on a result obtained by the cepstral analysis of the sound signal.
The sound processing apparatus according to Supplementary Note 8, wherein the feature value extractor extracts, as the feature values of the sound signal, values including the frequency components obtained by the Fourier transform of the sound signal and the zero-th-order component of the result obtained by the cepstral analysis of the sound signal.
The sound processing apparatus according to Supplementary Note 8.1, wherein the feature value extractor extracts, as the feature values of the sound signal, values including the frequency components obtained by the Fourier transform of the sound signal, the zero-th-order component of the result obtained by the cepstral analysis of the sound signal, and a differential component of the result obtained by the cepstral analysis of the sound signal.
The sound processing apparatus according to Supplementary Note 8.2, wherein the cepstral-analysis is a mel-frequency cepstral coefficient analysis.
The sound processing apparatus according to any of Supplementary Notes 8 to 8.3, comprising a learning unit configured to generate a model by learning the sound signal based on the feature values extracted from the sound signal and identification information identifying the sound signal.
The sound processing apparatus according to Supplementary Note 9, wherein the feature value extractor extracts the feature values from the newly detected sound signal, the sound processing apparatus comprising an identification unit configured to identify the identification information corresponding to the feature values extracted from the new sound signal using the model.
The sound processing apparatus according to Supplementary Note 8 or 9, wherein the feature value extractor extracts the feature values from the newly detected sound signal, the sound processing apparatus comprising an identification unit configured to identify the sound signal based on the feature values extracted from the newly detected sound signal.
A program for implementing, in an information processing apparatus, a feature value extractor configured to perform a Fourier transform and then a cepstral analysis of a sound signal and to extract, as feature values of the sound signal, values including frequency components obtained by the Fourier transform of the sound signal and a value based on a result obtained by the cepstral analysis of the sound signal.
The program according to Supplementary Note 10, wherein the program further implements, in the information processing apparatus, a learning unit configured to generate a model by learning the sound signal based on the feature values extracted from the sound signal and identification information identifying the sound signal.
The program according to Supplementary Note 10.1, wherein
the feature value extractor extracts the feature values from the newly detected sound signal, and
the program further implements, in the information processing apparatus, an identification unit configured to identify the identification information corresponding to the feature values extracted from the new sound signal using the model.
The program according to Supplementary Note 10 or 10.1, wherein
the feature value extractor extracts the feature values from the newly detected sound signal, and
the program further implements, in the information processing apparatus, an identification unit configured to identify the sound signal based on the feature values extracted from the newly detected sound signal.
The above programs may be stored in various types of non-transitory computer-readable media and provided to a computer. The non-transitory computer-readable media include various types of tangible storage media. The non-transitory computer-readable media include, for example, a magnetic recording medium (for example, a flexible disk, a magnetic tape, a hard disk drive), a magnetooptical recording medium (for example, a magnetooptical disk), a CD-ROM (Read Only Memory), a CD-R, a CD-R/W, a semiconductor memory (for example, a mask ROM, a PROM (Programmable ROM), an EPROM (Erasable PROM), a flash ROM, a RAM (Random Access Memory)). The programs may be provided to a computer by using various types of transitory computer-readable media. The transitory computer-readable media include, for example, an electric signal, an optical signal, and an electromagnetic wave. The transitory computer-readable media can provide the programs to a computer via a wired communication channel such as an electric wire and an optical fiber or via a wireless communication channel.
While the present invention has been described with reference to the example embodiments and so on, the present invention is not limited to the example embodiments described above. The configurations and details of the present invention can be changed in various manners that can be understood by one skilled in the art within the scope of the present invention.
The present invention is based upon and claims the benefit of priority from Japanese Patent Application 2019-042431 filed on Mar. 8, 2019 in Japan, the disclosure of which is incorporated herein in its entirety by reference.
Number | Date | Country | Kind |
---|---|---|---|
2019-042431 | Mar 2019 | JP | national |
Filing Document | Filing Date | Country | Kind |
---|---|---|---|
PCT/JP2019/049599 | 12/18/2019 | WO | 00 |