This application is related to and claims priority under 35 U.S.C §119(a) on Japanese Patent Application No. 2007-169117 filed on Jun. 27, 2007, and is incorporated by reference herein.
1. Technical Field
The present invention relates to an acoustic recognition apparatus for recognizing a specific acoustic signal and in particular, to an acoustic recognition apparatus, an acoustic recognition method, and an acoustic recognition program for recognizing an acoustic signal using an intensity distribution of a frequency.
2. Description of the Related Art
A monitoring camera has previously been used as a device for confirming the state of a specific place or thing. The monitoring camera is effective in detecting an abnormality such as an intrusion by a criminal. However, a simple image monitoring system requires a person in charge of monitoring to continuously watch a monitor at all times. Therefore, the person can fail to detect an abnormality, particularly in the event of an increase in workload of the person in charge of monitoring. With that in mind, in recent years, a device has been provided using image recognition technology able to detect and report both a motion of a person and a state of a thing. The device is used in applications where someone moves around in a place where persons should not have entered, or applications such as finding a defective product in a product line of a factory. Unfortunately, the range which can be provided by such image monitoring is limited to within the angular field of view of a camera. In addition, an abnormality may not be found simply by watching. Consequently, the image recognition alone is not perfect, and some other complementary methods are required.
In view of this, a method to detect an abnormality by detecting a specific sound using an acoustic recognition technology has been considered. For example, Japanese Patent Laid-Open No. 2005-196539 discusses an apparatus which detects a shutter sound in order to prevent unauthorized filming (e.g., sneak shot and digital shoplifting). The apparatus includes at least one sound collecting microphone that is responsive to the sound of photography in a prohibited area. When a visitor takes a picture in the photography prohibited area, the sound collecting microphones collect the sound. The apparatus compares the collected sound with at least one shutter sound sample data stored in a database to identify whether or not the sound is a shutter sound. If the collected sound is a shutter sound, the apparatus issues a warning sound.
Japanese Patent Laid-Open No. 10-97288 discusses a technique which analyzes an input sound signal to obtain a spectrum feature parameter, and recognizes the sound type based on the spectrum feature parameter. The apparatus is provided with a power ratio calculation part and a ratio information/time constant conversion part. The power ratio calculation part obtains the ratio information between the power of the spectrum feature parameter and the power of the estimated noise spectrum. The ratio information/time constant conversion part outputs a time constant of an estimated update of the estimated noise spectrum according to the ratio information. Further, the apparatus is provided with a noise spectrum forming part and a noise removing part. The noise spectrum forming part forms a new estimated noise spectrum based on the time constant, the spectrum feature parameter, and the previous estimated noise spectrum. The noise removing part removes a noise component by subtracting the noise spectrum from the spectrum feature parameter. Still further, the apparatus includes a pattern recognition part. The pattern recognition part determines the sound type by matching a reference parameter pattern with the spectrum feature parameter whose noise component is removed.
According to an aspect of an embodiment, an acoustic recognition apparatus determines whether or not a pre-stored target acoustic signal of a target sound subject to detection is contained in an entered input acoustic signal. The acoustic recognition apparatus includes an acoustic signal analysis part which divides the input acoustic signal into a plurality of frames separated by a unit time including at least one cycle of the target acoustic signal, obtains a frequency spectrum of the frames analyzed for each frequency, and creates an input frequency intensity distribution composed of the plurality of frames based on the frequency spectrum. A target sound storage part divides the target acoustic signal into a plurality of frames, analyzes the plurality of frames for each characteristic frequency having a feature of the target acoustic signal, and stores said frames including characteristic frequency having a feature of the target acoustic signal as a target frequency intensity distribution. A characteristic frequency extraction part extracts only a component of a characteristic frequency of the target acoustic signal stored by the target sound storage part from the input frequency intensity distribution created by the acoustic signal analysis part, and creates a characteristic frequency intensity distribution. A calculation part continuously compares the target frequency intensity distribution stored by the target sound storage part with the characteristic frequency intensity distribution created by the characteristic frequency extraction part by shifting the frames, and calculates a difference between the target frequency intensity distribution and the characteristic frequency intensity distribution. A determination part determines whether or not the target acoustic signal is contained in the input acoustic signal based on the difference calculated by the calculation part.
Hereinafter, embodiments will be described. The embodiments can be implemented in many different forms. Therefore, the embodiments should not be restrictively interpreted by the description of the present embodiments. It should be noted that the same reference numeral denotes the same element throughout the present embodiments.
The description of the present embodiments focuses on an apparatus, but as it should be apparent for so called those skilled in the art, that the present embodiments can also be implemented as a system, a method, and a program causing a computer to operate. In addition, the present embodiments can be implemented as hardware, software, or an embodiment of hardware and software. The program can be recorded in any computer-readable medium such as a hard disk, a CD-ROM, a DVD-ROM, an optical storage device and a magnetic storage device. Further, the program can be recorded in another computer via a network.
The acoustic recognition apparatus 100 in accordance with the first embodiment is provided with an A/D converter 110, a DSP 120 (Digital Signal Processor), and a memory 130.
The A/D converter 110 performs processes of reading an analog input signal entered from a microphone and converting the signal into a digital signal.
The DSP 120 in which the converted digital signal is entered performs an acoustic recognition process according to an acoustic recognition program.
It should be noted that the acoustic recognition apparatus 100 can also include a device to display the execution result on a screen or make a sound from a speaker as a warning sound for a user to confirm.
The memory 130 performs processes of storing the acoustic recognition program as well as storing a feature of a target sound.
The acoustic recognition apparatus 100 includes an acoustic signal analysis processing part 210, a characteristic frequency extraction processing part 220, a calculation processing part 230, a determination processing part 240, an output processing part 250, and a target sound storage part 260.
The acoustic signal analysis processing part 210 divides an acoustic signal entered from a microphone 280 into frames separated by a predetermined unit time (e.g., 20 msec). Further, the acoustic signal analysis processing part 210 performs a frequency analysis for each divided frame to obtain a frequency spectrum. The acoustic recognition apparatus 100 can obtain an intensity distribution of a frequency by storing the spectrum data for a plurality of frames. In other words, the acoustic recognition apparatus 100 performs a process of creating an input frequency intensity distribution showing an intensity of an input sound composed of a plurality of frames based on the obtained frequency spectrum.
It should be noted that a user can set any time length of one frame, but it is desirable that the time length of one frame should contain at least one cycle of a target sound subject to detection. Doing so allows the acoustic recognition apparatus 100 to detect a target sound is contained in the input sound with high accuracy. The user can also set any number of frames, but preferably about 50 to 100 frames (one second to two seconds in the case where one frame is 20 msec) should be set to the time length, which allows the acoustic recognition apparatus 100 to detect with a high accuracy.
The target sound storage part 260 stores information about the target sound subject to detection. More specifically, for example, a characteristic frequency indicating a feature of the target sound, a magnitude of a component of the characteristic frequency and other information are stored as a target frequency intensity distribution for each frame.
The characteristic frequency extraction processing part 220 performs a process of creating a characteristic frequency intensity distribution by extracting only the characteristic frequency component of the target sound stored by the target sound storage part 260 from the input frequency intensity distribution created by the acoustic signal analysis processing part 210. In so doing, the component of a frequency region not related to the target sound subject to detection is deleted from the input frequency intensity distribution.
It should be noted that when the characteristic frequency extraction processing part 220 extracts the characteristic frequency component of the target sound, the characteristic frequency extraction processing part 220 may extract only the value of the frequency. However, when the characteristic frequency extraction processing part 220 extracts the characteristic frequency component of the target sound, preferably the characteristic frequency extraction processing part 220 extracts should extract by allowing for from 50% to 200% of the characteristic frequency at the maximum. In doing so, a small error may occur, but the component of the characteristic frequency can be surely extracted.
The calculation processing part 230 performs a process of calculating the difference between the target frequency intensity distribution stored by the target sound storage part 260 and the characteristic frequency intensity distribution created by the characteristic frequency extraction processing part 220. More specifically, the difference is calculated by subtracting the characteristic frequency intensity distribution from the target frequency intensity distribution. The process is continuously performed for each unit time by shifting the input sound by one frame.
The determination processing part 240 performs a process of determining whether the target sound is contained in the input sound from the graph of the result calculated by the calculation processing part 230.
The output processing part 250 performs a process of displaying the result determined by the determination processing part 240 on a screen or outputting by voice.
First, an acoustic signal is entered from a microphone 280 (Operation S301). The acoustic signal analysis processing part 210 divides the entered acoustic signal into frames separated by a unit time (Operation S302). A frequency analysis is performed for each divided frame to obtain a frequency spectrum (Operation S303).
It should be noted that FFT(Fast Fourier transform) or wavelet transform can be used to obtain the frequency spectrum. Alternatively, the logarithm of the spectrum obtained by the transform may be used as the frequency spectrum.
On the basis of the frequency spectrum for each frame obtained by the Operation S303, an input frequency intensity distribution composed of a plurality of frames is created (Operation S304).
Here, the above process will be described in detail.
In
Now, go back to
Here, the process of detecting a presence or absence of a target sound will be described.
Now, return to
Here, the above process will be described in detail. Each of
In
where “a” and “b” are positive constant coefficients.
As a result of extraction, a characteristic frequency intensity distribution is created. When the characteristic frequency is extracted, most of the components of the non-target sound are deleted, but the components of the target sound are secured. In
Then, the calculation processing part performs a process of calculating the difference by comparing between the created characteristic frequency intensity distribution and the target frequency intensity distribution. More specifically, the calculation processing part subtracts the characteristic frequency intensity distribution from the target frequency intensity distribution and determines the total value of the remaining components as the difference. Assuming that the target frequency intensity distribution is “Ptarget(t, f)” and the result of subtracting the characteristic frequency intensity distribution from the target frequency intensity distribution is “Psub(t, f)”, the following expression is obtained.
The above formula assumes that if the magnitude of the frequency component corresponding to the target sound in the input sound is greater than the target sound stored in the target sound storage part 260, the subtracted result will not be negative.
Where “T” indicates the length of a time period subject to analysis, “shift” indicates the time delay (number of frames). More specifically, the total value of the frequency components after subtraction at time “t” is a sum of “Psub (t, f)” of the past “T” frames including a frame at the time. Here, it is preferable to set the target time period to a few seconds. For example, assuming that one frame is 20 msec, if the target time period is set to two seconds, T=100 (frames). It should be noted that “f1” and “f2” indicate the start and the end of a frequency period subject to detection respectively. This depends on the target sound subject to detection, but in general, it is desirable to set to a range from 100 Hz to 8000 Hz.
Meanwhile, information about the target sound subject to detection is previously stored in the target sound storage part 260. However, all frequencies for the target sound are not necessarily stored, but information about a sufficient number of frequency components to represent the features of the target sound may be stored. For example,
Alternatively, according to the above method, the total value of the frequency components after subtraction is calculated as the difference between the target frequency intensity distribution and the characteristic frequency intensity distribution, but other methods may be used. For example, in addition to the total value of the frequency components, the area of a frequency region shown in
The band division processing part 1210 is a processing part including such that only a specific frequency band of the input sound is subject to detection and the other frequency bands are not subject to detection. The processing speed can be increased and the processing efficiency can also be enhanced by decreasing the number of detections.
First, a sound is entered from the microphone 280 and is converted into an acoustic signal (Operation S1301). The band division processing part 1210 extracts only the frequency band subject to detection from the acoustic signal and the other frequency regions are deleted (Operation S1302).
Here, the process of the band division processing part 1210 will be described in detail.
It should be noted that a general FIR filter or QMF (Quadrature Mirror Filter) may be used as the frequency band division filter.
Now, go back to
The differentiation processing part 241 differentiates the result which the calculation processing part 230 calculated as the difference between the target frequency intensity distribution and the characteristic frequency intensity distribution. The first-order differential maybe used for differentiation, but the second-order differential can enhance the determination accuracy.
The processes from the Operation S1601 to the Operation S1608 are the same as those in the first embodiment. When a graph as shown in
Here, the process of the differentiation processing part 241 will be described in detail.
ΔPowsub(t)=Powsub(t)−Powsub(t−1) [Formula 4]
ΔΔPowsub(t)=ΔPowsub(t)−ΔPowsub(t−1) [Formula 5]
The acoustic recognition apparatus 100 is provided with an acoustic detection processing part 1810, an acoustic signal analysis processing part 210, a local peak determination processing part 1820, a maximum peak determination processing part 1830, a local peak selection processing part 1840, a database storage processing part 1850, and a target sound storage part 260. According to the fourth embodiment, the user of the acoustic recognition apparatus 100 can arbitrarily store the target sound subject to detection depending on the environment in the target sound storage part 260 and the user can also create the target sound storage part 260.
The acoustic detection processing part 1810 performs a process of detecting a rising edge of a sound. When the user turns on a storage switch 1805 and a target sound to be stored makes a sound, an acoustic storage process starts and detects a rising edge of the entered acoustic signal. There are various methods of detecting a rising edge of a sound. For example, it is possible to measure the magnitude of an input acoustic signal for each unit time and compare between the magnitude thereof and the threshold.
The acoustic signal analysis processing part 210 performs the same process as in the first embodiment. It should be noted that in accordance with the fourth embodiment, processes are performed up to the process of obtaining the frequency spectrum but not the process of creating the distribution.
The local peak determination processing part 1820 determines a local peak from the frequency spectrum obtained by the acoustic signal analysis processing part 210. Search is sequentially performed to find a frequency in the frequency spectrum starting with a frequency having a low frequency component. The frequency having a frequency component larger than that of an adjacent frequency is determined as a local peak (the detail will be described later).
The maximum peak determination processing part 1830 determines the largest frequency component of all the frequency components in the frequency spectrum as a maximum peak. The process may be configured such that a maximum value is obtained from all the frequency components in the frequency spectrum or the largest peak of all the local peaks determined by the local peak determination processing part 1820 can be determined as a maximum peak.
The local peak selection processing part 1840 selects a characteristic frequency stored as the characteristic frequency of the target sound in the target sound storage part 260. Here, a local peak whose difference to the largest peak of all the local peaks is within a predetermined first threshold and whose magnitude is equal to or greater than a predetermined second threshold is selected as a characteristic frequency.
The database storage processing part 1850 performs a process of storing a local peak selected by the local peak selection processing part 1840 as a characteristic frequency in the target sound storage part 260.
It should be noted that the acoustic detection processing part 1810 may be configured to be included in the acoustic recognition apparatus 100.
First, an acoustic signal is entered from the microphone 280 (Operation S1901). When the user turns on the storage switch and the target sound occurs, the acoustic detection processing part 1810 detects the entered acoustic (Operation S1902). The acoustic detection processing part 1810 determines whether there is a rising edge of a sound (Operation S1903). If the rising edge of a sound cannot be detected, the process is returned to the Operation S1902 in which the acoustic detection is performed again. If the rising edge of a sound is detected, the input acoustic signal is divided into frames (Operation S1904), and then frequency analysis is performed for each frame (Operation S1905). As a result of the frequency analysis, a frequency spectrum is created (Operation S1906). On the basis of the frequency spectrum, a local peak is determined (Operation S1907).
It should be noted that here, in the same way as in the first embodiment, the frequency spectrum may be obtained by taking the logarithm of the spectrum.
Here, the local peak determination process will be described in detail.
Spe(f)>Spe(f−1) and Spe(f)>Spe(f−1) [Formula 6]
Now, go back to
Here, the local peak selection process will be described in detail.
Now, go back to
Here, the information stored in the target sound storage part 260 will be described.
Now, go back to
As described above, according to the present embodiment, only the information about the characteristic frequency having a feature of the target sound is stored in the target sound storage part 260 and the other information is not stored therein. Accordingly, it is possible to reduce the amount of use of the target sound storage part 260 as much as possible, and detect the target sound with a high accuracy.
The acoustic detection processing part 1810 performs a process of detecting a rising edge of a sound in the same way as in the fourth embodiment.
The termination processing part 2300 determines whether the magnitude of the acoustic detected by the acoustic detection processing part 1810 is greater than a predetermined threshold and, if the magnitude of the acoustic is less than the predetermined threshold, terminates the following process.
It should be noted that the acoustic detection processing part 1810 and the termination processing part 2300 may be configured to be included in the acoustic recognition apparatus 100.
First, a sound is entered from the microphone 280 and is converted into an acoustic signal (Operation S2401). The acoustic detection processing part 1810 detects the converted acoustic signal (Operation S2402). Comparison is made between the level of the acoustic signal and the predetermined threshold (Operation S2403). If the level of the acoustic signal is less than the predetermined threshold, the termination processing is performed to terminate the process (Operation S2404). If the level of the input sound is equal to or greater than the predetermined threshold, Operations from S2405 to S2413 are the same process as Operation from S302 to S310 in the first embodiment is performed to determine a presence or absence of the target sound.
In doing so, if it is apparent that the target sound cannot be detected, the process can be omitted in advance, thereby increasing efficiency as well as reducing power consumption.
It should be noted that an arbitrary value can be set to the predetermined threshold. If a value is set to the second threshold “th2” in accordance with the fourth embodiment, the characteristic frequency having “th2” or less will not be stored in the target sound storage part 260. Accordingly, an undetectable input sound can surely be ignored, thereby increasing the efficiency of processing.
The acoustic recognition apparatus 100 in accordance with the present embodiment is provided with a CPU (Central Processing Unit) 2601, a main memory 2602, a mother board chip set 2603, a video card 2604, an HDD (Hard Disk Drive) 2611, abridge circuit 2612, an optical drive 2621, a keyboard 2622, and a mouse 2623.
The main memory 2602 is connected to the CPU 2601 through a CPU bus and the mother board chip set 2603. The video card 2604 is connected to the CPU 2601 through an AGB (Accelerated Graphics Port) and the mother board chip set 2603. The HDD 2611 is connected to the CPU 2601 through a PCI (Peripheral Component Interconnect) bus and the mother board chip set 2603.
The optical drive 2621 is connected to the CPU 2601 through a low-speed bus, the bridge circuit 2612 between the low-speed bus and the PCI bus, the PCI bus, and the mother board chip set 2603. The key board 2622 and the mouse 2623 are connected to the CPU 2601 through the same connection configuration. The optical drive 2621 reads (or reads and writes) data by emitting a laser beam onto an optical disk. The examples of the optical drive include a CD-ROM drive and a DVD drive.
The acoustic recognition apparatus 100 can be built by both copying an acoustic recognition program into the HDD 2611 and performing so called installation which is configured so that the acoustic recognition program copied in the main memory 2602 can be loaded (this installation is just an example). When the user instructs the OS (Operating System) which controls the computer to activate the acoustic recognition apparatus 100, the acoustic recognition program is loaded into the main memory 2602 and is activated.
It should be noted that the acoustic recognition program may be configured to be provided from a recording medium such as a CD-ROM or may be configured to be provided from another computer connected to a network through the network interface 2614.
As described above, even a hardware configuration in which the acoustic recognition apparatus 100 is implemented as a personal computer can also perform the process of the above specific embodiments.
The hardware configuration of
In addition, the above specific embodiments can be applied, for example, to determine whether an abnormal sound is produced in a machine. Alternatively, the above embodiments can be used for access security for checking entrance and exit by recognizing a sound.
In the foregoing description, the present invention has been described with reference to the specific embodiments, but the scope of the present invention is not limited to the description of the embodiments and various modifications or improvements can be made to each particular embodiment. An embodiment to which those modifications or improvements are made is also included in the scope of the present invention. This is apparent from the appended claims.
Number | Date | Country | Kind |
---|---|---|---|
2007-169117 | Jun 2007 | JP | national |