1. Field of the Invention
The present invention relates to a sound detection apparatus that captures images from an image capturing unit together with inputting sounds from a sound input unit, and detects a specific sound from input sounds using captured images, a control method thereof.
2. Description of the Related Art
Conventionally, there are voice recognition apparatuses that use image information in order to raise the accuracy of voice recognition by reducing the influence of noise and the like. In Japanese Patent Laid-Open No. 59-147398, lip movement is detected, and using the interval during which lip movement is detected as a voice interval, voice recognition is performed during that period. In Japanese Patent No. 03798530, the products of the similarities and probabilities of corresponding syllable candidates are calculated by performing image recognition of lip patterns, and added to the products of the similarities and probabilities of the syllable candidates derived through voice recognition to derive the most probable syllable candidate.
There are also image capturing apparatuses used in video surveillance that determine anomalies using the volume and type of sound.
In the case of determining the type of sound and detecting anomalies in video surveillance or the like, accuracy of the detection becomes an issue. Generally, the number of false negatives increases when trying to reduce false detection, and false detection increase when trying to perform detection without any false negatives.
Even when image information is used in order to reduce false detection, the surveillance target is a place where there could be a plurality of objects, and thus there needs to be a correspondence other than that between syllables and lip shapes, such as a correspondence between position information of objects and types of sounds related thereto, for example.
The present invention provides a sound detection apparatus that accurately detects sounds, a control method thereof.
A sound detection apparatus according to the present invention for achieving the above object is provided with the following configuration. That is, a sound detection apparatus that captures images from an image capturing unit together with inputting sounds from a sound input unit, and detects a specific sound from input sounds using captured images, includes a sound detection unit that detects a specific sound from sounds input by the sound input unit using thresholds for detecting sounds, an image recording unit that records images captured by the image capturing unit, a moving object detection unit that calculates a difference between an image recorded by the image recording unit and a current image captured by the image capturing unit and detects a location of a moving object from the current image, and a position/sound correspondence information management unit that manages a correspondence between information indicating a specific position in images captured by the image capturing unit and information indicating sounds that could occur at the specific position. The sound detection unit, in the case where a moving object is detected by the moving object detection unit, changes the threshold for detecting a sound managed by the position/sound correspondence information management unit, and detects the specific sound from sounds input by the sound input unit, using the changed threshold, with reference to the correspondence managed by the position/sound correspondence information management unit.
The present invention enables a sound detection apparatus that accurately detects sounds, a control method thereof and a program to be provided.
Further features of the present invention will become apparent from the following description of exemplary embodiments (with reference to the attached drawings).
Hereinafter, an embodiment of the present invention is described in detail using the drawings.
Reference numeral 101 denotes a sound input unit that captures sounds/voices from a microphone. Reference numeral 102 denotes an image input unit that captures images (still images or moving images) from a camera serving as an image capturing unit. Reference numeral 103 denotes a moving object detection unit that calculates the difference between a past image and a current image, and detects a location (image) where a difference exists in the current image as a location (image) where a moving object exists. Reference numeral 104 denotes an image recording unit that records past images, sounds/voices and the like to recording media (hard disk, memory, etc.). Reference numeral 105 denotes an image processing unit that performs image encoding. Reference numeral 106 denotes a sound detection unit that detects specific sounds. Specifically, sounds to be detected are selected in advance, and an acoustic model is prepared for each type of sound. The similarities between an input sound and the acoustic models are then compared, and the sound of the acoustic model having the highest score is presented as a detection result. Reference numeral 107 denotes a position/sound correspondence information management unit that manages position/sound correspondence information describing the positions of moving objects and the sounds that could occur at those positions.
Note that the sound detection apparatus in
Moving object detection involves executing processing for setting a moving object detection flag at the timing at which a moving object is detected, and clearing the moving object detection flag when a predetermined time has elapsed after the moving object is no longer detected. Sound detection involves executing processing for lowering a threshold for detecting a sound corresponding to the position at which the moving object is detected, when the moving object detection flag has been set.
First, the moving object detection processing is described in detail.
At step S201 of
Here,
At step S205, the moving object detection unit 103 determines whether there is a difference. If it is determined that there is a difference (YES at step S205), that is, if it is determined that there is a moving object, the moving object detection unit 103, at step S206, sets the moving object detection flag to 1. At step S207, the moving object detection unit 103 records the detection time. At step S208, the moving object detection unit 103 records the detection position. At step S209, the moving object detection unit 103 determines whether to end the moving object detection. If the case of ending the moving object detection (YES at step S209), the moving object detection unit 103 ends the processing. On the other hand, in the case of not ending the moving object detection (NO at step S209), the moving object detection unit 103 returns to step S202 and repeats the processing.
If it is determined in step S205 that there is not a difference (NO at step S205), the moving object detection unit 103, at step S210, determines whether a predetermined time has elapsed since the moving object detection time, recorded at step S207, at which a moving object was last detected. If it is determined that the predetermined time has elapsed (YES at step S210), the moving object detection unit 103, at step S211, sets the moving object detection flag to 0. The moving object detection unit 103 then proceeds to step S209.
On the other hand, if, in step S210, it is determined that the predetermined time has not elapsed (NO at step S210), the moving object detection unit 103 proceeds to step S209 without performing any processing. This processing is for keeping the moving object detection flag set for a predetermined time even after a moving object is no longer detected. The interval during which the moving object detection flag denoted by reference numeral 702 in
Next, the sound detection processing is described in detail.
At step S301 in
Here, the sound recognition processing is performed by preparing a plurality of models of specific sounds and background sounds, and computing the similarity with feature amounts in the sound interval as a likelihood. The likelihood column in
At step S303, the sound detection unit 106 determines whether the moving object detection flag is 1. If it is determined that the moving object detection flag is 1 (YES at step S303), the sound detection unit 106 proceeds to step S304. At step S304, the sound detection unit 106 retrieves a position with reference to a position/sound correspondence information management table (
In step S305, if it is determined that there is position/sound correspondence information (YES at step S305), the sound detection unit 106, in step S306, lowers the threshold for detecting a sound, with regard to only the sounds of the position/sound correspondence information among the sound recognition result candidates. At step S307, the sound detection unit 106 determines a sound recognition result candidate having a larger score than the threshold as a sound detection result.
On the other hand, if, at step S303, it is determined that the moving object detection flag is 0 (NO at step S303), or if, at step S305, it is determined that there is no position/sound correspondence information corresponding to the moving object detection position (NO at step S305), the sound detection unit 106 proceeds to step S307. In these cases, the sound detection unit 106, at step S307, rather than lowering the thresholds for detecting sounds, determines the sound detection result with the thresholds unchanged similarly to a conventional technique.
After determining the sound detection result at step S307, the sound detection unit 106, in step S308, determines whether to end the sound detection processing. In the case of not ending the sound detection processing (NO at step S308), the sound detection unit 106 returns to step S301 and repeats the processing. On the other hand, in the case of ending the sound detection processing (YES at step S308), the sound detection unit 106 ends the processing.
Hereinafter, specific examples of moving object detection processing and sound detection processing are described.
Note that although only the correspondence between positions and sounds (sound labels) are described in the position/sound correspondence information managed in the above-mentioned position/sound correspondence information management table, a configuration may be adopted in which the correspondence with reset thresholds is additionally described, and the thresholds are changed per sound label.
Also, although position/sound correspondence information consisting of preset positions and sounds (sound labels) corresponding thereto is used in the above example, the present invention is not limited thereto. For example, a configuration may be adopted in which object/sound correspondence information consisting of types of objects and types of sounds (sounds that the objects could possibly generate) corresponding thereto is initially created by recognizing objects within an image and positions thereof, and position/sound correspondence information is automatically created using this object/sound correspondence information.
Hereinafter, position/sound correspondence information creation processing for creating position/sound correspondence information from object/sound correspondence information is described. This processing is executed through the cooperation of the moving object detection unit 103, the sound detection unit 106, and the position/sound correspondence information management unit 107, for example.
At step S1001, the position/sound correspondence information management unit 107 sets an image for recognizing objects. At step S1002, the position/sound correspondence information management unit 107 clears the position/sound correspondence information in the position/sound correspondence information management table.
At step S1003, the moving object detection unit 103, as an object recognition unit, recognizes objects in the image. At step S1004, it is determined whether an object has been recognized. If it is determined that there are no recognized objects (NO at step S1004), the processing is ended. On the other hand, if it is determined that there is a recognized object (YES at step S1004), the processing proceeds to step S1005.
At step S1005, the position/sound correspondence information management unit 107 retrieves object/sound correspondence information with reference to an object/sound correspondence information control table for managing objects and sound information corresponding thereto. At step S1006, the position/sound correspondence information management unit 107 determines whether there is a corresponding sound.
If it is determined that there is a corresponding sound (YES at step S1006), the position/sound correspondence information management unit 107, at step S1007, adds the sound corresponding to the detection position of the object as a single record of the position/sound correspondence information management table. In the case where a door is detected as an object at a position 601 in
On the other hand, in step S1006, if it is determined that there are no corresponding sounds (NO at step S1006), the processing proceeds to step S1008.
At step S1008, the position/sound correspondence information management unit 107 updates the areas of the image for recognizing objects. The processing then returns to step S1003, and object recognition is repeated on the next of processing target. In other words, object detection processing is repeated, focusing on an area of the image in which an object has not been detected.
The above processing enables position/sound correspondence information such as shown in
Note that although the thresholds for detecting sounds corresponding to the position in which a moving object is detected are lowered in the above example, a configuration may be adopted in which the thresholds are raised. In that case, if a moving object is not detected, the thresholds for detecting all sounds are raised, and if a moving object is detected, the thresholds for detecting all sounds other than sounds corresponding to that position are raised. In this way, the thresholds for detecting sounds are changed (raised/lowered) according to the application, purpose or the like.
Also, although moving object detection processing and sound detection processing are performed independently in the above example, sounds in an interval (timeslot) from immediately before (predetermined time before) a moving object is detected until the present time may be extracted after performing moving object detection, and sound detection processing may be performed retroactively on only those sounds. In this case, a sound recording unit that records sounds input by the sound input unit 101 will be installed in the sound detection apparatus.
In the case of such a configuration, the moving object detection processing will be as shown in the flowchart of
If, at step S210, it is determined that the predetermined time has elapsed since the moving object detection time was last recorded (YES at step S210), the processing proceeds to step S401. At step S401, the moving object detection unit 103 determines whether the moving object detection flag is 1, that is, whether a moving object was detected before.
If it is determined that the moving object detection flag is 1 (YES at step S401), the processing proceeds to step S402. At step S402, the moving object detection unit 103 acquires a detection target interval to serve as the processing target of the sound detection processing. Specifically, the moving object detection unit 103 acquires a sound interval from the imaging time of a past image immediately before the moving object was detected until the predetermined time has elapsed after the moving object is no longer detected. For example, in
Next, at step S403, the sound detection unit 106 performs sound detection processing. This processing is substantially the same as the flowchart of
Also, although processing in the case where a moving object is detected at one location is shown in the above example, similar processing can also be performed in the case where moving objects are simultaneously detected at a plurality of positions.
When a sound interval 713 is detected and sound recognition result candidates are created at the timing of an end position 714, the detection position in the moving object detection interval 709 is the position 602. Thus, the thresholds for detecting the three sounds “smash”, “shatter” and “squeak” will be lowered, based on the position/sound correspondence information in
Also, when a sound interval 715 is detected and sound recognition result candidates are created at the timing of an end position 716, the detection positions in the moving object detection intervals 709 and 710 that overlap the sound intervals are the two positions 602 and 601. Thus, the thresholds for detecting the four sounds “smash”, “shatter”, “squeak” and “slam” will be lowered, based on the position/sound correspondence information in
Note that although the image capturing unit for capturing images is an image capturing apparatus (fixed camera) that captures only one point in the above example, the image capturing unit may be an image capturing apparatus having a pan/tilt/zoom function. In that case, an image is captured in capturable directions while panning, tilting and zooming, and a past image is created. The captured image is calibrated so as to enable comparison. An image is then captured in capturable directions while panning, tilting and zooming after a predetermined time, and the difference with the past image is created with the captured image as the current image. A configuration may be adopted in which a sound interval from the point in time when the past image was captured until the point in time when the current image was captured is extracted and sound detection processing is performed, after a moving object has been detected when there is a difference.
Also, the image capturing apparatus may be an omni-directional camera capable of omni-directional imaging. In this case, an omni-directional image is converted into a panorama image, and positions are specified in arbitrary frame units.
Also, although the thresholds for detecting sounds are lowered or raised individually in the above example, a configuration may be adopted in which the thresholds are fixed and the scores are weighted. For example, a configuration may be adopted in which the score of a sound corresponding to the moving object detection position is doubled to achieve substantively the same effect as lowering the threshold.
Also, although threshold processing is performed after computing likelihoods in the sound recognition processing, a configuration may be adopted in which the parameters of a decoder are changed during the sound recognition processing to facilitate detection of sounds corresponding to the moving object detection position.
Also, although the above example is limited to processing up until a sound is detected, a sound output unit may be assigned to the image capturing apparatus, and after detection of a sound, a warning sound notifying that fact may be output. Furthermore, a display unit may be assigned, and after detection of a sound, an image notifying that fact may be output to the display unit.
Also, a configuration may be adopted in which a communication function is assigned to the image capturing apparatus, and after detection of a sound, that fact is notified to the communication destination.
Also, a configuration may be adopted in which a recording unit that records images while indexing the sound detection times and an image playback unit are assigned to the image capturing apparatus to enable cue playback of scenes in which specific sounds are detected.
Also, although sound detection is performed after changing the thresholds of sounds in accordance with the position at which a moving object is detected after performing sound recognition in the above example, the present invention is not limited thereto. For example, a configuration may be adopted in which an acoustic model is selected in accordance with the labels of sounds corresponding to the position in which the moving object was detected to narrow down the types of sounds that are targeted for sound recognition.
In
Moving Object Detection Area is sorted into the case where a moving object is not detected (moving object not detected), the case where a moving object is detected and could be in any position (moving object detected), and the case where a moving object could be detected at a designated position (area designation). In other words, Moving Object Detection Area is sorted into any of information indicating “moving object not detected”, information indicating “moving object detected”, and information indicating coordinates serving as an area designation.
The sounds “ding-dong”, “ring”, “gush” and “background sound” are the sound labels of the acoustic model selected in the case where a moving object is not detected within a captured image. The sounds “eek”, “bang” and “background sound” are the sound labels of the acoustic model selected in the case where a moving object is detected and may be at any position. The sound “slam” is the sound label in the case where a moving object is detected at the position 601 in
Note that the label “background sound” is the sound label of a background sound model that is used commonly in any of the cases. A background sound model is an acoustic model that is made by compiling sounds that a user wants to exclude from the detection result, and in the case where the score of a background sound model ranks first, there will be no sound detection result. The method of creating a background sound model is discussed later.
The differences from the flowchart of the sound detection processing in
Next, if it is determined in step S305 after performing step S304 that there is position/sound correspondence information (YES at step S305), the processing proceeds to step S1202, and the acoustic model selection unit 1101 adds the acoustic model corresponding to sound labels thereof. If a moving object is detected at the position 601 in
Next, at step S302, the sound detection unit 106 performs sound recognition processing and creates sound recognition result candidates, using the selected acoustic models. At step S307, the sound detection unit 106 determines the sound detection result.
Step S308 is executed after determining the sound detection result at step S307 in the flowchart of
If, in step S305, it is determined that there is no position/sound correspondence information corresponding to the moving object detection position (NO at step S305), sound recognition result candidates are created at step S302 without adding an acoustic model. In this case, sound recognition is performed with only the acoustic model of the sounds “eek”, “bang” and “background sound” for when a moving object is detected at any position.
If, in step S303, it is determined that the moving object detection flag is 0 (NO at step S303), the processing proceeds to step S1203, and the acoustic model selection unit 1101 selects the moving-object-not-detected acoustic model. In the example of
In this way, the processing shown in
Also, although the types of sounds to serve as sound recognition targets are assumed in advance and acoustic models that can be used are prepared beforehand in the above example, the present invention is not limited thereto. For example, a configuration may be adopted in which background sounds in the usage environment of the sound detection apparatus are recorded in association with moving object detection positions, and background sound models associated with the moving object detection positions are created from the background sounds.
In
Reference numeral 1601 denotes a background sound model creation unit that, at the time of learning (recording) background sounds, sorts and records background sound data as moving-object-not-detected background sound data 1602, moving-object-detected background sound data 1603 or corresponding area-specific background sound data 1604 in accordance with the state of moving object detection. In other words, the background sound model creation unit 1601 also functions as a background sound recording unit. When learning of background sounds has ended, the background sound model creation unit 1601 creates a moving-object-not-detected background sound model 1605, a moving-object-detected background sound model 1606 and a corresponding area-specific background sound model 1607 from the respective background sounds. Note that the corresponding area-specific background sound model 1607 is created for each specific area of position/sound correspondence information registered in the position/sound correspondence information control table.
At step S1701, it is determined whether learning of background sounds has ended. While learning is ongoing, that is, in the case where learning of background sounds has not ended (NO at step S1701), the processing proceeds to step S1702, and background sound data continues to be recorded. On the other hand, if learning of background sounds has ended (YES at step S1701), the processing proceeds to step S1709, and the processing is ended after creating a series of background sound models.
At step S1702, the sound input unit 101 inputs sounds for a predetermined time. Next, at step S1703, the background sound model creation unit 1601 determines whether the moving object detection flag is 1. If it is determined that the moving object detection flag is 0 (NO at step S1703), the processing proceeds to step S1708, and the input sounds are added to the moving-object-not-detected background sound data 1602. The example in
On the other hand, if, in step S1703, it is determined that the moving object detection flag is 1 (YES at step S1703), the processing proceeds to step S1704, and the input sounds are added to the moving-object-detected background sound data 1603. The examples in
Next, at step S1705, the position/sound correspondence information management unit 107 searches the position/sound correspondence information management table. At step S1706, the position/sound correspondence information management unit 107 determines whether there is position/sound correspondence information corresponding to the moving object detection position. If it is determined that there is position/sound correspondence information (YES at step S1706), the processing proceeds to step S1707, and the background sound model creation unit 1601 adds the sounds corresponding to that area to the corresponding area-specific background sound data 1602. The example in
On the other hand, if, at step S1701, background sound learning has ended (YES at step S1701), the processing proceeds to step S1709, and the background sound model creation unit 1601 creates a moving-object-not-detected background sound model. Next, at step S1710, the background sound model creation unit 1601 creates a moving-object-detected background sound model. Next, at step S1711, the background sound model creation unit 1601 creates a corresponding area-specific background sound model. Finally, at step S1712, the position/sound correspondence information management unit 107 records the association of these background sound models and positions.
At step S1801, sounds complied for learning are input. At step S1802, feature amounts are extracted from the input sounds. At step S1803, a model is learned. At step S1804, the model is output.
Acoustic models to serve as sound detection targets are created beforehand as specific sounds from sound data collected in advance. Although normal background sound models are often created by collecting noises assumed in advance, there are also background sound models that are recreated by collecting noises on site.
In the present embodiment, sounds (noises) that should not be detected can be effectively selected by sorting background sounds according to the state of moving object detection, and switching background sound models according to the state of moving object detection.
Since sound detection processing in the case of using these background sound models further adds only processing for selecting a background sound model at the time of processing for selecting/adding acoustic models at steps S1201, S1202 and S1203 of
Note that in the above example the moving-object-detected background sound model also includes the sounds for the case where an area is designated. Although the sound in
Although the position/sound correspondence information control table is automatically created by recognizing objects from the image capturing screen in the above exemplary processing for creating position/sound correspondence information, a configuration may be adopted in which a user creates position/sound correspondence information manually.
When a user starts creation of position/sound correspondence information, management information of the position/sound correspondence information registered in the position/sound correspondence information management unit 107 is displayed as a list at step S2201.
Next, at step S2202, the user performs an operation input. When the user selects an item “▾” of the sound label “smash” under “Moving Object Detection Area” in
At step S2203, it is determined whether the operation is an area type selection, that is, a selection of the item “▾” under “Moving Object Detection Area”. If an area type selection is not made (NO at step S2203), the processing proceeds to step S2210. On the other hand, if an area type selection is made (YES at step S2203), the processing proceeds to step S2204, and it is determined whether “Moving object not detected” was selected. If “Moving object not detected” was selected (YES at step S2204), the processing proceeds to step S2209, and the area designation of the sound label (in this case, “smash”) is set to “Moving object not detected”.
On the other hand, if, in step S2204, “Moving object not detected” was not selected (NO at step S2204), the processing proceeds to step S2205, and it is determined whether “Area designation . . . ” was selected. If “Area designation . . . ” was not selected (NO at step S2205), the processing proceeds to step S2208, and the area designation of the sound label is set as “Moving object detected”.
On the other hand, if “Area designation . . . ” is selected (YES at step S2205), the processing proceeds to step S2206, and an image capturing screen is presented to the user, he or she is prompted to designate a target area with a drag operation, and the designated area is input.
This processing is repeated until the user performs an operation input that is determined to instruct the end of association at step S2210. In other words, if there is no operation input by the user that is determined to instruct the end of association (NO at step S2210), the processing returns to step S2202, and if there is an operation input by the user that is determined to instruct the end of association (YES at step S2210), the processing is ended.
As described above, according to the present embodiment, images are captured from an image capturing unit together with sounds being input by a sound input unit, and a specific sound is detected from input sounds using captured images. In particular, using the association between a specific position in an image and sounds, a threshold for detecting a sound that could occur at that position is lowered when a moving object is detected, allowing the sound to be detected. In other words, in cases other than when a moving object is detected, false detection of sounds in a scene in which there is no movement can be reduced by keeping the thresholds high and making it unlikely that unwanted sounds will be detected. This also enables false detection of sounds other than sounds that readily occur at a specific position to be reduced in a scene in which there is movement.
Alternatively, by performing detection after raising the thresholds of all sounds in the case where a moving object is not detected, and performing detection after raising the thresholds for detecting all sounds other than sounds that could occur at that position in the case of a moving object is detected, false detection of sounds in a scene in which there is no movement can be reduced. This also enables false detection of sounds other than sounds that readily occur at a specific position to be reduced in a scene in which there is movement.
Alternatively, detection can be facilitated by changing the acoustic model used in sound recognition in a case where a moving object is detected or where a moving object is not detected, and, moreover, by lowering the thresholds of sounds that could occur at the position at which a moving object is detected.
Alternatively, by learning background sound models to be used in sound recognition, and changing the background sound model that is applied, in a case where a moving object is detected or where a moving object is not detected, the possibility of a sound other than a specific sound assumed in advance being falsely recognized as the specific sound can be reduced.
Aspects of the present invention can also be realized by a computer of a system or apparatus (or devices such as a CPU or MPU) that reads out and executes a program recorded on a memory device to perform the functions of the above-described embodiment(s), and by a method, the steps of which are performed by a computer of a system or apparatus by, for example, reading out and executing a program recorded on a memory device to perform the functions of the above-described embodiment(s). For this purpose, the program is provided to the computer for example via a network or from a recording medium of various types serving as the memory device (e.g., computer-readable medium).
While the present invention has been described with reference to exemplary embodiments, it is to be understood that the invention is not limited to the disclosed exemplary embodiments. The scope of the following claims is to be accorded the broadest interpretation so as to encompass all such variations and equivalent structures and functions.
This application claims the benefit of Japanese Patent Application Nos. 2011-119710, filed on May 27, 2011 and 2012-101677, filed Apr. 26, 2012, which are hereby incorporated by reference herein in their entirety.
Number | Date | Country | Kind |
---|---|---|---|
2011-119710 | May 2011 | JP | national |
2012-101677 | Apr 2012 | JP | national |