The present disclosure relates to a sound collection device, a sound collection method, and a program for collecting a target sound.
JP 2012-216998 A discloses a signal processing device that performs noise reduction processing on sound collection signals obtained from a plurality of microphones. This signal processing device detects a speaker based on imaged data of a camera, and specifies a relative direction of the speaker with respect to a plurality of speakers. Moreover, this signal processing device specifies a direction of a noise source from a noise level included in an amplitude spectrum of a sound collection signal. The signal processing device performs noise reduction processing when the relative direction of the speaker and the direction of the noise source match. This effectively reduces a disturbance signal.
The present disclosure provides a sound collection device, a sound collection method, and a program that improve the accuracy of collecting a target sound.
According to one aspect of the present disclosure, there is provided a sound collection device that collects a sound while suppressing noise, the sound collection device including: a storage that stores first data indicating a feature amount of an image of an object that indicates a noise source or a target sound source; and a control circuit that specifies a direction of the noise source by performing a first collation of collating image data generated by a camera with the first data, and performs signal processing on an acoustic signal outputted from a microphone array so as to suppress a sound arriving from the specified direction of the noise source.
These general and specific aspects may be implemented by systems, methods, and computer programs, and combinations thereof.
According to the sound collection device, the sound collection method, and the program of the present disclosure, the direction in which the sound is suppressed is determined by collating the image data obtained from the camera with the feature amount of the image of the object that indicates the noise source or the target sound source. Therefore, the noise can be accurately suppressed. This improves the accuracy of collecting the target sound.
(Findings that Form the Basis of Present Disclosure)
The signal processing device of JP 2012-216998 A specifies the direction of the noise source from the noise level included in the amplitude spectrum of the sound collection signal. However, it is difficult to accurately specify the direction of the noise source only by the noise level. A sound collection device of the present disclosure collates at least any one of image data acquired from a camera and an acoustic signal acquired from a microphone array with data indicating a feature amount of a noise source or a target sound source to specify a direction of the noise source. As a result, the direction of the noise source can be accurately specified, and the noise arriving from the specified direction can be suppressed by signal processing. By accurately suppressing the noise, the accuracy of collecting the target sound is improved.
Hereinafter, embodiments will be described with reference to the drawings. In the present embodiment, an example in which a human voice is collected as a target sound will be described.
1. Configuration of Sound Collection Device
The camera 10 includes an image sensor such as a CCD image sensor, a CMOS image sensor, or an NMOS image sensor. The camera 10 generates and outputs image data which is an image signal.
The microphone array 20 includes a plurality of microphones. The microphone array 20 receives a sound wave, converts it into an acoustic signal which is an electric signal, and outputs the acoustic signal.
The control circuit 30 estimates a target sound source direction and a noise source direction based on the image data obtained from the camera 10 and the acoustic signal obtained from the microphone array 20. The target sound source direction is a direction in which a target sound source that emits a target sound is present. The noise source direction is a direction in which a noise source that emits noise is present. The control circuit 30 fetches the target sound from the acoustic signal output from the microphone array 20 by performing signal processing so as to emphasize the sound arriving from the target sound source direction and suppress the sound arriving from the noise source direction. The control circuit 30 can be implemented by a semiconductor element or the like. The control circuit 30 can be configured by, for example, a microcomputer, CPU, MPU, DSP, FPGA, or ASIC.
The storage 40 stores noise source data indicating a feature amount of the noise source. The image data obtained from the camera 10 and the acoustic signal obtained from the microphone array 20 may be stored in the storage 40. The storage 40 can be implemented by, for example, a hard disk (HDD), SSD, RAM, DRAM, a ferroelectric memory, a flash memory, a magnetic disk, or a combination thereof.
The input/output interface circuit 50 includes a circuit that communicates with an external device according to a predetermined communication standard. The predetermined communication standard includes, for example, LAN, Wi-Fi®, Bluetooth®, USB, and HDMI®.
The bus 60 is a signal line that electrically connects the camera 10, the microphone array 20, the control circuit 30, the storage 40, and the input/output interface circuit 50.
When the control circuit 30 acquires image data from the camera 10 or fetches it from the storage 40, the control circuit 30 corresponds to an input device for the image data. When the control circuit 30 acquires the acoustic signal from the microphone array 20 or fetches it from the storage 40, the control circuit 30 corresponds to an input device of the acoustic signal.
The control circuit 30 performs, as its function, a target sound source direction estimation operation 31, a noise source direction estimation operation 32, and a beam forming operation 33.
The target sound source direction estimation operation 31 estimates the target sound source direction. The target sound source direction estimation operation 31 includes a target object detection operation 31a, a sound source detection operation 31b, and a target sound source direction determination operation 31c.
The target object detection operation 31a detects a target from image data v generated by the camera 10. The target object is an object that is a target sound source. The target object detection operation 31 a detects, for example, a human face as a target object. Specifically, the target object detection operation 31a calculates a probability P(θt, φt|v) that a target object is included in each image in a plurality of determination regions r(θt, φt) in the image data v, wherein the image data v corresponds to one frame of a video or one still image. The determination regions r(θt, φt) will be described later.
The sound source detection operation 31b detects a sound source from an acoustic signal s obtained from the microphone array 20. Specifically, the sound source detection operation 31b calculates a probability P(θt, φt|s) that the sound source is present in a direction specified by a horizontal angle θt and a vertical angle φt with respect to the sound collection device 1.
The target sound source direction determination operation 31c determines the target sound source direction based on the probability P(θt, φt|v) that the image is the target object and the probability P(θt, φt|s) of the presence of the sound source. The target sound source direction is indicated by, for example, the horizontal angle θt and the vertical angle φt with respect to the sound collection device 1.
The noise source direction estimation operation 32 estimates the noise source direction. The noise source direction estimation operation 32 includes a non-target object detection operation 32a, a noise detection operation 32b, and a noise source direction determination operation 32c.
The non-target object detection operation 32a detects a non-target object from the image data v generated by the camera 10. Specifically, the non-target object detection operation 32a determines whether or not a non-target object is included in each image in a plurality of determination regions r(θn, φn) in the image data v, wherein the image data v corresponds to one frame of a video or one still image. The non-target object is an object that is a noise source. For example, when the sound collection device 1 is used in a conference room, the non-target objects are a door of the conference room, a projector in the conference room, and the like. For example, when the sound collection device 1 is used outdoors, the non-target object is a moving object that emits a sound, such as an ambulance.
The noise detection operation 32b detects noise from the acoustic signal s output by the microphone array 20. In the present specification, noise is also referred to as a non-target sound. Specifically, the noise detection operation 32b determines whether or not the sound arriving from the direction specified by a horizontal angle θn and a vertical angle φn is noise. The noise is, for example, a sound of opening and closing a door, a sound of a fan of a projector, and a siren sound of an ambulance.
The noise source direction determination operation 32c determines the noise source direction based on the determination result of the non-target object detection operation 32a and the determination result of the noise detection operation 32b. For example, when the non-target object detection operation 32a detects a non-target object and the noise detection operation 32b detects noise, the noise source direction is determined based on the detected position or direction. The noise source direction is indicated by, for example, the horizontal angle θn and the vertical angle φn with respect to the sound collection device 1.
The beam forming operation 33 fetches the target sound from the acoustic signal s by performing signal processing on the acoustic signal s output by the microphone array 20 so as to emphasize the sound arriving from the target sound source direction and suppress the sound arriving from the noise source direction. As a result, a clear voice with reduced noise can be collected.
The storage 40 stores noise source data 41 indicating the feature amount of the noise source. The noise source data 41 may include one noise source or a plurality of noise sources. For example, the noise source data 41 may include cars, doors, and projectors as noise sources. The noise source data 41 includes non-target object data 41a and noise data 41b which is non-target sound data.
The non-target object data 41a includes an image feature amount of the non-target object that is a noise source. The non-target object data 41a is, for example, a database including the image feature amount of the non-target object. The image feature amount is, for example, at least one of a wavelet feature amount, a Haar-like feature amount, a HOG (Histograms of Oriented Gradients) feature amount, an EOH (Edge of Oriented Histograms) feature amount, an Edgelet feature amount, a Joint Haar-like feature amount, a Joint HOG feature amount, a sparse feature amount, a Shapelet feature amount, and a co-occurrence probability feature amount. The non-target object detection operation 32a detects the non-target object by collating the feature amount fetched from the image data v with the non-target object data 41a, for example.
The noise data 41b includes an acoustic feature amount of noise output by the noise source. The noise data 41b is, for example, a database including the acoustic feature amount of noise. The acoustic feature amount is, for example, at least one of MFCC (Mel-Frequency Cepstral
Coefficient) and i-vector. The noise detection operation 32b detects noise, for example, by collating a feature amount fetched from the acoustic signal s with the noise data 41b.
2. Operation of Sound Collection Device
2.1 Overview of Signal Processing
2.2 Overall Operation of Sound Collection Device
The noise'source direction estimation operation 32 estimates the noise source direction (S1). The target sound source direction estimation operation 31 estimates the target sound source direction (S2). The beam forming operation 33 performs S11 beam forming processing based on the estimated noise source direction and the target sound source direction (S3). Specifically, the beam forming operation 33 performs signal processing on the acoustic signal output from the microphone array 20, so as to suppress the sound arriving from the noise source direction and emphasize the sound arriving from the target sound source direction. The order of the estimation of the noise source direction shown in Step 1 and the estimation of the target sound source direction shown in Step S2 may be reversed.
2.3 Estimation of Noise Source Direction
The estimation of the noise source direction will be described with reference to
The non-target object detection operation 32a detects the non-target object from the image data v generated by the camera 10 (S11). Specifically, the non-target object detection operation 32a determines whether or not the image in the determination region r(θn, φn) is the non-target in the image data v. The noise detection operation 32b detects noise from the acoustic signal s output from the microphone array 20 (S12). Specifically, the noise detection operation 32b determines, from the acoustic signal s, whether or not the sound arriving from the direction of the horizontal angle θn and the vertical angle φn is noise. The noise source direction determination operation 32c determines a noise source direction (θn, φn) based on the detection result of the non-target object and the noise (S13).
The non-target object detection operation 32a collates the fetched image feature amount with the non-target object data 41a to calculate a similarity P(θn, φn|v) with the non-target object (S113). The similarity P(θn, φn|v) is the probability that the image in the determination region r(θn, φn) is a non-target object, that is, the accuracy indicating likeness of a non-target object. The method of detecting a non-target object is freely selectable. For example, the non-target object detection operation 32a calculates the similarity by template matching between the fetched image feature amount and the non-target object data 41a.
The non-target object detection operation 32a determines whether or not the similarity is equal to or more than a predetermined value (S114). If the similarity is equal to or more than the predetermined value, it is determined that the image in the determination region r(θn, φn) is a non-target object (S115). If the similarity is lower than the predetermined value, it is determined that the image in the determination region r(θn, φn) is not a non-target object (S116).
The non-target object detection operation 32 a determines whether or not the determinations in all the determination regions r(θn, φn) in the image data v have been completed (S117). If there is a determination region r(θn, φn) for which determination has not been made, the process returns to Step S112. When the determinations for all the determination regions r(θn, φn) are completed, the process shown in
The noise detection operation 32b collates the fetched acoustic feature amount with the noise data 41b to calculate a similarity P(θn, φn|s) with noise (S123). The similarity P(θn, φn|s) is the probability that the sound arriving from the direction of the determination region r(θn, φn) is noise, that is, the accuracy indicating likeness of noise. The method of detecting noise is freely selectable. For example, the noise detection operation 32b calculates the similarity by template matching between the fetched acoustic feature amount and the noise data 41b.
The noise detection operation 32b determines whether or not the similarity is equal to or more than a predetermined value (S124). If the similarity is equal to or more than the predetermined value, it is determined that the sound arriving from the direction of the determination region r(θn, φn) is noise (S125). If the similarity is lower than the predetermined value, it is determined that the sound arriving from the direction of the determination region r(θn, φn) is not noise (S126).
The noise detection operation 32b determines whether or not the determinations in all the determination regions r(θn, φn) have been completed (S127). If there is a determination region r(θn, φn) for which determination has not been made, the process returns to Step S121. When the determinations for all the determination regions r(θn, (φn) are completed, the process shown in
The noise detection operation 32b delays the output of the microphone 20i by a delay amount corresponding to the distance dsine, and then an adder 321 adds the acoustic signals output from the microphones 20i and 20j. At the input of the adder 321, the phases of the signals arriving from the θ direction match, and hence, at the output of the adder 321, the signals arriving from the θ direction are emphasized. On the other hand, signals arriving from directions other than θ do not have the same phase as each other, and thus are not emphasized as much as the signals arriving from θ. Therefore, for example, by using the output of the adder 321, directivity is formed in the θ direction.
In the example of
The noise source direction determination operation 32c determines whether or not the determinations in all the determination regions r(θn, φn) have been completed (S134). If there is a determination region r(θn, φn) for which determination has not been made, the process returns to Step S131. When the determinations for all the determination regions r(θn, φn) are completed, the process shown in
2.4 Estimation of Target Sound Source Direction
The estimation of the target sound source direction will be described with reference to
The target object detection operation 31a detects the target object based on the image data v generated by the camera 10 (S21). Specifically, the target object detection operation 31a calculates the probability P(θt, φt|v) that the image in the determination region r(θt, φt) is the target object in the image data v. The method of detecting a target object is freely selectable. As an example, the detection of the target object is performed by determining whether or not each determination region r(θt, φt) matches the feature of a face that is a target object (see “Rapid Object Detection using a Boosted Cascade of Simple Features” ACCEPTED CONFERENCE ON COMPUTER VISION AND PATTERN RECOGNITION 2001).
The sound source detection operation 31b detects the sound source based on the acoustic signal s output from the microphone array 20 (S22). Specifically, the sound source detection operation 31b calculates the probability P(θt, φt|s) that the sound source is present in the direction specified by the horizontal angle θt and the vertical angle φt. The method of detecting a sound source is freely selectable. For example, the sound source can be detected using a CSP (Cross-Power Spectrum Phase Analysis) method or a MUSIC (Multiple Signal Classification) method.
The target sound source direction determination operation 31c determines a target sound source direction (θt, φt) based on the probability P(θt, φt|v) that the image is the target object calculated from the image data v and the probability P(θt, φt|s) that the image is the sound source calculated from the acoustic signal s(S23).
An example of the face specification method in Step S21 will be described.
The size of the region r(θt, φt) at the time of detecting a face may be constant or variable. For example, the size of the region r(θt, φt) at the time of detecting a face may change for each image data v for one frame of a video or one still image.
When the target object detection operation 31a determines whether or not the region r(θt, φt) is a face for all the regions r(θt, φt) in the image data v, the target object detection operation 31 a calculates the probability P(θt, φt|v) that the image at the position specified by the horizontal angle θt and the vertical angle φt in the image data v is a face by the following expression(1).
The CSP method, which is an example of the method of detecting a sound source in Step S22, will be described.
The sound source detection operation 31b calculates a probability P(θt|s) that the sound source is present at the horizontal angle θt by the following expression (2) using the CSP coefficient.
P(θt|s)=CSP(τ) (2)
Here, the CSP coefficient can be obtained by Expression (3) below (see IEICE Transactions D-II Vol.J83-D-II No.8 pp.1713-1721, “Localization of Multiple Sound Sources Based on CSP Analysis with a Microphone Array”). In Expression (3), n represents time, Si(n) represents an acoustic signal received by the microphone 20i, and Sj(n) represents an acoustic signal received by the microphone 20j. In Expression (3), DFT represents a discrete Fourier transform. Further, * indicates a conjugate complex number.
The time difference τ can be expressed by Expression (4) below using a sound velocity c, the distance d between the microphones 20i and 20j, and a sampling frequency Fs.
Therefore, as shown in Expression (5) below, by converting the CSP coefficient of Expression (2) from the time axis to the direction axis by Expression(5), the probability P(θt|s) that the sound source is present at the horizontal angle θt can be calculated.
A probability P(φt|s) that the sound source is present at the vertical angle φt can be calculated from the CSP coefficient and the time difference τ, similarly to the probability P(θt|s) at the horizontal angle θt. Further, the probability P(θt, φt|s) can be calculated based on the probability P(θt|s) and the probability P(φt|s).
P(θtφt)=WvP(θt, φt|v)+WsP(φt, φt|s) (6)
Then, the target sound source direction determination operation 31c determines the horizontal angle θt and the vertical angle φt at which the probability P(θt, φt) is the maximum as the target sound source direction by Expression (7) below (S232).
, =argmax(P(θt, φt)) (7)
The weight Wv for the probability P(θt, φt|v) of the target object shown in Expression (6) may be determined based on an image accuracy CMv indicating a certainty that the target object is included in the image data v, for example. Specifically, for example, the target sound source direction determination operation 31c sets the image accuracy CMv based on the image data v. For example, the target sound source direction determination operation 31c compares an average brightness Yave of the image data v with a recommended brightness (Ymin_base to Ymax_base). The recommended brightness has a range from the minimum recommended brightness (Ymin_base) to the maximum recommended brightness (Ymax_base). Information indicating the recommended brightness is stored in the storage 40 in advance. If the average brightness Yave is lower than the minimum recommended brightness, the target sound source direction determination operation 31c sets the image accuracy CMv to “CMv=Yave/Ymin_base”. If the average brightness Yave is higher than the maximum recommended brightness, the target sound source direction determination operation 31c sets the image accuracy CMv to “CMv=Ymax_base/Yave”. If the average brightness Yave is within the range of the recommended brightness, the target sound source direction determination operation 31c sets the image accuracy CMv to “CMv=1”. If the average brightness Yave is lower than the minimum recommended brightness Ymin_base or higher than the maximum recommended brightness Ymax_base, a face that is a target object may be erroneously detected. Therefore, when the average brightness Yave is within the range of the recommended brightness, the image accuracy CMv is set to the maximum value “1”, and the image accuracy CMv is lowered as the average brightness Yave is higher or lower than the recommended brightness. The target sound source direction determination operation 31c determines the weight Wv according to the image accuracy CMv by, for example, a monotonically increasing function.
The weight Ws with respect to the probability P(θt, φt|s) of the sound source shown in Expression (6) may be determined based on, for example, an acoustic accuracy CMs indicating a certainty that a voice is included in the acoustic signal s. Specifically, the target sound source direction determination operation 31c calculates the acoustic accuracy CMs using a human voice GMM (Gausian Mixture Model) and a non-voice GMM. The voice GMM and the non-voice GMM are generated by learning in advance. Information indicating the voice GMM and the non-voice GMM is stored in the storage 40. The target sound source direction determination operation 31c first calculates a likelihood Lv based on the voice GMM in the acoustic signal s. Next, the target sound source direction determination operation 31c calculates the likelihood Ln based on the non-voice GMM in the acoustic signal s. Then, the target sound source direction determination operation 31c sets the acoustic accuracy CMs to “CMs=Lv/Ln”. The target sound source direction determination operation 31c determines the weight Ws according to the acoustic accuracy CMs by, for example, a monotonically increasing function.
2.5 Beam Forming Processing
The beam forming processing (S3) by a beam forming operation 33 after the noise source direction (θn, φn) and the target sound source direction (θt, φt) are determined will be described. The method of beam forming processing is freely selectable. As an example, the beam forming operation 33 uses a generalized sidelobe canceller (GSC) (see Technical Report of IEICE, No.DSP2001-108, ICD2001-113, IE2001-92, pp. 61-68, October, 2001. “Adaptive Target Tracking Algorithm for Two-Channel Microphone Array Using Generalized Sidelobe Cancellers”).
The beam forming operation 33 includes an operation of delay elements 33a and 33b, a beam steering operation 33c, a null steering operation 33d, and an operation of a subtractor 33e.
The delay element 33a corrects an arrival time difference for a target sound based on a delay amount ZDt according to the target sound source direction (θt, φt). Specifically, the delay element 33a corrects an arrival time difference between an input signal u2(n) input to the microphone 20j and an input signal u1(n) input to the microphone 20i.
The beam steering operation 33c generates an output signal d(n) based on the sum of the input signal u1(n) and the corrected input signal u2(n). At the input of the beam steering operation 33c, the phases of signal components arriving from the target sound source direction (θt, φt) match, and hence the signal components arriving from the target sound source direction (θt, φt) in the output signal d(n) are emphasized.
The delay element 33b corrects the arrival time difference regarding noise based on a delay amount ZDn according to the noise source direction (θn, φn). Specifically, the delay element 33b corrects the arrival time difference between the input signal u2(n) input to the microphone 20j and the input signal u1(n) input to the microphone 20i.
The null steering operation 33d includes an adaptive filter (ADF) 33f. The null steering operation 33d set the sum of the input signal u1(n) and the corrected input signal u2(n) as an input signal x(n) of the adaptive filter 33f, and multiplies the input signal x(n) by the coefficient of the adaptive filter 33f to generate an output signal y(n). The coefficient of the adaptive filter 33f is updated so that the mean square error between the output signal d(n) of the beam steering operation 33c and the output signal y(n) of the null steering operation 33d, that is, the root mean square of the output signal e(n) of the subtractor 33e, is minimized.
The subtractor 33e subtracts the output signal y(n) of the null steering operation 33d from the output signal d(n) of the beam steering operation 33c to generate the output signal e(n). At the input of the null steering operation 33d, the phases of the signal components arriving from the noise source direction (θn, φn),) match, and hence the signal components arriving from the noise source direction (θn, φn) in the output signal e(n) output by the subtractor 33e are suppressed.
The beam forming operation 33 outputs the output signal e(n) of the subtractor 33e. The output signal e(n) of the beam forming operation 33 is a signal in which the target sound is emphasized and the noise is suppressed.
The present embodiment shows an example of executing the processing of emphasizing the target sound and suppressing the noise by using the beam steering operation 33c and the null steering operation 33d. However, the processing is not limited to this, and any processing may be employed as long as the target sound be emphasized and the noise be suppressed.
3. Effects and Supplements
The sound collection device 1 according to the present embodiment includes the input device, the storage 40, and the control circuit 30. The input device in the sound collection device 1 including the camera 10 and the microphone array 20 is the control circuit 30. The input device inputs (receives) the acoustic signal output from the microphone array 20 and the image data generated by the camera 10. The storage 40 stores the non-target object data 41a indicating the image feature amount of the non-target object that is the noise source and the noise data 41b indicating the acoustic feature amount of the noise output from the noise source. The control circuit 30 performs the first collation (S113) for collating the image data with the non-target object data 41a, and the second collation (S123) for collating the acoustic signal with the noise data 41b, thereby specifying the direction of the noise source (S133). The control circuit 30 performs the signal processing on the acoustic signal so as to suppress the sound arriving from the specified direction of the noise source (S3).
In this way, since the image data obtained from the camera 10 is collated with the non-target object data 41a, and the acoustic signal obtained from the microphone array 20 is collated with the noise data 41b, the direction of the noise source can be accurately specified. As a result, the noise can be accurately suppressed, so that the accuracy of collecting the target sound is improved.
The present embodiment differs from the first embodiment in determining whether or not there is a noise source in the direction of the determination region r(θn, φn). In the first embodiment, the non-target object detection operation 32a compares the similarity P(θn, φn|v) with the predetermined value to determine whether or not the image in the determination region r(θn, φn) is a non-target object. The noise detection operation 32b compares the similarity P(θn, φn51 s) with the predetermined value to determine whether or not the sound arriving from the direction of the determination region r(θn, φn) is noise. The noise source direction determination operation 32c determines that there is a noise source in the direction of the determination region r(θn, φn) when the image is a non-target object and noise.
In the present embodiment, the non-target object detection operation 32a outputs the similarity P(θn, φn‥V) with the target object. That is, Steps S114 to S116 shown in
In
P(θn, φn|v)+P(θn, φn|s) (8)
P(θn, φn|v)Wv×P(θn, φn|s)Ws (9)
P(θn, φn|v)Wv+P(θn, φn|s)Ws (10)
The noise source direction determination operation 32c determines whether or not the determinations in all the determination regions r(θn, φn) have been completed (S1304). If there is a determination region r(θn, φn) for which determination has not been made, the process returns to Step S1301. When the determinations for all the determination regions r(θn, φn) are completed, the process shown in
According to the present embodiment, as in the first embodiment, the noise source direction can be accurately specified.
The present embodiment differs from the first embodiment in data to be collated. In the first embodiment, the storage 40 stores the noise source data 41 indicating the feature amount of the noise source, and the noise source direction estimation operation 32 estimates the noise source direction using the noise source data 41. In the present embodiment, the storage 40 stores target sound source data indicating the feature amount of the target sound source, and the noise source direction estimation operation 32 estimates the noise source direction using the target sound source data.
According to the present embodiment, as in the first embodiment, the noise source direction can be accurately specified.
In the present embodiment, the target sound source data 42 may be used to specify the target sound source direction. For example, the target object detection operation 31a may detect a target object by collating the image data v with the target object data 42a. The sound source detection operation 31b may detect the target sound by collating the acoustic signal s with the target sound data 42 b. In this case, the target sound source direction estimation operation 31 and the noise source direction estimation operation 32 may be integrated into one.
As described above, the first to third embodiments have been described as an example of the technology disclosed in the present application. However, the technology in the present disclosure is not limited to this, and is applicable to embodiments in which changes, replacements, additions, omissions, and the like are appropriately made. Further, each component described in the embodiments can be combined to make a new embodiment. Therefore, other embodiments are described below.
In the first embodiment, in Step S132 in
In the first embodiment, in Step S132 of
The non-target object detection operation 32a may specify the noise source direction based on the detection of the non-target object, and the noise detection operation 32b may specify the noise source direction based on the detection of the noise. In this case, the noise source direction determination operation 32c may determine whether or not to suppress the noise by the beam forming operation based on whether or not the noise source direction specified by the non-target object detection operation 32 a and the noise source direction specified by the noise detection operation 32b match. The noise source direction determination operation 32c may suppress the noise by the beam forming operation 33 when the noise source direction can be specified by either one of the non-target object detection operation 32a and the noise detection operation 32b.
In the above embodiment, the sound collection device 1 includes both the non-target object detection operation 32a and the noise detection operation 32b, but may include only one of them. That is, the noise source direction may be specified only from the image data, or the noise source direction may be specified only from the acoustic signal. In this case, the noise source direction determination operation 32c may be omitted.
In the above embodiment, the collation by the template matching has been described. Instead of this, collation by machine learning may be performed. For example, the non-target object detection operation 32a may use PCA (Principal Component Analysis), neural network, linear discriminant analysis (LDA), support vector machine (SVM), AdaBoost, Real AdaBoost, or the like. In this case, the non-target object data 41a may be a model obtained by learning the image feature amount of the non-target object. Similarly, the target object data 42a may be a model obtained by learning the image feature amount of the target object. The non-target object detection operation 32a may perform all or part of the processing corresponding to Steps S111 to S117 in
A sound source separation technique may be used in the determination of the target sound or the noise. For example, the target sound source direction determination operation 31c may separate the acoustic signal into a voice and a non-voice by the sound source separation technique, and make determination of the target sound or the noise based on the power ratio between the voice and the non-voice. For example, blind sound source separation (BSS) may be used as the sound source separation technique.
In the above embodiment, an example in which the beam forming operation 33 includes the adaptive filter 33f has been described, but the beam forming operation 33 may have the configuration indicated by the noise detection operation 32b in
In the above embodiment, the example in which the microphone array 20 includes the two microphones 20i and 20j has been described, but the microphone array 20 may include two or more microphones.
The noise source direction is not limited to one direction and may be a plurality of directions. The emphasis in the target sound direction and the suppression in the noise source direction are not limited to the above embodiment, and can be performed by any method.
In the above embodiment, the case where the horizontal angle θn and the vertical angle φn are determined as the noise source direction has been described, but when the noise source direction can be specified by at least any one of the horizontal angle θn and the vertical angle φn, at least any one of the horizontal angle θn and the vertical angle φn may be determined. Similarly for the target sound source direction, at least any one of the horizontal angle θt and the vertical angle φt may be determined.
The sound collection device 1 does not need to include one or both of the camera 10 and the microphone array 20. In this case, the sound collection device 1 is electrically connected to the external camera 10 or the external microphone array 20. For example, the sound collection device 1 may be an electronic device such as a smartphone including the camera 10, and electrically and mechanically connected to an external device including the microphone array 20. When the input/output interface circuit 50 inputs (receives) image data from the camera 10 externally attached to the sound collection device 1, the input/output interface circuit 50 corresponds to an input device for image data. When the input/output interface circuit 50 inputs (receives) an acoustic signal from the microphone array 20 externally attached to the sound collection device 1, the input/output interface circuit 50 corresponds to an input device for the acoustic signal.
In the above embodiment, an example of detecting a human face has been described, but in the case of collecting a human voice, the target object is not limited to a human face and may be any part that can be recognized as a person. For example, the target object may be a human body or a lip.
In the above embodiment, the human voice is collected as the target sound, but the target sound is not limited to the human voice. For example, the target sound may be a car sound or an animal bark.
(Summary of Embodiments)
(1) According to the present disclosure, there is provided a sound collection device that collects a sound while suppressing noise, the sound collection device including: a storage that stores first data indicating a feature amount of an image of an object that indicates a noise source or a target sound source; and a control circuit that specifies a direction of the noise source by performing a first collation of collating image data generated by a camera with the first data, and performs signal processing on an acoustic signal outputted from a microphone array so as to suppress a sound arriving from the specified direction of the noise source.
Since the direction of the noise source is specified by collating the image data with the first data indicating the feature amount of the image of the object that indicates the noise source or the target sound source, the direction of the noise source can be accurately specified. Since the noise arriving from the direction of the noise source that is accurately specified is suppressed, the accuracy of collecting the target sound is improved.
(2) In the sound collection device of the item (1), the storage may store second data indicating a feature amount of a sound output from the object, and the control circuit may specify the direction of the noise source by performing the first collation and a second collation of collating the acoustic signal with the second data.
Further, since the direction of the noise source is specified by collating the acoustic signal with the second data indicating the feature amount of the sound output from the object, the direction of the noise source can be accurately specified. Since the noise arriving from the direction of the noise source that is accurately specified is suppressed, the accuracy of collecting the target sound is improved.
(3) In the sound collection device of the item (1), the first data may indicate the feature amount of the image of the object that is the noise source, and the control circuit may perform the first collation, and when an object similar to the object is detected from the image data, the control circuit may specify a direction of the detected object as the direction of the noise source.
Thereby, a blind spot can be formed in advance before the noise source outputs the noise. Therefore, for example, a sudden sound generated from the noise source can be suppressed to collection the target sound.
(4) In the sound collection device of the item (1), the first data may indicate the feature amount of the image of the object that is the target sound source, and the control circuit may perform the first collation, and when an object not similar to the object is detected from the image data, the control circuit may specify a direction of the detected object as the direction of the noise source.
Thereby, a blind spot can be formed in advance before the noise source outputs the noise.
(5) In the sound collection device of the item (3) or (4), the control circuit may divide the image data into a plurality of determination regions in the first collation, collate an image in each determination region with the first data, and specify the direction of the noise source based on a position of the determination region including the detected object in the image data.
(6) In the sound collection device of the item (2), the second data may indicate a feature amount of noise output from the noise source, and the control circuit may perform the second collation, and when a sound similar to the noise is detected from the acoustic signal, the control circuit may specify a direction in which the detected sound arrives as the direction of the noise source.
By collating with the feature amount of the noise, the direction of the noise source can be accurately specified.
(7) In the sound collection device of the item (2), the second data may indicate a feature amount of a target sound output from the target sound source, and the control circuit may perform the second collation, and when a sound not similar to the target sound is detected from the acoustic signal, the control circuit may specify a direction in which the detected sound arrives as the direction of the noise source.
(8) In the sound collection device of (6) or (7), the control circuit may collection the acoustic signal with directivity directed to each of a plurality of determination directions in the second collation, and collate the collected acoustic signal with the second data to specify a determination direction in which the sound is detected as the direction of the noise source.
(9) In the sound collection device of the item (2), when the control circuit specified the direction of the noise source in any one of the first collation and the second collation, the control circuit may suppress the sound arriving from the direction of the noise source.
(10) In the sound collection device of the item (2), when the control circuit specified the direction of the noise source in both of the first collation and the second collation, the control circuit may suppress the sound arriving from the direction of the noise source.
(11) In the sound collection device of the item (2), a first accuracy that the noise source is present may be calculated by the first collation, and a second accuracy that the noise source is present may be calculated by the second collation, and when a calculation value calculated based on the first accuracy and the second accuracy is equal to or more than a predetermined threshold value, the control circuit may suppress the sound arriving from the direction of the noise source.
(12) In the sound collection device of the item (11), the calculation value may be any one of a product of the first accuracy and the second accuracy, a sum of the first accuracy and the second accuracy, a weighted product of the first accuracy and the second accuracy, and a weighted sum of the first accuracy and the second accuracy.
(13) In the sound collection device according to any one of the items (1) to (12), the control circuit may determine a target sound source direction in which the target sound source is present based on the image data and the acoustic signal, and perform signal processing on the acoustic signal so as to emphasize a sound arriving from the target sound source direction.
(14) The sound collection device of the item (1) may include at least one of the camera and the microphone array.
(15) In the sound collection device of the item (1), the image data may be generated by an external camera, and the acoustic signal may be outputted from an external microphone array.
(16) The sound collection device of the item (1) may further includes at least one of a first input device to receive the image data generated by an external camera; and a second input device to receive the acoustic signal outputted from an external microphone array.
(17) According to the present disclosure, there is provided a sound collection method of collecting a sound while suppressing noise by a control circuit, the sound collection method including: receiving image data generated by a camera; receiving an acoustic signal output from a microphone array; acquiring first data indicating a feature amount of an image of an object indicating a noise source or a target sound source; and specifying a direction of the noise source by performing a first collation of collating the image data with the first data, and performing signal processing on the acoustic signal so as to suppress a sound arriving from the specified direction of the noise source.
(18) According to the present disclosure, there is provided a non-transitory computer-readable storage medium storing a computer program to be executed by a control circuit of a sound collection device, the computer program causes the control circuit to execute: receiving image data generated by a camera; receiving an acoustic signal output from a microphone array; acquiring first data indicating a feature amount of an image of an object indicating a noise source or a target sound source; and specifying a direction of the noise source by performing a first collation of collating the image data with the first data, and performing signal processing on the acoustic signal so as to suppress a sound arriving from the specified direction of the noise source.
The sound collection device and the sound collection method according to all claims of the present disclosure are implemented by cooperation with hardware resources, for example, a processor, a memory, and a program.
The sound collection device of the present disclosure is useful, for example, as a device that collects a voice of a person who is talking.
Number | Date | Country | Kind |
---|---|---|---|
2018-112160 | Jun 2018 | JP | national |
This is a continuation application of International Application No. PCT/JP2019/011503, with an international filling date of Mar. 19, 2019, which claims priority of Japanese Patent Application No. 2018-112160 filed on Jun. 12, 2018, each of the content of which is incorporated herein by reference.
Number | Date | Country | |
---|---|---|---|
Parent | PCT/JP2019/011503 | Mar 2019 | US |
Child | 17116192 | US |