The present disclosure relates to audio and video conferencing systems and methods for controlled a microphone array beam direction.
Generally, when collecting a human voice far from a microphone, noise or a reverberation component that is undesirable to collect is relatively large compared to the human voice. Therefore, the sound quality of the voice to be collected is remarkably reduced. Because of this, it is desired to suppress the noise and the reverberation component, and clearly collect only the voice.
In conventional sound collecting devices, sound collecting of a human voice is carried out by detecting the direction of arrival of a noise acquired by a microphone, and adjusting the beam forming focus direction. However, in conventional sound collecting devices, the beam forming focus direction is adjusted not only for a human voice, but also for noise. Because of this, there is a risk that unnecessary noise is collected and that the human voice can only be collected in fragments.
An object of a number of embodiments according to the present invention is to provide a sound collecting device that collects only the sound of a human voice by analyzing an input signal, a sound emitting/collecting device, a signal processing method, and a medium.
The sound collecting device is provided with a plurality of microphones, a beam forming unit that forms directivity by processing a collected sound signal of the plurality of microphones, a first echo canceller disposed on the front of the beam forming unit, and a second echo canceller disposed on the back of the beam forming unit.
The signal processing unit 15 operates on the sound acquired by the microphone array as described in detail below. Furthermore, the signal processing unit 15 processes the emitted sound signal input from the OF 19. The speaker 70L or the speaker 70R emit the signal that has undergone signal processing in the signal processing unit 15. Note that the functions of the signal processing unit 15 can also be realized in a general information processing device, such as a personal computer. In this case, the information processing device realizes the functions of the signal processing unit 15 by reading and executing a program 151 stored in the memory 150, or a program stored on a recording medium such as a flash memory.
The first echo canceller 31 is installed on the back of the microphone 11, the first echo canceller 32 is installed on the back of the microphone 12, and the first echo canceller 33 is installed on the back of the microphone 13. The first echo cancellers carry out linear echo cancellation on the collected sound signal of each microphone. These first echo cancellers remove echo caused by the speaker 70L or the speaker 70R to each microphone. The echo canceling carried out by the first echo cancellers is made up of an FIR filter process and a subtraction process. The echo canceling of the first echo cancellers is a process that inputs a signal (X) emitted from the speaker 70L or the speaker 70R (emitted sound signal) that has been input to the signal processing unit 15 from the interface (I/F) 19, estimates an echo component (Y) using the FIR filter, and subtracts each estimated echo component from the sound signal (D) collected from each microphone and input to the first echo cancellers which results in an echo removed sound signal (E).
Continuing to refer to
The DOA 60 receives sound information from, in this case, two of the echo cancellers, AEC 31 and 33, and operates to detect the direction of arrival of voice. The DOA 60 detects a direction of arrival (θ) of the collected sound signal collected in the microphone 11 and the microphone 13 after the voice flag is input. The direction of arrival (θ) will be described later in detail. However, when the voice flag has been input in the DOA 60, the value of the direction of arrival (θ) does not change even if noise other than that of a human voice occurs. The direction of arrival (θ) detected in the DOA 60 is input to the BF 20. The DOA 60 will be described in detail below.
The BF 20 carries out a beam forming process based on the input direction of arrival (θ) of sound. This beam forming process allows sound in the direction of arrival (θ) to be focused on. Therefore, because noise arriving from a direction other than the direction of arrival (θ) can be minimized, it is possible to selectively collect voice in the direction of arrival (θ). The BF 20 will be described in more detail later.
A second echo canceller 40, illustrated in
Functional elements comprising the second echo canceller 40 are shown and described in more detail with reference to
|R|=|BY|/(ERLÊ0.5), with ERLE=power(BD/power(BE), Equation 1:
and with BD being a microphone signal after BF, BE being the output of AEC1 after BF and BY being acoustic echo estimate after BF.
The estimated spectrum of the remaining acoustic echo component |R| is removed from the input signal (BF microphone signal) by damping the spectrum amplitude by multiplication, and the degree of input signal damping is determined by the value of |R|. The larger the value of the calculated residual echo spectrum, the more damping is applied to the input signal (this relationship can be determined empirically). In this manner, the signal processing unit 15 of the present embodiment also removes a remaining echo component that could not be removed by the subtraction process.
The frequency spectrum amplitude multiplication process is not carried out prior to beam forming, as the phase information of the collected sound signal level gain is lost, therefore making a beam forming process difficult by the BF 20. Furthermore, the frequency spectrum amplitude multiplication process is not carried out prior to beam forming in order to preserve the information of the harmonic power spectrum, power spectrum change rate, power spectrum flatness, formant intensity, harmonic intensity, power, first-order difference of power, second-order difference of power, cepstrum coefficient, first-order difference of cepstrum coefficient, or second-order difference of cepstrum coefficient described below, and as such, voice activity detection is possible by the VAD 50. Then, the signal processing unit 15 of the present embodiment removes the echo component using the subtraction process, carries out the beam forming process by the BF 20, the voice determination by the VAD 50, and the detection process of the direction of arrival in the DOA 60, and carries out the frequency spectrum amplitude multiplication process on the signal that has undergone beam forming.
Next, the functions of the VAD 50 will be described in detail using
The zero-crossing rate 41 calculates the number of times an audio signal changes from a positive value to negative or vice-versa in a given audio frame. The harmonic power spectrum 42 indicates what degree of power each harmonic component of the audio signal has. The power spectrum change rate 43 indicates the rate of change of power to the frequency component of the audio signal. The power spectrum flatness 44 indicates the degree of the swell of the frequency component of the audio signal. The formant intensity 45 indicates the intensity of the formant component included in the audio signal. The harmonic intensity 46 indicates the intensity of the frequency component of each harmonic included in the audio signal. The power 47 is the power of the audio signal. The first-order difference of power 48, is the difference from the previous power 47. The second-order difference of power 49, is the difference from the previous first-order difference of power 48. The cepstrum coefficient 51 is the logarithm of the discrete cosine transformed amplitude of the audio signal. A first-order difference 52 of the Cepstrum coefficient is the difference from the previous Cepstrum coefficient 51. A second-order difference 53 of the cepstrum coefficient is the difference from the previous first-order difference 52 of the cepstrum coefficient.
It should be noted that when finding the Cepstrum coefficient 51, the high frequency component of the audio signal can be emphasized by using a pre-emphasis filter. This audio signal may then be further processed by a Mel filter bank and a Discrete Cosine Transform to give the final coefficients needed. Finally, it should be understood that the voice features are not limited to the parameters described above, and any parameter that can discriminate a human voice from other sounds may be used.
I should be understood, that a voice signal emphasizing a high frequency may be used when finding the cepstrum coefficient 51 by using a pre-emphasis filter, and a discrete cosine transformed amplitude of the voice signal compressed by a mel filter bank may be used. Further, it should be understood that the voice features are not limited to the parameters described above, and any parameter that can discriminate a human voice from other sounds may be used.
The neural network 57 is a method for deriving results from a judgment example of a person, and each neuron coefficient is set to an input value so as to approach the judgment result derived by a person. More specifically, the neural network 57 is a mathematical model made up of a known number of nodes and layers used to determine whether a current audio frame is human voice or not. The value at each of these nodes is computed by multiplying the values of the nodes in the previous layers with weights and adding some bias. These weights and bias are obtained beforehand for every layer of the neural network by training it with a set of known examples of speech and noise files.
The neural network 57 outputs a predetermined value based on an input value by inputting the value of various voice features (zero-crossing rate 41, harmonic power spectrum 42, power spectrum change rate 43, power spectrum flatness 44, formant intensity 45, harmonic intensity 46, power 47, first-order difference of power 48, second-order difference of power 49, cepstrum coefficient 51, first-order difference of cepstrum coefficient 52, or second-order difference of cepstrum coefficient 53) in each neuron. The neural network 57 outputs each of a first parameter value, which is a human voice, and a second parameter value, which is not a human voice in the final two neurons. Finally, the neural network 57 determines that it is a human voice when the difference between the first parameter value and the second parameter value exceeds a predetermined threshold value. By this, the neural network 57 can determine whether the voice signal is a human voice based on the judgment example of a person.
Next, the functions of the DOA 60 will be described in detail using
The DOA 60 detects the time difference of the input signals of each of the microphone 11 and the microphone 13 based on the peak position of the cross-correlation function. The sound displacement (L2) is calculated by the product of the time difference of the input signal and the sound speed. Here, L2=L1*sinθ. Because L1 is a fixed value, it is possible to detect 63 (referring to
Next, the functions of the BF 20 will be described in detail using
When the direction of arrival (θ) of the voice is input from the DOA 60, a beam coefficient renewing unit 25 renews the coefficient of the FIR filters. For example, the beam coefficient renewing unit 25 renews the coefficient of the FIR filters using an appropriate algorithm based on the input voice signal so that an output signal is at its minimum, under constraining conditions that the gain at the focus angle based on the renewed direction of arrival (θ) is 1.0. Therefore, because noise arriving from directions other than the direction of arrival (θ) can be minimized, it is possible to selectively collect voice in the direction of arrival (θ).
The BF 20 repeats processes such as those described above, and outputs a voice signal corresponding to the direction of arrival (θ). By this, the signal processing unit 15 can always collect sound with the direction having a human voice as the direction of arrival (θ) at high sensitivity. In this manner, because a human voice can be tracked, the signal processing unit 15 can suppress the deterioration in sound quality of a human voice due to noise.
The operation of the sound emitting/collecting device 10 will be described below using
Continuing to refer to
The BF 20 forms directivity (
Note that in the present embodiment, an example of the sound emitting/collecting device 10 was given as a sound emitting/collecting device 10 having the functions of both emitting sound and collecting sound, but the present invention is not limited to this. For example, it may be a sound collecting device having the function of collecting sound.
The forgoing description, for purposes of explanation, used specific nomenclature to provide a thorough understanding of the invention. However, it will be apparent to one skilled in the art that specific details are not required in order to practice the invention. Thus, the forgoing descriptions of specific embodiments of the invention are presented for purposes of illustration and description. They are not intended to be exhaustive or to limit the invention to the precise forms disclosed; obviously, many modifications and variations are possible in view of the above teachings. The embodiments were chosen and described in order to best explain the principles of the invention and its practical applications, they thereby enable others skilled in the art to best utilize the invention and various embodiments with various modifications as are suited to the particular use contemplated. It is intended that the following claims and their equivalents define the scope of the invention.
Number | Date | Country | |
---|---|---|---|
62518315 | Jun 2017 | US |