The present disclosure relates to a sound collecting device, a sound collecting method, and a sound collecting program.
Japanese Unexamined Patent Application Publication No. 2007-251354 (Patent Literature 1) and Japanese Unexamined Patent Application Publication No. 2000-261534 (Patent Literature 2) describe a sound collecting device capable of obtaining clear voice in a noisy environment by providing a microphone for generating a voice signal based on air vibration and a vibration sensor for generating a vibration signal corresponding to a voice signal based on bone vibration. The former microphone may be referred to as an air conduction microphone, and the latter vibration sensor may be referred to as a bone conduction microphone.
The sound collecting device according to Patent Literature 1 includes a filtering unit for converting a vibration signal generated by the vibration sensor into a voice signal, and outputs a voice signal based on the vibration signal generated by the vibration sensor even under quiet conditions. The sound collecting device described in Patent Literature 1 is configured to update a filter coefficient of the filtering unit so that an error signal, which is a difference between a voice signal output from the filtering unit and a voice signal generated by a microphone, becomes small.
The sound collecting device described in Patent Literature 2 mixes a voice signal generated by a microphone and a vibration signal generated by a vibration sensor at a predetermined mixing ratio. The sound collecting device described in Patent Literature 2 is configured to increase the ratio of the voice signal generated by the microphone under quiet conditions and increase the ratio of the vibration signal generated by the vibration sensor under noisy conditions.
The sound collecting device preferably outputs the voice signal generated by the microphone under quiet conditions, since there is a difference in the quality of the voice signal between the voice signal generated by the microphone and the voice signal based on the vibration signal generated by the vibration sensor. In Patent Literature 1, it is intended to improve the quality of the voice signal based on the vibration signal by updating the filter coefficient of the filtering unit so that the error signal becomes small. However, for example, in a noisy environment, the voice signal generated by the microphone includes environmental noise, and it may not be possible to improve the quality of the voice signal based on the vibration signal. Therefore, an improvement is required.
A first aspect of one or more embodiments provides a sound collecting device including: a microphone configured to generate a first voice signal based on air vibration; a vibration sensor configured to generate a vibration signal based on vibration transmitted to a human body by speech; an adaptive filter configured to set the first voice signal as a target signal, and to generate a converted voice signal by multiplying the vibration signal by a coefficient to bring the vibration signal closer to the target signal; a subtractor configured to generate a residual signal that is a difference between the target signal and the converted voice signal; and an adaptive controller configured to control the adaptive filter to update the coefficient to be multiplied by the vibration signal so that the residual signal becomes small, wherein when it is determined to be a voice section where voice is present, the adaptive controller is configured to generate to supply to the adaptive filter an adaptive filter control signal that controls the adaptive filter to update the coefficient so that the residual signal becomes small at a first speed; and when it is determined to be a non-voice section where the voice is not present, the adaptive controller is configured to generate to supply to the adaptive filter an adaptive filter control signal that controls the adaptive filter to update the coefficient so that the residual signal becomes small at a second speed slower than the first speed, or not to update the coefficient.
A second aspect of one or more embodiments provides a sound collecting method including: generating a voice signal by a microphone based on air vibration; generating a vibration signal by a vibration sensor based on vibration transmitted to a human body through speech; generating a converted voice signal by an adaptive filter, with the voice signal as a target signal, by multiplying the vibration signal by a coefficient to bring the vibration signal closer to the target signal; generating a residual signal, which is a difference between the target signal and the converted voice signal, by a subtractor, and controlling the adaptive filter by an adaptive controller to update a coefficient to be multiplied by the vibration signal so that the residual signal becomes small, wherein the adaptive controller generates and supplies to the adaptive filter an adaptive filter control signal that controls the adaptive filter to update the coefficient so that when it is determined to be a voice section where voice is present, the residual signal becomes small at the first speed; and generates and supplies to the adaptive filter an adaptive filter control signal that controls the adaptive filter to update the coefficient so that when it is determined to be a non-voice section where the voice is not present, the residual signal becomes small at a second speed slower than the first speed, or not to update the coefficient.
A third aspect of one or more embodiments provides a sound collecting program product stored in a non-transitory storage medium causing a computer to execute the steps of: a step of generating a voice signal by a microphone based on air vibration; a step of generating a vibration signal by a vibration sensor based on vibration transmitted to a human body by speech; a step of setting the voice signal as a target signal and generating a converted voice signal by an adaptive filter, by multiplying the vibration signal by a coefficient to bring the vibration signal closer to the target signal; and a step of generating a residual signal, which is a difference between the target signal and the converted voice signal, by a subtractor, a step of controlling the adaptive filter by an adaptive controller to update a coefficient to be multiplied by the vibration signal so that the residual signal becomes small, wherein the step of controlling the adaptive filter by the adaptive controller to update the coefficient includes: a step of generating and suppling to the adaptive filter an adaptive filter control signal that controls the adaptive filter to update the coefficient when it is determined to be a voice section where voice is present so that the residual signal becomes small at the first speed; and a step of generating and supplying to the adaptive filter an adaptive filter control signal that controls the adaptive filter to update the coefficient when it is determined to be a non-voice section where the voice is not present so that the residual signal becomes small at a second speed slower than the first speed, or not to update the coefficient.
Hereinafter, a sound collecting device, a sound collecting method, and a sound collecting program according to a first embodiment will be described with reference to the accompanying drawings.
A vibration sensor 3 generates a vibration signal based on the vibration transmitted to the human body. The vibration sensor 3 is arranged to contact the surface of the human body. The vibration sensor includes a vibration receiver embedded in the human body, a microphone arranged in direct contact with the human body, a camera for acquiring the vibration transmitted to the surface of the human body as an image, and a rangefinder for acquiring the vibration transmitted to the surface of the human body as position information. The A/D converter 4 performs an A/D conversion of the analog vibration signal supplied from the vibration sensor 3, and supplies a digital vibration signal to the adaptive controller 5, the adaptive filter 6, and the environmental noise analyzer 8.
Returning to
The subtractor 7 supplies the difference between the converted voice signal output from the adaptive filter 6 and the voice signal output from the A/D converter 2 as a residual signal to the adaptive controller 5 and the adaptive filter 6.
The adaptive controller 5 includes voice section detection units 51 and 52, a sound pressure level acquisition unit 53, a sound pressure level ratio calculation unit 55, a residual relative level acquisition unit 54, a correlation degree calculation unit 56, and an adaptive filter learning speed setting unit 57. The voice section detection units 51 and 52 detect voice sections of a voice signal and a vibration signal, respectively, by a technique called Voice Activity Detection (VAD). The voice section detection units 51 and 52 detect voice sections according to whether at least the sound pressure level exceeds a predetermined level.
In order to improve the detection accuracy of voice sections, the voice section detection units 51 and 52 may detect voice sections by adopting the technology described in Japanese Patent No. 5874344 (Patent Literature 3) or Japanese U.S. Pat. No. 5,948,918 (Patent Literature 4) and detecting features of a human voice by analyzing frequencies. Each of the voice section detection units 51 and 52 supplies a detection signal for discriminating a voice section and a non-voice section of a voice signal and a vibration signal to the adaptive filter learning speed setting unit 57.
A sound pressure level acquisition unit 53 acquires sound pressure levels of a voice signal and a vibration signal. A sound pressure level ratio calculation unit 55 calculates a sound pressure level ratio which is a ratio between a sound pressure level of a voice signal and a sound pressure level of a vibration signal, and supplies it to an adaptive filter learning speed setting unit 57. The sound pressure levels of the voice signal and the vibration signal may be expressed as an average amplitude value of the sound pressure per unit time, or may be expressed as a sum of squares of the sound pressure per unit time. The ratio of the sound pressure level in the speaking section and the ratio of the sound pressure level in the non-speaking section differ depending on the environmental noise level. Therefore, the sound pressure level ratio calculated by the sound pressure level ratio calculation unit 55 indicates the environmental noise level.
The residual signal output from the subtractor 7 and the vibration signal output from the A/D converter 4 are input to the residual relative level acquisition unit 54. In the voice section, air vibration due to speech or the like is input to the microphone 1 and vibration due to speech or the like is transmitted to the vibration sensor 3, so that the residual signal is at a low level. In the non-voice section or when there is environmental noise in the voice section, the residual signal is at a relatively high level. A residual relative level acquisition unit 54 normalizes the level of the residual signal output from the subtractor 7 by the level of the vibration signal to acquire the residual relative level.
The larger the vibration signal, the larger the level of the residual signal tends to be. Therefore, by normalizing the level of the residual signal by the level of the vibration signal, the residual relative level, which is the level of the residual signal that is not affected by the magnitude of the vibration signal, can be obtained.
The correlation degree calculation unit 56 compares the residual relative level with a predetermined threshold (second threshold) to calculate the correlation degree. The correlation degree calculation unit 56 determines that the correlation between the voice signal and the vibration signal is high when the residual relative level is below the threshold, and outputs a correlation degree having a value indicating that the correlation is high. The correlation degree calculation unit 56 determines that the correlation between the voice signal and the vibration signal is low when the residual relative level exceeds the threshold, and outputs a correlation degree having a value indicating that the correlation is low.
The adaptive filter learning speed setting unit 57 discriminates the voice section and the non-voice section at least based on the detected signals by the voice section detecting sections 51 and 52, and generates an adaptive filter control signal.
For better operation of the adaptive filter 6, the adaptive filter learning speed setting unit 57 may generate an adaptive filter control signal based on the detection signals by the voice section detection units 51 and 52 and the environmental noise level generated by the sound pressure level ratio calculation unit 55. For better operation of the adaptive filter 6, the adaptive filter learning speed setting unit 57 may generate an adaptive filter control signal based on the detection signals by the voice section detection units 51 and 52 and the determination result by the correlation degree calculation unit 56.
The adaptive filter learning speed setting unit 57 may determine to be a voice section (on) if either one of the detection signals by the voice section detection unit 51 and the detection signal by the voice section detection unit 52 indicates a voice section. Conversely, the adaptive filter learning speed setting unit 57 may determine not to be a voice section (off) if either one indicates a non-voice section.
As shown in
In pattern #3, the adaptive filter learning speed setting unit 57 sets the learning speed as saving when the voice section detection is off and the environmental noise level is high exceeding a predetermined threshold. In pattern #4, the adaptive filter learning speed setting unit 57 sets the learning speed as saving when the voice section detection is on and the environmental noise level is high. “The learning speed is active” means that the adaptive operation in the adaptive filter 6 is actively promoted, and “the learning speed is saving” means that the adaptive operation in the adaptive filter 6 is suppressed or stopped.
Specifically, actively promoting the adaptive operation in the adaptive filter 6 means that controlling the adaptive filter 6 to update the coefficient (described below) which multiplies the vibration signal within a short time at the first speed. Suppressing the adaptive operation in the adaptive filter 6 means controlling the adaptive filter 6 to update the coefficient over a long time at the second speed slower than the first speed. Stopping the adaptive operation in the adaptive filter 6 means controlling the adaptive filter 6 not to update the coefficient (to maintain the coefficient).
As shown in
In pattern #7, the adaptive filter learning speed setting unit 57 sets the learning speed as saving when the voice section detection is off and the correlation degree is low. In pattern #8, the adaptive filter learning speed setting unit 57 sets the learning speed as saving when the voice section detection is on and the correlation degree is low.
As shown in
As shown in
If the learning speed is set as activate, the adaptive filter 6 updates the coefficient at the first speed. If the learning speed is set as saving, the adaptive filter 6 updates the coefficient at the second speed slower than the first speed or does not update the coefficient.
The adaptive filter learning speed setting unit 57 may generate an adaptive filter control signal based on the voice section detection, the environmental noise level, and the correlation degree. In this case, either of the environmental noise level and the correlation degree may be given priority to set as active or saving. Moreover, the environmental noise level and the correlation degree may be made into points, and the adaptive filter learning speed setting unit 57 may determine whether to be a voice section by combining the points of the environmental noise level and the points of the correlation degree, and set as active or saving.
The adder 641 to 64n adds the outputs of the multipliers 630 and 631, the outputs of the adder 641 and the multiplier 632, the outputs of the adder 642 and the multiplier 633, . . . and the outputs of the adder 64 (n−1) (not shown) and the multiplier 63n, respectively. Thus, the adder 64n outputs a converted voice signal obtained by correcting the vibration signal output from the A/D converter 4 to be closer to the voice signal output from the A/D converter 2.
The subtractor 7 outputs a residual signal which is a difference between the converted voice signal output from the adder 64n and the voice signal output from the A/D converter 2. The adaptive coefficient update unit 61 updates the coefficient that multiplier 630-63n multiply to the inputted samples so that the residual signal becomes small.
At this time, when the adaptive filter control signal is high, indicating “active”, the adaptive coefficient update unit 61 updates the coefficient to be supplied to the multiplier 630 to 63n in a short time so that the residual signal becomes small. When the adaptive filter control signal is low, indicating “saving”, the adaptive coefficient update unit 61 updates the coefficient to be supplied to the multiplier 630 to 63n in a direction where the residual signal becomes small over a long time, or does not update the coefficient.
As explained with reference to
As explained with reference to
Therefore, if the adaptive filter control signal is low, the adaptive coefficient update unit 61 may not update the coefficient, or if it does, it may not update it immediately, but gradually update the coefficient over a long time. The adaptive filter 6 obtains a coefficient that brings the vibration signal closer to the voice signal by learning before the environmental noise level becomes high or the correlation degree becomes low, and outputs a converted voice signal having good voice quality. Therefore, the adaptive filter 6 can continuously output a converted voice signal having good voice quality without updating the coefficient only for a short time when the environmental noise level becomes high or the correlation degree becomes low.
The sound pressure level acquisition units 81 and 82 and the sound pressure level ratio calculation unit 83 have substantially the same configuration as the sound pressure level acquisition unit 53 and the sound pressure level ratio calculation unit 55 in the adaptive controller 5 shown in
The environmental noise analyzer 8 is provided to select the voice signal output from the A/D converter 2 by the selector 9 when the environmental noise does not affect the voice such as speech in the voice section, and to select the converted voice signal output from the adaptive filter 6 by the selector 9 when the environmental noise does affect the voice.
The sound pressure level ratio output from the sound pressure level ratio calculation unit 83 and the adaptive filter control signal supplied from the adaptive controller 5 are input to the selector control signal setting unit 84. The adaptive filter control signal is input to the selector control signal setting unit 84 in order to generate a selector control signal for selecting the voice signal output from the A/D converter 2 and the converted voice signal output from the adaptive filter 6 based on the environmental noise level in the non-voice section. Since the environmental noise level in the voice section is affected by the voice, it may not indicate the true environmental noise level.
The selector control signal setting unit 84 selects the voice signal when the environmental noise level in the non-voice section is less than or equal to a predetermined threshold (third threshold), generates a selector control signal for selecting the converted voice signal when the environmental noise level in the non-voice section exceeds the threshold, and supplies it to the selector 9. The third threshold used by the selector control signal setting unit 84 may be the same as or different from the first threshold used by the adaptive filter learning speed setting unit 57.
In this case, the environmental noise analyzer 8 supplies a selector control signal for selecting the voice signal to the selector 9 before time t1, and the selector 9 selects and outputs the voice signal. After time t1, the environmental noise analyzer 8 supplies a selector control signal for selecting the converted voice signal to the selector 9. Instead of immediately switching the voice signal to the converted voice signal, the selector 9 switches the voice signal to the converted voice signal at time t2 while gradually decreasing the sound pressure level of the voice signal and gradually increasing the sound pressure level of the converted voice signal over the time from t1 to t2.
After the time t3, the environmental noise analyzer 8 supplies a selector control signal for selecting a voice signal to the selector 9. Similarly, the selector 9 switches to the voice signal at the time t4 while gradually decreasing the sound pressure level of the converted voice signal and gradually increasing the sound pressure level of the voice signal over the time t3 to t4.
When switching between the voice signal and the converted voice signal, the selector 9 mixes the voice signal and the converted voice signal while gradually decreasing the sound pressure level of one and gradually increasing the sound pressure level of the other, so that the voice signal and the converted voice signal can be switched without discomfort.
Instead of switching between the voice signal and the converted voice signal as shown in
The environmental noise analyzer 8 may be omitted when the selector 9 mixes the voice signal and the converted voice signal in accordance with the correlation degree calculated by the correlation degree calculation unit 56. The correlation degree calculation unit 56 may calculate a correlation of three or more levels, and the selector 9 may mix the voice signal and the converted voice signal by varying the weights of the two in a plurality of ways. The correlation degree may be calculated by the correlation degree calculation unit 56 in two stages, or any number of stages.
Referring to
As described above, the sound collecting device 100 does not always update the coefficient to be multiplied by the converted voice signal in the adaptive filter 6 so that the residual signal becomes small in a short time, but updates it over a long time or does not update it when the quality of the converted voice signal may be deteriorated. Therefore, the sound collecting device 100 can improve the quality of the voice signal (converted voice signal) based on the vibration signal generated by the vibration sensor 3 compared with the sound collecting device described in Patent Literature 1.
Furthermore, the sound collecting device 100 selects and outputs the voice signal output from the A/D converter 2 by the selector 9 and the converted voice signal output from the adaptive filter 6. Therefore, with the sound collecting device 100, the voice signal generated by the microphone 1 and the voice signal based on the vibration signal generated by the vibration sensor 3 can be selected as appropriate depending on the environment.
Hereinafter, the sound collecting device, the sound collecting method, and the sound collecting program according to a second embodiment will be described with reference to the accompanying drawings.
In
The digital voice signal (second voice signal), which is the voice (the partner voice) transmitted from the communication partner and received via a server and a line 11, is supplied to the echo canceller 20 and the D/A converter 15. The second voice signal may be referred to as the partner voice signal. The D/A converter 15 performs D/A conversion of the input digital voice signal and supplies the analog voice signal to the speaker 16. The speaker 16 reproduces the input voice signal and outputs the partner voice. At this time, since the microphone 1 collects the partner voice output from the speaker 16, the voice emitted by the communication partner may be superimposed on the voice emitted by the user as an echo component.
The echo canceller 20 suppresses the echo component superimposed on the voice signal output from the A/D converter 2 by using the voice signal received via the line 11. The echo canceller 20 supplies the voice signal whose echo component is suppressed to the adaptive controller 5 and the subtractor 7. Although the echo canceller 20 may not be able to completely cancel the echo component superimposed on the voice signal collected by the microphone 1, the voice signal output from the echo canceller 20 is called the echo-canceled voice signal.
As an example, the echo canceller 20 may have a configuration as shown in
The echo canceller 20 is not limited to a configuration including the adaptive filter 13 as shown in
Returning to
As will be described below, the adaptive filter 6 sets the echo-canceled voice signal output from the echo canceller 20 as a target signal and generates a converted voice signal by correcting the vibration signal to be closer to the target signal, and supplies it to the line 11. The line 11 is the Internet line, for example. The converted voice signal is transmitted to a communication partner via the line 11 and an unillustrated Internet communication server.
In
Since most of the section b3 overlaps with the section c2 and the sound pressure level of both the partner voice and the user voice is high, the echo component is likely to remain even though the echo canceller cancels the echo. Although the section b1 overlaps with the section c1 and the sound pressure level of the partner voice is low, the echo component may remain. The section b2 is located in the non-voice section of the user voice, and it can be expected that the echo component is sufficiently cancelled by the echo canceller.
Schematically, the adaptive controller 12 generates an adaptive filter control signal to vary the operation of the adaptive filter 13 depending on whether it is a voice section where the partner voice is present or a non-voice section where the partner voice is not present. Specifically, when the partner voice section information indicates the voice section of the partner voice, the adaptive filter learning speed setting unit 122 generates an adaptive filter control signal for setting the learning speed as active and supplies it to the adaptive filter 13. When the partner voice section information indicates the non-voice section of the partner voice, the adaptive filter learning speed setting unit 122 generates an adaptive filter control signal for setting the learning speed as saving and supplies it to the adaptive filter 13.
“The learning speed is active” means that the adaptive operation in the adaptive filter 13 is actively promoted, and “the learning speed is saving” means that the adaptive operation in the adaptive filter 13 is suppressed or stopped.
Specifically, actively promoting the adaptive operation in the adaptive filter 13 means controlling the adaptive filter 13 to update the later-described coefficient so as to generate a cancellation signal for canceling the echo component within a short time at the first speed. Suppressing the adaptive operation in the adaptive filter 13 means controlling the adaptive filter 13 to update the coefficient over a long time at the second speed slower than the first speed. Stopping the adaptive operation in the adaptive filter 13 means controlling not to update the coefficient (to maintain the coefficient).
The adder 1341 to 134n adds the outputs of the multipliers 1330 and 1331, the outputs of the adder 1341 and the multiplier 1332, the outputs of the adder 1342 and the multiplier 1333, . . . , the adder 134 (n−1) (not shown) and the outputs of the multiplier 133n, respectively. Thus, the adder 134n outputs a cancel voice signal for canceling the echo component from the voice signal superimposed by the echo component.
The subtractor 14 subtracts the cancel voice signal from the voice signal superimposed with the echo component output from the A/D converter 2 and outputs the echo-canceled voice signal. The adaptive coefficient update unit 131 updates the coefficient that multiplier 630 to 63n multiplies to the input sample so as to generate the cancel voice signal in which the echo component remains as little as possible.
At this time, the adaptive coefficient update unit 131 updates the coefficient to be supplied to the multiplier 1330 to 133n in a short time when the adaptive filter control signal is high, indicating “active”. The adaptive coefficient update unit 131 updates the coefficient to be supplied to the multiplier 1330 to 133n over a long time or does not update the coefficient when the adaptive filter control signal is low, indicating “saving”.
The voice section detection unit 510 detects the voice section of the vibration signal by a technique called VAD, and supplies the voice section information to the adaptive filter learning speed setting unit 550. The voice section detection unit 510 detects the voice section according to whether at least the sound pressure level exceeds a predetermined level. The voice signal output from the echo canceller 20 and the partner voice signal are input to the residual echo level estimation unit 520. The residual echo level estimation unit 520 estimates the residual echo level remaining in the target signal by calculating a relative sound pressure level ratio per predetermined unit time between the sound pressure level of the partner voice signal and the sound pressure level of the voice signal output from the echo canceller 20. The predetermined unit time is, for example, several milliseconds or tens of milliseconds. The residual echo level estimation unit 520 supplies the residual echo level to the adaptive filter learning speed setting unit 550.
The adaptive filter learning speed setting unit 550 generates an adaptive filter control signal for setting the learning speed as active and supplies it to the adaptive filter 6 when the voice section information indicates the voice section of the user and the first condition that the residual echo level is less than or equal to a predetermined threshold is satisfied. The adaptive filter learning speed setting unit 550 generates an adaptive filter control signal for setting the learning speed as save and supplies it to the adaptive filter 6 when the first condition is not satisfied.
“The learning speed is active” means that the adaptive operation in the adaptive filter 6 is actively promoted, and “the learning speed is saving” means that the adaptive operation in the adaptive filter 6 is suppressed or stopped.
Specifically, “the adaptive operation in the adaptive filter 6 is actively promoted” means that the adaptive filter 6 controls to update the coefficient (described below) which multiplies the vibration signal within a short time at the third speed. Suppressing the adaptive operation in the adaptive filter 6 means controlling the adaptive filter 6 to update the coefficient over a long time at the fourth speed which is slower than the third speed. Stopping the adaptive operation in the adaptive filter 6 means controlling the adaptive filter 6 not to update the coefficient (to maintain the coefficient). The third speed may be the same or different from the first speed, and the fourth speed may be the same or different from the second speed.
When the voice section information does not indicate the voice section of the user, the learning speed is set as saving since no voice signal is present as a target signal. When the voice section information indicates the voice section of the user but the residual echo level exceeds the threshold, the learning speed is set as saving since the quality of the converted voice signal may be deteriorated by the presence of the residual echo component. A threshold to be compared with the residual echo level that does not deteriorate the quality of the converted voice signal by the adaptive filter 6 may be measured in advance and stored in the storage unit.
The voice section information of the vibration signal output from the voice section detection unit 510, the vibration signal, and the voice signal output from the echo canceller 20 are input to the vibration signal level correction unit 530. The vibration signal level correction unit 530 calculates the relative sound pressure level ratio between the vibration signal and the voice signal output from the echo canceller 20 per predetermined unit time in the voice section of the vibration signal. The vibration signal level correction unit 530 outputs a corrected sound pressure level obtained by correcting the sound pressure level of the vibration signal to a sound pressure level corresponding to the sound pressure level of the voice signal based on the relative sound pressure level ratio. The predetermined unit time is several milliseconds or tens of milliseconds, for example.
The voice signal collected by the microphone 1 may include an echo component or environmental noise. A relatively accurate sound pressure level of the voice signal that is not affected by the echo component or environmental noise can be obtained when the sound pressure level of the vibration signal is corrected to a sound pressure level corresponding to the sound pressure level of the voice signal.
The voice signal output from the echo canceller 20, the partner voice signal, and the voice section information of the vibration signal are input to the residual echo level estimation unit 520 shown in
When the voice section information of the vibration signal indicates the non-voice section of the user and the partner voice section information indicates the voice section of the partner voice signal, the microphone 1 does not collect the voice emitted by the user but only the echo, so that the voice signal output from the echo canceller 20 includes only the echo component.
Then, the residual echo level estimation unit 520 calculates the relative sound pressure level ratio between the partner sound pressure information and the voice signal output from the echo canceller 20 per predetermined unit time when the voice section information of the vibration signal indicates the non-voice section of the user and the partner voice section information indicates the voice section of the partner voice signal. The predetermined unit time here is also about several milliseconds or tens of milliseconds, for example. The relative sound pressure level ratio calculated by the residual echo level estimation unit 520 corresponds to the estimated residual echo level. In this way, the residual echo level estimation unit 520 estimates the residual echo level.
The residual echo level output from the residual echo level estimation unit 520 and the corrected sound pressure level output from the vibration signal level correction unit 530 are input to the level ratio calculation unit 540. The level ratio calculation unit 540 divides the corrected sound pressure level by the residual echo level to calculate a relative sound pressure level ratio between the corrected sound pressure level and the residual echo level. The residual echo level included in the voice signal collected by the microphone 1 is estimated in advance by the residual echo level estimation unit 520. The corrected sound pressure level corresponding to the sound pressure level of the voice signal based on the vibration signal is obtained by the vibration signal level correction unit 530.
Therefore, the relative sound pressure level ratio calculated by the level ratio calculation unit 540 is an accurate sound pressure level ratio even in a state where the microphone 1 collects environmental noise and a state where the voice emitted by the user overlaps with the partner voice. When the relative sound pressure level ratio calculated by the level ratio calculation unit 540 exceeds a predetermined threshold, the voice signal output from the echo canceller 20 includes almost no echo component, and the echo component is canceled by the echo canceller 20. When the relative sound pressure level ratio calculated by the level ratio calculation unit 540 is less than or equal to a predetermined threshold, the voice signal output from the echo canceller 20 includes an echo component, and the echo component is not canceled by the echo canceller 20.
The adaptive filter learning speed setting unit 550 receives the voice section information output from the voice section detection unit 510 and the relative sound pressure level ratio output from the level ratio calculation unit 540. The adaptive filter learning speed setting unit 550 generates an adaptive filter control signal for setting the learning speed as active and supplies it to the adaptive filter 6 when a second condition that the voice section information indicates the voice section of the user, and that that the relative sound pressure level ratio output from the level ratio calculation unit 540 exceeds a threshold is satisfied. When the second condition is not satisfied, the adaptive filter learning speed setting unit 550 generates an adaptive filter control signal for setting the learning speed as saving and supplies it to the adaptive filter 6.
When the voice section information does not indicate the voice section of the user, the learning speed may be set as saving since no voice signal is present as a target signal. When the relative sound pressure level ratio is less than or equal to the threshold even though the voice section information indicates the voice section of the user, it is preferable to set the learning speed as saving since the quality of the converted voice signal may be deteriorated by the presence of residual echo components.
In a specific third configuration example of the adaptive controller 5 shown in
The adaptive filter learning speed setting unit 550 generates an adaptive filter control signal for setting the learning speed as active and supplies it to the adaptive filter 6 when the following fourth condition is satisfied: the partner voice section information indicates a voice section of the partner voice signal; the relative sound pressure level ratio output from the level ratio calculation unit 540 exceeds a threshold; and the voice section information indicates a voice section of the user.
The adaptive filter learning speed setting unit 550 generates an adaptive filter control signal for setting the learning speed as saving and supplies it to the adaptive filter 6 when neither of the third condition nor the fourth condition is satisfied.
As a more preferable configuration, the adaptive controller 5 shown in
The adaptive filter learning speed setting unit 550 generates an adaptive filter control signal for setting the learning speed as active and supplies it to the adaptive filter 6 when the following fifth condition is satisfied: the voice section information indicates the user voice section; and the level ratio calculated by the level ratio calculating unit 540 exceeds a predetermined threshold. The adaptive filter learning speed setting unit 550 generates an adaptive filter control signal for setting the learning speed as saving and supplies it to the adaptive filter 6 when the fifth condition is not satisfied.
In
The adder 641 to 64n adds the outputs of the multipliers 630 and 631, the outputs of the adder 641 and the multiplier 632, the outputs of the adder 642 and the multiplier 633, . . . , the outputs of the adder 64 (n−1) (not shown) and the multiplier 63n, respectively. As a result, the adder 64n outputs a converted voice signal obtained by correcting the vibration signal output from the A/D converter 4 to be closer to the voice signal output from the echo canceller 20.
The subtractor 7 outputs a residual signal that is the difference between the converted voice signal output from the adder 64n and the voice signal output from the echo canceller 20. The adaptive coefficient update unit 61 updates the coefficient that multiplier 630 to 63n multiplies to the input samples so that the residual signal becomes small.
At this time, when the adaptive filter control signal is high, indicating “active”, the adaptive coefficient update unit 61 updates the coefficient to be supplied to the multiplier 630 to 63n in a short time so that the residual signal becomes small. When the adaptive filter control signal is low, indicating “saving”, the adaptive coefficient update unit 61 updates the coefficient to be supplied to the multiplier 630 to 63n in a direction where the residual signal becomes small over a long time, or does not update the coefficient.
When the adaptive filter control signal for setting the learning speed as active is input, the adaptive filter updates a coefficient to be supplied to the multiplier 630 to 63n in a short time to correct the vibration signal to be closer to the voice signal. Thus, the sound collecting device 200 can immediately supply a converted voice signal having a good voice quality to the line 11.
When an adaptive filter control signal for setting the learning speed as saving is input, the adaptive filter 6 does not update the coefficient supplied to the multiplier 630 to 63n, or does not update it immediately, but gradually updates it over a long time. As a result, the sound collecting device 200 can supply the converted voice signal to the line 11 with the speech quality maintained, with little or no degradation of the speech quality of the converted speech signal.
The adaptive filter 6 obtains a coefficient that brings the vibration signal closer to the voice signal by learning when any of the first to fifth conditions is satisfied, and outputs a converted voice signal having good voice quality. Therefore, the adaptive filter 6 generates a converted voice signal by using a coefficient that brings the already obtained vibration signal closer to the voice signal even when none of the first to fifth conditions is satisfied, so that the converted voice signal having good voice quality can be continuously output.
Using the flowcharts shown in
In
Following step S3, the adaptive filter 13 updates the coefficient supplied to the multiplier 1330 to 133n in a short time in step S5. Following step S4, the adaptive filter 13 updates the coefficient supplied to the multiplier 1330 to 133n over a long time or does not update the coefficient in step S6.
The adaptive controller 5 determines the voice section based on the vibration signal in step S7, and corrects the sound pressure level of the vibration signal in step S8. In parallel with steps S7 and S8, the adaptive controller 5 estimates the residual echo level in step S9. Subsequently, the adaptive controller 5 calculates the relative sound pressure level ratio between the corrected sound pressure level and the residual echo level in step S10.
In step S11 of
In step S13, the adaptive controller 5 supplies the adaptive filter control signal indicating “active” to the adaptive filter 6. In step S14, the adaptive controller 5 supplies the adaptive filter control signal indicating “saving” to the adaptive filter 6. Following step S13, the adaptive filter 6 updates the coefficient supplied to the multiplier 630 to 63n in a short time in step S15. Following step S14, the adaptive filter 6 updates the coefficient supplied to the multiplier 630 to 63n over a long time or does not update the coefficient in step S16.
In step S17 following steps S15 or S16, the sound collecting device 200 determines whether the power supply is turned off. When the power supply is not turned off (NO), the sound collecting device 200 returns the process to step S1 in
As described above, the sound collecting device 200 does not always update the coefficient to be multiplied by the converted voice signal in the adaptive filter 6 so that the residual signal becomes small in a short time. When there is a possibility that the quality of the converted voice signal deteriorates due to the presence of the residual echo component, the sound collecting device 200 performs update over a long time or not perform update. Therefore, the sound collecting device 200 can improve the quality of the voice signal (converted voice signal) based on the vibration signal generated by the vibration sensor 3.
The sound collecting device 200 can further improve the quality of the voice signal based on the vibration signal generated by the vibration sensor 3 in an environment where the echo component of the voice of the communication partner may overlap with the voice signal of the user.
The present invention is not limited to a first embodiment or a second embodiment described above, and can be modified in various ways without departing from the scope of the present invention. In
The sound collecting program according to a first embodiment should cause a computer to execute at least the following first to fourth steps. In the first step, a converted voice signal is generated by multiplying the vibration signal by a coefficient so that the vibration signal generated by the vibration sensor 3 based on vibration transmitted to the human body is corrected to be closer to the voice signal which is based on air vibration generated by the microphone 1. In the second step, a residual signal, which is a difference between the voice signal and the converted voice signal, is generated.
In the third step, when it is determined to be a voice section where voice is present, the coefficient is updated so that the residual signal becomes small at the first speed. In the fourth step, when it is determined to be a non-voice section where the voice is not present, the coefficient is updated so that the residual signal becomes small at the second speed slower than the first speed, or the coefficient is maintained without updating. The sound collecting program of a first embodiment may further cause the computer to execute the fifth step of selecting the voice signal and the converted voice signal or outputting a mixture of both.
In the second and third configuration examples of the adaptive controller 5 shown in
In
In
The sound collecting program according to a second embodiment should cause the computer to execute at least the following first to fourth steps. In the first step, an echo component superimposed on the first voice signal, which is based on air vibration generated by the microphone 1, by the microphone 1 collecting the sound reproduced by the speaker 16 by the second voice signal transmitted from the communication partner and received via the line is suppressed.
In the second step, the first voice signal with the echo component suppressed as a target signal and a converted speech signal is generated by multiplying the vibration signal by a coefficient, so that the vibration signal generated by the vibration sensor 3 based on the vibration transmitted to the human body by speech is brought closer to the target signal. In the third step, a residual signal, which is a difference between the target signal and the converted voice signal, is generated. In the fourth step, the coefficient to be multiplied by the vibration signal is updated so that the residual signal becomes small.
Number | Date | Country | Kind |
---|---|---|---|
2021-194233 | Nov 2021 | JP | national |
2022-006136 | Jan 2022 | JP | national |
This application is a continuation of PCT Application No. PCT/JP2022/033098, filed on Sep. 2, 2022, and claims the priority of Japanese Patent Application No. 2021-194233, filed on Nov. 30, 2021, and Japanese Patent Application No. 2022-006136, filed on Jan. 19, 2022, the entire contents of all of which are incorporated herein by reference.
Number | Date | Country | |
---|---|---|---|
Parent | PCT/JP2022/033098 | Sep 2022 | WO |
Child | 18677136 | US |