This application is based upon and claims the benefit of priority of the prior Japanese Patent Application No. 2014-040649, filed on Mar. 3, 2014, the entire contents of which are incorporated herein by reference.
The embodiments discussed herein are related to a voice processing device, a noise suppression method, and a computer-readable recording medium storing voice processing program.
As mobile phones and hands-free telephone calls in an automobile have been widely used, there has been a demand for noise suppression performed at the time of calling under a noise environment. For example, under a noise environment in which stationary noise, such as road noise, and the like, is large, there is a desire for a technique for increasing a noise suppression amount and thus making voice be easily heard. Therefore, there have been attempts to perform noise suppression with less voice distortion on voice data under a noise environment.
For example, there is known a technique for estimating a target value that indicates a level to which the noise is suppressed, based on a representative value of signals obtained by transforming a signal of voice including noise for a predetermined period of time from a time area to a frequency area. There is also another known technique in which a coefficient used for noise suppression is calculated based on an amplitude component of voice for each predetermined frequency band, and the calculated coefficient is multiplied on a signal on the frequency axis of the original signal, thereby suppressing noise. For noise suppression, a technique for controlling upper and lower limits of noise suppression and a technique for correcting a coefficient depending on whether a signal seems to be voice or non-voice are also known (see, for example, International Publication Pamphlet No. WO2012/098579, Japanese Laid-open Patent Publication No. 2001-267973, Japanese Laid-open Patent Publication No. 2010-204392, and Japanese Laid-open Patent Publication No. 2007-183306).
As a related technique, a technique in which whether a plurality of frames having a predetermined length, which are obtained from a voice signal, are voice frames or non-voice frames is determined and a non-stationary frame is detected based on a non-stationary condition that indicates a non-voice frame is non-stationary is known (see, for example, Japanese Laid-open Patent Publication No. 2010-230814).
According to an aspect of the invention, a voice processing device includes a noise-originating coefficient calculation section that calculates a noise-originating coefficient that gradually decreases as a target value of stationary noise for each frequency increases, the target value being calculated based on an amplitude value of a frequency spectrum obtained by time-frequency transforming a voice signal for a predetermined period of time; and a suppression signal generation section that generates, when the frequency spectrum is determined as being stationary on the basis of the amplitude value, a suppression signal by multiplying a suppression coefficient based on the noise-originating coefficient by the amplitude value, the suppression signal being frequency-time transformed to be output.
The object and advantages of the invention will be realized and attained by means of the elements and combinations particularly pointed out in the claims.
It is to be understood that both the foregoing general description and the following detailed description are exemplary and explanatory and are not restrictive of the invention, as claimed.
In suppressing noise, noise is suppressed at a fixed ratio so as not to cause distortion of voice by suppressing noise. When such noise suppression is performed, noise is expected to be made natural noise that is to be heard when the volume is turned down. However, when noise itself is large, both of residual noise of stationary noise and residual noise of non-stationary noise are increased. On the other hand, when the suppression ratio is simply lowered to increase the noise suppression amount, target voice is mistakenly recognized as noise and the voice is excessively suppressed, so that voice distortion might occur. When, for example, noise is mistakenly recognized as target voice on the other way around, the suppression amount might drastically change in the time direction. The change might cause a drastic change in amplitude, and thus, turns to noise distortion.
According, it is desired to allow noise suppression with less voice distortion.
A voice processing device 1 according to a first embodiment will be described with reference to the accompanying drawings. The voice processing device 1 is a device that outputs voice, of which a voice signal that has been input thereto has been subjected to noise suppression processing. The voice processing device 1 may be used for preprocessing of a reception sound or a transmission sound of a multifunctional mobile phone, an output sound of a voice output device, such as a speaker, an earphone, and the like, and an input sound for voice recognition, and the like. The voice processing device 1 is provided, for example, in a multifunctional mobile phone, a car-mounted communication device, a voice output device, a voice recognition device, and the like.
The transformation section 5 transforms a voice signal on a time axis for a predetermined period of time to a frequency spectrum. In this case, the voice signal includes a mix of target voice, stationary noise, and non-stationary noise. The transformation section 5 cuts out and transforms a signal of a predetermined period of time as a frame in chronological order. The processing, for example, may be performed using a window function such that predetermined periods of time before and behind in chronological order at least partially overlap each other. For example, the transformation section 5 performs Fast Fourier Transform (FFT) on the voice signal. A frame herein is a signal corresponding to a signal in a predetermined period of time cut out when transformation to a signal on a frequency axis is performed, that is, a voice signal in a predetermined period of time, or a frequency spectrum obtained by transforming a voice signal in a predetermined period of time.
The stationary noise estimation section 7 estimates a target value of stationary noise for each frequency, based on an amplitude value for each frequency of a frequency spectrum. The stationary noise estimation section 7 smoothes, for example, the amplitude spectrum of a frequency spectrum in the time axis direction and estimates a target value of residual noise for each frequency. The target value of the estimated noise will be hereinafter also referred to as a value of a stationary noise model. Also, the targets value estimated for each frequency will be collectively referred to as a stationary noise model.
The stationary determination section 9 determines, based on the amplitude value for each frequency of the frequency spectrum, whether a component of each frequency is stationary or non-stationary. Specifically, the stationary determination section 9 may be configured to use, for example, stationary/non-stationary determination described in Japanese Laid-open Patent Publication No. 2010-230814 to calculate the rate of change with time for each amplitude spectrum and determine that a frequency component is non-stationary, when the rate of change with time is higher than a threshold, and that a frequency component is stationary, when the rate of change with time is lower than the threshold.
The noise-originating coefficient calculation section 11 calculates a noise-originating coefficient of “1” or less, which gradually decreases as the target value increases. A calculation formula may be stored, for example, in the storage section 19, and be read out. What is meant by calculating a noise-originating coefficient of “1” or less is that, when a suppression coefficient is “1”, suppression is not performed and, as the suppression coefficient decreases from “1”, the suppression amount increases, not that the noise-originating coefficient is strictly “1” or less.
When it is determined by the stationary determination section 9 that a frequency component is stationary, the suppression coefficient calculation section 13 obtain a suppression coefficient based on a noise-originating coefficient y, for example, by multiplying a constant C (0<C≦1) and the noise-originating coefficient y together. When it is determined that a frequency component is non-stationary, the suppression coefficient calculation section 13 obtains “1” as a suppression coefficient. The constant C is a value that indicates to what degree stationary noise is suppressed from a target value and, for example, may be stored in the storage section 19 in advance. What is meant by using the constant C of “1” or less is that, when the constant C is “1”, suppression is not performed and, as the constant C decreases from “1”, the suppression amount increases, not that the noise-originating coefficient is strictly “1” or less.
The suppression signal generation section 15 generates a suppression signal obtained by multiplying an amplitude value for each frequency of the frequency spectrum and a corresponding suppression coefficient. The inverse transformation section 17 frequency-time transforms the suppression signal and outputs the frequency-time transformed suppression signal. To collectively describe these, Expression 1 and Expression 2 below are obtained.
Suppression coefficient=Constant C×Noise-originating coefficient y (stationary). Expression 1
Suppression coefficient=1 (non-stationary). Expression 2
What is meant by making the suppression coefficient be “1” is that suppression is not positively performed, not that the suppression coefficient is strictly “1”.
y=1.0−0.00002x. Expression 3
In this case, details of the noise-originating coefficient will be described.
In the stationary noise example 40, the value of the stationary noise model differs between the amplitude spectrum 42 and the amplitude spectrum 44 relative to the frequency 46. Referring to these relative to the noise-originating coefficient 30, for the amplitude spectrum 42, the noise-originating coefficient 30=y1 corresponds to the value x1 of the stationary noise model. For the amplitude spectrum 44, the noise-originating coefficient 30=y2 corresponds to the value x2 of the stationary noise mode. In this case, as the value of the stationary noise model increases, the value of the noise-originating coefficient 30 decreases, and thus, noise is suppressed more.
A suppression voice signal 60 represents an example of noise suppression performed when the noise-originating coefficient 30 is not used, that is, when the noise-originating coefficient 30=1. A suppression voice signal 62 represents an example where noise suppression is performed using the noise-originating coefficient 30. A suppression voice signal 70 and a suppression voice signal 72 represent examples where the suppression voice signal 60 and the suppression voice signal 62 are enlarged in the amplitude direction. In each of the suppression voice signals 60, 62, 70, and 72, the abscissa axis represents time and the ordinate axis represents amplitude.
In the example where the noise-originating coefficient 30 is not used, the suppression voice signal 70 has an amplitude 74 after being processed. In the example where the noise-originating coefficient 30 is used, the suppression voice signal 72 has an amplitude 76 after being processed, and the amplitude is reduced to be lower than the amplitude 74. Thus, noise suppression with a greater noise suppression amount and less distortion may be performed on the voice signal 50 by using the noise-originating coefficient 30.
A suppression voice signal 86 represents an example of change with time of the amplitude spectrum of a component of the suppression signal 82 at the frequency F. A suppression voice signal 88 represents an example of change with time of a component of a signal, noise of which is suppressed using the noise-originating coefficient 30 according to this embodiment, at the frequency F. As comparing the suppression voice signal 86 and the suppression voice signal 88 to each other, it is understood that the change in the amplitude of noise on the time axis is made moderate by using the noise-originating coefficient 30. Thus, noise distortion is reduced.
The transformation section 5 time-frequency transforms the voice signal to output a frequency spectrum (S102). Time-frequency transform is performed, for example, by cutting out a part of the voice signal on the time axis, which corresponds to a predetermined period of time, from the voice signal in chronological order and performing Fast Fourier Transform thereon. The stationary noise estimation section 7 estimates a target value of stationary noise, based on the frequency spectrum (S103). That is, the stationary noise estimation section 7 estimates a value of a stationary noise model for each frequency, based on an amplitude value for each frequency of the frequency spectrum.
The noise-originating coefficient calculation section 11 calculates a noise-originating coefficient y of “1” or less, which gradually decreases as the value of the stationary noise model increases (S104). In this case, for example, the noise-originating coefficient calculation section 11 calculates the noise-originating coefficient y with reference to the coefficient calculation table 32.
The stationary determination section 9 determines, based on the amplitude value for each frequency of the frequency spectrum, whether a component for each frequency is stationary or non-stationary (S105). When it is determined that a frequency component is stationary (YES in S105), the suppression coefficient calculation section 13 multiplies the constant C of “1” or less and the noise-originating coefficient y together to obtain a suppression coefficient (S106). The then suppression coefficient will be also referred to as a stationary noise suppression coefficient. When it is determined that a frequency component is non-stationary (NO in S105), the suppression coefficient calculation section 13 sets “1” as a suppression coefficient (S107).
The suppression signal generation section 15 generates a suppression signal obtained by multiplying the amplitude value for each frequency and the suppression coefficient together (S108). The inverse transformation section 17 frequency-time transforms the suppression signal (S109), and outputs the frequency-time transformed suppression signal (S110). When there is not an input to end a system (NO in S111), the voice processing device 1 repeats the processes in and after S101. When there is an input to end a system (YES in S111), the voice processing device 1 ends processing.
As described above, in the voice processing device 1, the noise-originating coefficient calculation section 11 calculates a noise-originating coefficient that gradually decreases as a target value of stationary noise for each frequency increases, where the target value is calculated based on the amplitude value of a frequency spectrum obtained by time-frequency transforming a voice signal of a predetermined period of time. When it is determined, based on the amplitude value of the frequency spectrum, that the frequency spectrum is stationary, the suppression signal generation section 15 generates a suppression signal by multiplying the amplitude value by a suppression coefficient based on the noise-originating coefficient to be output after frequency-time transforming.
That is, the voice processing device 1 transforms a voice signal on a time axis for a predetermined period of time to a frequency spectrum. The voice processing device 1 estimates a target value of stationary noise for each frequency, based on the amplitude value for each frequency of the frequency spectrum. The voice processing device 1 calculates a noise-originating coefficient of “1” or less, which gradually decreases as the target value increases. The voice processing device 1 multiplies a constant of 1 or less and the noise-originating coefficient together to obtain a suppression coefficient for a frequency component of the frequency spectrum that has been determined to be stationary. The voice processing device 1 sets “1” as a suppression coefficient for a frequency component that has been determined to be non-stationary. The voice processing device 1 generates a suppression signal obtained by multiplying the amplitude value for each frequency and a suppression coefficient together, frequency-time transforms the generated suppression signal, and outputs the frequency-time transformed suppression signal.
As described above, the voice processing device 1 uses the noise-originating coefficient that gradually decreases with increasing target value estimated as a value of stationary noise model. By using the gradually decreasing noise-originating coefficient which is continuous without an inconsistency part based on the estimated value of stationary noise model, increase in noise suppression amount may be realized while reducing distortion that occurs due to noise suppression. Also, by multiplying a signal by the noise-originating coefficient corresponding to the value of the stationary noise model, the noise suppression amount of stationary noise may be increased with increasing value of the stationary noise model, and thus, the amplitude change of a voice signal may be made moderate.
By using a noise-originating coefficient, a frequency component of a frequency spectrum, which is determined to be stationary, is suppressed, and therefore, noise suppression with less distortion may be performed even when noise is large. By using a noise-originating coefficient corresponding to a value of stationary noise model, excessive suppression may be prevented, and noise distortion is reduced. Also, when the component is not determined to be stationary, suppression is not performed, and therefore, a voice is not suppressed as noise, and voice distortion is reduced.
Note that, although a case where whether a frequency component is stationary or non-stationary is determined for each frequency component has been described in the above-described example, the stationary determination section 9 may be configured to perform determination to be stationary or non-stationary for each frame. In this case, the suppression coefficient calculation section 13 preferably calculates a suppression coefficient for a frequency component included in a frame that has been determined stationary, based on Expression 1.
A voice processing device 130 according to a second embodiment will be described below with reference to the accompanying drawings. In the voice processing device 130 according to the second embodiment, similar configurations and operations to those of the voice processing device 1 according to the first embodiment are denoted by the same reference characters as the reference characters in the first embodiment and the overlapping description will be omitted.
The voice reception section 132 receives an analog voice signal as an electrical signal converted, for example, by a microphone, or the like, and digitalizes the received analog voice signal, and outputs the digitaized signal as a voice signal on a time axis. When the stationary determination section 9 determines that a frequency component is stationary, the target voice determination section 134 determines whether or not the determined frequency component is a target sound.
Target sound determination may be performed, for example, by a method in which a target sound is determined as a sound of a frequency at which “the amplitude value of the frequency spectrum/the value of the stationary noise model” is equal to or higher than a threshold because a voice usually has a great amplitude. Using this method, it may be determined whether or not a component for each frequency is a target sound. For example, the threshold is set to be a value that is greater than a maximum value of a voice signal that is considered to include only noise. Using a statistical method, the threshold may be obtained from a plurality of voice signals which have been actually obtained, for example.
Another known method may be applicable to determine whether or not a frequency component is a target sound, for example. Further, a corresponding frequency component may be determined to be a target sound in a case where there is another method, a certain condition is satisfied in the above-described method, or one of the conditions is satisfied.
Similar to the suppression coefficient calculation section 13 according to the first embodiment, for a frequency component that has been determined to be stationary by the stationary determination section 9, the suppression coefficient calculation section 136 calculates a suppression coefficient, based on Expression 1. For a frequency component that has been determined to be a target sound, the suppression coefficient calculation section 136 sets “1” as a suppression coefficient, as expressed by Expression 2. When it is determined that a frequency component is neither stationary nor a target sound, the suppression coefficient calculation section 136 calculates the suppression coefficient, based on Expression 4 below. This suppression coefficient will be also referred to as a non-stationary noise suppression coefficient.
Suppression coefficient=Coefficient K(f)×Constant C×Noise-originating coefficient y. Expression 4
Note that the coefficient K(f) is a coefficient that represents the ratio of the value of the stationary noise model to the corresponding frequency component and a coefficient when the corresponding frequency component is suppressed to the stationary noise model. The coefficient K(f) is calculated, based on the target value estimated by the stationary noise estimation section 7 and each frequency component obtained by performing transformation by the transformation section 5, using Expression 5 below.
Coefficient K(f)=Target value of each frequency (the value of the stationary noise model)/Amplitude value of each frequency component. Expression 5
The transformation section 5 time-frequency transforms the voice signal to output a frequency spectrum on a frequency axis (S152). Time-frequency transformation is performed, for example, by cutting out a part of the voice signal on the time axis, which corresponds to a predetermined period of time, from the voice signal, and performing Fast Fourier Transform thereon. The stationary noise estimation section 7 estimates a target value of stationary noise, based on the frequency spectrum (S153). That is, the stationary noise estimation section 7 estimates the value of the stationary noise model for each frequency, based on the amplitude value for each frequency of the frequency spectrum on the frequency axis.
The noise-originating coefficient calculation section 11 calculates a noise-originating coefficient of “1” or less, which gradually decreases as the value of the stationary noise model increases (S154). In this case, for example, the noise-originating coefficient calculation section 11 calculates a noise-originating coefficient y with reference to the coefficient calculation table 32.
The stationary determination section 9 determines, based on the amplitude value for each frequency of the frequency spectrum on the frequency axis, whether a component for each frequency is stationary or non-stationary (S155). When it is determined that a frequency component is stationary (YES in S155), the suppression coefficient calculation section 136 multiplies the constant C of “1” or less by the noise-originating coefficient y to calculate a stationary noise suppression coefficient, based on Expression 1 (S156). When it is determined that a frequency component is non-stationary (NO in S155), the target sound determination section 134 determines whether or not the frequency component is a target sound (S157). When it is determined that the frequency component is a target sound (YES in S157), the suppression coefficient calculation section 136 sets “1” as a suppression coefficient (S158). When it is determined that the frequency component is not a target sound (NO in S157), the suppression coefficient calculation section 136 calculates a non-stationary noise suppression coefficient, based on Expression 4 (S159).
The suppression signal generation section 15 generates a suppression signal obtained by multiplying the amplitude value for each frequency and the suppression coefficient together (S160). The inverse transformation section 17 frequency-time transforms the suppression signal (S161) and outputs the frequency-time transformed suppression signal (S162). When there is not an input to end a system (NO in S163), the voice processing device 130 repeats the processes in and after S151. When there is an input to end a system (YES in S163), the voice processing device 130 ends processing.
As described above, the voice processing device 130 transforms a voice signal on the time axis for a predetermined period of time to a frequency spectrum on the frequency axis. The voice processing device 130 estimates a target value of stationary noise for each frequency, based on an amplitude value for each frequency of the frequency spectrum. The voice processing device 130 calculates a noise-originating coefficient of “1” or less, which gradually decreases as the target value increases. The voice processing device 130 multiplies the constant C of 1 or less and the noise-originating coefficient together to obtain a suppression coefficient for a frequency component of a frequency spectrum, which has been determined to be stationary. For a frequency component determined to be non-stationary, the voice processing device 130 further determines whether or not the frequency component is a target sound. When the frequency component is a target sound, the voice processing device 130 sets “1” as a suppression coefficient, while, when it is determined that the frequency component is not a target sound, the voice processing device 130 calculates a non-stationary noise suppression coefficient. The voice processing device 130 generates a suppression signal obtained by multiplying the amplitude value for each frequency and the suppression coefficient together, frequency-time transforms the generated suppression signal, and outputs the frequency-time transformed suppression signal.
As described above, in the voice processing device 130, similar to the voice processing device 1 according to the first embodiment, a noise-originating coefficient that gradually decreases as a target value calculated as a value of a stationary noise model increases is used. With the noise-originating coefficient, a frequency component of a frequency spectrum, which has been determined to be stationary, is suppressed. Accordingly, noise suppression with less distortion may be enabled even when noise is large. Furthermore, the voice processing device 130 determines, for a frequency component that has been determined to be non-stationary, whether or not the frequency component is a target sound and sets, when the frequency component is a target sound, the suppression coefficient=1 so as not to perform suppression. When the frequency component is not a target sound, the voice processing device 130 performs suppression using a non-stationary noise suppression coefficient. Therefore, in addition to the advantages of the voice processing device 1 according to the first embodiment, it may be enabled to perform noise suppression while further reducing the voice distortion. Specifically, when stationary noise is larger, a greater noise suppression effect may be achieved. As described above, determination to be or not a target sound is performed, and thus, noise may be suppressed by increasing the noise suppression amount and voice distortion may be reduced by reducing a voice suppression amount.
Note that, as a target sound determination method, the following method may be used. That is, the target sound determination section 134 may be configured to determine a target sound when an autocorrelation value between the corresponding frame and a frame before the corresponding frame in the time direction is higher than a threshold, utilizing the fact that a voice has a high autocorrelation and noise has a low autocorrelation. In this case, determination to be or not a target sound is performed on each time frame. Also, the determination may be performed, for example, by the stationary determination section 9, for a frame including a frequency component that has been determined to be non-stationary.
When a target sound is determined for a frame in the above-described manner, the stationary determination section 9 may be configured to determine whether a frequency spectrum is stationary or non-stationary for each frame, based on an amplitude value for each frequency of a frequency spectrum on a frequency axis. Specifically, the stationary determination section 9 may be configured to use, for example, stationary/non-stationary determination described in Japanese Laid-open Patent Publication No. 2010-230814 to determine that the frequency spectrum is non-stationary when the rate of change with time of the amplitude spectrum of the corresponding frame is higher than a threshold, and determine, when the rate of change with time is lower than the threshold, that the frequency spectrum is stationary. As for the rate of change with time, various modified examples, such as a method in which the rate of change with time is calculated for a statistical representative value, such as an average value of the amplitude spectrum of the corresponding frame, and the like, a method in which the rate of change with time is calculated for each frequency component and a statistical representative value is set as the rate of change with time, and the like, may be used. As another method, a method in which, when the statistical representative value of the amplitude spectrum of the corresponding frame is greater than the statistical representative value of the target value of stationary noise of the corresponding frame by a predetermined value or more, it is determined that the frequency spectrum is non-stationary, or the like, may be used. Note that, when determination to be or not stationary is performed on each frame, the suppression coefficient calculation section 13 preferably calculates a stationary noise suppression coefficient for all frequency components in a frame that has been determined to be stationary using Expression 1 described above.
A method in which a target sound is determined for each frame may be used in combination with the above-described method in which a target sound is determined for each frequency. For example, the target sound determination section 134 may be configured to determine, only when a target sound is determined by both of the above-described determination methods, that the frequency component is a target sound. As another option, the target sound determination section 134 may be configured to determine, when a target sound is determined by either one of the above-described methods, that the frame or the frequency component is a target sound.
A voice processing device 200 according to a third embodiment will be described below with reference to the accompanying drawings. In the voice processing device 200 according to the third embodiment, similar configurations and operations to those of the voice processing device 1 according to the first embodiment and the voice processing device 130 according to the second embodiment are denoted by the same reference characters as the reference characters in the first embodiment and the second embodiment, and the overlapping description will be omitted.
The target sound ratio calculation section 202 calculates a target sound ratio for each predetermined period time extracted by the transformation section 5, that is, for each temporal frame. The target sound ratio is expressed by Expression 6 below, assuming that an FFT length is the number of frequency components in one frame.
Target sound ratio=The number of frequencies that have been determined to be a target sound in one frame/FFT length. Expression 6
Similar to the suppression coefficient calculation section 13 and the suppression coefficient calculation section 136, the suppression coefficient calculation section 204 calculates, based on Expression 1, a suppression coefficient for a frequency component that has been determined to be stationary by the stationary determination section 9. For a frequency component that has been determined to be a target sound, the suppression coefficient calculation section 204 sets “1” as a suppression coefficient, as expressed by Expression 2. When a frequency component is determined to be neither stationary nor non-stationary, the suppression coefficient calculation section 204 calculates a suppression coefficient in accordance with the target sound ratio.
In the sound ratio-based coefficient data table 210, when the target sound ratio is equal to or larger than a first predetermined value Th1 set in advance (that is, when the target sound ratio is high), the suppression coefficient is calculated by Expression 4, similar to the non-stationary suppression coefficient calculated in the voice processing device 130 according to the second embodiment. For the sake of convenience, Expression 4 is described again below.
Target sound ratio (high): Suppression coefficient=Coefficient K(f)×Constant C×Noise-originating coefficient y. Expression 4
When the target sound coefficient is less than the first predetermined value Th1 and is equal to or greater than a second predetermined value Th2, which is smaller than the first predetermined value Th1 (that is, when the target sound ratio is intermediate), the suppression coefficient is calculated by Expression 7 below. When the target sound ratio is less than the second predetermined value Th2 (that is, when the target sound ratio is low), the suppression coefficient is calculated by Expression 8 below.
Target sound ratio (intermediate): Suppression coefficient=Coefficient K(f)×Constant C. Expression 7
Target sound ratio (low): Suppression coefficient=Coefficient K(f). Expression 8
Note that the target sound ratio may be calculated for several voice signals obtained in advance, for example, in a state where noise is small, and then, the first predetermined value Th1 and the second predetermined value Th2 may be determined based on the degree of a distribution of the calculated target sound ratio.
As illustrated in
As illustrated in
The transformation section 5 time-frequency transforms the voice signal and outputs a frequency spectrum on a frequency axis (S232). Time-frequency transformation is performed, for example, by cutting out a part of the voice signal on the time axis, which corresponds to a predetermined period of time, from the voice signal, and performing Fast Fourier Transform thereon. The stationary noise estimation section 7 estimates a target value of stationary noise, based on the frequency spectrum (S233). That is, the stationary noise estimation section 7 estimates a value of a stationary noise model for each frequency, based on an amplitude value for each frequency of the frequency spectrum on the frequency axis.
The noise-originating coefficient calculation section 11 calculates a noise-originating coefficient of “1” or less, which gradually decreases as the value of the stationary noise model increases (S234). In this case, for example, the noise-originating coefficient calculation section 11 calculates a noise-originating coefficient y with reference to the coefficient calculation table 32.
The stationary determination section 9 determines, based on the amplitude value for each frequency of the frequency spectrum on the frequency axis, whether a component for each frequency is stationary or non-stationary. Also, the target sound ratio calculation section 202 determines whether or not the component for each frequency is a target sound (S235). Details of the process in the S235 will be described later. The target sound ratio calculation section 202 calculates a target sound ratio (S236). That is, based on a result of sound type determination which will be described later, the target sound ratio calculation section 202 calculates a target sound ratio for each frame. The suppression coefficient calculation section 204 calculates a suppression coefficient for each frequency (S237). Details of suppression coefficient calculation processing will be described later.
The suppression signal generation section 15 generates a suppression signal obtained by multiplying an amplitude value for each frequency and the suppression coefficient together (S238). The inverse transformation section 17 frequency-time transforms the suppression signal (S239), and outputs the frequency-time transformed suppression signal (S240). When there is not an input to end a system (NO in S241), the voice processing device 200 repeats the processes in and after S231. When there is an input to end a system (YES in S241), the voice processing device 200 ends processing.
Next, sound type determination processing will be described with reference to
As illustrated in
The target sound determination section 134 determines, for a frequency component that has been determined to be not stationary sound, whether or not the frequency component is a target sound (S256). When it is determined that the frequency component is a target sound (YES in S256), the target sound determination section 134 sets n=n+1 (S257). When it is determined that the frequency component is not a target sound (NO in S256), the target sound determination section 134 sets flg=2 (S258).
In S259, the stationary determination section 9 sets i=i+1 (S259), when the variable i is not the FFT length FFT_N (NO in S260), the process returns to S253 to repeat the process. When the variable i is the number of frequency components in one frame=FFT_N (YES in S260), the stationary determination section 9 ends sound type determination processing, and the process returns to the process illustrated in
Subsequently, details of suppression coefficient calculation processing will be described with reference to
When flg=1 (NO in S272, YES in S274), the suppression coefficient calculation section 204 sets the suppression coefficient=1. When flg=2 (NO in S274), the suppression coefficient calculation section 204 calculates a non-stationary noise suppression coefficient (S276). That is, the suppression coefficient calculation section 204 calculates the non-stationary noise suppression coefficient for each frequency component, bade on the target sound ratio calculated in the process illustrated in
As described in detail above, the voice processing device 200 according to the third embodiment performs noise suppression in accordance with a target sound ratio. The target sound ratio is calculated in accordance with the ratio of the frequency component that is determined to be a target sound in each frame. When the target sound ratio is high, a suppression coefficient is calculated such that non-stationary noise in the corresponding frame is further suppressed.
As described above, with the voice processing device 200 according to the third embodiment, in addition to the advantages of the voice processing device 1 according to the first embodiment and the voice processing device 130 according to the second embodiment, noise suppression in accordance with a target sound ratio may be advantageously performed on a non-stationary noise portion. For example, even when determination to be a target sound or a non-voice sound that is not a target voice is performed, the accuracy of determination is not 100%, and therefore, when noise is mistakenly determined as a target sound, the suppression amount might drastically vary in the time direction. This causes drastic change in amplitude and then a noise distortion. However, by performing noise suppression in a stepwise fashion in accordance with the target sound ratio, even such a noise distortion may be reduced.
Note that, in the third embodiment, the target sound ratio is divided into three levels, but the target sound ratio is not limited thereto. A case where the target sound ratio is divided into more levels or less levels is construed to be in the range of modification of noise suppression according to this embodiment.
A voice processing device 300 according to a fourth embodiment will be described below with reference to the accompanying drawings. In the voice processing device 300 according to the fourth embodiment, similar configurations and operations to those in the first to third second embodiments are denoted by the same reference characters as the reference characters in the first to third embodiments, and the overlapping description will be omitted.
In the voice processing device 300, instead of the target sound determination section 134 in the second embodiment and the third embodiment, the target sound determination section 307 performs determination to be or not a frequency component is a target sound. The voice processing device 300 receives two voice signals. The voice reception section 132 receives one of the voice signals. The voice reception section 303 receives the other one of the voice signals. The two voice signals are signals of voices obtained at different places (spatial positions) at the same time. The two voice signals may be, for example, signals based on voices collected by two microphones placed at different positions. The second transformation section 305 transforms a voice signal from the voice reception section 303 to a frequency spectrum on a frequency axis.
The target sound determination section 307 determines, based on a phase difference or an amplitude ratio between two frequency spectrums, whether or not the corresponding frequency component is a target sound is determined. When the phase difference is used, whether or not the phase difference between the two frequency spectrums is a value that indicates the direction of a target sound is determined. That is, the target sound determination section 307 calculates a phase difference between the two frequency spectrums for each frequency, and determines whether or not the calculated phase difference is included in the range of the phase difference that is possible in the direction of a predetermined sound source.
Ra=(ds/(ds+d×cos θ))(0≦θ≦180). Expression 9
In
Rmin≦R≦RmaxRmin=ds/(ds+d×cos θmin)Rmax=ds/(ds+d×cos θmax). Expression 10
When a frequency component has an amplitude spectrum ratio that satisfies Expression 10, the target sound determination section 307 determines the frequency component to be a target sound.
Note that, in this embodiment, the target sound ratio calculation section 202 calculates a target sound ratio using the number of frequency components that have been determined to be a target sound based on a phase difference or the amplitude ratio between two frequency spectrums.
As described in detail above, in this embodiment, the target sound determination section 307 determines whether or not a frequency component is a target sound, based on a phase difference or an amplitude ratio between two voice signals, depending on whether or not the direction of a sound source indicates the direction of a target sound. Thus, when the direction of a sound source is defined, determination of a target sound may be performed using two voice signals collected at the same time. The voice processing device 300 according to the fourth embodiment may achieve similar advantages to those of voice processing device 200 according to the third embodiment. Furthermore, the direction of a sound source that is desired to be saved as a voice may be specified, and thus, noise suppression may be performed.
A modified example of a noise-originating coefficient will be described.
In the example of
y=1.0ax(a=1.53×10−5). Expression 11
In the example of
y=1.0bx2(b=4.66×10−10). Expression 12
As illustrated in
As described above, the noise-originating coefficient 360 or the noise-originating coefficient 362 according to this modified example is applied to any one of the first to fourth embodiments, and thus, similar to the advantages of each of the embodiments, noise suppression that does not cause a distortion may be performed. With the noise-originating coefficient 362, as compared to a case where the noise-originating coefficient 360 is used, the noise suppression amount may be advantageously further increased when the value x of the stationary noise model is large.
An example of a computer commonly used in order to cause the computer to execute the operation of each of noise suppression methods according to the first to fourth embodiments and the modified example will be described below.
The CPU 402 is an arithmetic processing unit that controls the operation of the entire control section 400. The memory 404 is a storage section that stores a program that controls the operation of the control section 400 in advance and is used as a working area, as appropriate, when a program is executed. The memory 404 is, for example, a random access memory (RAM), a read only memory (ROM), or the like. The input device 406 is a device that obtains, when being operated by a user of the computer, inputs of various types of information from the user, which are associated to the contents of the operation, and sends the obtained input information to the CPU 402, and is, for example, a keyboard device, a mouse device, or the like. The output device 408 is a device that outputs a result of processing executed by the control section 400 and includes a display device or the like. For example, the display device displays a text and an image in accordance with display data sent by the CPU 402.
The external storage device 412 is, for example, a storage device, such as a hard disk, a flash memory, and the like, which stores various types of control programs that are executed by the CPU 402, obtained data, and the like. The medium driving device 414 is a device that writes and reads data to and from a removable recording medium 416. The CPU 402 may be configured to read out a predetermined control program stored in the removable recording medium 416 via the medium driving device 414 to execute the predetermined control program and thereby perform various types of control processing. The removable recording medium 416 is for example, a compact disc (CD)-ROM, a digital versatile disc (DVD), a universal serial bus (USB) memory, or the like. The network connection device 418 is an interface device that performs management of wired or wireless communication of various types of data with an external device. The bus 410 is a communication path which connects the above-described devices together and through which data is communicated.
Programs that cause a computer to execute the noise suppression methods according to the first to fourth embodiments are stored, for example, in the external storage device 412. The CPU 402 reads out a program from the external storage device 412 to cause the control section 400 to perform the operation of noise suppression. In this case, first, a control program used for causing the CPU 402 to perform the operation of noise suppression is generated and is stored in the external storage device 412. Then, a predetermined instruction is given to the CPU 402 from the input device 406 to cause the CPU 402 to read out the control program from the external storage device 412 and execute the control program. As another option, the programs may be stored in the removable recording medium 416.
Note that the present disclosure is not limited to the above-described embodiments, and various configurations and embodiments may be employed without departing from the gist of the present disclosure. For example, the first to fourth embodiments and the modified example are not limited to the description above, but may be combined as long as it is logically possible to combine them.
All examples and conditional language recited herein are intended for pedagogical purposes to aid the reader in understanding the invention and the concepts contributed by the inventor to furthering the art, and are to be construed as being without limitation to such specifically recited examples and conditions, nor does the organization of such examples in the specification relate to a showing of the superiority and inferiority of the invention. Although the embodiments of the present invention have been described in detail, it should be understood that the various changes, substitutions, and alterations could be made hereto without departing from the spirit and scope of the invention.
Number | Date | Country | Kind |
---|---|---|---|
2014-040649 | Mar 2014 | JP | national |
Number | Name | Date | Kind |
---|---|---|---|
20070156399 | Matsuo | Jul 2007 | A1 |
20080192956 | Kazama | Aug 2008 | A1 |
20100092000 | Kim | Apr 2010 | A1 |
20100250246 | Matsumoto | Sep 2010 | A1 |
20100296665 | Ishikawa | Nov 2010 | A1 |
20110081026 | Ramakrishnan | Apr 2011 | A1 |
20130191118 | Makino | Jul 2013 | A1 |
20130216058 | Furuta et al. | Aug 2013 | A1 |
20130251170 | Every | Sep 2013 | A1 |
20140200887 | Nakadai | Jul 2014 | A1 |
20140241546 | Matsumoto | Aug 2014 | A1 |
Number | Date | Country |
---|---|---|
2001-267973 | Sep 2001 | JP |
2007-183306 | Jul 2007 | JP |
2010-204392 | Sep 2010 | JP |
2010-230814 | Oct 2010 | JP |
WO 2012098579 | Jul 2012 | WO |
Entry |
---|
Extended European Search Report issued Jul. 16, 2015 in corresponding European Patent Application No. 15156291.5. |
Masanori Kato et al., “Noise Suppression with High Speech Quality Based on Weighted Noise Estimation and MMSE STSA”, Electronics and Communications in Japan, Part 3, Vol, 89, No. 2, 2006, pp. 43-53. |
Nils Westerlund et al., “Speech enhancement for personal communication using an adaptive gain equalizer”, Signal Processing 85 (2005), pp. 1089-1101. |
Number | Date | Country | |
---|---|---|---|
20150248895 A1 | Sep 2015 | US |