The present invention relates to a noise reduction device, a noise reduction program and a noise reduction method, all of which make it possible to adaptively learn each of adaptive coefficients used respectively for obtaining estimated values of stationary noise and non-stationary noise at the same time, to thereby improve an effect of noise suppression, and to thus enhance speech adequate for speech recognition in an environment where both the stationary noise and the non-stationary noise are present.
First of all, descriptions will be provided for the current status of an in-vehicle speech recognition system which constitutes the background of the present invention. The in-vehicle speech recognition system has reached a level of practical use where the in-vehicle speech recognition system is applied mainly to the inputting of commands, addresses and the like in a car navigation system. In reality, however, CD music needs to be stopped from being played, or passengers need to refrain from talking, while speech recognition is being performed. In addition, speech recognition can not be performed in a case where a crossing bell is being sounding in a nearby railroad crossing. Consequently, reviewing the present level of development of the in-vehicle speech recognition, one may think that many restraints have still been imposed on use of the in-vehicle speech recognition system, and that the in-vehicle speech recognition system is still technically in a transition period.
One may think that noise robustness in the in-vehicle speech recognition system will be achieved step by step through its technological development ladder 1 to 5 as shown in
In the case of its development ladder 1, a multi-style training technique and a spectral subtraction technique have made great contributions to enhancing the noise robustness. The multi-style training technique is a technique for using sound, in which various noises are superimposed on speeches uttered by humans, for the adaptive learning of an acoustic model. In addition, stationary noise components are subtracted from an observed signal by use of the spectral subtraction technique, both when speech recognition is performed and when an acoustic model is adaptively trained. These techniques have remarkably enhanced noise robustness. As a consequence, the speech recognition system has reached the level of practical use as far as the stationary cruising noise is concerned.
The sounds coming from the CD/radio to be treated in its development ladder 2 are non-stationary noise as in the case of the non-stationary environment noise to be treated in its development ladder 3. However, the sounds coming from the CD/radio is different from the non-stationary environment noise in that the sounds coming from the CD/radio are sounds coming from specific in-vehicle appliances. For this reason, electric signals which have not yet been converted to the sounds can be used, as reference signals, in order to suppress noise. A system for suppressing noise by use of electric signals is termed as an echo canceller. It is known that the echo canceller exhibits high performance in a silent environment where no noise exists except for sounds from the CD/radio. For this reason, it is expected that both the echo canceller and the spectral subtraction technique are used in the development ladder 2 of the in-vehicle speech recognition system. It is known, however, that performance of a conventional echo canceller is degraded in a vehicle compartment of a car which is moving. This is because noise, including driving noise irrelevant to reference signals, is observed at the same time as the reference signals are observed.
x=r*g
where * denotes a convolution calculation.
In this respect, the echo canceller 40 can cancel the echo signal x through the following process. An estimated value h of the impulse response g is figured out in an adaptive filter 42. Thus, an estimated echo signal r*h is generated. In a subtraction unit 43, the estimated echo signal r*h is subtracted from a signal In of sound received by the microphone 1. Thereby, the echo signal x can be cancelled. In general, a filter coefficient h is learned in a non-speech segment by use of a least-mean-square (LMS) algorithm or a normalized least-mean-square (N-LMS) algorithm. The echo canceller takes both a phase and an amplitude into consideration. For this reason, it can be expected that the echo canceller brings about a higher performance as far as a silent environment is concerned. It is known, however, that the performance decreases when environment noise around the echo canceller is high.
As measures to increase performance of the echo canceller in a noisy environment, one may conceive that noise reduction is performed before noise cancellation is performed. In theory, however, the noise reduction using the spectral subtraction technique can not be performed before the echo canceller is implemented in the time domain. In addition, if noise reduction is designed to be performed by use of a filter, the echo canceller can not follow change in the filter. Furthermore, if the noise reduction is performed before the noise cancellation is performed, this brings about a problem that echo components obstructs the estimating of stationary noise components for the purpose of the noise reduction. For this reason, there have been a small number of cases where the noise reduction is performed before the echo cancellation is performed.
If an echo canceller using the spectral subtraction technique or a Wiener filter in the frequency domain is adopted as the echo canceller 70 in the rear stage, the noise reduction can be performed before the echo cancellation is performed, or at the same time as the echo cancellation is performed. In this case, however, echo components are included in noise components to be reduced, in the noise reduction unit 60. This makes it difficult to estimate stationary noise components exactly. With this difficulty into consideration, an application of the noise reduction device disclosed in Non-patent Literature 1 is limited to talks on the phone. The noise reduction device disclosed in Non-patent Literature 1 is designed to measure stationary noise components during a time when the two calling parties utter no speech, or during a time when only background noise exists.
In the case of these conventional noise reduction devices, the respective echo cancellers are constituted in a two-stage manner. These constitutions make it possible to reduce echo more securely. In the case of each of the noise reduction devices disclosed in Non-patent Literatures 3 and 4, echo components which are as large as designated by an estimate value of the echo are reduced as they are. For this reason, the echo components can not be eliminated completely. In addition, in the case of the noise reduction device disclosed in Non-patent Literature 3, flooring is performed on the basis of a value of output from the preprocessing. In the case of the noise reduction device disclosed in Non-patent Literature 4, an original sound adding method for improving audibility is adopted. In each of the two cases, echo elements can not be reduced to zero. On the other hand, in a case where residual noise is in the form of music or spoken news, no matter how much the power of the residual noise may be weakened, it is likely that the noise is treated as human speeches, and that this treatment leads to a false recognition, when speech recognition is intended to be performed.
Non-patent Literature 4 also refers to a scheme for dealing with reverberation of echo. According to this scheme, while an echo cancellation process is being performed, an estimated value of echo, which has been found in a previous frame, is multiplied by a coefficient, and a value thus obtained is added to an estimated value of echo in the current frame. Thereby, the echo cancellation process is performed on both echo components and reverberation components. However, this brings about a problem that the coefficient needs to be given corresponding to an environment in a room in advance, and that the coefficient is not determined automatically.
An echo canceller using a power spectrum in the frequency domain can deal with not only a case where echo and reference signals to be referred to in order to reduce the echo are in the form of monophonic signals, but also a case where they are in the form of stereo signals. Specifically, a power spectrum of a reference signal may be defined as a weighted average of the right and left reference signals, and the weight may be determined in accordance with a degree of a correlation among the observed signal as well as its right and left reference signals, as described in Deligne, S., and Gopinath, R. [2001]. “Robust Speech Recognition with Multi-channel Codebook Dependent Cepstral Normalization (MCDCN),” Conference Proceedings of ASRU, 2001, pp. 151-154. In a case where a pre-process is intended to be performed for an echo canceller in the time domain, a stereo echo canceller technique, on which many research results have been disclosed, may be applied to the pre-process.
Thus, an aspect of the present invention is to provide a noise reduction technique which makes it possible to improve noise robustness in an environment where non-stationary noise, such as sounds coming from the CD/radio, exists in addition to stationary noise. The aspect is achieved by effective use of existing acoustic models and the like, without changing the framework of the spectral subtraction technique described above to a large extent.
Another aspect of the present invention is to provide a noise reduction technique which makes it possible to estimate stationary noise components even in conditions where echo sound always exists.
Another aspect of the present invention is to provide a noise reduction technique which makes it possible to more fully reduce echo components which are the chief cause of a source error in recognized characters. The aspect can be achieved by means of maintaining compatibility between the noise reduction technique and the acoustic model when stationary noise is intended to be reduced.
In another aspect of the present invention, an observed signal can be obtained by converting the sound wave to an electric signal and by thereafter converting the electric signal to a signal in the frequency domain.
In still another aspect of the present invention, an observed signal and a reference signal can be obtained by converting a signal in the time domain to a signal in the frequency domain in each predetermined frame.
In the case of yet another aspect of the present invention, each of the adaptive coefficients to be obtained by the learning is used in a noise segment where the observed signal does not include non-stationary noise components.
For a more complete understanding of the present invention and the advantages thereof, reference is now made to the following description taken in conjunction with the accompanying drawings, in which:
a) and 3(b) are diagram respectively showing how the system shown in
a) and 4(b) are diagrams respectively showing, in cooperation with
As described above, the spectral subtraction technique is widely used in a speech recognition process nowadays. With this taken into consideration, the present invention provides a noise reduction technique which makes it possible to improve noise robustness in an environment where non-stationary noise, such as sounds coming from the CD/radio, exists in addition to stationary noise. This is achieved by effective use of existing acoustic models and the like, without changing the framework of the spectral subtraction technique to a large extent.
In addition, in a case where sounds coming from the in-vehicle CD/radio are a sound source of echo, it can not be expected that a time during which no echo exists occurs. For this reason, stationary noise components can not be estimated exactly by use of the conventional techniques as shown in
Moreover, the conventional technique as shown in
Furthermore, in the case of the aforementioned scheme for dealing with reverberation of echo, a coefficient by which to multiply an estimated value of the echo which has been figured out in the previous frame needs to be given corresponding to an environment of a room in advance. This brings about a problem that the coefficient can not be determined automatically. Accordingly, still the present invention provides a noise reduction technique which makes it possible to reduce the reverberation of the echo while learning the coefficient whenever necessary.
In the case of a noise reduction device, a noise reduction program and a noise reduction method, a predetermined constant is calculated by use of its adaptive coefficient, and a predetermined reference signal in the frequency domain is calculated by use of its adaptive coefficient. Thereby, estimated values are obtained respectively for stationary noise components included in a predetermined observed signal in the frequency domain and non-stationary noise components corresponding to the reference signal. Subsequently, a noise reduction process is applied to the observed signal on the basis of each of the estimated values. Based on the results, each of the adaptive coefficients is updated. Each of the adaptive coefficients is learned by means of obtaining the estimated values and updating the adaptive coefficients in a repetitive manner.
In this respect, the noise reduction device, the noise reduction program and the noise reduction method are, for example, what is used for a speech recognition system and a hands-free telephone. The noise reduction process is, for example, that which uses the spectral subtraction technique or the Wiener filter.
In the case of this configuration, when the estimated values respectively of the stationary noise components and the non-stationary noise components included in the observed signal are obtained, the noise reduction process is applied to the observed signal on the basis of each of the estimated values. Based on this result, each of the adaptive coefficients is updated. Based on each of the adaptive coefficients thus updated, each of the estimated values is figured out once again. Each of the adaptive coefficients is learned through repeating this learning step. In other words, each time the learning step is performed, both of the adaptive coefficients are sequentially updated on the basis of a result of performing the noise reduction process by use of the estimated values respectively of the stationary noise and the non-stationary noise. Simultaneously, both of the adaptive coefficients are learned. If the noise reduction process is applied to the observed signal on the basis of the estimated values to be obtained by means of applying the respective adaptive coefficients which are finally obtained through this learning process, the stationary noise components and the non-stationary noise components can be reduced from the observed signal in a satisfactory manner.
In the case of the present invention, the adaptive coefficients respectively of the stationary noise components and the non-stationary noise components are designed to be learned at the same time. For this reason, the noise reduction process can be performed more exactly in comparison with a conventional scheme. In the case of the conventional scheme, a noise reduction process is performed on the basis of a result of learning components of one of the stationary noise and the non-stationary noise. Thereafter, with regard to the observed signal to which the noise reduction process has thus been applied, components of the other of the stationary noise and the non-stationary noise are learned separately. Thus, a result of this learning is reflected on the noise reduction process at high exactness.
In a case of the present invention, an observed signal is obtained by converting the sound wave to an electric signal and by thereafter converting the electric signal to a signal in the frequency domain. In addition, a reference signal can be obtained by converting, to a signal in the frequency domain, a signal corresponding to sound coming from a sound source of non-stationary noise which is a cause of non-stationary noise components included in the observed signal. A sound wave is converted to an electric signal, for example, by use of a microphone. An electric signal is converted to a signal in the frequency domain, for example, by use of the discrete Fourier transform (DFT). A sound source of non-stationary noise includes, for example, a CD player, a radio, a machine which produces non-stationary operating sound and a speaker of a telephone. A signal corresponding to sound coming from a sound source of non-stationary noise includes, for example, a speech signal which is in the form of an electric signal generated in a sound source of non-stationary noise, and what is in the form of an electric signal converted from sound coming from a sound source of non-stationary noise.
In this case, before the electric signal is converted to a signal in the frequency domain, an echo cancellation in the time domain may be applied to the electric signal on the basis of the reference signal which has not yet been converted to a signal in the frequency domain.
In another case of the present invention, an observed signal and a reference signal is obtained by converting a signal in the time domain to a signal in the frequency domain in each predetermined frame. In this case, estimated values respectively of non-stationary noise components in each predetermined frame is obtained on the basis of reference signals in a plurality of predetermined frames preceding the frame. In addition, a coefficient for the reference signal is any one of a plurality of coefficients respectively for the reference signals in the plurality of predetermined frames.
In this case, a noise reduction process is performed by means of subtracting, from the observed signal, estimated values respectively of the stationary noise components and the non-stationary noise components. In addition, the learning is performed by means of updating the adaptive coefficients in a way that makes smaller a mean-square value of the difference between the observed signal and a sum of the estimated values respectively of the stationary noise components and the non-stationary noise components in each predetermined frame.
In another case of the present invention, each of the adaptive coefficients to be obtained by the learning is used in a noise segment where the observed signal does not include non-stationary noise components. In addition, the estimated values respectively of stationary noise components and non-stationary noise components included in the observed signal are obtained on the basis of the reference signal in a non-noise segment where the observed signal includes the non-stationary noise components. Thereby, a noise reduction process is applied to the observed signal on the basis of each of the estimated values. In this case, if the non-stationary components are based on speech uttered by a speaker, an output as a result of the noise reduction process is used for a speech recognition process to be applied to the speech uttered by the speaker.
In this case, the noise reduction process is performed by means of subtracting, from the observed signal, the estimated values respectively of the stationary noise components and the non-stationary noise components. In this respect, before the subtraction process is performed, the estimated values respectively of the stationary noise components may be multiplied by a first subtraction coefficient. As a value of the first subtraction coefficient, a value which is equivalent to that taken on by a subtraction coefficient to be used for reducing stationary noise components by means of the spectral subtraction technique when the acoustic model to be used for the speech recognition is learned. The “equivalent value” includes not only a “value equal” to that taken on by the subtraction coefficient but also a value in a range in which expected effects of the present invention is obtained. Furthermore, in this case, before the subtraction process is performed, the estimated values respectively of the non-stationary noise components may be multiplied by a second subtraction coefficient. To this end, a value larger than that taken on by the first subtraction coefficient may be used as a value taken on by the second subtraction coefficient.
The noise reduction unit 10 reduces stationary noise by use of the echo canceller and the spectral subtraction technique integrally. In other words, the noise reduction unit 10 obtains, through the adaptive learning, an adaptive coefficient Wω(m) to be used for calculating an estimated value Qω(T) of the power spectrum in echo included in the observed signal Xω(T), in a non-speech segment where no speech exists. During the process, the noise reduction unit 10 figures out an estimated value Nω of the power spectrum of the stationary noise included in the observed signal Xω(T). On the basis of a result of this, the noise reduction unit 10 performs the echo cancellation process, and reduces the stationary noise, in a speech segment where speech s exists.
The noise reduction unit 10 includes an adaptation unit 11, multiplication units 12 and 13, a subtraction unit 14, a multiplication unit 15, and a flooring unit 16. The adaptation unit 11 calculates the estimated values Qω(T) and Nω on the basis of the adaptive coefficient Wω(m). The multiplication unit 12 multiplies the estimated value Nω by a subtraction weight α1. The multiplication unit 13 multiplies the estimated value Qω(T) by a subtraction weight 2. The subtraction unit 14 subtracts outputs of the multiplication units 12 and 13 from the observed signal Xω(T) and outputs a result Yω(T) of the subtraction. The multiplication unit 15 multiplies the estimated value Nω by a flooring coefficient β. The flooring unit 16 outputs a power spectrum Zω(T) which is used when a speech recognition process is applied to the speech s. When a adaptive learning is performed in the non-speech segment, the adaptation unit 11 makes reference to the reference signal Rω(T) in each sound frame, and hence updates the adaptive coefficient Wω(m) by means of using an output Yω(T) from the subtraction unit 14 as an error signal Eω(T). On the basis of the adaptive coefficient Wω(m), the adaptation unit 11 calculates the estimated values Nω and Qω(T). In addition, when the adaptive learning is performed in the speech segment, the adaptation unit 11 calculates the estimated value Qω(T), and outputs the estimated value Nω, on the basis of the reference signal Rω(T) and the adaptive coefficient Wω(m) on which the learning has been performed.
The subtraction weights α1 and α2 by which the estimated values Nω and Qω(T) are multiplied respectively in the multiplication units 12 and 13 shown in
Eω(T)=Xω(T)−Qω(T)−Nω (1)
The estimated value Qω(T) of the echo is expressed by the following equation by use of the reference signal Rω(T−m) representing the previous M−1 frames and the adaptive coefficient Wω(m).
The reason why the reference signal Rω(T−m) representing the previous M−1 frames is referred to is that a reverberation whose length exceeds one frame is intended to be dealt with. The estimate value Nω of the stationary noise is defined by Equation (3) for reasons of convenience.
Wω(M)=Nω/Const (3)
On the basis of the definitions respectively of Equations (2) and (3), Equation (1) can be expressed by Equation (4).
The adaptive coefficient Wω(m) can be figured out through the adaptive learning in a way that minimizes Equation (5) in the non-speech segment.
Φω=Expect└{Eω(T)}2┘ (5)
where Expect └ ┘ denotes a manipulation of an expected value.
A manipulation for calculating an average of the frames in the non-speech segment is performed as the manipulation of the expected value. In this respect, a total sum of frames up to the Tth frame in the non-speech segment is expressed by the following symbol.
When Equation (5) is minimized, the following equation can be established.
Consequently, the following relationships can be obtained.
Consequently, the adaptive coefficient Wω(m) can be figured out by use of the following equation.
Bω=Aω
−1
·Cω (10)
If the aforementioned method is performed, an inverse matrix of the matrix Aω needs to be found. For this reason, an amount of the calculation is relatively large. If an approximation for a diagonalization is applied to the matrix Aω, an approximate value of Wω(m) can be also figured out sequentially as follows.
where ΔWω denotes an amount of the updating of Wω(m) in the frame T, ALMS denotes an update coefficient, and BLAM denotes a constant for stability.
In the non-speech segment, the power spectrum Yω(T) as the consequence of reducing the stationary noise and the echo from the observed signal Xω(T) can be obtained by use of W(m) to be found in the non-speech segment in the aforementioned manner. In the speech segment, the power spectrum Yω(T) can be obtained in accordance with Equation (12), or Equation (13) which is obtained by applying Equations (2) and (3) to Equation (12).
The acoustic model to be used for a speech recognition process has been heretofore learned with only stationary noise taken into consideration. For this reason, the acoustic model can be applied to the speech recognition process to be performed on the basis of the output Zω(T) in this system, if a value equal to that of the subtraction weight in the spectral subtraction to be applied when the acoustic model is learned is used as a value of the subtraction weight α1 to be assigned to the estimated value Nω of the stationary noise. The application of the acoustic model to the speech recognition process makes it possible to tune, to the best extent possible, performance of the speech recognition to be performed in a case where no echo exists. If a value larger than α1 is used as a value of the subtraction weight α2 to be assigned to the estimated value Nω of the echo, this use makes it possible to more fully reduce echo which is not included when the acoustic model is learned. This makes it possible to remarkably enhance performance of the speech recognition to be performed in a case where the echo exists.
In general, in a case where the spectral subtraction technique is applied to the noise reduction process to be performed as the pre-process for the speech recognition process, adequate flooring is essentially required to be performed. This flooring can be performed, by use of the estimated value Nω of the stationary noise, in accordance with Equations (14a) and (14b), where β denotes the flooring coefficient. If a value equal to that of the flooring coefficient to be used for the noise reduction process which is performed when the acoustic model to be used for the speech recognition to be performed on the basis of the output Zω(T) in this system is used as a value of β, this makes it possible to enhance exactness of the speech recognition process.
Zω(T)=Yω(T) if Y(T)≧β·Nω (14a)
Zω(T)=β·Nω if Yω(T)<β·Nω (14b)
Through this flooring, the power spectrum Zω(T) which is inputted into the speech recognition, and which is the consequence of reducing the stationary noise and the echo, can be obtained. If the inverse discrete Fourier transform (I-DFT) is applied to Zω(T), and concurrently if a phase of the observed signal is used, speech z(t) in the time domain which is actually audible to the human ears can be obtained.
a), 3(b), 4(a) and 4(b) show how the addition of the constant term Const to Equation (4) representing the error signal Eω(T) to be used for the adaptive learning enables the stationary noise components to be estimated at the same time as an adaptive coefficient W concerning the reference signal R is estimated. Incidentally, the figures show it in a case where a value representing the number M of frames in the reference signal R to be used for calculating the estimated value of the echo components is defined as “1” for reasons of simplification.
On the other hand,
Then, by use of the publicly-known method to be performed on the basis of the power of the observed signal and the like, the system determines, in step 33, whether or not a segment belonged to by the frame for which the power spectra Xω(T) and Rω(T) are obtained this time is a speech segment where a speaker utters speech. In a case where the system determines that the segment belonged to by the frame is not the speech segment, the system proceeds to step 34. In a case where the segment belonged to by the frame is the speech segment, the system proceeds to step 35.
In step 34, the system updates the estimated value of the stationary noise and the adaptive coefficient of the echo canceller. Specifically, the adaptation unit 11 finds the adaptive coefficient Wω(m) by use of Equations (7) to (10), and finds the estimated value Nω of the power spectrum of the stationary noise included in the observed signal. Incidentally, instead of this, the adaptive coefficient Wω(m) and the estimated value Nω of the power spectrum of the stationary noise may be sequentially updated by use of Equations (11a) and (11b). Subsequently, the system proceeds to step 35.
In step 35, the adaptation unit 11 finds the estimated value Qω(T) of the power spectrum of the echo included in the observed signal, by use of Equation (2), on the basis of the adaptive coefficient Wω(m) and the reference signals of the previous M−1 frames. Thereafter, in step 36, the multiplication units 12 and 13 respectively multiply the subtraction weights α1 and α2 to the estimated values Nω and Qω(T) thus figured out. The subtraction unit 14 subtracts the results of the multiplications from the power spectrum Xω(T) of the observed signal in accordance with Equation (12), accordingly obtaining the power spectrum Yω(T) as the consequence of reducing the stationary noise and the echo.
Thence, in step 37, the flooring is performed by use of the estimated value Nω of the stationary noise. Specifically, the multiplication unit 15 multiplies the estimated value Nω of the stationary noise, which has been found by the adaptation unit 11, by the flooring coefficient β. The flooring unit 16 compares the multiplication result β·Nω and the output Yω(T) from the subtraction unit 14 in accordance with Equations (14a) and (14b). The flooring unit 16 outputs Yω(T) as a value representing the power spectrum Zω(T) to be outputted therefrom, if Yω(T)≧β·Nω. The flooring unit 16 outputs β·Nω as a value representing the power spectrum Zω(T) to be outputted therefrom, if Yω(T)<β·Nω. In step 38, the flooring unit 16 outputs the power spectrum Zω(T) for one frame, which the flooring is applied to in this manner.
Subsequently, the system determines, in step 39, whether or not the sound frame to which the process is applied by means of obtaining the power spectra Xω(T) and Rω(T) this time is the last of the sound frames. In a case where the system determines that the sound frame is not the last one, the system returns to step 31. Thus, the system continues performing the process on the following frame. In a case where the system determines that the frame is the last one, the system completes the process shown in
Through the process show in
In the case of this embodiment, the adaptive coefficients Wω(M) and Wω(m) (m=0, . . . , M−1) to be used for calculating the estimated values Nω and Qω(T) respectively of the stationary noise components and the non-stationary noise components are designed to be learned at a time as described above. Accordingly, the adaptive coefficients can be learned exactly. This makes it possible to achieve Ladder 2 in the aforementioned development ladders, or noise robustness needed for the speech recognition process to be performed in a vehicle where stationary driving noise and echo coming from the CD/radio exist.
In addition, if a value equal to that representing the subtraction weight which is used for reducing the stationary noise when the acoustic model to be used for a speech recognition process to be performed in Ladder 1 is learned is used as a value representing the subtraction weight α1 to be assigned to the estimated value Nω of the stationary noise, the acoustic model for Ladder 1 can be used, as it is, in the speech recognition process to be performed in Ladder 2. In other words, its consistency with the acoustic model which is used for existing products is high.
Additionally, the noise reduction unit 10 is designed to perform the echo cancellation process, and to reduce the noise components, by use of the spectral subtraction technique. This makes it possible to package the system in the existing speech recognition system without changing the architecture of a speech recognition engine to a large extent.
Furthermore, if a value larger than the subtraction weight α1 is adopted as the subtraction weight α2 to be assigned to the estimated value Qω(T) of the echo, more of the echo components, which are the chief cause of the source error in recognized characters, can be reduced.
Moreover, if the estimated value Qω(T) of the echo in each frame is obtained with additionally reference to the reference signals in the preceding M−1 frames, and concurrently if the adaptive coefficients of the reference signals are defined as M coefficients concerning the reference signals respectively in the M−1 frames, the learning can be performed in a way that reduces the reverberation of the echo inclusively.
In the case of Example 1, first of all, the microphone 1 shown in
In addition, when the vehicle was at a stop, the CD/radio 2 was operated, and accordingly music was outputted from the speaker 3. Thus, an observed signal from the microphone 1 and a reference signal from the CD/radio were recorded at a time. Then, the observed signal thus recorded (hereinafter referred to as “data concerning recorded music”) was overlapped over data concerning the recorded speech at an adequate level.
Thereby, an experimental observed signal x(t) was generated in a case where the speed was 0 km, in another case where the speed was 50 km, and in the other case where the speed was 100 km.
Then, a noise reduction was applied to the recorded reference signal r(t) and the generated experimental observed signal x(t) by use of the system shown in
It should be noted that the digit task is sensitive to the insertion error in recognized characters in the non-speech segment and that the digit task is accordingly suitable to observe an amount of reducing the echo, or the noise made from the musical sound in this case. This is because the number of digits is not limited in the digit task. On the other hand, the command task is free from the source error in recognized characters. This is because the grammar in the command task consists of one sentence and one word. For this reason, one may think that the command task is suitable to observe a degree of speech distortion in a speech segment.
The noise reduction method of the system shown in
Word error rate (%) concerning the experimental observed signals to be observed respectively when the vehicle speeds were 0 km, 50 km and 100 km, as well as an average of the rates, are shown, as a result of performing the speech recognition by means of the digit task, in columns representing Example 1 in Table 3 shown in
As Example 2, the speech recognition was performed under the same conditions as the speech recognition as Example 1 was performed, except for by use of the system shown in
As Comparative Example 1, the speech recognition was performed, by use of the noise reduction method shown in columns representing Comparative Example 1 in Table 2, under the same conditions as the speech recognition as Example 1 was performed, except that the data concerning the recorded speech on which no recorded musical sound was overlapped was used, instead of the experimental observed signals, for the speech recognition. Results of performing the speech recognition by means of the respective tasks are shown in columns representing Comparative Example 1 in Tables 3 and 4. In the case of this noise reduction method, only the spectral subtraction was applied as measures against the stationary noise and the echo. Even this method brought about sufficiently high performance of the speech recognition in an environment where only stationary noise exists.
As Comparative Examples 2 to 5, the speech recognitions were performed under the same conditions as the speech recognition as Example 1 was performed, except for by use of the respective noise reduction methods shown in columns representing Comparative Examples 2 to 5 in Table 2. Results of performing the speech recognitions are shown in columns representing Comparative Examples 2 to 5 in Tables 3 and 4.
In the case of the noise reduction method of Comparative Example 2, only the conventional mode of the spectral subtraction was performed, but no echo cancellation was performed, as shown in the columns representing Comparative Example 2 in Table 2. In this case, the performance of the speech recognition was relatively low in comparison with Comparative Examples 3 to 5 which used the same experimental observed signals as Comparative Example 2 used, as shown in Tables 3 and 4. This is because no echo cancellation was performed.
In the case of this noise reduction method of Comparative Example 3, the echo cancellation was designed to be performed in the front stage, and the spectral subtraction was designed to be performed in the rear stage, as measures against the stationary noise and the echo, as shown in columns representing Comparative Example 3 in Table 2. The echo cancellation in the front stage was performed by use of a normalized least-mean-square (N-LMS) algorithm with a tap number of 2048. This method was equivalent to the conventional technique shown in
In the case of this noise reduction method of Comparative Example 4, the stationary noise was designed to be reduced in the front stage by means of performing the spectral subtraction, and the echo was designed to be reduced in the rear stage by an echo canceller in the spectral subtraction mode, as shown in the corresponding columns in Table 2. This method was equivalent to the conventional technique shown in
The chief difference between Comparative Example 4 and Example 1 is that the stationary noise components were simultaneously figured out in the process of adapting the echo canceller in the case of Comparative Example. The method of Example 1 was superior to the methods of Comparative Examples 3 and 4 in performance.
The method of Comparative Example 5 was obtained by introducing the echo canceller in the time domain, as the pre-processor, to the front stage of the method of Comparative Example 4. This method was equivalent to the conventional technique shown in
The reason why the results of Examples 1 and 2 were superior to the results of Comparative Examples 3 and 4 can be considered as follows. Specifically, in the case of the method of Comparative Example 3, the observed signal to be inputted into the echo canceller in the front stage included the stationary noise components as they were, none of which components were reduced from the observed signal. This inclusion decreased the performance of the echo canceller in a high-noise environment. Furthermore, in the case of the method of Comparative Example 4, an averaged power N′ which was subtracted from the observed signal X in the front stage included influence of the echo. This made it impossible to reduce the stationary noise exactly.
On the contrary, in the case of Example 1, the estimated value N″ of the stationary noise components and the adaptive coefficient W in the echo canceller were designed to be learned at a time. On the basis of the result, the noise reduction was designed to be performed. This made it possible to reduce both the stationary noise and the echo adequately. Moreover, in the case of Example 2, the echo canceller in the time domain was introduced as the pre-processor. This made it possible to further enhance the performance, as shown in Tables 3 and 4.
In Table 3 (
It should be noted that the present invention is not limited to the aforementioned embodiments, and that the present invention can be carried out by modifying the present invention whenever deemed necessary. For example, in the case of the aforementioned embodiments, the noise reduction process is performed by means of subtracting power spectrum. Instead, however, the noise reduction process may be performed by means of subtracting magnitude. In general, the noise reduction process is implemented by means of subtracting both the power and the magnitude.
Moreover, in the case of the aforementioned embodiments, the spectral subtraction technique is used in order to reduce stationary noise (background noise). Instead, however, another method of reducing the spectrum of the background noise, such as the Wiener filter, may be used to this end.
Furthermore, the present invention has been described giving the example of the echo and the reference signal which are in the form of a monophonic signal. The present invention is not limited to this. The present invention can deal with the echo and the reference signal which are in the form of a stereo signal. Specifically, as described in the section of the prior art, the power spectrum of the reference signal may be defined as a weighted average of its right and left reference signals. In addition, the stereo echo canceller technique may be applied to the pre-process for the echo canceller in the time domain.
Additionally, in the case of the aforementioned embodiments, the sound signal outputted from the CD/radio 2 is used as the reference signal. Instead, however, a sound signal outputted from the car navigation system may be used as the reference signal. This makes it possible to realize barge-in which accepts an interruption of the system prompt with the user's speech through performing the speech recognition while the system is in the process of giving a message to the driver via voice.
As well, in the case of the aforementioned embodiments, the noise reduction is designed to be performed for the purpose of performing the speech recognition in the vehicle compartment. However, the present invention is not limited to this. The present invention can be applied for the purpose of performing the speech recognition in any other environment. For example, the speech recognition may be designed to be capable of being performed by use of a portable personal computer (hereinafter referred to as a “note PC”) while a speech file in the MP3 format, or musical sound of a CD or the like is being played back, by the following means. The speech recognition system for performing the noise reduction in accordance with the present invention is configured by use of the note PC. Thus, a speech signal outputted from the note PC is used as the reference signal in the system.
Commands may be designed to be capable of being inputted into a robot by use of speech while canceling internal noise, including noise from the servo motor, which becomes conspicuous during operations of the robot, by the following means. A speech recognition system for performing the noise reduction in accordance with the present invention is configured in the robot. A microphone with which to obtain the reference signal is set in the body of the robot. A microphone with which to receive commands, which microphone is directed outward from the body, is set in the body. Moreover, commands, including a channel change and preset timer record, may be designed to be capable of being given to a home TV set by use of speech while TV is being watched, by the following means. A speech recognition system for performing the noise reduction in accordance with the present invention is configured in the TV set. Sound outputted from the TV set is used as the reference signal.
In addition, the present invention has been described using the case of the application of the present invention to the speech recognition. However, the present invention is not limited to this. The present invention can be applied to various purposes for which stationary noise and echo need to be reduced. For example, in the case of calling with a hands-free telephone, a speech signal transmitted from a caller on the other end of the line is converted to speech by use of the speaker. This speech is inputted, as echo, through the microphone with which the user of the telephone inputs his/her speech. With this taken into consideration, if the present invention is applied to the telephone so that the speech signal transmitted from the caller on the other end of the line is used as the reference signal, this makes it possible to reduce the echo components from the input signal, thus enabling quality of the call to be improved.
In the case of the present invention, each of adaptive coefficients to be used for calculating estimated values respectively of stationary noise components and non-stationary noise components is designed to be learned on the basis of an observed signal and a reference signal in the frequency domain at a time. This enables each of the adaptive coefficients to be learned more exactly even in a segment where both of the stationary noise components and the non-stationary noise components are present, and thus making it possible to more exactly figure out the estimated values respectively of the stationary noise components and the non-stationary noise components. In this respect, a noise reduction process can be applied to both the stationary noise components and the non-stationary noise components by use of the spectral subtraction technique. This does not largely change a framework of the spectral subtraction which is prevailingly in use in the current speech recognition practice.
Accordingly, if a first subtraction coefficient taking on a value equivalent to that of a subtraction coefficient to be used for reducing stationary noise by use of the spectral subtraction technique is adopted, when the acoustic model to be used for speech recognition is used as described before, this makes it possible to perform a noise reduction process suitable for the acoustic model. For this reason, the existing acoustic model can be utilized effectively.
Furthermore, in this case, if the second subtraction coefficient which takes on a value larger than that taken on by the first subtraction coefficient is adopted as described above, an over-subtraction technique can be introduced. In other words, if only the second subtraction coefficient concerning the echo components as the non-stationary noise components is set at a value larger than that taken on by a subtraction coefficient which is supposed in the acoustic model, more of the echo components, which are the chief cause of the source error in recognized characters, can be reduced while maintaining interchangeability between the noise reduction technique and the acoustic model when stationary noise is intended to be reduced.
As described above, moreover, if estimated values of non-stationary noise components in each of predetermined frames are acquired on the basis of reference signals respectively of a plurality of predetermined frames preceding the frame, and concurrently if adaptive coefficients concerning the respective reference signals are defined as a plurality of coefficients concerning the reference signals respectively of the plurality of frames, the learning can be performed in order to reduce the echo reverberation, which is the non-stationary noise components, inclusively.
Although the preferred embodiments of the present invention have been described in detail, it should be understood that various changes, substitutions and alternatives can be made therein without departing from the spirit and scope of the invention as defined by the appended claims. Thus, the present invention can be realized in hardware, software, or a combination of hardware and software. It may be implemented as a method having steps to implement one or more functions of the invention, and/or it may be implemented as an apparatus having components and/or means to implement one or more steps of a method of the invention described above and/or known to those skilled in the art. A visualization tool according to the present invention can be realized in a centralized fashion in one computer system, or in a distributed fashion where different elements are spread across several interconnected computer systems. Any kind of computer system—or other apparatus adapted for carrying out the methods and/or functions described herein—is suitable. A typical combination of hardware and software could be a general purpose computer system with a computer program that, when being loaded and executed, controls—the computer system such that it carries out the methods described herein. The present invention can also be embedded in a computer program product, which comprises all the features enabling the implementation of the methods described herein, and which—when loaded in a computer system—is able to carry out these methods.
Computer program means or computer program in the present context include any expression, in any language, code or notation, of a set of instructions intended to cause a system having an information processing capability to perform a particular function either directly or after conversion to another language, code or notation, and/or after reproduction in a different material form.
Thus the invention includes an article of manufacture which comprises a computer usable medium having computer readable program code means embodied therein for causing one or more functions described above. The computer readable program code means in the article of manufacture comprises computer readable program code means for causing a computer to effect the steps of a method of this invention. Similarly, the present invention may be implemented as a computer program product comprising a computer usable medium having computer readable program code means embodied therein for causing a function described above. The computer readable program code means in the computer program product comprising computer readable program code means for causing a computer to affect one or more functions of this invention. Furthermore, the present invention may be implemented as a program storage device readable by machine, tangibly embodying a program of instructions executable by the machine to perform method steps for causing one or more functions of this invention. Methods of this invention may be implemented by an apparatus which provides the functions carrying out the steps of the methods. Apparatus and/or systems of this invention may be implemented by a method that includes steps to produce the functions of the apparatus and/or systems.
It is noted that the foregoing has outlined some of the more pertinent objects and embodiments of the present invention. This invention may be used for many applications. Thus, although the description is made for particular arrangements and methods, the intent and concept of the invention is suitable and applicable to other arrangements and applications. It will be clear to those skilled in the art that modifications to the disclosed embodiments can be effected without departing from the spirit and scope of the invention. The described embodiments ought to be construed to be merely illustrative of some of the more prominent features and applications of the invention. Other beneficial results can be realized by applying the disclosed invention in a different manner or modifying the invention in ways known to those familiar with the art.
Number | Date | Country | Kind |
---|---|---|---|
2004-357821 | Dec 2004 | JP | national |
Number | Date | Country | |
---|---|---|---|
Parent | 11298318 | Dec 2005 | US |
Child | 12185954 | US |