In the attached drawings:
A noise suppressor embodying the invention will now be described with reference to the attached drawings, in which like elements are indicated by like reference characters. This noise suppressor may be used as a preprocessor in speech recognition apparatus, or as an initial stage for processing a speech signal picked up by a microphone in a mobile telephone or hands-free telephone, although the embodiment is not restricted to these applications.
Referring to
The analyzer 10 receives a digital speech signal x(n) including noise, and executes a fast Fourier transform (FFT) to analyze the signal into a complex-valued frequency spectrum C(m). The noise reducer 20 receives the frequency spectrum output from the analyzer 10 and removes noise components. The output generator 30 then generates an output speech signal y(n) by performing an inverse FFT on the output G(m) of the noise reducer 20.
The analyzer 10 comprises a window processor 101 and a fast Fourier transform (FFT) processor 102 as shown in
The notation x(n) in
The input digital speech signal is not limited to a signal picked up by a microphone and converted from analog to digital form. The signal may be read from a memory, or transmitted from another device.
The window processor 101 applies a window function to the N consecutive samples x(n) to improve the precision of the analysis. The output b(n) of the window processor 101 is obtained by multiplication by a window function w(n) as in equation (1). Various window functions are applicable; for example, the Hamming window given by equation (2) may be applied. The windowing process is executed in relation to the frame splicing process carried out in the output generator 30 as described later.
Although the use of a window function is preferred, it is not strictly necessary. In some situations the window processor 101 should be omitted, as noted below.
The FFT processor 102 performs an N-point FFT on the output b(n) of the window processor 101. The spectrum C(m) obtained in the FFT processor 102 is accordingly the result of the discrete Fourier transform (DFT) given by equation (3), the integer m in which is known as the frequency number.
The invention is not limited to use of the FFT; other methods of analyzing the signal into a frequency spectrum may be applied. Furthermore, if the noise suppressor 1 forms part of a device that already employs a frequency analyzer for another purpose, that frequency analyzer may be used as a component element of the noise suppressor 1, instead of providing a separate analyzer 10. Such a configuration is possible, for example, when the noise suppressor 1 is used in an Internet protocol (IP) telephone. An IP telephone inserts encoded FFT output into the IP packet payload; the FFT output prior to encoding may be used as the output of the analyzer 10 described above.
The noise reducer 20 has a magnitude characterizer 201, a peak detector 202, and a masking processor 203 as shown in
The magnitude characterizer 201 calculates a magnitude curve or amplitude characteristic of the frequency spectrum C(m) received from the FFT processor 102. As the frequency spectrum C(m) consists of complex values, the magnitude characterizer 201 takes their absolute values, and then performs a logarithmic conversion on the absolute values to obtain the amplitude characteristic D(m) as in equation (4). The logarithmic conversion provides perceptual linearity.
D(m)=log10∥C(m)∥ (where ∥•∥ denotes absolute value) (4)
As the spectrum C(m) has the property C(m)=C*(N−m) (where 1≦m≦N/2−1, and C*(N−m) is the complex conjugate value of C(N−m)), it is sufficient to perform the processes in the noise reducer 20 on values of m in the range of 0≦m≦N/2.
The peak detector 202 detects the positions of peaks in the amplitude characteristic D(m). The peak detector 202 finds peak points mp at which the value of the amplitude characteristic D(m) reaches a local maximum.
To reduce the effects of noise and to emphasize the peaks (local maxima) in the amplitude characteristic D(m), a local comparison function E(k) approximating the average shape of a typical speech signal spectrum around a peak position is used. The degree of dissimilarity F(m) between the amplitude characteristic D(m) and the local comparison function E(k) is calculated according to equation (5), and any position at which the degree of dissimilarity F(m) attains a local minimum value below a predetermined threshold level is taken as a peak point mp. Roughly speaking, the peak detector 202 detects peaks with shapes that strongly resemble a typical speech peak. The local comparison function E(m) is prestored in the peak detector 202. The symbols −M1 and M2 in equation (5) represent the beginning and end of the interval over which the local comparison function E(k) is defined.
The masking processor 203 performs the following masking process on the detected peak points mp, starting with the peak point mm having the largest magnitude D(mm).
A masking function M(s, mm, D(mm)) created on the basis of known perceptual masking characteristics is prestored in a table in the masking processor 203 (see
D(mm)−D(s)>M(s,mm,D(mm)) (6)
This masking process yields the values of the noise-suppressed spectrum G(m) in the range of 0≦m≦N/2. The values of G(m) in the range of N/2+1≦m≦N−1 are obtained from the relationship G(m)=G*(N−m). The complete noise-suppressed spectrum G(m) thus obtained is received by the output generator 30.
The output generator 30 has an inverse FFT processor 301 and a splicer 302 as shown in
The inverse FFT processor 301 performs an inverse FFT on the noise-suppressed spectrum G(m) to obtain the noise-suppressed signal g(n). If, in place of the FFT, the analyzer 10 uses some other type of frequency analysis process, the inverse FFT processor 301 uses the corresponding inverse process.
The splicer 302 adds the values of the first N/2 data points in the noise-suppressed signal g(n) of the current frame to the values of the last N/2 data points in the noise-suppressed signal g′(n) of the immediately preceding frame to obtain the output speech signal y(n), as in equation (7).
y(n)=g(n)+g′(n+N/2) (7)
In the above process, the data are shifted so that half of the data (N/2 samples) in successive frames overlap; this is a well-known method of smoothly splicing waveforms. The time available to the analyzer 10, noise reducer 20 and output generator 30 in which to process one frame as described above is NT/2, where T is the sampling period of the speech signal. The sampling period T is generally in the range from 31.25 microseconds to 125 microseconds, so if N is 512, then NT/2 is in the range from 8 to 32 milliseconds.
Depending on the use of the noise suppressor, it may be possible to omit the output generator 30 or to use the output generator of another device. When the noise suppressor is used in a speech recognition device, for example, the output generator 30 may be omitted by using the values of the noise-suppressed spectrum G(m) as recognition features. When the noise suppressor is used in an IP telephone set, the output generator already present in the IP telephone set may be used to perform the above processes.
The operation (noise suppression method) of the noise suppressor 1 having the structure described above will now be explained with reference to
As described above, the window processor 101 performs a windowing process on the N consecutive data samples x(n) received by the analyzer 10, the FFT processor 102 performs an N-point FFT on the windowed data b(n) output from the window processor 101, and the noise reducer 20 processes the resulting frequency spectrum C(m) in the range 0≦m≦N/2, taking advantage of the relationship C(m)=C*(N−m) to omit processing for values of m greater than N/2.
The magnitude characterizer 201 in the noise reducer 20 calculates the magnitude curve or amplitude characteristic of the spectrum C(m).
To detect peaks in the amplitude characteristic D(m) the peak detector 202 may use, for example, the local comparison function E(k) shown in
From among the peak points mp, the masking processor 203 determines the peak point mm having the largest amplitude D(mm), reads the prestored values M(s, mm, D(mm)) of the masking function corresponding to peak position mm and amplitude D(mm) from the table, and tests the condition on the amplitude D(s) given by inequality (6) above for values of s in the range of 0≦s≦N/2. When this condition is satisfied, the corresponding frequency spectrum value C(s) is replaced with zero, thereby removing the corresponding frequency component from the spectrum. The masking function is defined so that the masking process removes frequency components that are significantly smaller than the peak amplitude, where the criteria for being significantly smaller become more stringent with increasing distance from the peak.
After completing this masking process for the peak point mm with the largest amplitude, the masking processor 203 further modifies the frequency spectrum by performing a similar masking process for the peak position mp with the next largest amplitude, and proceeds in this way through all the detected peak points in their order of magnitude. When a frequency component is removed, if it was located at one of the peak positions mp, that position may be discarded from the list of peak positions, to avoid unnecessary masking processing for peaks that have themselves already been masked.
The masking function is preferably designed so that masking increases with increasing frequency, as illustrated in
As can be appreciated from
Incidentally, the amplitude characteristic in
The inverse FFT processor 301 in the output generator 30 performs an N-point inverse FFT to convert the noise-suppressed spectrum G(m) to a noise-suppressed signal g(n), and the splicer 302 splices the noise-suppressed signals g(n) of successive frames to obtain the output speech signal y(n).
Like conventional spectral subtraction, the embodiment described above operates in the frequency domain, so it does not require extensive time-domain processing such as autocorrelation computation, and it does not require two microphones or the processing of two input signals. Unlike conventional spectral subtraction, the embodiment described above removes irregular noise at even high noise levels, and does not require the detection of speech-free intervals or the determination of a separate noise spectrum. Accordingly, the above embodiment provides an effective way to suppress a wide variety of irregular noise without requiring extra hardware or extensive signal processing.
Some exemplary variations of the above embodiment will now be described.
The overlapping of frames in the above embodiment is not essential; each successive frame may consist of an entirely new set of samples. Noise reduction can then be carried out with a processor of lower processing power than required in the embodiment above, or by a processor that must devote more of its power to other processes. When the frames do not overlap, it is also preferable not to execute the windowing process.
The computation carried out in the magnitude characterizer 201 may be simplified in two ways. One way is to omit the logarithmic conversion and to calculate the amplitude characteristic D(m) using equation (8) below. A further way is to omit the square-root operation required in the absolute-value calculation and to calculate the amplitude characteristic D(m) using equation (9). Either of these simplifications can produce results similar to those obtained in the embodiment above, provided the masking function M(s, mm, D(mm)) is altered accordingly.
D(m)=∥C(m)∥ (where ∥•∥ denotes absolute value) (8)
D(m)=∥C(m)∥2 (where ∥•∥ denotes absolute value) (9)
The peak detection process in the peak detector 202 may be simplified by averaging the amplitude characteristic D(m) over intervals from m−K to m+K (where K is a positive integer).
The masking function M(s, mm, D(mm)) may be simplified to the form in equation (10), which assigns a predetermined constant value H to positions s within a fixed distance P of the peak position mp and assigns the greatest expressible positive value to more distant positions. The masking value is accordingly constant within a local range including the peak position mp, and no components outside that local range are removed, because no component can have a magnitude exceeding the greatest expressible positive value. If the constant P is set to the average distance between peak points mp, then on the average, the masking function given by equation (10) removes frequency components with amplitudes that are attenuated by more than H with respect to the amplitude of the nearest peak point mp.
In another possible simplification, the masking function has the form M(s, mp, D(mp))=M1(s, mp)+M2(D(mp)), so that it is the sum of a first function M1 of the peak position mp and frequency number s and a second function M2 of the peak magnitude D(mp). With this type of masking function it only necessary to store a single curve of the type shown in
Instead of completely removing masked frequency components, the masking process may only attenuate them. For example, the complex values C(m) of masked frequency components may be multiplied by a positive real number less than unity.
The noise suppressor according to the present invention may be used in combination with other noise suppressors. A sound source separator that uses two microphones to separate the speech of a plurality of speakers by independent component analysis (ICA) may be provided upstream of the inventive noise suppressor, and the inventive noise suppressor may be used to remove residual noise from each separated speech signal.
Those skilled in the art will recognize that further variations are possible within the scope of the invention, which is defined in the appended claims.
Number | Date | Country | Kind |
---|---|---|---|
JP2006-229341 | Aug 2006 | JP | national |