1. Technical Field
This invention relates to a method and system for controlling a weighting filter based on the spectral content of the input speech signal, among other possible factors.
2. Related Art
An analog portion of a communications network may detract from the desired audio characteristics of vocoded speech. In a public switched telephone network, a trunk between exchanges or a local loop from a local office to a fixed subscriber station may use analog representations of the speech signal. For example, a telephone station typically transmits an analog modulated signal with approximately 3.4 KHz bandwidth to the local office over the local loop. The local office may include a channel bank that converts the analog signal to a digital pulse-code-modulated signal (e.g., DS0). An encoder in a base station may subsequently encode the digital signal, which remains subject to the frequency response originally imparted by the analog local loop, the telephone, and the speaker.
The analog portion of the communications network may skew the frequency response of a voice message transmitted through the network. A skewed frequency response may negatively impact the digital speech coding process because the digital speech coding process may be optimized for a different frequency response than the skewed frequency response. As a result, analog portion may degrade the intelligibility, consistency, realism, clarity or another performance aspect of the digital speech coding.
The change in the frequency response may be modeled as one or more modeling filters interposed in a path of the voice signal traversing an ideal analog communications network with an otherwise flat spectral response. A Modified Intermediate Reference System (MIRS) refers to a modeling filter or another model of the spectral response of a voice signal path in a communications network. If a voice signal that has a flat spectral response is inputted into an MIRS filter, the output signal has a sloped spectral response with an amplitude that generally increases with a corresponding increase in frequency.
In the prior art, an encoder may use weighting filters with identical responses for a pitch-preprocessing weighting filter, an adaptive-codebook weighting filter, and a fixed-codebook weighting filter. The adaptive-codebook weighting filter may be used for open-loop pitch estimation. If identical filters are used for pitch pre-processing and open-loop pitch estimation and if the input speech has a skewed spectral response (e.g., MIRS response), the encoded speech signal may be degraded in perceptual quality. For example, if the input speech signal to the pitch-preprocessing weighting filter has an MIRS spectral response, the output speech signal from the pitch-preprocessing weighting filter may not be as periodic as it otherwise might be with a different spectral response of the input speech signal. Accordingly, the output of the pitch-preprocessing weighting filter may not be sufficiently periodic to capture coding efficiencies or perceptual aspects associated with generally periodic speech. Thus, the need exists for a pitch-preprocessing weighting filter that addresses the spectral response of the input speech signal to enhance the periodicity of the weighted speech signal.
If identical weighting filters are used for both open-loop pitch estimation and fixed-codebook search, the bandwidth of the encoded speech and the perceptual quality of the encoded speech may be degraded. For example, the weighting filters may filter out unwanted noise from the input speech signal, which may lead incidentally to a reduced bandwidth of the encoded speech signal. If the input speech signal has a desired noise component or another speech component that requires a wide bandwidth for accurate encoding, the weighting filters may attenuate the speech noise component of the encoded speech to such a degree that the encoded speech sounds artificial or synthetic when reproduced. Thus, a need exists for weighting filters of an encoder that filter out unwanted noise and yet maintain the appropriate bandwidth necessary for a perceptually accurate reproduction of the speech.
In accordance with the invention, a method for preparing a speech signal for encoding comprises determining whether the spectral content of an input speech signal is representative of a defined spectral characteristic (e.g., a defined characteristic slope). A weighting filter may be associated with a particular portion of the encoder and may comprise a frequency-specific component that has a response tailored to the particular portion of the encoder, consistent with perceptual quality considerations of the reproduced speech signal. A frequency-specific filter component of a weighting filter is controlled based on one or more of the following: the determination of the spectral content of the speech signal and an affiliation of the encoder with a particular portion of the encoder. A core weighting filter component of the weighting filter may be maintained regardless of the spectral content of the speech signal.
The frequency specific filter component of a weighting filter may include a low-pass filter component, a high-pass filter component, or some other filter component. In one example, a low-pass filter component of a pre-processing weighting filter is controlled based on the determination of the spectral content of the input speech signal to enhance the periodicity of the weighted speech. In another example, a high-pass filter component of a fixed codebook weighting filter is controlled based on the determination of the spectral content of the speech signal to enhance the perceptual quality of reproduced speech, derived from the encoded speech.
In accordance with another aspect of the invention, if multiple weighting filters are used in the encoder, the responses of at least two weighting filters may differ to correspond to the speech processing objectives of specific portions of the encoder, consistent with achieving a desired level of perceptual quality of the speech signal. In other words, different weighting filter responses could be used for different portions of the encoder to enhance the perceptual quality of the reproduced speech.
Other systems, methods, features and advantages of the invention will be apparent to one with skill in the art upon examination of the following figures and detailed description. It is intended that all such additional systems, methods, features and advantages be included within this description, be within the scope of the invention, and be protected by the accompanying claims.
Like reference numerals designate corresponding elements throughout the different figures.
The term coding refers to encoding of a speech signal, decoding of a speech signal or both. An encoder codes or encodes a speech signal, whereas a decoder codes or decodes a speech signal. The encoder may determine certain coding parameters that are used both in an encoder to encode a speech signal and a decoder to decode the encoded speech signal. The term coder refers to an encoder or a decoder.
For an uplink transmission from the mobile station 127 to the base station 112, the mobile station 127 has a microphone 124 that receives an audible speech message of acoustic vibrations from a speaker or source. The microphone 124 transduces the audible speech message into a speech signal. In one embodiment, the microphone 124 has a generally flat spectral response across a bandwidth of the audible speech message so long as the speaker has a proper distance and position with respect to the microphone 124. An audio stage 134 preferably amplifies and digitizes the speech signal. For example, the audio stage 134 may include an amplifier with its output coupled to an input of an analog-to-digital converter. The audio stage 134 inputs the speech signal into the spectral detector 221.
A spectral detector 221 detects the spectral contents or spectral response of the speech signal. In one embodiment, the spectral detector 221 determines whether or not the spectral contents conform to a defined spectral slope (e.g., an MIRS response). A spectral response refers to the energy distribution (e.g., magnitude versus frequency) of the voice signal over at least part of the bandwidth of the voice signal. A flat spectral response refers to an energy distribution that generally keeps the original spectrum of input speech signal over the bandwidth. A sloped spectral response refers to an energy distribution that generally tilts the original spectral response (of an inputted speech signal) with respect to frequency of the inputted speech signal.
An MIRS spectral response refers to an energy distribution where an inputted speech signal is tilted upward in magnitude for a corresponding increase in frequency. For both a flat and MIRS speech signal, the energy distribution is usually not evenly distributed over the bandwidth of the speech signal.
A first spectral response refers to a voice signal with a sloped spectral response where the higher frequency components have relatively greater amplitude than the average amplitude of other frequency components of the voice signal. A second spectral response refers to a voice signal where the higher frequency components have approximately equal amplitudes to lower frequency components, or where amplitudes are within a range of each other. A third spectral response refers to a voice signal where the higher frequency components have relatively lower amplitude than the average amplitude of other frequency components of the voice signal.
At the mobile station 127, the spectral response of the outgoing speech signal may be influenced by one or more of the following factors: (1) frequency response of the microphone 124, (2) position and distance of the microphone 124 with respect to a source (e.g., speaker's mouth) of the audible speech message, and (3) frequency response of an audio stage 134 that amplifies the output of the microphone 124. The spectral response of the outgoing speech signal, which is inputted into the spectral detector 221, may vary. In one example, the spectral response may be generally flat with respect to most frequencies over the bandwidth of the speech message. In another example, the spectral response may have a slope that indicates an amplitude that increases with frequency over the bandwidth of the speech message. For instance, an MIRS response has an amplitude that increases with a corresponding increase in frequency over the bandwidth of the speech message.
The encoder 11 reduces redundant information in the speech signal or otherwise reduces a greater volume of data of an input speech signal to a lesser volume of data of an encoded speech signal. The encoder 11 may comprise a coder, a vocoder, a codec, or another device for facilitating efficient transmission of information over the air interface between the mobile station 127 and the base station 112. In one embodiment, the encoder 11 comprises a code-excited linear prediction (CELP) coder or a variant of the CELP coder. In an alternate embodiment, the encoder 11 may comprise a parametric coder, such as a harmonic encoder or a waveform-interpolation encoder. The encoder 11 is coupled to a transmitter 62 for transmitting the coded signal over the air interface to the base station 112.
The base station 112 may include a receiver 128 coupled to a decoder 120. At the base station 112, the receiver 128 receives a transmitted signal transmitted by the transmitter 62. The receiver 128 provides the received speech signal to the decoder 120 for decoding and reproduction on the speaker 126 (i.e., transducer). A decoder 120 reconstructs a replica or facsimile of the speech message inputted into the microphone 124 of the mobile station 127. The decoder 120 reconstructs the speech message by performing inverse operations on the encoded signal with respect to the encoder 11 of the mobile station 127. The decoder 120 or an affiliated communications device sends the decoded signal over the network to the subscriber station (e.g., fixed subscriber station 118).
For a downlink transmission from the base station 112 to the mobile station 127, a source at the fixed subscriber station 118 (e.g., a telephone set) may speak into a microphone 124 of the fixed subscriber station 118 to produce a speech message. The fixed subscriber station 118 transmits the speech message over the communications network 117 via one of various alternative communications paths to the base station 112.
Each of the alternate communications paths may provide a different spectral response of the speech signal that is applied to the spectral detector 221 of the base station 112. Three examples of communications paths are shown in
The spectral response of any of the three illustrative communications paths may be flat or may be sloped. The slope may or may not be consistent with an MIRS model of a telecommunications system, although the slope may vary from network to network.
The encoder 11 at the base station 112 encodes the speech signal from the spectral detector 221. For a downlink transmission, the transmitter 130 transmits an encoded signal over the air interface to a receiver 222 of the mobile station 127. The mobile station 127 includes a decoder 120 coupled to the receiver 222 for decoding the encoded signal. The decoded speech signal may be provided in the form of an audible, reproduced speech signal at a speaker 126 or another transducer of the mobile station 127.
An MIRS speech signal may be formed because of the network or filtering which tilts the original spectral response of an inputted speech signal. The MIRS speech signal contains more high-frequency energy than the original response of the inputted speech signal, but could still have a negative or a positive tilt because of the underlying slope of the original spectral response. In the context of an MIRS response, the slope shown in
Although the slope shown in
For example,
The encoder 11 includes an input section 10 coupled to an analysis section 12 and an adaptive codebook section 14. In turn, the adaptive codebook section 14 is coupled to a fixed codebook section 16. A multiplexer 60, associated with both the adaptive codebook section 14 and the fixed codebook section 16, is coupled to a transmitter 62.
The transmitter 62 and a receiver 128 along with a communications protocol represent an air interface 64 of a wireless system. The input speech from a source or speaker is applied to the encoder 11 at the encoding site. The transmitter 62 transmits an electromagnetic signal (e.g., radio frequency or microwave signal) from an encoding site to a receiver 128 at a decoding site, which is remotely situated from the encoding site. The electromagnetic signal is modulated with reference information representative of the input speech signal. A demultiplexer 68 demultiplexes the reference information for input to the decoder 120. The decoder 120 produces a replica or representation of the input speech, referred to as output speech, at the decoder 120.
The input section 10 has an input terminal for receiving an input speech signal. The input terminal feeds a high-pass filter 18 that attenuates the input speech signal below a cut-off frequency (e.g., 80 Hz) to reduce noise in the input speech signal. The high-pass filter 18 feeds a pre-processing weighting filter 21 and a linear predictive coding (LPC) analyzer 30. The pre-processing weighting filter 21 may feed both a pitch pre-processing module 22 and a pitch estimator 32. Further, the pre-processing weighting filter 21 may be coupled to an input of a first summer 46 via the pitch pre-processing module 22.
In one embodiment, a speech characteristic classifier 26 comprises a detector 24. The detector 24 may refer to a classification unit that (1) identifies noise-like unvoiced speech and (2) distinguishes between non-stationary voiced and stationary voiced speech in an interval of an input speech signal. The detector 24 may detect or facilitate detection of the presence or absence of a triggering characteristic (e.g., a generally voiced and generally stationary speech component) in an interval of input speech signal. In another embodiment, the detector 24 may be integrated into the speech characteristic classifier 26 to detect a triggering characteristic in an interval of the input speech signal. Where the detector 24 is so integrated, the speech characteristic classifier 26 is coupled to a selector 34.
The analysis section 12 includes the LPC analyzer 30, the pitch estimator 32, a voice activity detector 28, a speech characteristic classifier 26, and a controller 27. The LPC analyzer 30 is coupled to the voice activity detector 28 for detecting the presence of speech or silence in the input speech signal. The pitch estimator 32 is coupled to a mode selector 34 for selecting a pitch pre-processing procedure or a responsive long-term prediction procedure based on input received from the detector 24. The controller 27 controls the pre-processing weighting filter 21, the adaptive-codebook weighting filter 25, or both based on the spectral content of the speech signal. The pre-processing weighting filter 21, the adaptive-codebook weighting filter 25, or the fixed-codebook weighting filter 23 may be referred to generally as a weighting filter.
The adaptive codebook section 14 includes a first excitation generator 40 coupled to a synthesis filter 42 (e.g., short-term predictive filter). In turn, the synthesis filter 42 feeds an adaptive-codebook weighting filter 23. The adaptive-codebook weighting filter 23 is coupled to an input of the first summer 46, whereas a minimizer 48 is coupled to an output of the first summer 46. The minimizer 48 provides a feedback command to the first excitation generator 40 to minimize an error signal at the output of the first summer 46. The adaptive codebook section 14 is coupled to the fixed codebook section 16 where the output of the first summer 46 feeds the input of a second summer 44 with the error signal.
The fixed codebook section 16 includes a second excitation generator 58 coupled to a synthesis filter 42 (e.g., short-term predictive filter). In turn, the synthesis filter 42 feeds a fixed-codebook weighting filter 25. The fixed-codebook weighting filter 25 is coupled to an input of the second summer 44, whereas a minimizer 48 is coupled to an output of the second summer 44. A residual signal is present at the output of the second summer 44. The minimizer 48 provides a feedback command to the second excitation generator 58 to minimize the residual signal.
In one alternate embodiment, the synthesis filter 42 and the adaptive-codebook weighting filter 23 of the adaptive codebook section 14 are combined into a single filter.
In another alternate embodiment, the synthesis filter 42 and the fixed-codebook weighting filter 25 of the fixed codebook section 16 are combined into a single filter. In yet another alternate embodiment, the three perceptual weighting filters (21, 23, and 25) of the encoder 11 may be replaced by two perceptual weighting filters, where each remaining perceptual weighting filter is coupled in tandem with the input of one of the minimizers 48. Accordingly, in the foregoing alternate embodiment the pre-processing weighting filter 21 from the input section 10 is deleted.
In accordance with
The pre-processing weighing filter 21 of the input section 10 has a first time versus amplitude response that opposes a second time versus amplitude response of the formants of the input speech signal. The formants represent key amplitude versus frequency responses of the speech signal that characterize the speech signal consistent with an linear predictive coding analysis of the LPC analyzer 30. The pre-processing weighting filter 21 is adjusted to compensate for the perceptually induced deficiencies in error minimization, which would otherwise result, between the reference speech signal (e.g., input speech signal) and a synthesized speech signal.
The input speech signal is provided to a linear predictive coding (LPC) analyzer 30 (e.g., LPC analysis filter) to determine LPC coefficients for the synthesis filters 42 (e.g., short-term predictive filters). The input speech signal is inputted into a pitch estimator 32. The pitch estimator 32 determines a pitch lag value and a pitch gain coefficient for voiced segments of the input speech. Voiced segments of the input speech signal refer to generally periodic waveforms.
The pitch estimator 32 may perform an open-loop pitch analysis at least once a frame to estimate the pitch lag. Pitch lag refers a temporal measure of the repetition component (e.g., a generally periodic waveform) that is apparent in voiced speech or voice component of a speech signal. For example, pitch lag may represent the time duration between adjacent amplitude peaks of a generally periodic speech signal. As shown in
The pitch estimator 32 maximizes the correlations between signals occurring in different sub-frames to determine candidates for the estimated pitch lag. The pitch estimator 32 preferably divides the candidates within a group of distinct ranges of the pitch lag. After normalizing the delays among the candidates, the pitch estimator 32 may select a representative pitch lag from the candidates based on one or more of the following factors: (1) whether a previous frame was voiced or unvoiced with respect to a subsequent frame affiliated with the candidate pitch delay; (2) whether a previous pitch lag in a previous frame is within a defined range of a candidate pitch lag of a subsequent frame, and (3) whether the previous two frames are voiced and the two previous pitch lags are within a defined range of the subsequent candidate pitch lag of the subsequent frame. The pitch estimator 32 provides the estimated representative pitch lag to the adaptive codebook 36 to facilitate a starting point for searching for the preferential excitation vector in the adaptive codebook 36. The adaptive codebook section 11 later refines the estimated representative pitch lag to select an optimum or preferential excitation vector from the adaptive codebook 36.
The speech characteristic classifier 26 preferably executes a speech classification procedure in which speech is classified into various classifications during an interval for application on a frame-by-frame basis or a subframe-by-subframe basis. The speech classifications may include one or more of the following categories: (1) silence/background noise, (2) noise-like unvoiced speech, (3) unvoiced speech, (4) transient onset of speech, (5) plosive speech, (6) non-stationary voiced, and (7) stationary voiced. Stationary voiced speech represents a periodic component of speech in which the pitch (frequency) or pitch lag does not vary by more than a maximum tolerance during the interval of consideration. Non-stationary voiced speech refers to a periodic component of speech where the pitch (frequency) or pitch lag varies more than the maximum tolerance during the interval of consideration. Noise-like unvoiced speech refers to the nonperiodic component of speech that may be modeled as a noise signal, such as Gaussian noise. The transient onset of speech refers to speech that occurs immediately after silence of the speaker or after low amplitude excursions of the speech signal. A speech classifier may accept a raw input speech signal, pitch lag, pitch correlation data, and voice activity detector data to classify the raw speech signal as one of the foregoing classifications for an associated interval, such as a frame or a subframe. The foregoing speech classifications may define one or more triggering characteristics that may be present in an interval of an input speech signal. The presence or absence of a certain triggering characteristic in the interval may facilitate the selection of an appropriate encoding scheme for a frame or subframe associated with the interval.
A first excitation generator 40 includes an adaptive codebook 36 and a first gain adjuster 38 (e.g., a first gain codebook). A second excitation generator 58 includes a fixed codebook 50, a second gain adjuster 52 (e.g., second gain codebook), and a controller 54 coupled to both the fixed codebook 50 and the second gain adjuster 52. The fixed codebook 50 and the adaptive codebook 36 define excitation vectors. Once the LPC analyzer 30 determines the filter parameters of the synthesis filters 42, the encoder 11 searches the adaptive codebook 36 and the fixed codebook 50 to select proper excitation vectors. The first gain adjuster 38 may be used to scale the amplitude of the excitation vectors of the adaptive codebook 36. The second gain adjuster 52 may be used to scale the amplitude of the excitation vectors in the fixed codebook 50. The controller 54 uses speech characteristics from the speech characteristic classifier 26 to assist in the proper selection of preferential excitation vectors from the fixed codebook 50, or a sub-codebook therein.
The adaptive codebook 36 may include excitation vectors that represent segments of waveforms or other energy representations. The excitation vectors of the adaptive codebook 36 may be geared toward reproducing or mimicking the long-term variations of the speech signal. A previously synthesized excitation vector of the adaptive codebook 36 may be inputted into the adaptive codebook 36 to determine the parameters of the present excitation vectors in the adaptive codebook 36. For example, the encoder may alter the present excitation vectors in its codebook in response to the input of past excitation vectors outputted by the adaptive codebook 36, the fixed codebook 50, or both. The adaptive codebook 36 is preferably updated on a frame-by-frame or a subframe-by-subframe basis based on a past synthesized excitation, although other update intervals may produce acceptable results and fall within the scope of the invention.
The excitation vectors in the adaptive codebook 36 are associated with corresponding adaptive codebook indices. In one embodiment, the adaptive codebook indices may be equivalent to pitch lag values. The pitch estimator 32 initially determines a representative pitch lag in the neighborhood of the preferential pitch lag value or preferential adaptive index. A preferential pitch lag value minimizes an error signal at the output of the first summer 46, consistent with a codebook search procedure. The granularity of the adaptive codebook index or pitch lag is generally limited to a fixed number of bits for transmission over the air interface 64 to conserve spectral bandwidth. Spectral bandwidth may represent the maximum bandwidth of electromagnetic spectrum permitted to be used for one or more channels (e.g., downlink channel, an uplink channel, or both) of a communications system. For example, the pitch lag information may need to be transmitted in 7 bits for half-rate coding or 8-bits for full-rate coding of voice information on a single channel to comply with bandwidth restrictions. Thus, 128 states are possible with 7 bits and 256 states are possible with 8 bits to convey the pitch lag value used to select a corresponding excitation vector from the adaptive codebook 36.
The encoder 11 may apply different excitation vectors from the adaptive codebook 36 on a frame-by-frame basis or a subframe-by-subframe basis. Similarly, the filter coefficients of one or more synthesis filters 42 may be altered or updated on a frame-by-frame basis. However, the filter coefficients preferably remain static during the search for or selection of each preferential excitation vector of the adaptive codebook 36 and the fixed codebook 50. In practice, a frame may represent a time interval of approximately 20 milliseconds and a sub-frame may represent a time interval within a range from approximately 5 to 10 milliseconds, although other durations for the frame and sub-frame fall within the scope of the invention.
The adaptive codebook 36 is associated with a first gain adjuster 38 for scaling the gain of excitation vectors in the adaptive codebook 36. The gains may be expressed as scalar quantities that correspond to corresponding excitation vectors. In an alternate embodiment, gains may be expresses as gain vectors, where the gain vectors are associated with different segments of the excitation vectors of the fixed codebook 50 or the adaptive codebook 36.
The first excitation generator 40 is coupled to a synthesis filter 42. The first excitation vector generator 40 may provide a long-term predictive component for a synthesized speech signal by accessing appropriate excitation vectors of the adaptive codebook 36. The synthesis filter 42 outputs a first synthesized speech signal based upon the input of a first excitation signal from the first excitation generator 40. In one embodiment, the first synthesized speech signal has a long-term predictive component contributed by the adaptive codebook 36 and a short-term predictive component contributed by the synthesis filter 42.
The first synthesized signal is compared to a weighted input speech signal. The weighted input speech signal refers to an input speech signal that has at least been filtered or processed by the pre-processing weighting filter 21. As shown in
The second excitation generator 58 may generate an excitation signal based on selected excitation vectors from the fixed codebook 50. The fixed codebook 50 may include excitation vectors that are modeled based on energy pulses, pulse position energy pulses, Gaussian noise signals, or any other suitable waveforms. The excitation vectors of the fixed codebook 50 may be geared toward reproducing the short-term variations or spectral envelope variation of the input speech signal. Further, the excitation vectors of the fixed codebook 50 may contribute toward the representation of noise-like signals, transients, residual components, or other signals that are not adequately expressed as long-term signal components.
The excitation vectors in the fixed codebook 50 are associated with corresponding fixed codebook indices 74. The fixed codebook indices 74 refer to addresses in a database, in a table, or references to another data structure where the excitation vectors are stored. For example, the fixed codebook indices 74 may represent memory locations or register locations where the excitation vectors are stored in electronic memory of the encoder 11.
The fixed codebook 50 is associated with a second gain adjuster 52 for scaling the gain of excitation vectors in the fixed codebook 50. The gains may be expressed as scalar quantities that correspond to corresponding excitation vectors. In an alternate embodiment, gains may be expresses as gain vectors, where the gain vectors are associated with different segments of the excitation vectors of the fixed codebook 50 or the adaptive codebook 36.
The second excitation generator 58 is coupled to a synthesis filter 42 (e.g., short-term predictive filter), which may be referred to as a linear predictive coding (LPC) filter. The synthesis filter 42 outputs a second synthesized speech signal based upon the input of an excitation signal from the second excitation generator 58. As shown, the second synthesized speech signal is compared to a difference error signal outputted from the first summer 46. The second synthesized signal and the difference error signal are inputted into the second summer 44 to obtain a residual signal at the output of the second summer 44. A minimizer 48 accepts the residual signal and minimizes the residual signal by adjusting (i.e., searching for and applying) the preferential selection of an excitation vector in the fixed codebook 50, by adjusting a preferential selection of the second gain adjuster 52 (e.g., second gain codebook), or by adjusting both of the foregoing selections. A preferential selection of the excitation vector and the gain scalar (or gain vector) apply to a subframe or an entire frame. The filter coefficients of the synthesis filter 42 remain fixed during the adjustment.
The LPC analyzer 30 provides filter coefficients for the synthesis filter 42 (e.g., short-term predictive filter). For example, the LPC analyzer 30 may provide filter coefficients based on the input of a reference excitation signal (e.g., no excitation signal) to the LPC analyzer 30. Although the difference error signal is applied to an input of the second summer 44, in an alternate embodiment, the weighted input speech signal may be applied directly to the input of the second summer 44 to achieve substantially the same result as described above.
The preferential selection of a vector from the fixed codebook 50 preferably minimizes the quantization error among other possible selections in the fixed codebook 50. Similarly, the preferential selection of an excitation vector from the adaptive codebook 36 preferably minimizes the quantization error among the other possible selections in the adaptive codebook 36. Once the preferential selections are made in accordance with
A transmitter 62 or a transceiver is coupled to the multiplexer 60. The transmitter 62 transmits the reference information from the encoder 11 to a receiver 128 via an electromagnetic signal (e.g., radio frequency or microwave signal) of a wireless system as illustrated in
The receiver 128 is coupled to a demultiplexer 68 for demultiplexing the reference information. In turn, the demultiplexer 68 is coupled to a decoder 120 for decoding the reference information into an output speech signal. As shown in FIG. 3, the decoder 120 receives reference information transmitted over the air interface 64 from the encoder 11. The decoder 120 uses the received reference information to create a preferential excitation signal. The reference information facilitates accessing of a duplicate adaptive codebook and a duplicate fixed codebook to those at the encoder 70. One or more excitation generators of the decoder 120 apply the preferential excitation signal to a duplicate synthesis filter. The same values or approximately the same values are used for the filter coefficients at both the encoder 11 and the decoder 120. The output speech signal obtained from the contributions of the duplicate synthesis filter and the duplicate adaptive codebook is a replica or representation of the input speech inputted into the encoder 11. Thus, the reference data is transmitted over an air interface 64 in a bandwidth efficient manner because the reference data is composed of less bits, words, or bytes than the original speech signal inputted into the input section 10.
In an alternate embodiment, certain filter coefficients are not transmitted from the encoder to the decoder, where the filter coefficients are established in advance of the transmission of the speech information over the air interface 64 or are updated in accordance with internal symmetrical states and algorithms of the encoder and the decoder.
The synthesis filter 42 (e.g., a short-term synthesis filter) may have a response that generally conforms to the following equation:
where 1/A(z) is the filter response represented by a z transfer function, ai revised is a linear predictive coefficient, i=1 . . . P, and P is the prediction or filter order of the synthesis filter. Although the foregoing filter response may be used, other filter responses for the synthesis filter 42 may be used. For example, the above filter response may be modified to include weighting or other compensation for input speech signals.
If the response of the synthesis filter 42 of the encoder 11 is expressed as 1/A(z), a response of a corresponding analysis filter of the decoder 120 or the LPC analyzer 30 is expressed as A(z) in accordance with the following equation:
where ai modified is the non-quantized equivalent of ai revised. Thus, the same or similar bandwidth expansion constants or filter coefficients may be applied to a synthesis filter 42, a corresponding analysis filter, or both. During coding, the analysis filter coefficients (i.e., ai modified) are applied to a bandwidth expansion and then quantized. Synthesis filter coefficients (i.e., ai revised) are derivable from the expanded, quantized analysis filter coefficients.
The encoder 11 may encode speech differently in accordance with differences in the detected spectral characteristics of the input speech. If the spectral response is regarded as generally sloped in accordance with a defined characteristic slope (e.g., first spectral response), the pre-processing weighting filter 21 may use a first value for the weighting constant (e.g., α=0.2). On the other hand, if the spectral response is regarded as generally flat (e.g., second spectral response), the pre-processing weighting filter 21 may use a second value for the weighting constant (e.g., α=0) distinct from the first value of the weighting constant. The first value of the weighting constant is an example of a first coding parameter value and the second value of the weighting constant is an example of a second coding parameter value.
In one embodiment, the encoder of
In one embodiment, the pre-processing weighting filter 21 comprises a core weighting filter component and a low-pass filter component. Further, the low-pass filter component may be selectively activated or deactivated in response to the spectral content of the input speech signal. The activation of the low-pass filter component may be used to enhance the periodicity of the modified weighted speech signal, derived from the input speech signal.
In one example, the filter response for the pre-processing weighting filter may be expressed as the following equation:
where 1/A(z) is an LPC synthesis filter response, α is a low-pass adaptive coefficient, and γ1 and γ2 are constant coefficients. In an alternate embodiment, γ1 and γ2 may represent adaptive coefficients, rather than constant coefficients. The core weighting component of the above pre-processing filter equation is:
The low-pass filter component of the above equation is 1+(αZ−1
In one illustrative embodiment, the low-pass adaptive coefficient α has a value between 0 and 0.3. Further, γ1 may fall within a range between 0.9 and 0.97, whereas γ2 may fall within a range between 0.4 and 0.6.
In one embodiment, the adaptive codebook weighting filter comprises the core weighting filter component. In one example, the adaptive codebook weighting filter may be expressed as the following equation.
where 1/A(z) is the LPC synthesis filter response, γ1 and γ2 are constant coefficients. In an alternate embodiment, γ1 and γ2 may represent adaptive coefficients, rather than constant coefficients.
In one illustrative embodiment, γ1 may fall within a range between 0.9 and 0.97, whereas γ2 may fall within a range between 0.4 and 0.6.
In one embodiment, the fixed codebook weighting filter 25 comprises a core weighting filter component and a high-pass filter component. Further, the high-pass filter component may be selectively activated or deactivated in response to the spectral content of the speech signal to improve the spectral characteristics of the encoded and reproduced speech signals.
In one example, the filter response for the fixed-codebook weighting filter 25 may be expressed as the following equation:
where 1/A(z) is the LPC synthesis filter response is μ is a high-pass adaptive coefficient, and γ1 and γ2 are constant coefficients. In an alternate embodiment, γ1 and γ2 may represent adaptive coefficients rather than constant coefficients. The core weighting component of the fixed codebook filter of the above equation is
The high-pass filter component of the above equation is 1−μZ−1.
In one illustrative embodiment, the high-pass adaptive coefficient has a value between 0 and 0.5. Further, γ1 may fall within a range between 0.9 and 0.97, whereas γ2 may fall within a range between 0.4 and 0.6.
In an alternate embodiment, the frequency response of the perceptual weighting filter (21, 23, or 25) may be expressed generally as the following equation:
where α is a weighting constant, ρ and β are preset coefficients (e.g., values from 0 to 1), P is the predictive order or the filter order of the perceptual weighting filter 20, and {ai} is the linear predictive coding coefficient. The perceptual weighting filter 21 controls the value of α based on the spectral response of the input speech signal.
For example, in the adjusting or selection of preferential coding parameter values, different values of the weighting constant α may be selected to adjust the frequency response of the perceptual weighting filter in response to the determined slope or flatness of the speech signal. In one embodiment, α approximately equals 0.2 for generally sloped input speech consistent with the MIRS spectral response or a first spectral response. Similarly, in one embodiment α approximately equals 0 for an input speech signal with a generally flat signal response or a second spectral response.
A multi-rate encoder may include different encoding schemes to attain different transmission rates over an air interface. Each different transmission rate may be achieved by using one or more encoding schemes. The highest coding rate may be referred to as full-rate coding. A lower coding rate may be referred to as one-half-rate coding where the one-half-rate coding has a maximum transmission rate that is approximately one-half the maximum rate of the full-rate coding. An encoding scheme may include an analysis-by-synthesis encoding scheme in which an original speech signal is compared to a synthesized speech signal to optimize the perceptual similarities or objective similarities between the original speech signal and the synthesized speech signal. A code-excited linear predictive coding scheme (CELP) is one example of an analysis-by synthesis encoding scheme. Although the signal processing system of the invention is primarily described in conjunction with an encoder 11 that is well-suited for fall-rate coding and half-rate coding, the signal processing system of the invention may be applied to lesser coding rates than half-rate coding or other coding schemes.
In an alternate embodiment, the location of each weighting filter may be described with reference to one or more modules (e.g., the pitch pre-processing module 22, synthesis filter 42, or synthesis filter 56) or signal paths that interconnect the modules within the encoder (11 or 111). The physical or logical signal paths may be indicated by the arrows in
The method of
In step S100, an encoder (e.g., 11 or 111) or a spectral detector 221 determines whether the spectral content of an input speech signal is representative of a defined spectral characteristic. For example, the spectral detector 221 or a spectral analyzer may determine whether or not the input speech signal has a defined spectral slope as the defined spectral characteristic. The defined spectral slope may comprise an MIRS response, an IRS response, the first spectral response, the second spectral response, and the third spectral response, or some other spectral response.
In step S102, an encoder (e.g., 11 or 111) or a controller 27 controls a filter parameter (e.g., coefficient) or a filter response of a weighting filter (e.g., 21, 23 and 25) based on one or more of the following: (1) the determination of the spectral content of the speech signal and (2) the affiliation of the weighting filter in the encoder 11 with a particular location, portion or section of the encoder 11. For example, the controller 27 may control a frequency-specific filter component of a subject weighting filter (e.g., 21, 23 or 25) based on the determination of the spectral content of the speech signal or/and the location of a subject weighting filter in the encoder (11 or 111).
To control a filter response of a weighting filter in step S102, the controller 27 may control a frequency-specific filter component the weighting filter. The control of the weighting filters (e.g., 21, 23 and 25) may differ with the identity of the weighting filters. With respect to a low-pass filter component of a pre-processing weighting filter 21, the controller 27 may control the pre-processing weighting filter 21 based on the determination of the spectral content of the speech signal. If the spectral detector 221 determines that the spectral content of the input speech signal is consistent with a low-frequency energy that falls below a low frequency energy threshold, the controller 27 may activate a low-pass filter component of a pre-processing weighting filter 21 to change a spectral response of the pre-processing weighting filter 21.
Alternately, the controller 27 may change filter parameters of a low-pass filter component of a pre-processing weighting filter 21 to increase filtering or attenuation of the low pass filter component, if the spectral detector 221 determines that the spectral content of the input speech signal is consistent with a low frequency energy that falls below a low frequency energy threshold.
With respect to a high-pass filter component of a fixed codebook weighting filter 25, the controller 27 may control the high-pass filter component based on the determination of the spectral content of the speech signal. For example, the controller 27 may control a high-pass filter component of a fixed codebook weighting filter 25 in response to the detection or absence of a noisy speech component or undesired noise (e.g., background noise) of the input speech signal. Undesired noise means an unwanted noise signal or background noise, as opposed to a desired noisy speech component that contributes to the accurate reproduction of a speech signal. If the spectral detector 221 detects an undesired noise level (e.g., an undesired background noise level) that meets or exceeds a minimum threshold level, the controller 27 may activate or otherwise invoke the high pass filter component to attenuate or remove the undesired noise (e.g., undesired background noise). However, if the undesired noise level (e.g., undesired background noise level) is less than the minimum threshold level, the high pass filter component is deactivated or decreased.
In an alternate embodiment, if the spectral detector 221 or the speech characteristic classifier 26 detects a noisy speech component that meets or exceeds a minimum threshold level (i.e., magnitude) over a certain spectral range, the controller 27 may activate or control a response (e.g., a complex response, as opposed to a high pass response) of a fixed codebook weighting filter 25 to maximize or increase the bandwidth (e.g., higher fidelity) of the reproduced speech signal.
In step S104, a core weighting filter component of the weighting filter is maintained regardless of the spectral content of the input speech signal. In one embodiment, even if the frequency specific component of the weighting filter was adjusted in step S102, the core weighting filter component is kept the same in step S104. In one configuration, the core weighting filter component may be defined by a filter response that does not lead to a perceptual degradation of the reproduced speech signal, even if the spectral response of the input speech signal varies or departs from a generally flat spectral response.
In an alternate embodiment, one or more filter parameters of the core weighting filter component may be changed in response to the spectral content of the input speech signal to enhance the perceptual quality of the reproduced speech. The core weighting filter component may be associated with one or more of the following: a pre-processing weighting filter 21, a fixed codebook weighting filter 25, and an adaptive-codebook weighting filter 23.
where 1/A(z) is an LPC synthesis filter response and γ1 and γ2 are constant coefficients. Conversely, if active, the pre-processing weighting filter 21 conforms to the a second filter response of:
where 1/A(z) is an LPC synthesis filter response, α is a low-pass adaptive coefficient, and γ1 and γ2 are constant coefficients.
The method of
In step S12, a controller 27 of the encoder (e.g., 11 or 111) controls a low-pass filter component of a pre-processing weighting filter 21 based on the determination of the spectral content of the input speech signal. The pre-processing weighting filter 21 adapts in response to the spectral content of the input speech signal.
Step S12 may be carried out in accordance with several alternative techniques, which may or may not overlap in their scope. Under a first technique for executing step S12, if the spectral tilt of the speech signal is consistent with an MIRS or an IRS spectral response, the controller 27 activates or increases the contribution of the low-pass filter component of the pre-processing filter 21.
Under a second technique for executing step S12, if the spectral detector 221 detects or determines that the spectral tilt of the input speech signal is consistent with a low frequency energy that falls below a low frequency energy threshold, the controller 27 activates or increases the contribution of the low pass filter component of the pre-processing filter 21. However, if the detector 24 determines that the spectral tilt of the speech signal is consistent with a low frequency energy that meets or exceeds a low frequency energy threshold, the controller 27 deactivates, bypasses or decreases the contribution of the low pass filter component in the digital domain. The activation, deactivation, or bypass of the low-pass filter component is readily realized in the digital domain by digital signal processing or otherwise.
Accordingly, the control of the low-pass filter component facilitates the maintenance of a generally periodic nature of a speech signal. The pre-processing weighting filter 21 has a spectral response that is designed to maintain the generally periodic component of the input speech signal. If the periodic nature of the speech signal is maintained, the open-loop pitch search and coding may be executed with greater efficiency. In general, periodic speech signals may be represented accurately with fewer bits, for transmission over the air interface, than nonperiodic speech signals require for the same level of perceptual quality of the reproduced speech.
In an alternate embodiment of step S12, filter parameters of the pre-processing weighting filter 21 are changed in response to detection of the presence or the absence of a spectral tilt in the input speech signal. For example, if the detector determines that the spectral tilt of the input speech signal is consistent with a low frequency energy that falls below a low frequency energy threshold, the filter parameters of the pre-processing weighting filter 21 are changed to activate or increase a contribution of the low-pass filtering of a low-pass filter component of the pre-processing filter. However if the detector determines that the spectral tilt of the speech signal is consistent with a low frequency energy that meets or exceeds a low frequency energy threshold, the filter parameters of the preprocessing filter are changed to deactivate or decrease the contribution of low-pass filtering of the a low-pass filter component of the pre-processing filter.
In step S14 after step S12, the encoder maintains a core weighting filter component of the pre-processing weighting filter 21 regardless of the spectral content of the speech signal. Accordingly, even though the low-pass filter component of the pre-processing weighting filter 21 may be changed, the core weighting filter component of the pre-processing weighting filter 21 may remain the same.
In one embodiment, the adaptive codebook weighting filter may be adjusted in addition to the pre-processing weighting filter 21. The adaptive codebook filter may comprise a core weighting filter component. The weighting filter may be controlled in accordance with several alternate control techniques following step S10 or elsewhere in the method of
where 1/A(z) is the LPC synthesis filter response is μ is a high-pass adaptive coefficient, and γ1 and γ2 are constant coefficients. In the above equation, the weighting filter component is
and the high-pass filter component is (1−μZ−1). Like steps or procedures in
The method of
The spectral detector 221 may detect whether a noisy speech component or an undesired background noise exceeds a high frequency energy threshold over a certain defined range. In one embodiment, the spectral detector 221 may determine whether a spectral content of the speech signal is tilted such that the high frequency components have a greater magnitude than the lower frequency components as information for deciding how to control the filtering of the high-pass filter component.
In step S18, a controller 27 of the encoder (e.g., 11 or 111) controls a high-pass filter component of a fixed codebook weighting filter 25 based on one or more of the following: (1) the determination of the spectral content (of step S16) of the speech signal, (2) the detection of the presence of the background noise in speech signal, and (3) the detection of the presence of the noisy speech component in the speech signal. For example, if the detected background noise level meets or exceeds a minimum threshold in a certain spectral range, the presence of background noise is detected and the high-pass filter component of the fixed codebook weighting filter 25 may be activated or otherwise invoked to suppress the unwanted background noise. However, if the detected background noise level falls below the minimum threshold, the high pass filter component may be deactivated or made inactive to maximize the bandwidth of the output speech signal and to maintain the high frequency energy of a noisy speech component.
Step S18 may be carried out as follows. If the high pass filter component is deactivated or inactive the fixed codebook weighting filter 25 has the response of
Conversely, if the high pass filter component is activated or active, the fixed codebook weighting filter 25 response has the response of
The fixed codebook weighting filter 25 may activate or deactivate the high-pass filter component (e.g., 1−μZ−1) in response to the detection or absence of at least one of a noisy speech component and background noise of the input speech. The high-pass filter component is arranged to increase the bandwidth of the output speech signal so that the output speech sounds more natural. If the detector or speech classifier 26 determines that the input speech signal has a noisy speech component of sufficient magnitude over a spectral range, the high pass filter component may be controlled (e.g., changed to inactive or activated in a frequency selective manner with respect to the spectral range) to maximize the bandwidth of the output speech signal and to maintain the high frequency energy.
In an alternate embodiment, filter parameters of the fixed codebook weighting filter 25 are changed in response to detection of the presence or the absence of a noisy speech component in the input speech signal. For example, if the detector (24 or 221) or speech classifier 26 determines that the high frequency range of the input speech signal is consistent with a high frequency energy that contains background noise components, the filter parameters of the fixed-codebook weighting filter are changed to activate or increase the contribution of high-pass filtering of a high-pass filter component of the fixed-codebook weighting filter. However if the detector (24 or 221) or speech classifier 26 determines that the spectral content of the speech signal is consistent with a high frequency energy that does not have background noise component, the filter parameters of the fixed codebook weighting filter 25 are changed to deactivate or decrease the contribution of the high-pass filter component.
In step S14 after step S18, the encoder maintains a core weighting filter component of the fixed-codebook weighting filter 25 regardless of the spectral content of the speech signal. Accordingly, even though the high-pass filter component of fixed codebook weighting filter 25 may be changed, the core weighting component may remain static or unchanged. Similarly, the controller 27 may change a first filter response or first set of filter parameters of one weighting filter, without changing a second filter response or a second set of filter parameters for another weighting filter.
In one embodiment, the adaptive codebook weighting filter 23 may comprise a core weighting filter component. The adaptive codebook weighting filter 23 may be controlled in accordance with several alternate control techniques. Under a first control technique, the core weighting filter component of the adaptive codebook is static. Under a second control technique, the filter parameters, associated with the core weighting filter parameters, may be adaptive to improve the searching of the adaptive codebook.
While various embodiments of the invention have been described, it will be apparent to those of ordinary skill in the art that many more embodiments and implementations are possible that are within the scope of this invention. Accordingly, the invention is not to be restricted except in light of the attached claims and their equivalents.
This application claims the benefit of provisional application Ser. No. 60/233,044, entitled SIGNAL PROCESSING SYSTEM FOR FILTERING SPECTRAL CONTENT OF A SIGNAL FOR SPEECH ENCODING, filed on Sep. 15, 2000 under 35 U.S.C. 119(e).
Number | Name | Date | Kind |
---|---|---|---|
5295224 | Makamura et al. | Mar 1994 | A |
5633980 | Ozawa | May 1997 | A |
5664055 | Kroon | Sep 1997 | A |
5717618 | Menkhoff et al. | Feb 1998 | A |
5806022 | Rahim et al. | Sep 1998 | A |
5845244 | Proust | Dec 1998 | A |
6636829 | Benyassine et al. | Oct 2003 | B1 |
6807524 | Bessette et al. | Oct 2004 | B1 |
Number | Date | Country | |
---|---|---|---|
20020116182 A1 | Aug 2002 | US |
Number | Date | Country | |
---|---|---|---|
60233044 | Sep 2000 | US |