1. Field of the Invention
The present invention relates to speech coding algorithms and, more particularly to a Phase Excited Linear Predictive (PELP) low bit rate speech synthesizer and a pitch detector for a PELP synthesizer.
2. Background of Related Art
Mobile communications are growing at a phenomenal rate due to the success of several different second-generation digital cellular technologies, including GSM, TDMA and CDMA. To improve data throughput and sound quality, considerable effort is being devoted to the development of speech coding algorithms. Indeed, speech coding is applicable to a wide range of applications, including mobile telephony, internet phones, automatic answering machines, secure speech transmission, storing and archiving speech and voice paging networks.
Waveform codecs are capable of providing good quality speech at bit rates down to about 16 kbits/s, but are of limited use at rates lower than 16 kbit/s. Vocoders on the other hand can provide intelligible speech at 2.4 kbits/s and below, but cannot provide natural sounding speech at any bit rate. Hybrid codecs attempt to fill the gap between waveform and source codecs. The most commonly used hybrid codecs are time domain Analysis-by-Synthesis (AbS) codecs. Such codecs use the same linear prediction filter model of the vocal tract as found in Linear Predictive Coding (LPC) vocoders. However, instead of applying a simple two-state, voiced/unvoiced, model to find the necessary filter input, the excitation signal is chosen by matching the reconstructed speech waveform as closely as possible to the original speech waveform.
The distinguishing feature of AbS codecs is how the excitation waveform for the synthesis filter is chosen. AbS codecs split the input speech to be coded into frames, typically about 20 ms long. For each frame, parameters are determined for a synthesis filter, and then the excitation to the synthesis filter is determined by finding the excitation signal which when passed into the synthesis filter minimizes the error between the input speech and the reconstructed speech. Thus, the encoder analyses the input speech by synthesizing many different approximations to the input speech. For each frame, the encoder transmits information representing the synthesis filter parameters and the excitation to the decoder and, at the decoder, the given excitation is passed through the synthesis filter to generate the reconstructed speech. However, the numerical complexity involved in passing every possible excitation signal through the synthesis filter is quite large and thus, must be reduced, but without significantly compromising the performance of the codec.
The synthesis filter is usually an all pole, short-term, linear filter intended to model the correlations introduced into speech by the action of the vocal tract. The synthesis filter may also include a pitch filter to model the long-term periodicities present in voiced speech. Alternatively these long-term periodicities may be exploited by using an adaptive codebook in the excitation generator so that the excitation signal includes a component of the estimated pitch period.
There are various kinds of AbS codecs, such as Multi-Pulse Excited (MPE), Regular-Pulse Excited (RPE), and Code-Excited Linear Predictive (CELP). Generally MPE and RPE codecs will work without a pitch filter, although their performance will be improved if one is included. For CELP codecs a pitch filter is extremely important.
The differences between MPE, RPE and CELP codecs arise from the representation of the excitation signal. In MPE codecs, the excitation signal is given by a fixed number of non-zero pulses for every frame of speech. The positions of these non-zero pulses within the frame and their amplitudes must be determined by the encoder and transmitted to the decoder. In theory it is possible to find the best values for all the pulse positions and amplitudes, but this is not practical due to the excessive complexity required. In practice some sub-optimal method of finding the pulse positions and amplitudes must be used. Typically about 4 pulses per 5 ms can be used for good quality reconstructed speech at a bit-rate of around 10 kbits/s.
Like the MPE codec, the RPE codec uses a number of non-zero pulses to represent the excitation signal. However, the pulses are regularly spaced at a fixed interval, and the encoder only needs to determine the position of the first pulse and the amplitude of all the pulses. Therefore less information needs to be transmitted about pulse positions, so for a given bit rate the RPE codec can use more non-zero pulses than the MPE codec. For example, at a bit rate of about 10 kbits/s around 10 pulses per 5 ms can be used, compared to 4 pulses for MPE codecs. This allows RPE codecs to give slightly better quality reconstructed speech than MPE codecs.
Although MPE and RPE codecs provide good quality speech at rates of around 10 kbits/s and higher, they are not suitable for lower rates due to the large amount of information that must be transmitted about the excitation pulses' positions and amplitudes. If the bit rate is reduced by using fewer pulses or by coarsely quantizing the pulse amplitudes, the reconstructed speech quality deteriorates rapidly.
Currently the most commonly used algorithm for producing good quality speech at rates below 10 kbits/s is CELP. CELP differs from MPE and RPE in that the excitation signal is effectively vector quantized. The excitation signal is given by an entry from a large vector quantizer codebook and a gain term to control its power. The codebook index is represented with about 10 bits and the gain is coded with about 5 bits. Thus, the bit rate necessary to transmit the excitation information is about 15 bits. CELP coding has been used to produce toll quality speech communications at bit rates between 4.8 and 16 kbits/s.
It is an object of the present invention to provide an efficient speech coding algorithm operable at low bit rates yet capable of reproducing high quality speech.
The present invention provides a speech encoder including a content extraction module, a pitch detector, and a naturalness enhancement module. The content extraction module includes a band pass filter that receives a speech input signal and generates a band limited speech signal. A first speech buffer connected to the band pass filter stores the band limited speech signal. An LP analysis block, connected to the first speech buffer, reads the stored speech signal and generates a plurality of LP coefficients therefrom. An LPC to LSF block connected to the LP analysis block converts the LP coefficients to a line spectral frequency (LSF) vector. An LP analysis filter connected to the LPC to LSF block extracts an LP residual signal from the LSF vector. An LSF quantizer connected to the LPC to LSF block receives the LSF vector and determines an LSF index therefore. The pitch detector is connected to the LP analysis block of the content extraction module. The pitch detector classifies the band filtered speech signal as one of a voiced signal and an unvoiced signal. The naturalness enhancement module is connected to the content extraction module and the pitch detector. The naturalness enhancement module includes a means for extracting parameters from the LP residual signal, where for an unvoiced signal the extracted parameters include pitch and gain and for a voiced signal the extracted parameters include pitch, gain and excitation level. A quantizer quantizes the extracted parameters and generating quantized parameters.
In another embodiment, the present invention provides a content extraction module for a speech encoder. The content extraction module includes a band pass filter that receives a speech input signal and generates a band limited speech signal, and a first speech buffer connected to the band pass filter that stores the band limited speech signal. An LP analysis block connected to the first speech buffer reads the stored speech signal and generates a plurality of LP coefficients therefrom. An LPC to LSF block connected to the LP analysis block converts the LP coefficients to a line spectral frequency (LSF) vector. An LP analysis filter connected to the LPC to LSF block extracts an LP residual signal from the LSF vector, and an LSF quantizer connected to the LPC to LSF block receives the LSF vector and determines an LSF index therefor.
In a further embodiment, the present invention provides a naturalness enhancement module for a speech encoder, where the speech encoder includes a pitch detector for determining whether an input speech signal is a voiced signal or an unvoiced signal and a content extraction module for generating an LP residual signal from the input speech signal. The naturalness enhancement module includes a means for extracting parameters from the LP residual signal, where for an unvoiced signal the extracted parameters include pitch and gain and for a voiced signal the extracted parameters include pitch, gain and excitation level, and a quantizer for quantizing the extracted parameters and generating quantized parameters.
In a further embodiment, the present invention provides a pitch detector for a speech encoder. The pitch detector includes a first operation level for analyzing a speech signal and, based on a first predetermined ambiguity value of the speech signal, generating a first estimated pitch period. A second operation level analyzes the speech signal and, based on a second predetermined ambiguity value of the speech signal, generates a second estimated pitch period.
In yet another embodiment, the present invention provides a speech signal preprocessor for preprocessing an input speech signal prior to providing the speech signal to a speech encoder. The preprocessor includes a band pass filter that receives the speech input signal and generates a band limited speech signal, and a scale down unit connected to the band pass filter for limiting a dynamic range of the band limited speech signal.
The present invention also provides a method of encoding a speech signal, including the steps of filtering the speech signal to limit its bandwidth, fragmenting the filtered speech signal into speech segments, and decomposing the speech segments into a spectral envelope and an LP residual signal. The spectral envelope is represented by a plurality of LP filter coefficients (LPC). Then, the LPC are converted into a plurality of line spectral frequencies (LSF) and each speech segment is classified as one of a voiced segment and an unvoiced segment based on a pitch of the segment. Next, parameters are extracted from the LP residual signal, where for an unvoiced segment the extracted parameters include pitch and gain and for a voiced segment the extracted parameters include pitch, gain and excitation level. Finally, the extracted parameters are quantized to generate quantized parameters.
The foregoing summary, as well as the following detailed description of preferred embodiments of the invention, will be better understood when read in conjunction with the appended drawings. For the purpose of illustrating the invention, there is shown in the drawings embodiments that are presently preferred. It should be understood, however, that the invention is not limited to the precise arrangements and instrumentalities shown. In the drawings:
a is a schematic block diagram of a naturalness enhancement module for an unvoiced signal of a PELP encoder in accordance with the present invention;
b is a schematic block diagram of a naturalness enhancement module for a voiced signal of a PELP encoder in accordance with the present invention;
The detailed description set forth below in connection with the appended drawings is intended as a description of the presently preferred embodiments of the invention, and is not intended to represent the only forms in which the present invention may be practiced. It is to be understood that the same or equivalent functions may be accomplished by different embodiments that are intended to be encompassed within the spirit and scope of the invention. In the drawings, like numerals are used to indicate like elements throughout.
The present invention is directed to a low bit rate Phase Excited Linear Predictive (PELP) speech synthesizer. In PELP coding, a speech signal is classified as either voiced speech or unvoiced speech and then different coding schemes are used to process the two signals.
For voiced speech, the voiced speech signal is decomposed into a spectral envelope and a speech excitation signal. An instantaneous pitch frequency is updated, for example every 5 ms, to obtain a pitch contour. The pitch contour is used to extract an instantaneous pitch cycle from the speech excitation signal. The instantaneous pitch cycle is used as a reference to extract the excitation parameters, including gain and excitation level. The spectral envelope, instantaneous pitch frequency, gains and excitation level are quantized. For unvoiced speech, a spectral envelope and gain are used, together with an unvoiced indicator.
A decoder is used to synthesize the voiced speech signal. A Linear Predictive (LP) excitation signal is constructed using a deterministic signal and a noisy signal. The LP excitation signal is then passed through a synthesis filter to generate the synthesized speech signal. To synthesize the unvoiced speech signal, a unity-power white-Gaussian noise sequence is generated and normalized to the gains to form an unvoiced excitation signal. The unvoiced excitation signal is then passed through a LP synthesis filter to generate a synthesized speech signal.
PELP coding uses linear predictive coding and mixed speech excitation to produce a natural synthesized speech signal. Different from other linear prediction based coders, the mixed speech excitation is obtained by adjusting only the phase information. The phase information is obtained using a modified speech production model. Using the modified speech production model, the information required to characterize a speech signal is reduced, which reduces the data sent over the channel. The present invention allows a natural speech signal to be synthesized with few data bits, such as at bit rates from 2.0 kb/s to below 1.0 kb/s.
The present invention further provides a pitch detector for the PELP coder. The pitch detector is used to classify a speech frame as either voiced or unvoiced. For voiced speech, the pitch frequency of the voiced sound is estimated. The pitch detector is a key component of the PELP coder.
Referring now to the drawings,
The purpose of the content extraction module 100 is to extract the information content from an input speech signal s' (n). The content extraction module 100 has a pre-processing unit that includes a band pass filter (BPF) 110, a scale down unit 112, and a first speech buffer 113. The input speech signal s' (n) is provided to the BPF 110, which limits the input speech signal s' (n) from about 150 Hz to 3400 Hz. Preferably, the BPF 110 uses an eighth order IIR filter. The aim of the lower cut-off is to reject low frequency disturbances, which could be perceptually very sensitive. The upper cut-off is to attenuate the signals at the higher frequencies. The 8th order IIR filter may be formed using a 4th order low-pass section and a 4th order high-pass section. The transfer functions of the low-pass and high-pass sections are defined in equations (1) and (2), respectively.
The BPF 110 thus produces a band-limited speech signal, which is provided to the scale down unit 112. The scale down unit 112 scales this signal down by about a half (0.5) to limit the dynamic range and hence to yield a speech signal s(n). The speech signal s(n) is segmented into frames, for example 20 ms frames, and stored in the first speech buffer 113. For an 8 kHz sampling system, a speech frame contains 160 samples. In the presently preferred embodiment, the first speech buffer 113 stores 560 samples Bsp1 (n) for n=0,559 for analysis by an LP analysis block 114. When a frame (160 samples) of the speech signal s(n) is available, it is loaded into the first speech buffer 113 from samples n=400 to 559. The samples proceeding Bsp1(400) are made up of the previous consecutive frames.
In the presently preferred embodiment, the LP analysis block 114 performs a 10th order Burg's LP analysis to estimate the spectral envelope of the speech frame. The LP analysis frame contains 170 samples, from Bsp1(390) to Bsp1(559). The result of the LP analysis is ten LP coefficients (LPC), a″ (i) where i=1 to 10. A bandwidth expansion block 116 is used to expand the set of LP coefficients using equation (3), which generates bandwidth expanded LP coefficients a′(i).
a′(i)=0.996ia″(i) for i=1, 2 , . . . 10 Eqn 3
A frame of an LP residual signal r(n) is extracted using an LP analysis filter in the following manner. After the set of bandwidth expanded LP coefficients a′(i) is generated, the coefficients a′(i) are converted to line spectral frequencies (LSF) ω′l(i) (i=1 to 10), at an LPC to LPF block 118. The current set of LSF ω′l(i) is then linearly interpolated with the set of the previous frame LSF at an interpolate LSF block 120 to compute a set of intermediate LSF ωl(i), preferably every Sms. Hence there are four sets of intermediate LSF ωl(m,i) (m=1, 4; i=1, 10) in a speech frame. The four intermediate LSF sets ωl(m,i) are converted back to corresponding LP coefficients a(m,i) (m=1, 4; i=1, 10) at an LSF to LPC block 122. Then, a frame of the residual signal r(n) is obtained using an inverse filter 124 operating in accordance with equation (4).
A first residual buffer 130 stores the residual signal r(n). The size of the first residual buffer 130 is preferably 320 samples. That is, the stored data is Brd1(n) for n=0 to 319, which is the current residual frame and a previous consecutive frame. To compute the current residual frame, the inverse filter 124 is operated as shown in Table 1.
The LSF ω′l(i) from the LPC to LSF block 118 are also quantized by an LSF codebook or quantizer 126 to determine an index IL. That is, as is understood by those of ordinary skill in the art, the LSF quantizer 126 stores a number of reference LSF vectors, each of which has an index associated with it. A target LSF vector ω′l(i) is compared with the LSF vectors stored in the LSF quantizer 126. The best matched LSF vector is chosen and an index IL of the best matched LSF vector is sent over the channel for decoding.
As previously discussed, for the LP residual signal r(n), different coding schemes are used for different signal types. For a voiced segment, a pitch cycle is extracted from the LP residual signal r(n) every 5 ms, i.e. an instantaneous pitch cycle. The gain, pitch frequency and excitation level for the instantaneous pitch cycle are extracted. A consecutive set for each parameter is arranged to form a parameter contour. The sensitivity of each parameter to the synthesised speech quality is different. Hence, different update rates are used to sample each parameter contour for coding efficiency. In the presently preferred embodiment, a 5 ms update is used for gain and a 10 ms update is used for the pitch frequency and excitation level. For an unvoiced segment, only the gain contour is useful. An unvoiced sub-segment is extracted from the LP residual signal r(n) every 5 ms. The gain of each unvoiced sub-segment is computed and arranged in time to form a gain contour. Once again a 5 ms update rate is used to sample the unvoiced gain. A pitch detector 128 is used to classify the speech signal s(n) as either voiced or unvoiced. In the case of voiced speech the pitch frequency is estimated.
Referring now to
In level (1), the speech signal s(n) is filtered with a low pass filter 300 to reject the higher frequency content that may obstruct the detection of true pitch. The cut-off frequency of the low-pass filter 300 is preferably set to 1000 Hz. Preferably the filter 300 has a filter transfer function as defined in equation (5).
The output sl(n) of the low-pass filter 300 is loaded into a second speech buffer 302. In the presently preferred embodiment, the second speech buffer 302 is used to store two consecutive frames Bsp2(n) where n=0 to 319, which is 320 samples. More particularly, the input to the low pass filter 300 is taken from the first speech buffer 113 as Bsp1(400) to Bsp1(559) and a modified speech signal sl(n)output from the low pass filter 300 is stored in the second speech buffer 302 Bsp2(160) to Bsp2(319)
The stored modified speech signal Bsp2(n), n=160 to 319 is provided to an inverse filter 304 to obtain a band-limited residual signal rl(n). The filter coefficients of the inverse filter 304 are set to ai(4) for i=0, 10. The residual signal rl(n) output from the inverse filter 304 is stored in a second residual buffer 306. The second residual buffer 306 preferably stores 320 samples Brd2(n) where n=0 to 319, and thus, the residual buffer 306 holds two consecutive residual frames. The current residual signal rl(n) is stored in Brd2(n), where n=160 to 319.
After a new residual signal rl(n) is loaded into the second residual buffer 306, a cross-correlation function is computed at block 308 using data read from the buffer 306 Brd2(n) in accordance with equation (6).
A peak detector 310 finds the global maximum Crmax and its location Prmax, across the cross-correlation function Cr(m), m=16 to 160. A level detector 312 checks if Crmax is greater than or equal to about 0.7, in which case the confidence for a voice signal is high. In this case, the cross-correlation function Cr(m) is re-examined to eliminate possible multiple pitch errors and hence to yield the estimated pitch-period Pest and its correlation function Cest at block 314. The multiple-pitch error checking is preferably carried out as follows:
If the level detector 312 determines that Crmax is less than about 0.7, level (2) pitch detection processing is used.
Level (2)
Level (2) of the pitch detector 128 is delegated to the detection of an unvoiced signal. This is done by accessing the RMS level and energy distribution Ru of the speech signal s(n). The RMS value of the speech signal s(n) is computed at block 316 in accordance with equation (7).
The vocal tract has certain major resonant frequencies that change as the configuration of the vocal tract changes, such as when different sounds are produced. The resonant peaks in the vocal tract transfer function (or frequency response) are known as “formants”. It is by the formant positions that the ear is able to differentiate one speech sound from another. The energy distribution Ru, defined as the energy ratio between the higher formants and all the detectable formants, for a pre-emphasized spectral envelope, is computed at block 318. The pre-emphasized spectral envelope is computed from a set of pre-emphasized filter coefficients that defines a system with the transfer function shown in equation (8).
A#(z)=(1+0.99z−1)A′(z) Eqn 8
If a′ and a# are the filter coefficients for A′(z) and A#(z), they are related as shown in equation (9).
a#0=1.0
a#ia′l=0.99a′i-1for i=1,2, . . . , 10 Eqn 9
a#11=0.99a′10
After filter coefficients a# are available, a# are zero padded to 256 samples and an FFT analysis is applied to yield a smoothed spectral envelope. For example, assuming Xk where k=1 to M are the magnitude values for formants (1) to (M), where formants (1) to (m) are below 2 kHz and formants (m+1) to (M) are above 2 kHz, the energy distribution is defined as:
Detection of an unvoiced signal is done at block 320 by checking if either RMS is less than about 58.0 or Ru is greater than about 0.5. If either of these conditions is met, an unvoiced frame is declared and Cest and pest are cleared or set to zero. Otherwise, the pitch detector 128 will call upon the level (3) analysis.
Level (3)
In level (3), a cross-correlation function low-pass filtered speech signal Cs(m) is computed from the low-pass filtered speech signal stored in the second speech buffer 302 using equation (11), at block 322.
A peak detector 324 is connected to the block 322 and detects the global maximum Csmax and its location psmax of Cs(m). The correlation function Cs(m) calculated at block 322 is examined at block 326, in a similar manner as is done in level (1) with Cr(m), and then the appropriate cross-correlation function Cr(m) or Cs(m) is selected at block 328 to eliminate multiple pitch errors.
For example, assume the estimated pitch-period and its associated correlation function for Cr(m) and Cs(m) are prest and Crest and psest and Csest respectively. The value Csmax is then assessed and the following logic decisions are performed. If Csmax is greater than or equal to about 0.7, a voiced signal is declared and pitch logic (1) is used to choose p′est from prest and psest and determine Cest. The estimated pitch-period pest is obtained by post processing p′est. Otherwise, the sum of Crmax and Csmax is computed, Csum=Crmax+Csmax. When the value of Csum is available, the logic decisions are made as follows.
If Csum≧1.0, a voiced signal is declared and pitch logic (2) is used to choose p′est from prest and psest, and determine Cest. The estimated pitch-period pest is obtained by post-processing p′est, as described below. Otherwise, an unvoiced signal is declared, Cest=0.0 and pest=0.
Pitch logic (1)
For pitch logic (1), two conditions are analyzed at a first decision block:
Pitch logic (2) is a simple comparison between two correlation maximums. If Csmax>Crmax, the voicing decision made from Cs(m) may be high, and hence the result is taken from Cs(m), p′est=psest and Cest=Csmax. Otherwise, if Crmax>Csmax, then p′est=prest and Cest=Crmax.
After the pitch period p′est is selected, the pitch period p′est is smoothed by a pitch post-processing unit 330. The pitch post-processing unit 330 is a median smoother used to smooth out an isolated error such as a multiple pitch error or a sub-multiple pitch error. In the presently preferred embodiment, the pitch post-processing unit 330 differs from conventional median smoothers, which operate on the pitch-periods taken from both the previous and future frames, because the median smoother uses the current estimated pitch-period and pitch-periods estimated in the two previous consecutive frames.
Assume the estimated pitch-period for the lth speech frame as p(l) and p(l−1) and p(l−2) are the estimated pitch-periods for the two previous consecutive frames.
Referring now to
A contour is a sequence of parameters, which in the presently preferred embodiment are updated every 5 ms. As previously discussed, the length of a speech frame is 20 ms, hence there are four (4) parameters (m) in a frame, which make up a contour. The parameters for an unvoiced signal are pitch and gain. On the other hand, the parameters for a voiced signal are pitch, gain and excitation level.
Unvoiced signal
For an unvoiced signal, at block 210 the contours are extracted from the data Brd1(n) stored in the first residual buffer 130. The contours required for an unvoiced signal are pitch and gain. The pitch contour ωp is used to specify the pitch frequency of a speech signal at each update point. For the unvoiced signal, the pitch contour ωp is set to zero to distinguish it from a voiced signal.
ωp(m)=0 for m=1 to 4.
Gain factors λ(m) are computed using the residual signal r(n) data Brd1(n) stored in the first residual buffer 130.
where n1=160+40×(m−1) and m=1 to 4.
The encoder parameters must be quantized before being transmitted over the air to the decoder side. For the unvoiced signal, the pitch frequency and gain are quantized at block 212, which then outputs a quantized pitch and quantized gain.
Voiced Signal
Three contours are required for a voiced signal, pitch, gain and excitation level. The four parameters (m) for each these contours are extracted from the instantaneous pitch cycles u(n) every 5 ms. Thus, at block 250 the pitch cycles u(n) are extracted from the data Brd1(n) stored in the first residual buffer 113. The length of each pitch cycle u(n) is known as the instantaneous pitch-period p(m). The value of p(m) is chosen from a range of pitch-period candidates pc. The range of pc is computed from the estimated pitch-period pest generated by the pitch detector 128. Assume Pc(1) and Pc(M) are the lowest and highest pitch-period candidates, such that:
pc(1)<pc(2)<pc(3)< . . . <pc(M)
The value of Pc(1) and Pc(M) are computed as:
pc(1)=integer(0.9×pest) Eqn 13a
pc(M)=integer(1.1×pest) Eqn 13b
A cross-correlation function C(k) is then computed for each of the pc(k). The pc(k) that yields the highest cross-correlation function is chosen to be the p(m) at the update point. The cross-correlation function C(k) is defined in equation (14).
The value of n1 is set as 200, 240, 280 and 320 for each update point. After p(m) is obtained, the instantaneous pitch cycle u(n) is extracted from Brd1(n) for the four update points.
Once an instantaneous pitch cycle u(n) is available, the three contours (pitch frequency, gain and excitation level) are computed at block 252. The gain factor λ is calculated using equation (15).
To compute the excitation level ε, the absolute maximum value for the pitch cycle u(n) is determined using equation (16).
A(m)=max (|u(m,n)|) for n−0,1,2, . . . , p(m)−1 Eqn 16
The excitation level is computed using equation (17).
Finally for the pitch frequency ωp, a fractional pitch-period p′ is first computed from the cross-correlation function C(pc(1)) . . . C(pc(M)). Suppose the p(m) is the instantaneous pitch-period and p(m)=pck. The fractional pitch-period p′(m) is computed as shown in equation (18).
The pitch frequency is defined as shown in equation (19).
Table 2 summarizes the PELP coder parameters.
As with the unvoiced parameters, the encoder parameters must be quantized before being transmitted over the air to the decoder side. For the voiced signal, to achieve very low bit rate coding, at block 254, the pitch frequency ωp and excitation level ε are downsampled to reduce the information content, such as downsampling at 4:1 rate. After the pitch frequency ωp and excitation level ε are downsampled, they are quantized at block 256. Output from the quantization block 256 are a quantized pitch, quantized gain, and quantized excitation level.
Hence, only one pitch frequency and excitation level is quantized for each 20 ms voiced frame. An example of the quantization scheme for a 1.8 kb/s PELP coder is shown in Table 3.
Further quality enhancement may be achieved by reducing the downsampling rate of the pitch frequency ωp and the excitation level ε, for example to 2:1 and so on, as will be understood by those of ordinary skill in the art.
PELP Decoder
The PELP decoder uses the LP residual parameters generated by the encoder (gain, pitch frequency, excitation level) to reconstruct the LP excitation signal. The reconstructed LP excitation signal is a quasi-periodic signal for voiced speech and a white Gaussian noise signal for unvoiced speech. The quasi-periodic signal is generated by linearly interpolating the pitch cycles at 5 ms intervals. Each pitch cycle is constructed using a deterministic component and a noise component. In addition, the LSF vector is linearly interpolated with the one in the previous frame to obtain an intermediate LSF vector and converted to LPC. After the excitation signal is constructed, it is passed through an LP synthesis filter to obtain the synthesised speech output signal s(n).
The parameters needed for speech synthesis are listed in Table 4. If the parameters are further downsampled for lower bit rates, the intermediate parameters are recovered via a linear interpolation.
Referring now to
To synthesize an unvoiced speech frame, at block 404 a random excitation signal is generated. More particularly, four segments of a unity-power white-Gaussian sequence (40 samples each) are generated, i.e. g′(m,n) for m=1, 4; n=0, 39. The white Gaussian noise generator is implemented by a random number generator that has a Gaussian distribution and white frequency spectrum. At block 406, each sequence g′(m,n) is scaled to the corresponding gain λ(m) to yield g(m,n), as shown by equation (20).
g(m, n)=λ(m)g′(m, n) Eqn 20
for m=1,2,3,4
for n=0,1,2, . . . ,39
In addition, using the codebook index IL generated by the encode (
ωl′(m,i)=ωl(l−1,i)+0.25*m*(ωl)l,i)−ωl(l−1i)) Eqn 21
for i=1,2, . . . , 10
Finally, the synthesized unvoiced speech signal is obtained by passing the Gaussian sequence g(m,n) to an LP synthesis filter 412. The operation of the LP synthesis filter 412 is defined by difference equation (22).
where e(n) is the input to the LP synthesis filter. The filtering is done according to Table 5.
A voiced speech signal is processed differently from an unvoiced speech signal. For a voiced speech signal, a quasi-periodic excitation signal is generated at block 414. The quasi-periodic signal is generated by interpolating the four synthetic pitch cycles in a 20 ms frame. Each synthetic pitch cycle is generated using the corresponding gain λ, pitch frequency ωp and excitation level ε.
For example, suppose the synthetic pitch cycle u(n) at an update point within the 20 ms frame is defined in the frequency domain by its pitch-period p, a magnitude spectrum Uk and a phase spectrum φk. Only half of the frequency spectrum is used, i.e., k is defined from
The pitch-period p is calculated as shown in equation (23).
A flat magnitude spectrum is used in the PELP coding for Uk and is defined as shown in equation (24).
U0=0
Uk=λ√{square root over (p)} Eqn 24
The phase spectrum φk includes deterministic phases φd at the lower frequency band and random phase components φr at the higher frequency band.
The separation between the two bands is known as the separation frequency ωs, where:
ωs=π×ε Eqn 26
The deterministic phases φd are derived from a modified speech production model as shown in equation (27).
The ways in which α, β and γ can be computed are well understood by those of ordinary skill in the art. The random phase spectrum is generated using a random number generator. The random number generator provides a uniform distributed random number range from 0 to 1.0, which is normalized to 0 and π.
After the magnitude and phase spectra for the pitch cycle are obtained, they are transformed to real and imaginary spectra for interpolation as shown in equation (28).
Rk=|Uk| cos(φk)
Ik=|Uk| sin(φk) Eqn28
To synthesize a voiced excitation, the pitch frequency and the real and imaginary spectra from one pitch cycle to another are linearly interpolated to provide a smooth change of both the signal energy and shape. For example, suppose u(m−1)(n) and u(m) (n) are adjacent pitch cycles (5ms apart). The pitch-frequencies and real and imaginary spectra for the 2 cycles are denoted as ωp(m−1), Rk(m−1), Ik(m−1) and ωp(m), Rk(m), Ik(m) respectively. The voiced excitation signal v(m)(n) n=0,39 is synthesized from these two pitch cycles using equation (29).
where ψ(n) is a linear interpolation function defined by equation (30).
The value p(m)(n) is the instantaneous pitch-period for each time sample (n), and is computed from the instantaneous pitch frequency ωp(m)(n) as shown in equation (31).
The instantaneous pitch frequency
is computed as:
K(n) is a parameter related to the instantaneous pitch period as:
The instantaneous phase value σ(m)(n) is calculated via as:
After the four pieces of voiced excitation v(m)(n), m=1,4; n=0,39 are available, they are used as inputs to the LP synthesis filter 412 for synthesizing the voiced speech, in the same manner as is done for unvoiced speech, according to Table 6.
A voiced onset frame is defined when a voiced frame is indicated directly after an unvoiced frame. In a voiced onset frame, parameters for pitch cycle {u(0)(n)} are not available for interpolating it with {u(1)(n)}. To solve this problem, the parameters for {u(1)(n)} are re-used by {u(0)(n)} as shown below, and then the normal voiced synthesis is resumed.
As is apparent, the present invention provides a Phase Excited Linear Prediction type vocoder. The description of the preferred embodiments of the present invention have been presented for purposes of illustration and description, but are not intended to be exhaustive or to limit the invention to the forms disclosed. It will be appreciated by those skilled in the art that changes could be made to the embodiments described above without departing from the broad inventive concept thereof. For example, the present invention is not limited to a vocoder having any particular bit rate. It is understood, therefore, that this invention is not limited to the particular embodiments disclosed, but covers modifications within the spirit and scope of the present invention as defined by the appended claims.
Number | Name | Date | Kind |
---|---|---|---|
5293448 | Honda | Mar 1994 | A |
5517595 | Kleijn | May 1996 | A |
5754974 | Griffin et al. | May 1998 | A |
5774837 | Yeldener et al. | Jun 1998 | A |
5809456 | Cucchi et al. | Sep 1998 | A |
5845244 | Proust | Dec 1998 | A |
6041297 | Goldberg | Mar 2000 | A |
6067511 | Grabb et al. | May 2000 | A |
6070137 | Bloebaum et al. | May 2000 | A |
6119082 | Zinser, Jr. et al. | Sep 2000 | A |
6233550 | Gersho et al. | May 2001 | B1 |
6636829 | Benyassine et al. | Oct 2003 | B1 |
6782360 | Gao et al. | Aug 2004 | B1 |
Number | Date | Country | |
---|---|---|---|
20030074192 A1 | Apr 2003 | US |