The present disclosure is related generally to audio encoding and decoding and, more particularly, to a system and method for conversion of linear predictive coding (“LPC”) coefficients using the auto-regressive (“AR”) extension of correlation coefficients for use in sub-band speech or other audio encoder-decoders (“codecs”).
Many devices used for communication or entertainment purposes possess the ability to play back or reproduce sound based on a signal representing that sound. For example, a personal computer, laptop computer, or tablet computer may be used to play a video that has both image and sound. A smart-phone may be able to play such a video and may also be used for voice communications, i.e., by sending and receiving signals that represent a human voice.
In all such systems, there is a need to electrically encode the sound signal for transmission or storage and conversely to electrically decode the encoded signal upon receipt. Early forms of sound encoding included encoding sound as bumps in plastic or wax (e.g., early gramophones and record players), while later forms of analog encoding became more symbolic, recording sound as magnetic magnitudes on discrete regions of a magnetic tape. Digital recording, coming later still, converted the sound signal to a series of numbers and provided for more efficient usage of transmission and storage facilities.
However, as the transmission of sound data became more prevalent and the computing power of the devices involved became increasingly greater, more complex and efficient systems for encoding were devised. For example, many cell-phone conversations today are encoded for transmission by way of a class of LPC algorithms. Algorithms in this class such as algebraic codebook linear predictive algorithms decompose speech, for example, into a model and an excitation for that model, mimicking the manner in which the human vocal tract (akin to the model) is excited by vibration of the vocal chords (akin to the excitation). The LPC coefficients describe the model.
While algorithms of this class are efficient with respect to bandwidth consumption, the process required to create the transmitted data is quite complex and computationally expensive. Moreover, the continued increase in consumer demands upon their computing devices raises a need for yet a further increase in computational efficiency. The present disclosure is directed to a system and method that may provide enhanced computational efficiency in audio coding and decoding. However, it should be appreciated that any particular benefit is not a limitation on the scope of the disclosed principles or of the attached claims, except to the extent expressly recited in the claims. Additionally, the discussion of technology in this Background section is merely reflective of inventor observations or considerations and is not an indication that the discussed technology represents actual prior art.
While the appended claims set forth the features of the present techniques with particularity, these techniques, together with their objects and advantages, may be best understood from the following detailed description taken in conjunction with the accompanying drawings of which:
Before providing a detailed discussion of the figures, a brief overview is given to guide the reader. The disclosed systems and methods provide for the efficient conversion of linear predictive coefficients. This method is usable, for example, in the conversion of full band LPC to sub-band LPCs of a sub-band speech codec. The sub-bands may or may not be down-sampled. In this method, the LPC of the sub-bands are obtained from the correlation coefficients which are in turn obtained by filtering the AR extended auto-correlation coefficients of the full band LPCs. The method then allows the generation of an LPC approximation of a pole-zero weighted synthesis filter. While one may attempt to employ FFT-based methods to strive for the same general result, such methods tend to be much less suitable in terms of both complexity and accuracy.
Turning now to a more detailed discussion in conjunction with the attached figures, techniques of the present disclosure are illustrated as being implemented in a suitable environment. The following description is based on embodiments of the disclosed principles and should not be taken as limiting the claims with regard to alternative embodiments that are not explicitly described herein. Thus, for example, while
The schematic diagram of
In the illustrated embodiment, the components of the user device 110 include a display screen 120, a camera 130, a processor 140, a memory 150, one or more audio codecs 160, and one or more input components 170.
The processor 140 can be any of a microprocessor, microcomputer, application-specific integrated circuit, or the like. For example, the processor 140 can be implemented by one or more microprocessors or controllers from any desired family or manufacturer. Similarly, the memory 150 may reside on the same integrated circuit as the processor 140. Additionally or alternatively, the memory 150 may be accessed via a network, e.g., via cloud-based storage. The memory 150 may include a random-access memory (i.e., Synchronous Dynamic Random-Access Memory, Dynamic Random-Access Memory, RAMBUS Dynamic Random-Access Memory, or any other type of random-access memory device). Additionally or alternatively, the memory 150 may include a read-only memory (i.e., a hard drive, flash memory or any other desired type of memory device).
The information that is stored by the memory 150 can include program code associated with one or more operating systems or applications as well as informational data, e.g., program parameters, process data, etc. The operating system and applications are typically implemented via executable instructions stored in a non-transitory computer readable medium (e.g., memory 150) to control basic functions of the electronic device 110. Such functions may include, for example, interaction among various internal components and storage and retrieval of applications and data to and from the memory 150.
The illustrated device 110 also includes a network interface module 180 to provide wireless communications from and to the device 110. The network interface module 180 may include multiple communications interfaces, e.g., for cellular, WiFi, broadband, and other communications. A power supply 190, such as a battery, is included for providing power to the device 110 and to its components. In an embodiment, all or some of the internal components communicate with one another by way of one or more shared or dedicated internal communication links 195, such as an internal bus.
Further with respect to the applications, these typically utilize the operating system to provide more specific functionality, such as file-system service and handling of protected and unprotected data stored in the memory 150. Although many applications may govern standard or required functionality of the user device 110, in many cases applications govern optional or specialized functionality, which can be provided, in some cases, by third-party vendors unrelated to the device manufacturer.
Finally, with respect to informational data, e.g., program parameters and process data, this non-executable information can be referenced, manipulated, or written by the operating system or an application. Such informational data can include, for example, data that are preprogrammed into the device during manufacture, data that are created by the device, or any of a variety of types of information that is uploaded to, downloaded from, or otherwise accessed at servers or other devices with which the device 110 is in communication during its ongoing operation.
In an embodiment, the device 110 is programmed such that the processor 140 and memory 150 interact with the other components of the device 110 to perform a variety of functions. The processor 140 may include or implement various modules and execute programs for initiating different activities such as launching an application, transferring data, and toggling through various graphical user interface objects (e.g., toggling through various icons that are linked to executable applications).
Although the device 110 described in reference to
The encoder 200 receives input speech s at an LPC analysis filter 201 as well as at a first sub-band filter 202 and at a second sub-band filter 203. The LPC analysis filter 201 processes the input speech s to produce quantized LPC coefficients Aq. Because the quantized LPCs are common to both the bands, and the codec for each band requires an estimate of the spectrum of each of the respective bands, the quantized LPC coefficients Aq are provided as input to a first LPC and correlation conversion module 204 associated with the first sub-band and to a second LPC and correlation conversion module 205 associated with the second sub-band.
The first and second LPC and correlation conversion modules 204, 205 provide band-specific LPC coefficients Al (low) and Ah (high) to respective sub-band encoder modules 206, 207. The sub-band encoder modules 206, 207 receive respective filtered speech inputs Sl (low) and Sh (high) from the first sub-band filter 202 and the second sub-band filter 203. The sub-band encoder modules 206, 207 produce respective quantized LPC parameters for the associated bands. As such, the output of the encoder 200 comprises the quantized LPC coefficients Aq as well as quantized parameters corresponding to each sub-band.
As will be appreciated, quantization of a value entails setting that value to a closest allowed increment. In the illustrated arrangement, the quantized LPC coefficients are shown as the only common parameter. However, it will be appreciated that there may be other common parameters as well, e.g., pitch, residual energy, etc.
The band spectra may be represented in any suitable form known in the art. For example a band spectrum may be represented as direct LPCs, correlation or reflection coefficients, log area ratios, line spectrum parameters or frequencies, or a frequency-domain representation of the band spectrum. It will be appreciated that the LPC conversion is dependent on the form of the filter coefficients of the sub-band filters.
The decoder 300 is similar to but essentially inverted from the encoder 200. The decoder 300 receives the quantized LPC coefficients Aq as well as the quantized parameters corresponding to each sub-band. The quantized parameters corresponding to the low and high sub-bands are input to a respective first sub-band decoder 301 and a second sub-band decoder 302. The quantized LPC coefficients Aq are provided to a first LPC and correlation conversion module 303 associated with the first sub-band and to a second LPC and correlation conversion module 304 associated with the second sub-band.
The first LPC and correlation conversion module 303 and the second LPC and correlation conversion module 304 output, respectively, the band-specific LPC coefficients Al (low) and Ah (high), which are in turn provided to the first sub-band decoder 301 and to the second sub-band decoder 302. The outputs of the first sub-band decoder 301 and the second sub-band decoder 302 are provided to respective sub-band filters 305, 306, which produce, respectively, a low-band speech signal sl and a high-band speech signal sh. The low-band speech signal sl and the high-band speech signal sh are combined in combiner 307 to yield a final recreated speech signal.
As noted above, one might use a frequency-domain approach for the LPC conversion. In this approach, the full band LPC is converted to the frequency domain using the FFT. The Fourier spectrum of the full band LPC is then multiplied by the power spectrum of the filter coefficients to obtain the power spectrum of the baseband signal. The LPC of the baseband signal is then computed using the inverse FFT of the power spectrum.
However, the accuracy of this frequency-domain approach is dependent on the length (N) of the FFT; the greater the FFT length, the better the estimation accuracy. Unfortunately, as the FFT length increases, complexity also increases. Moreover, since the LPC coefficients are representative of an AR process with an infinite impulse response (“IIR”), it may be inferred that irrespective of the FFT length, the frequency-domain approach will not result in the exact values of the correlation coefficients of the baseband signal. Intuitively an IIR signal, which must be truncated and windowed for FFT processing, will result in response inaccuracies regardless of the order of the FFT.
In contrast, the described system and method provide a low complexity, high accuracy estimate of the correlation coefficients, from which an LPC of the filtered signal may be derived. In an LPC-based speech codec, speech is assumed to correspond to an AR process of certain order n (typically n=10 for 4 kHz bandwidth, n=16 or 18 for 7 kHz bandwidth). For an AR signal s(j) with order n, the correlation coefficients R(k), k>n, can be obtained from the values of R(k) for 0≦k≦n using the following recursive equation:
where ai are the LPC coefficients. If a signal is passed through a filter h(j), then the correlation coefficients Ry(k) of the filtered signal y(j) are given by:
R
y(k)=R(k)*h(k)*h(−k), (2)
where * is a convolution operator. In sub-band speech codecs, the filters are usually symmetric and are of finite length (“FIR”), and the lengths L of these filters are constrained by the codec delay requirements. With the symmetric assumption, the above equation can now be written as:
R
y(k)=R(k)*h(k)*h(k). (3)
If h(j) is symmetric and has length L, then h(j)*h(j) is also symmetric and has length 2 L−1. To estimate the correlation coefficient Ry(k) for larger values of k, Equation (3) would be very complex. However, the LPC order n0 of the filtered signal is typically smaller (≦n), and hence it is necessary to calculate Ry(k) for 0≦k≦n0. This can be achieved by limiting the R(k) calculation to 0≦k≦n0+L−1.
A flow diagram for an exemplary LPC conversion process 400 is shown in
At stage 403 of the process 400, the correlation coefficients Ry(k) for n≦k≦L+n−1 are extended via autoregression, using equation (1) above, for example. At stage 404 of the process, the R(k) are filtered, using equation (2) above, for example. Finally at stage 405, Levinson Durbin is used to obtain LPC coefficients Al of order n0 from Ry(k).
It will be appreciated that with R(0)=1, and the LPC coefficients ai known, the above equation can be viewed as a set of n simultaneous equation with R(1), R(2), . . . , R(n) unknowns. This set of equations is solvable with stable LPC coefficients. In order to avoid the high complexity (order n3) of direct solutions such as Gaussian elimination, the equation in matrix form can be assumed to have a Toeplitz structure. In this way, the LPC coefficients are converted to reflection coefficients and thence to the correlation values. Both of these algorithms have a complexity of the order n2, and hence the overall complexity of obtaining correlation coefficients from LPC is of order n2.
Flow diagrams showing exemplary processes for converting LPC coefficients ai to reflection coefficients and converting reflection coefficients to correlations are shown in
Otherwise the process 500 flows to stage 505, wherein ρi←ai and c←1−ρi2. From there the process 500 flows to stage 506, wherein ∀j<i,
At stage 507, the value of i is decremented, and the process flow returns to stage 503. Once i reaches 0, the process provides an output at stage 504 as discussed above.
Turning to
for(k=1; k≦j/2; 30 +k){t=λk+ρj·λj−k λj−k=λj−k+ρj·λk λk=t}
At stage 605, R(j) is calculated according to
and the value of j is incremented at stage 606 before the process 600 returns to stage 603. If j>n at stage 603, then the process 600 terminates at stage 607 and outputs the correlation values R. Otherwise, the foregoing steps are again executed until j>n.
As noted above, embodiments of the described autoregressive extension technique are generally superior to ordinary FFT techniques in terms of complexity and accuracy. For example, consider a full band input signal (having 8 kHz bandwidth) which is an order 16 AR process. Assume that the LPC analysis for n=16 (i.e., no mismatch between the actual order and the analysis order) is performed on the full band signal, and the full band signal is passed through an L=51 tap symmetric FIR low-pass filter to obtain a filtered signal. The normalized correlations (n0=16) of the filtered signal can be obtained using the autocorrelation method, and the actual spectrum can be obtained from the correlations.
For purposes of comparison, spectra were obtained using the described LPC conversion method as well as two FFT-based LPC conversion methods (using FFT of lengths 256 and 1024).
By way of summary,
The process of LPC conversion described herein is also applicable when upsampling or downsampling are involved. In this situation, the upsampling and downsampling can be applied to the extended correlations.
In order to more generally compare the resource cost of the described algorithm to that of the FFT-based methods, consider the differences in computational complexity between certain example steps from the two approaches. In the described approach, the computational complexity of obtaining the correlations from the LPC is approximately equal to 2.5·n·(n+1) operations. The autoregressive extension of the correlations requires an additional (L+n0−n)·n operations. Finally, filtering of the correlations requires (2·L−1)·n0 operations. Thus the total number of simple (multiply and add) operations C1 is:
C
1=2.5·n·(n+1)+(L+n0−n)·n+(2·L−1)·n0.
So, given an example of L=50 and n=n0=16, then the number of simple mathematical operations is C1=2984. Additionally, there are n divide operations, which require more processing cycles than simple multiply and add operations. Assuming the computational complexity of a divide operation is 15 processing cycles, then the overall complexity of the described approach is approximately 2984+16·15=3224 operations.
Turning now to the complexity of the FFT approach, the complexity of real FFT or Inverse FFT is assumed to be 2·N log(N/2). The complexity of a divide is again assumed to be 15 times the complexity of multiply and add operations. The overall complexity C2 is therefore given by:
C
2=4·N log(N/2)+7.5·N.
Thus for N=256, C2 is approximately 9000 operations. Thus, as can be seen, even for an FFT length of 256, the FFT-based approach is approximately three times as complex as the approach described herein.
In keeping with a further embodiment, the described principles are also applicable in the context of analysis-by-synthesis (“AbS”) speech codecs (e.g., Code-Excited Linear Prediction (“CELP”) codecs). In AbS speech codecs, an excitation vector is passed through an LPC synthesis filter to obtain the synthetic speech as described further above. At the encoder side, the optimum excitation vector is obtained by conducting a closed loop search where the squared distortion of an error vector between the input speech signal and the fed-back synthetic speech signal is minimized. For improved audio quality, the minimization is performed in the weighted speech domain, wherein the error signal is further processed through a weighting filter W(z) derived from the LPC synthesis filter.
Let 1/A(z) be the LPC synthesis filter, where:
and where n is the LPC order. The weighting filter is typically a pole-zero filter given by:
The synthesis and post-filtering steps of a CELP decoder provide another context within AbS speech codecs where filters are cascaded and where the process described herein may be used. Again, an LPC synthesis filter of the following form is used:
where n is the LPC order. This filter is then cascaded with a weighting filter W(z). In this case W(z)is of the form:
where μ<1 is a tilt factor. Note that these synthesis and weighting filters may occupy the full bandwidth of the encoded speech signal or alternatively form just a sub-band of a broader bandwidth speech signal.
In both of these cases, the weighting filter may be written in the form:
where P(z) is an all zero filter of order L and 1/Q(z) is an all pole filter of order M. The weighted synthesis filter is now:
Passing the excitation vectors through the weighting synthesis filter is generally a complex operation. To reduce the complexity of the above operation, a method for approximating the weighted synthesis filter to an LP filter of order n0<n+M+L has been proposed in the past. However, such a method requires generating the approximate LP filter through the generation of the impulse response of the weighted synthesis filter and then obtaining the correlations from the impulse response. Similar to the FFT-based method, this method requires truncation and windowing of the impulse response and hence suffers from the same drawbacks as the FFT-based methods.
The problem of truncation can be resolved by using the autoregressive correlation extension approach described herein to approximate the LPC of a weighted synthesis filter. When only an all zero filter P(z) is used as a weighting filter, the weighted synthesis filter is given by:
In this situation, one can directly use the method of
When an all pole filter 1/Q(z) is used as a weighting filter, the weighted synthesis filter is given by:
If one were to use the approach described in
When a pole-zero filter P(z)/Q(z) is used as a weighting filter, the weighted synthesis filter is given by:
In this case, a combination of the two foregoing approaches may be applied. In particular, the polynomials A(z) and Q(z) in the denominator of Ws(z) are multiplied to obtain B(z)=A(z)·Q(z), which is a polynomial of order n+M. Ws(z)=1/B(z) is assumed to be an LPC synthesis filter of order n+M. At this point, the approach described in
A method of LPC conversion by filtering of the auto-regressively extended correlation coefficients has been described. This method is in many embodiments an improvement over FFT-based methods in terms of both complexity and accuracy. However, in view of the many possible embodiments to which the principles of the present disclosure may be applied, it should be recognized that the embodiments described herein with respect to the drawing figures are meant to be illustrative only and should not be taken as limiting the scope of the claims. Therefore, the techniques as described herein contemplate all such embodiments as may come within the scope of the following claims and equivalents thereof.
The present application claims priority to U.S. Provisional Patent Application 61/774,777, filed on Mar. 8, 2013, which is incorporated herein by reference in its entirety.
Number | Date | Country | |
---|---|---|---|
61774777 | Mar 2013 | US |