The present invention relates generally to signal encoding and in particular to speech encoding.
For most of the period since the advent of wireless communication, information (e.g., audio, video) has been communicated through a process that involved continuously modulating a carrier signal with an information bearing signal, for example, an audio or video signal.
In the 1990's progress in digital circuitry in terms of processing power and integrated circuit cost reduction allowed digital technology to supplant analog technology in cellular telephony. Digital technology is less prone to various types of analog signal degradation such as fading. Moreover, digital technology facilitates use of advanced techniques such as error-correction to achieve improved quality and data compression which results in lower bandwidth requirements for the same quality.
For cellular telephony in particular the primary form of data to be communicated is speech audio. Typically, superior compression can be achieved by using a compression algorithm that is specifically designed for the type of data to be compressed. A compression technique that is especially suited to speech audio is known as Code-Excited Linear Prediction (CELP). CELP is based on a model of the human vocal apparatus, viz., the vocal cords and the vocal tract. In the model, the vocal tract is modeled by a discrete time signal filter that has a frequency response that mimics the resonances of the vocal tract, and sounds which in reality are generated by bursts of air passing the vocal cords and exciting acoustic resonances in the vocal tract are simulated (e.g., in a cell phone) by the output of the filter when a series of pulses are input into the filter. A discrete portion of speech (e.g., a frame or sub-frame) is then represented by a set of pulses and optionally by filter coefficients defining the filter. The set of pulses is described by the number of pulses, the magnitudes of the pulses, the positions of the pulses within the frame (or sub-frame), and the signs (±) of the pulses. As a person is speaking into his or her communication device, for each successive sub-frame the foregoing information must be transmitted; however, typically the information itself is not transmitted, rather the information is encoded and a code representing the information is transmitted. One way of doing this is to store each and every possible combination of the number, magnitudes, positions, and signs of the pulses in a codebook, with each possible combination having a unique address in the codebook, and to transmit the address in some form rather than transmitting the information about the pulses. A drawback of this approach is that if it is desired to achieve higher audio fidelity by allowing for more pulses or more precision in describing the positions or magnitudes of the pulses, the size of the codebook will increase thereby increasing the memory and search requirements for the codebook.
According to one aspect, the invention provides a transmitting voice communication device that has an audio encoder that encodes audio coupled to an arithmetic encoder which further encodes the output of the audio encoder. According to certain embodiments the audio encoder is a CELP audio encoder. According other embodiments the audio encoder is a Discrete Cosine Transform (DCT) encoder.
According to another aspect, the invention provides a receiving voice communication devices that has an arithmetic decoder that decodes received information encoding audio and passes output to an audio decoder which further decodes the output of the arithmetic decoder. According to certain embodiments the audio decoder is a CELP decoder and according to other embodiments the audio decoder is a DCT decoder.
The accompanying figures, where like reference numerals refer to identical or functionally similar elements throughout the separate views and which together with the detailed description below are incorporated in and form part of the specification, serve to further illustrate various embodiments and to explain various principles and advantages all in accordance with the present invention.
Skilled artisans will appreciate that elements in the figures are illustrated for simplicity and clarity and have not necessarily been drawn to scale. For example, the dimensions of some of the elements in the figures may be exaggerated relative to other elements to help to improve understanding of embodiments of the present invention.
Before describing in detail embodiments that are in accordance with the present invention, it should be observed that the embodiments reside primarily in combinations of method steps and apparatus components related to digital speech communication. Accordingly, the apparatus components and method steps have been represented where appropriate by conventional symbols in the drawings, showing only those specific details that are pertinent to understanding the embodiments of the present invention so as not to obscure the disclosure with details that will be readily apparent to those of ordinary skill in the art having the benefit of the description herein.
If n is the number of pulse positions in a sub-frame and m is an upper bound on the sum of the integer pulse magnitudes for the sub-frame, then the number of pulses in the sub-frame denoted by k is bounded as follows:
1≦k≦min(m,n)
The number of possible sets of pulse positions in the sub-frame is given by:
The number of possible ways to distribute the energy in the pulses is given by:
and the number of combinations of different signs of the pulses is given by 2k.
Accordingly, the number of different unique sets of pulses for a sub-frame is given by:
The preceding expression also gives the number of unique codes that would need to be stored if the prior art code-book approach were used.
Referring again to
The magnitudes of the pulses can be represented by a magnitude vector in which each element is an integer representing the magnitude of a pulse. Such magnitude vectors can be converted to binary vectors (i.e., vectors in which each element is a single bit, viz., 0 or 1) by the pulse information encoder 211 by replacing each magnitude integer by a sequence of zeros numbering one less than the magnitude integer followed by a one. In as much as the last bit in the binary vector would always be a one, it can be ignored. The following are examples (for m=6 and k=3) of magnitude vectors at the left and corresponding binary vectors at the right that result from the foregoing conversion process:
The binary vectors can then be encoded using the arithmetic encoder 212. The magnitude vectors can be recovered, after arithmetic decoding, by counting the number of zeros preceding each one.
The signs of the pulses can be represented by a binary vector in which the bit value represents the sign, e.g., a bit value of 1 can represent a negative sign, and a bit value of 0 a positive sign. If the CELP encoder 210 outputs sign information differently, the pulse information encoder 211 can reformat the sign information in the foregoing manner.
The pulse information encoder 211 is coupled to the arithmetic encoder 212. The arithmetic encoder 212 encodes the pulse information received from the CELP encoder 210 through the pulse information encoder 211. The operation of the arithmetic encoder 212 is described more fully below. By using an arithmetic encoder, storing a large codebook is avoided.
The arithmetic encoder 212 is coupled to a channel encoder 217 which is coupled to a transmitter 214 of a transceiver 216. The transceiver 216 also includes a receiver 218. The receiver 218 is coupled to an arithmetic decoder 220 through a channel decoder 219. The arithmetic decoder 220 outputs pulse information. The operation of the arithmetic decoder 220 is described more fully below. The arithmetic decoder 220 is coupled through a pulse information decoder 221 to a CELP decoder 222. The pulse information decoder 221 performs the inverse of the processes performed by the pulse information encoder 211. The CELP decoder 222 reconstructs a digital representation of speech audio (digitized audio signal) using the pulse information. The CELP decoder 222 is coupled to a digital-to-analog converter (D/A) 224 that is coupled through a second amplifier 226 to a speaker 228.
According to alternative embodiments of the invention, parts of the methods shown in
Although in the foregoing the probability of ones and zeros is assumed to remain fixed, alternatively the probabilities can vary. In certain embodiments total number of ones (or zeros) is known a priori or separately transmitted beforehand, and at any bit position in a sequence being encoded the probability of a zero is computed as the ratio of the number of zeros yet to be encountered to the total number of bits yet to be processed.
In the example shown in
However, in practice, the width of the code space interval corresponding to a source sequence may not exactly be equal to 1/NP(n,k) because of the rounding operations necessary to perform fixed-precision arithmetic. The actual width of the interval corresponding to a source sequence depends on the sequence itself and the precision used in the calculations. While this is cumbersome to compute, a bound can be derived for the minimum length of the code words IP(n,k,w) based on a few conservative assumptions. For example, it can be shown that (see Appendix I):
I
P(n, k, w)=┌log2 NP(n, k)+Ω(n, k, w)┐, where
Ω(n, k, w)=log2(1/1−(n/k)2−(w+1))+log2(1/1−(n−1/k−1)2−(w+1))+ . . . +log2(1/1−(n−k+1/1)2−(w+1))+log2(1/1−(n/n−k)2−(w+1))+log2(1/1−(n−1/n−k−1)2−(w+1))+ . . . +log2(1/1−(k+1/1)2−(w+1))
In the equations above, w represents a precision parameter, i.e., (starting) positions, and (the widths of the) intervals in the code space are stored using w+2 and w+1 bits respectively. In general, in order to compute such positions (denoted x) and intervals (denoted y) in the code space, binary registers that are up to 2*(w+2) bits wide will need to be used assuming that the input symbol probabilities (e.g., probabilities of binary digits 0 and 1) are also represented using (w+1) bits. Binary registers of such width are used to store a numerator of a parameter z that is discussed below in the context of
A mathematical foundation of arithmetic encoding is given in the first part of Appendix I. Referring to
After block 602, decision block 604 tests if there are any remaining ones in the sequence α being encoded. If so the flowchart branches to block 606 in which the quantity z is computed, the number of information bits yet to be coded ñ is decremented, and the index i is incremented. Initially the outcome of decision block 604 is positive. The quantity z is related to the size of the portion of the code space that is associated with a zero value for a current bit position in the sequence being encoded and is a fraction of the portion of the code space associated with a previous bit. This can be understood by referring to second column of
Next the flowchart 600 reaches decision block 608 which tests if the current bit in the sequence being encoded, identified by index i, is a zero or one. If the current bit is a zero then in block 610 the value y is set equal to z and ñ0 (the number of zeros yet to be encountered) is decremented. The value of x is unchanged. On the other hand if the current bit is a one then in block 612 y is set equal to a previous value of y minus z and x is set equal to a previous value of x plus z. The new value of y is a proportion of the previous value of y with the proportion given by the probability of the current bit value (zero or one). x and y are related respectively to the starting point and the width of the area within the code space [0,1) as represented by [0,2w) that corresponds to the bit sequence encoded so far.
After either block 610 or 612 decision block 614 is reached. Decision block 614 tests if the value of y is less than 2w. (Note that blocks 606, 610 and 612 will reduce the value of y.) If so then in block 616 the value of y is scaled up by a factor of 2 (e.g., by a left bit shift), the value of e is computed, and the value of x is reset to 2(x mod 2w). Using the mod function essentially isolates a portion of x that is relevant to remaining, less significant code bits. Because both y and x are scaled up in block 616 in a process referred to as renormalization, even as the encoding continues and more and more information bits are being encoded, the full value of 2w is still used as the basis of comparison of x in the floor function to determine the value of the code bits. Similarly, the full value of 2w is still used as the basis of comparison of y in the decision block 614.
After block 616, decision block 618 tests if the variable e is equal to 1. If the outcome of decision block 618 is negative, then the flowchart 600 branches to decision block 620 which tests if the variable e is greater than 1 (e.g., if there is an overflow condition). If not, meaning that the value of e is zero, the flowchart 600 branches to block 622 wherein the value of the run bit variable rb is set equal to 1.
Next the flowchart 600 reaches block 624 in which the code bit index j is incremented, the code bit vj is set equal to value of nb, and then nb is set equal to e. Note that for the first two executions of block 624, j is set to values less than one, so the values of vj that are set will not be utilized as part of the output code.
When the outcome of decision block 618 positive the flowchart 600 will branch through block 626 in which the run length variable rl is incremented and then return to decision block 614. Decision block 628 tests if the run length variable rl is greater than zero—the initial value. If so then in block 630 the index j is incremented, code bit vj is set to the run bit variable rb, and the run length rl is decremented, before returning to decision block 628. When it is determined in decision block 628 that the run length variable rl is zero the flowchart 600 returns to block 614.
If the outcome of decision block 620 is positive, i.e., an overflow condition has been detected, then the flowchart 600 branches to block 632 in which the nb variable is incremented, the rb variable is zeroed, and the e is decremented by 2, after which the flowchart 600 proceeds with block 624.
If it is determined in decision block 604 that only zeros remain in the sequence being encoded, then the flowchart 600 branches to block 634 in which the value of the variable e is computed as the floor function of x divided by 2w. Next decision block 636 tests if e is greater than 1. If so then in block 638 the next bit variable nb is incremented, the run bit variable rb is set equal to 0, and the variable e is decremented by 2. If the outcome of decision block 636 is negative, then in block 640 the run bit variable rb is set equal to 1. After either block 638 or 640, in block 642, the index j is incremented, the code bit vj is set equal to the next bit variable nb, and the next bit variable nb is set equal to e.
Next decision block 644 tests if the run length variable rl is greater than zero. If so then in block 646 the index j is incremented, the code bit vj is set equal to the run bit variable rb, and the run length variable rl is decremented, after which the flowchart 600 returns to block 644.
After block 644 in block 648 the index j is incremented, and the code bit vj is set equal to the next bit variable nb. Next decision block 650 tests if the index j is less than the code length l. If so then block 652 sets remaining code bits to 1. When j reaches l the encoding terminates.
Referring to
After block 712 block 710 is executed. When it is determined in decision block 704 that y is not less than 2w, the flowchart 700 branches to block 714 which computes the value of z as shown, decrements the number of information bits yet to be decoded n, and increments the index i which points to bits of the decoded sequence. Next decision block 716 tests if x is less than z. If not then in block 718 an ith decoded bit ui is set equal to one, x and y are decremented by z to account for the parts of x and y represented by the ith bit just decoded. If decision block 716 determines that x is less than z then in block 720 the ith decoded bit ui is set equal to zero, y is set equal to z, and the number of zeros yet to be encountered no is decremented to account for the zero bit ui just decoded.
After either block 718 or 720 decision block 722 tests if the number of zeros remaining is less than the total number of bits remaining. If the outcome of block 722 is affirmative, the flowchart 700 loops back to decision block 704. If the outcome of block 722 is negative, the flowchart branches to decision block 724 which tests if i is less than n. If so block 726 zero fills the remaining bits. When the outcome of decision block 724 is negative the decoding process terminates.
In block 808 information as to the position of any non-zero coefficients is encoded in a first binary vector. The length of the first binary vector is equal to the number of DCT coefficients, and each bit in the first binary vector is set to a one or a zero depending on whether the corresponding (by position) coefficient of the quantized DCT output is non-zero or zero.
In block 810 the signs of the non-zero quantized DCT coefficients are encoded in a second binary vector. The second binary vector need only be as long as the number of non-zero quantized DCT coefficients. Each bit in the second binary vector is set equal to a zero or a one depending on whether the corresponding non-zero quantized DCT coefficient is negative or positive. As discussed above arithmetic coding and decoding of binary vectors encoding sign information can be based on assumed fixed probabilities of ½ for both zero and one, and therefore it is not necessary to transmit the number of ones (or zeros) in such vectors.
In block 812 the magnitudes of the non-zero quantized DCT coefficients are encoded in a third binary vector. The method of encoding magnitudes described above with reference to the pulse information encoder 211 is suitably used. Note that according to certain embodiments the sum of the magnitudes of the coefficients is a fixed (design) value, and in such cases the number of zeros in binary vectors encoding the magnitudes will also be fixed and therefore need not be transmitted.
In block 814 one or more of the first through third binary vectors are encoded using an arithmetic encoder. Two or more of the first through third binary vectors can be concatenated and encoded together by the arithmetic encoder, or the binary vectors can be encoded separately by the arithmetic encoder. In block 816 the number of non-zero DCT coefficients are transmitted. The number of non-zero DCT coefficients can be encoded (e.g., arithmetic encoded or Huffman encoded) prior to transmission. In block 818 the encoded binary vectors are transmitted.
The transceiver module 1102 is coupled to the antenna 1004. Carrier signals that are modulated with data, e.g., audio data, pass between the antenna 1004 and the transceiver module 1102.
The microphone 1012 is coupled to the A/D 1106. Audio, including spoken words and ambient noise, is input through the microphone 1012 and converted to digital format by the A/D 1106.
A switch matrix 1122 that is part of the keypad 1008 is coupled to the key input decoder 1108. The key input decoder 1108 serves to identify depressed keys and to provide information identifying each depressed key to the processor 1104.
The D/A 1112 is coupled to the speaker 1010. The D/A 1112 converts decoded digital audio to analog signals and drives the speaker 1010. The display driver 1114 is coupled to the display 1006.
The program memory 1116 is used to store programs that control the wireless communication device 1000. The programs stored in the program memory 1116 are executed by the processor 1104. The workspace memory 1118 is used as a workspace by the processor 1104 in executing programs. Methods that are carried out by programs stored in the program memory 1116 are described above with reference to
In the foregoing specification, specific embodiments of the present invention have been described. However, one of ordinary skill in the art appreciates that various modifications and changes can be made without departing from the scope of the present invention as set forth in the claims below. Accordingly, the specification and figures are to be regarded in an illustrative rather than a restrictive sense, and all such modifications are intended to be included within the scope of present invention. The benefits, advantages, solutions to problems, and any element(s) that may cause any benefit, advantage, or solution to occur or become more pronounced are not to be construed as a critical, required, or essential features or elements of any or all the claims. The invention is defined solely by the appended claims including any amendments made during the pendency of this application and all equivalents of those claims as issued.
A) Mathematical Foundation of Arithmetic Coding:
In arithmetic coding, each information word to be coded is assigned a unique subinterval within the unit interval [0, 1). The computation of this interval can be performed recursively with the knowledge of the probabilities of the symbols within the information word. A point within the interval is then selected, and a fractional representation of this point is used as the codeword.
Mathematically, let α denote a binary information word and l(α)=[x(α), x(α)+y(α)) denote the interval corresponding to α where x(α) denotes the start of the interval and y(α) denotes the width of the interval. When α is just the empty sequence ε, we define
x(ε)=0.0 and y(ε)=1.0,
so that l(ε)=[0, 1). If the interval corresponding to α is known, then the intervals corresponding to α0 and α1 (i.e., the concatenation of α and either 0 or 1 respectively) can be computed as follows.
x(α0)=x(α),
y(α0)=y(α) P(0|α),
x(α1)=x(α)+y(α) P(0|α), and
y(α1)=y(α) P(1|α)=y(α) (1−P(0|α))=y(α)−y(α) P(0|α),
where P(0|α) and P(1|α) (=1−P(0|α)) denote respectively the probabilities of a 0 or 1 bit following the bit sequence α. Using the notation z(α)=y(α) P(0|α) in the above equations, we have
x(α0)=x(α),
y(α0)=z(α),
x(α1)=x(α)+z(α), and
y(α1)=y(α)−z(α).
Computation of the interval l(α) corresponding to α using the above recursive equations requires infinite precision. In arithmetic coding, rounding and scaling (or renormalization) operations are used which allow the computation of l(α) to be performed using finite precision arithmetic. However, the computed interval is now only an approximation of the actual interval. Let us define the integers x*(α), y*(α), L(α), and w so that x(α) and y(α) can be expressed using finite precision (i.e., using L(α)+w bits) as
x(α)=x*(α)/2L(α)+w, and
y(α)=y*(α)/2L(α)+w.
The recursive equations for the computation of the interval l(α) are now reformulated as follows. For the empty sequence ε, we define
x*(ε)=0, y*(ε)=2w, and L(ε)=0.
If x*(α), y*(α), and L(α) are known for a sequence α, then we have
for the sequence α0:
z*(α)=└y*(α)P(0|α)+1/2┘,
x*(α0)=x*(α)2d0,
y*(α0)=z*(α)2d0, and
L(α0)=L(α)+d0,
where d0 is an integer for which 2w≦y*(α0)<2w+1; and
for the sequence α1:
z*(α)=└y*(α)P(0|α)+1/2┘,
x*(α1)=(x*(α)+z*(α))2d1,
y*(α1)=(y*(α)−z*(α))2d1, and
L(α1)=L(α)+d1,
where d1 is an integer for which 2w≦y*(α1)<2w+1.
In the above equations, the rounding operation used in the computation of z*(α) ensures that it is expressed in finite precision (w+1 bits). Also, the choice of d0 (respectively d1) used in scaling y*(α0) (respectively y*(α1)) ensures that the scaled interval width has enough precision (w+1 bits) for further subdivision. The precision parameter w is a design value and should be chosen to suit the coding application. A choice of w=14, for example, provides enough precision for general applications and also allows standard integer arithmetic to be used in computing the codeword.
The binary fractional representations of x(α) and y(α) are shown in
B) Bounding the Codeword Length:
Consider the encoding of an n-bit sequence using the flowchart 600 in
y*(α0)/2L(α0)+w>(y*(α)P(0|α)−1/2))/2L(α)+w, and
y*(α1)/2L(α1)+w≧(y*(α)P(1|α)−1/2))/2L(α)+w.
Combining the two expressions, we have
y*(αu)/2L(αu)+w≧(y*(α)P(u|α)−1/2))/2L(α)+w
where u is a 0 or 1. The above expression can be rewritten as
Since y*(α)≧2w, we have
where δ=2−(w+1). Applying the above relationship recursively to the input bit sequence (i.e., information word) α=u1, u2, . . . , un and recalling that y(ε)=1, we have
The expression P(u1|ε)P(u2|u1) . . . P(un|u1u2 . . . un−1) represents the probability P(α) of the sequence α and is also the ideal interval width. If α is a n-bit sequence with k ones and if the probability of a zero at any position is given by ñ0/ñ, then it can be shown that P(α)=1/NP(n,k) where
Simplifying the notation by replacing P(ui|u1u2 . . . ui−1) by Pi, we have
Each term of the form
reduces the interval width from the ideal value P(α) with the greatest reduction occurring for the smallest value of Pi. While the actual set of probabilities {Pi, i=1, 2, . . . , n} depends on the particular n-bit sequence, the following set of n probabilities {k/n, k−1/n−1, . . . , 1/n−k+1, n−k/n, n−k−1/n−1, . . . , 1/k+1} provides a lower bound for any sequence α. The codeword length lP(n,k,w) should be chosen such that 2−l
l
P(n, k, w)=┌log2 NP(n,k)+Ω(n,k,w)┐, where
Ω(n,k,w)=log2(1/1−(n/k)2−(w+1))+log2(1/1−(n−1/k−1)2−(w+1))+ . . . +log2(1/1−(n−k+1/1)2−(w+1))+log2(1/1−(n/n−k)2−(w+1))+log2(1/1−(n−1/n−k−1)2−(w−1))+ . . . +log2(1/1−(k+1/1)2−(w+1))