The transmission of high-resolution signals typically includes data compression and recovery via codec devices for encoding and decoding the data stream before and after transmission. Requirements for faster speeds and higher resolutions (e.g., real time audio and video) have created a need for improved encoding and decoding of data signals generally and, in particular, for high resolution signals. Conventional approaches have been limited in their capacity to accommodate these requirements in many practical applications where the resolution requirements can result in unacceptably slow speeds.
Certain embodiments enable improved encoding and decoding of a vector of coefficients by associating a vector element of a signed pyramid with an encoded value that includes a first portion and a second portion, where the first portion identifies a corresponding vector element of an unsigned pyramid and a second portion characterizes sign values for nonzero components of the vector element of the signed pyramid. As a result, computational constraints such as word size apply to the unsigned pyramid instead of the signed pyramid. The smaller size of the unsigned pyramid enables extending the range of signed pyramid parameters that are operable within the computational constraints.
One embodiment relates to a method of processing audio signals. A first operation includes accessing an input audio signal from an audio source. A second operation includes encoding the input audio signal by determining a plurality of encoded values. An encoded value of the plurality of encoded values includes a first portion and a second portion, the first portion including an index to an element of an unsigned pyramid that is defined by a vector size and a quantization parameter, and the second portion including a corresponding sign value for each nonzero component of the element of the unsigned pyramid. A third operation includes decoding the encoded values in accordance with the encoding of the input audio signal to generate an output audio signal. A fourth operation includes providing the output audio signal to an audio player. Additional operations related to data transmission of the encoded values may be included between the second operation for encoding and the third operation for decoding.
Another embodiment relates to an apparatus for carrying out any one of the above-described methods, where the apparatus includes a computer for executing instructions related to the method. For example, the computer may include a processor for executing at least some of the instructions. Additionally or alternatively the computer may include circuitry or other specialized hardware for executing at least some of the instructions. In some operational settings, the apparatus may be configured as a system that includes one or more units, each of which is configured to carry out some aspects of the method either in software, in hardware or in some combination thereof. At least some values for the results of the method can be saved for later use in a computer-readable medium, including memory units and storage devices. Another embodiment relates to a computer-readable medium that stores (e.g., tangibly embodies) a computer program for carrying out the any one of the above-described methods with a computer. In these ways, aspects of the disclosed embodiments enable improved coding and decoding of signals in a variety of operational settings and provide improvements in computer-related technology in response to increasing requirements for faster speeds and higher resolutions.
Certain embodiments are illustrated by way of example and not limitation in the figures of the accompanying drawings.
The description that follows includes systems, methods, techniques, instruction sequences, and computer-program products that illustrate embodiments of the present disclosure. In the following description, for purposes of explanation, numerous specific details are set forth in order to provide an understanding of various embodiments of the disclosed subject matter. It will be evident, however, to those skilled in the art that embodiments of the disclosed subject matter may be practiced without these specific details. In general, well-known instruction instances, protocols, structures and techniques have not been shown in detail.
For positive integers n; k, the signed pyramid S(n, k) denotes the subset of vectors from n whose L1 norm equals k, where
denotes the set of integers:
S(n,k)={(y1, . . . ,yn)∈n:Σi=1n|yi|=k}. (1)
A pyramid vector quantizer (PVQ) maps vectors in n, where
denotes the set of real numbers, to vectors in S(n, k), for some positive integer k, the quantization parameter that characterizes the quantizer resolution. In a conventional coding application, a vector x∈
n is quantized to a vector y∈S(n, k), and this quantized vector is then encoded and transmitted to a decoder, where the process is reversed.
One way to encode the quantized vector y is to enumerate the set S(n, k), that is, order its elements, so that each vector y∈S(n, k) has a unique index IS(y), 0≤IS(y)<|S(n, k)|, where |S| denotes the cardinality of the set S. Then, to encode y, its index IS(y) is transmitted. On the receiving side, the decoder, using the same enumeration, reconstructs y from its index. For a practical application, the enumeration should be efficiently computable so that for a given vector), its index IS(y) can be computed efficiently, and conversely for a given index IS(y), the corresponding vector y can be computed efficiently. Such efficient enumerations of S(n, k) are known to those skilled in the art of coding for a PVQ. The encoding of the index itself can be accomplished, for example, by using a fixed-length code with the binary representation of IS(y) in ┌log|S(n, k)|┐ bits (i.e., the upper integral bound of the base-2 logarithm). Alternatively, a somewhat more efficient encoding can be accomplished by using an optimal two-length code, with some indices being coded in ┌log|S(n, k)|┐ bits and the remaining indices being coded in └log|S(n, k)|┘ bits (i.e., the lower integral bound of the base-2 logarithm), where this approach assumes a uniform probability distribution on the vectors in S(n, k)).
The enumeration algorithms involve arithmetic operations on integers of size roughly log|S(n, k)|. In an efficient computer implementation, one would prefer to have these integers fit in the natural word size, in, of the processor in use (e.g., m=32 or m=64 bits). This places a constraint on the feasible parameters n, k of the PVQ so that log|S(n, k)|≤m. In practical applications, it is often the case that a relatively long vector needs to be quantized and encoded, for which one has a given budget of bits B. If B>m this cannot be done in one encoding, since a pyramid of size 2B is too large for a single encoding. A common solution is to partition the vector into two halves, and divide the budget between the two halves (not necessarily in equal parts), as in for example B=B1+B2−c, where c denotes the possible cost of describing the budget allocation or other properties of the partition to the decoder. One can then attempt the quantization/encoding on each of the halves, and the process can be continued recursively until the budget bi for each part of the vector to be encoded satisfies bi≤m.
As discussed below in detail, certain embodiments are based on an alternative coding process that relaxes the above constraints on n and k, thereby allowing for pyramids of sizes that can be larger than 2m so that the number of splits can be reduced in the above-described recursive subdivision process. For positive integers n, k, the unsigned pyramid P(n, k) denotes the subset of vectors in S(n, k) whose components are all nonnegative:
P(n,k)={(y1, . . . ,yn)∈n:yi≤0,1≤i≤n,Σi=1n|yi|=k}. (2)
As described below, the encoding of a vector y∈S(n, k) can be performed in two steps resulting in two portions.
In the first step, one determines the corresponding vector y′∈P(n, k), defined as yi′=|yi|, 1≤i≤n. Then one finds and encodes the corresponding index IP(y′) in an enumeration of P(n, k). With the assumption of a uniform distribution of vectors in P(n, k), this requires l(y′) bits, where l(y′) satisfies └log|P(n, k)|┘≤l(y′)≤┌log|P(n, k)|┐. The first portion of the encoded vector is the index IP(y′).
In the second step, the sign of each nonzero component (e.g., entry) of y can be characterized by a 1-bit code. The second portion of the encoded vector is a characterization of these signs, possibly a simple encoding utilizing one bit per sign (e.g., ‘0’ representing ‘+’, ‘1’ representing ‘−’).
To perform the first step, one can use a conventional enumeration of P(n, k), which is generally much smaller than S(n, k). The constraint on the sizes of the integers operated on becomes log|P(n, k)|≤m, which is less stringent than the constraint log|S(n, k)|≤m.
The second portion of the encoding from the second step is relatively small, requires no arithmetic operations, and is not subject to any particular limitation. The length of this part however can vary from 1 bit up to min(n, k) bits, depending on the number of nonzero components in y, which is assumed to be nonzero. As a result, this portion of the encoding may have a variable length as compared to conventional methods that employ a fixed-length encoding.
The table in
In addition to the above-described advantage in extending the range of feasible parameters for the PVQ, it should be noted that P(n, k) is a simpler combinatorial object as compared with S(n, k) and so the corresponding enumeration algorithms are therefore also simpler. As discussed below, elements of P(n, k) can be identified with combinations of (n+k−1) elements taken (n−1) at a time.
In a typical high-resolution coding application with an m-bit machine, one is given a budget b of bits and a vector x of length n to quantize and encode with the largest available quantization parameter k for a high-level quantization. In the case of the above-described conventional method based on an enumeration of S(n, k) with a fixed-length encoding, if b≤m then one finds the largest value of k that satisfies |S(n, k)|≤2b. If b>m, then the vector can be split as described above.
For embodiments based on a variable-length encoding as described above, the code length may depend strongly on x, and so different quantization levels may be possible for a given bit budget b and different values of x. Alternatively, one can focus on finding the quantization level corresponding to the parameter k so that the bound is achieved on average (e.g., over many coding operations uniformly sampled from possible vectors x). That is, the actual bit consumption may be above or below the set budget for individual cases so long as the long-term average satisfies the budget.
For certain embodiments the quantization parameter k can be determined for a given vector length n by evaluating an expected code length so that the value of k yields an expected code length that is relatively (or optimally) close to the bit budget b. With the assumption that all vectors in S(n, k) are equally likely, the expected code length is bounded above by Lk, defined as
Lk=┌log|P(n,k)|┐+n−Z(n,k), (3)
where Z(n,k) is the expected number of zero components in a vector selected randomly from P(n,k).
For the evaluation of this formula for Lk, the number of vectors in P(n, k) is
For example, (n+k−1) elements mark locations of boundaries between vector elements plus locations of unitary contributions to the vector elements with one vector element boundary fixed to start (e.g., left-most or right-most). Then locations for (n−1) additional boundaries for the vector elements can be chosen from the (n+k−1) possible locations.
The value of Z(n,k) is
In the above formula for Z(n,k), the factor (jn) is the number of different choices for j zero components and |P(n−j,k−n+j) is the number of strictly positive integer vectors of dimension (n−j) with components summing to k.
The above-described formulas can be used to derive expressions that determine the optimal (or nearly optimal) quantization parameter k for a given vector size n and bit budget b. For integers n, b greater than 1, the function K(n, h) is defined as the value of k such that ┌Lk┐ is closest to b, with ties broken in favor of the larger value of k. In some operational settings, a pre-determined estimate for this function can be stored for convenient access. For example, certain embodiments include the following functional form for an approximation to the optimal value of k:
where the coefficients c0(n) and c1(n) can be obtained through a least-squares fit for an approximation to K(n, b) over a range of relevant values. This approximation enables an efficient evaluation of the quantization parameter k for given values of the vector length n and bit budget b.
Although certain embodiments are described here with reference to audio signals, those skilled in the art of data processing will appreciate the extensions to alternative data sets (e.g., video signals).
A second operation 304 includes encoding the input audio signal to determine encoded values. An encoded value includes a first portion and a second portion, the first portion including an index to an element of an unsigned pyramid that is defined by a vector size and a quantization parameter, and the second portion including a corresponding sign value for each nonzero component of the element of the unsigned pyramid.
A third operation 306 includes decoding the encoded values in accordance with the encoding of the input audio signal to generate an output audio signal. A fourth operation 308 includes providing the output audio signal to an audio player (e.g., including an output speaker). As discussed below, optional operations 310 related to data transmission are typically performed between the encoding operation 304 and the decoding operation 306.
The unsigned pyramid is typically defined as in Eq. 2. That is, the unsigned pyramid includes a plurality of vectors of the vector size, and each of the plurality of vectors of the unsigned pyramid has non-negative integral vector elements with a sum of the non-negative integral vector elements being equal to the quantization parameter. As discussed above with respect to Eq. 2, the first portion of the encoded value may include a first sequence of bits, a length of the first sequence being selected from one or more values based on a size of the unsigned pyramid. As discussed above, this length may be a fixed length based on the size of the unsigned pyramid or varied somewhat to minimize or reduce the number of bits as in the optimal two-length code discussed above. Further, the second portion of the encoded coefficient may include a second sequence of bits, a variable length of the second sequence being based on the number of nonzero components of the vector encoded in the first sequence. That is, the unsigned-pyramid element identified from the first portion includes a number of nonzero components whose sign values (plus or minus) are identified from the second portion in order to recover the corresponding signed-pyramid element.
The related signed pyramid is typically defined as in Eq. 1. That is, the related signed pyramid includes a plurality of vectors of the vector size, and each of the plurality of vectors of the signed pyramid has integral vector elements with a sum of magnitudes of the integral vector elements being equal to the quantization parameter.
As described by Eq. 1 and Eq. 2, each element of the unsigned pyramid can be identified with an element of the signed pyramid by taking absolute values (i.e., magnitudes) of the vector elements. In this way, an element of the signed pyramid can be encoded by a first portion that identifies a corresponding element of the unsigned pyramid and a second portion that accounts for the omitted sign values of the vector elements. It should be noted that words such as first and second are used here and elsewhere for labeling purposes only and are not intended to denote any specific spatial or temporal ordering. Furthermore, the labeling of a first element does not imply the presence of a second element.
Although the embodiment of
A second operation 404 includes quantizing a vector of the frequency-transform coefficients with the quantization parameter to determine elements of a signed pyramid (e.g., Eq. 1) that is defined by the vector size and the quantization parameter. That is, the frequency components may be combined (e.g., via partitioning), scaled, and mapped to integral values so that the sum of the integral vector elements is equal to the quantization parameter, which is also an integer value. For example, based on a real-valued vector (e.g., the vector of scaled frequency components), the mapping may determine an optimally close or sufficiently close integer-valued vector. A third operation 406 includes determining the first portion and the second portion of the encoded value from the element of the signed pyramid, where the first portion identifies a corresponding element of the unsigned pyramid (e.g., Eq. 2) and the second portion characterizes sign values for nonzero components of the element of the signed pyramid (e.g., Eq. 1). Similar operations may be carried out for determining each encoded value in the encoding operation 304, where different pyramids may be used depending for example on the partitioning of the frequency spectrum and the relevant bit budgets for portions of the frequency spectrum.
In general, the decoding operation 306 is consistent with the encoding operation 304 so that the output audio signal is an approximation relative to the quantization parameter and the vector size for the input audio signal. That is, the operations related to the decoding include a reversal of the operations related to the encoding in accordance with the vector size and the quantization parameter.
As noted above, operations 310 related to data transmission are typically performed between the encoding operation 304 and the decoding operation 306.
System embodiments related to the above-described method embodiments may range in complexity depending on the requirements of the operational setting.
Similarly as discussed above with respect to
Although the embodiment of
Initially, audio content (such as musical or vocal tracks) is created in a content creation environment 802 (e.g., corresponding to the input unit 702 of
The audio signal 806 then is encoded via a transform-based encoder with energy smoothing 808 (e.g., corresponding to the encoder 704 of
The encoded bitstream 810 is delivered for consumption by a listener through a delivery environment 812 (e.g., corresponding to the hardware units 710 for data transmission of
The output of the delivery environment 812 is a transmitted encoded bitstream 818 that is input to a transform-based decoder with energy smoothing 820 (e.g., corresponding to the decoder 706 of
The blocks of frequency transform coefficients 902 may be MDCT coefficients extracted from the audio signal 806 via MDCT processing as described above with reference to
In this context, it should be noted that the partitioning of frequency bands into separate frequency segments for separate processing by the vector quantizer 908 typically has a significant impact on the overall computational expense, which depends on the number of vector quantizations required to represent the audio signal. As discussed above with respect to
The vector quantizer 908 then may then determine the encoded values for the processed frequency-transform coefficients 906 in accordance with the second operation 302 of
Further details related to elements of
The example computer system 1100 includes a processor 1102 (e.g., a central processing unit (CPU), a graphics processing unit (GPU) or both), a main memory 1104, and a static memory 1106, which communicate with each other via a bus 1108. The computer system 1100 may further include a video display unit 1110 (e.g., a liquid crystal display (LCD) or a cathode ray tube (CRT)). The computer system 1100 also includes an alphanumeric input device 1112 (e.g., a keyboard), a user interface (UI) cursor control device 1114 (e.g., a mouse), a storage unit 1116 (e.g., a disk drive), a signal generation device 1118 (e.g., a speaker), and a network interface device 1120.
In some contexts, a computer-readable medium may be described as a machine-readable medium. The storage unit 1116 includes a machine-readable medium 1122 on which is stored one or more sets of data structures and instructions 1124 (e.g., software) embodying or utilizing any one or more of the methodologies or functions described herein. The instructions 1124 may also reside, completely or at least partially, within the static memory 1106, within the main memory 1104, or within the processor 1102 during execution thereof by the computer system 1100, with the static memory 1106, the main memory 1104, and the processor 1102 also constituting machine-readable media. For example, the instructions 1124 may correspond to elements of any of the above-described methods or a control system that implements any of those methods.
While the machine-readable medium 1122 is shown in an example embodiment to be a single medium, the terms “machine-readable medium” and “computer-readable medium” may each refer to a single storage medium or multiple storage media (e.g., a centralized or distributed database, and/or associated caches and servers) that store the one or more sets of data structures and instructions 1124. These terms shall also be taken to include any tangible or non-transitory medium that is capable of storing, encoding or carrying instructions for execution by the machine and that cause the machine to perform any one or more of the methodologies disclosed herein, or that is capable of storing, encoding or carrying data structures utilized by or associated with such instructions. These terms shall accordingly be taken to include, but not be limited to, solid-state memories, optical media, and magnetic media. Specific examples of machine-readable or computer-readable media include non-volatile memory, including by way of example semiconductor memory devices, e.g., erasable programmable read-only memory (EPROM), electrically erasable programmable read-only memory (EEPROM), and flash memory devices; magnetic disks such as internal hard disks and removable disks; magneto-optical disks; compact disc read-only memory (CD-ROM) and digital versatile disc read-only memory (DVD-ROM). However, the terms “machine-readable medium” and “computer-readable medium” are intended to specifically exclude non-statutory signals per se.
The instructions 1124 may further be transmitted or received over a communications network 1126 using a transmission medium. The instructions 1124 may be transmitted using the network interface device 1120 and any one of a number of well-known transfer protocols (e.g., hypertext transfer protocol (HTTP)). Examples of communication networks include a local area network (LAN), a wide area network (WAN), the Internet, mobile telephone networks, plain old telephone (POTS) networks, and wireless data networks (e.g., WiFi and WiMax networks). The term “transmission medium” shall be taken to include any intangible medium that is capable of storing, encoding or carrying instructions for execution by the machine, and includes digital or analog communications signals or other intangible media to facilitate communication of such software.
Certain embodiments are described herein as including logic or a number of components, modules, or mechanisms. Modules may constitute either software modules or hardware-implemented modules. A hardware-implemented module is a tangible unit capable of performing certain operations and may be configured or arranged in a certain manner. In example embodiments, one or more computer systems (e.g., a standalone, client or server computer system) or one or more processors may be configured by software (e.g., an application or application portion) as a hardware-implemented module that operates to perform certain operations as described herein.
In various embodiments, a hardware-implemented module (e.g., a computer-implemented module) may be implemented mechanically or electronically. For example, a hardware-implemented module may comprise dedicated circuitry or logic that is permanently configured (e.g., as a special-purpose processor, such as a field programmable gate array (FPGA) or an application-specific integrated circuit (ASIC)) to perform certain operations. A hardware-implemented module may also comprise programmable logic or circuitry (e.g., as encompassed within a general-purpose processor or other programmable processor) that is temporarily configured by software to perform certain operations. It will be appreciated that the decision to implement a hardware-implemented module mechanically, in dedicated and permanently configured circuitry, or in temporarily configured circuitry (e.g., configured by software) may be driven by cost and time considerations.
Accordingly, the term “hardware-implemented module” (e.g., a “computer-implemented module”) should be understood to encompass a tangible entity, be that an entity that is physically constructed, permanently configured (e.g., hardwired), or temporarily or transitorily configured (e.g., programmed) to operate in a certain manner and/or to perform certain operations described herein. Considering embodiments in which hardware-implemented modules are temporarily configured (e.g., programmed), each of the hardware-implemented modules need not be configured or instantiated at any one instance in time. For example, where the hardware-implemented modules comprise a general-purpose processor configured using software, the general-purpose processor may be configured as respective different hardware-implemented modules at different times. Software may accordingly configure a processor, for example, to constitute a particular hardware-implemented module at one instance of time and to constitute a different hardware-implemented module at a different instance of time.
Hardware-implemented modules can provide information to, and receive information from, other hardware-implemented modules. Accordingly, the described hardware-implemented modules may be regarded as being communicatively coupled. Where multiple of such hardware-implemented modules exist contemporaneously, communications may be achieved through signal transmission (e.g., over appropriate circuits and buses) that connect the hardware-implemented modules. In embodiments in which multiple hardware-implemented modules are configured or instantiated at different times, communications between such hardware-implemented modules may be achieved, for example, through the storage and retrieval of information in memory structures to which the multiple hardware-implemented modules have access. For example, one hardware-implemented module may perform an operation and store the output of that operation in a memory device to which it is communicatively coupled. A further hardware-implemented module may then, at a later time, access the memory device to retrieve and process the stored output. Hardware-implemented modules may also initiate communications with input or output devices and may operate on a resource (e.g., a collection of information).
The various operations of example methods described herein may be performed, at least partially, by one or more processors that are temporarily configured (e.g., by software) or permanently configured to perform the relevant operations. Whether temporarily or permanently configured, such processors may constitute processor-implemented modules that operate to perform one or more operations or functions. The modules referred to herein may, in some example embodiments, comprise processor-implemented modules.
Similarly, the methods described herein may be at least partially processor-implemented. For example, at least some of the operations of a method may be performed by one or more processors or processor-implemented modules. The performance of certain of the operations may be distributed among the one or more processors, not only residing within a single machine, but deployed across a number of machines. In some example embodiments, the processor or processors may be located in a single location (e.g., within a home environment, an office environment or as a server farm), while in other embodiments the processors may be distributed across a number of locations.
The one or more processors may also operate to support performance of the relevant operations in a “cloud computing” environment or as a “software as a service” (SaaS). For example, at least some of the operations may be performed by a group of computers (as examples of machines including processors), these operations being accessible via a network (e.g., the Internet) and via one or more appropriate interfaces (e.g., application program interfaces (APIs)).
Although only certain embodiments have been described in detail above, those skilled in the art will readily appreciate that many modifications are possible without materially departing from the novel teachings of this disclosure. For example, aspects of embodiments disclosed above can be combined in other combinations to form additional embodiments. Accordingly, all such modifications are intended to be included within the scope of this disclosure.
This application claims the benefit of U.S. Provisional Application No. 62/381,479, filed Aug. 30, 2016, which is incorporated herein by reference in its entirety.
Number | Name | Date | Kind |
---|---|---|---|
5271089 | Ozawa | Dec 1993 | A |
5781561 | Machida et al. | Jul 1998 | A |
6240385 | Foodeei | May 2001 | B1 |
6988067 | Kim et al. | Jan 2006 | B2 |
7218677 | Otsuka | May 2007 | B2 |
7539612 | Thumpudi | May 2009 | B2 |
7889103 | Mittal | Feb 2011 | B2 |
8175888 | Ashley | May 2012 | B2 |
9009036 | Valin et al. | Apr 2015 | B2 |
9015052 | Lin et al. | Apr 2015 | B2 |
9245532 | Gournay et al. | Jan 2016 | B2 |
20050285764 | Bessette et al. | Dec 2005 | A1 |
20120128064 | Sato | May 2012 | A1 |
20120259644 | Lin et al. | Oct 2012 | A1 |
20130132100 | Sung et al. | May 2013 | A1 |
20140025375 | Norvell | Jan 2014 | A1 |
20140358978 | Valin | Dec 2014 | A1 |
20160027449 | Svedberg | Jan 2016 | A1 |
20160088297 | Svedberg | Mar 2016 | A1 |
20160093311 | Kim et al. | Mar 2016 | A1 |
20180060023 | Nemer et al. | Mar 2018 | A1 |
Number | Date | Country |
---|---|---|
WO-2018044897 | Mar 2018 | WO |
WO-2018044897 | Jan 2019 | WO |
Entry |
---|
Fischer, Thomas, “Pyramid Vector Quantizer”, IEEE Transactions on Information Theory, vol. IT-32, No. 4., (Jul. 1986), 568-583. |
Valin, Jean-Marck, et al., “A HIgh-Quality Speech and Audio Codec with Less Than 10-ms Delay”, IEEE Transactions on Audio, Speech, and Language Processing, vol. 18, No. 1., (Jan. 2010), 58-67. |
“International Application Serial No. PCT/US2017/049130, International Search Report dated Dec. 26, 2017”, 2 pgs. |
“International Application Serial No. PCT/US2017/049130, Written Opinion dated Dec. 26, 2017”, 10 pgs. |
Number | Date | Country | |
---|---|---|---|
20180061428 A1 | Mar 2018 | US |
Number | Date | Country | |
---|---|---|---|
62381479 | Aug 2016 | US |