1. Field
The present disclosure relates generally to communications, and more particularly, to techniques for processing audio and speech signals.
2. Introduction
In the world of communications, where bandwidth is a fundamental limitation, audio and speech processing plays an important role multimedia applications. Audio and speech processing often involves various forms of signal compression to drastically decrease the amount of information required to represent audio and speech signals, and thereby reduce the transmission bandwidth. These processing systems are often referred to as encoders for compressing the audio and speech and decoders for decompressing audio and speech.
Traditional audio and speech processing systems achieve significant compression ratios using complex psychoacoustic models and filters at the cost of high complexity and delay. However, in the context of body area networks, tight constraints on power and latency demand simpler, low-complexity solutions to signal compression. Compression ratios are often traded off for power and latency gains.
In one aspect of the disclosure, a method of audio or speech processing includes generating a plurality of frames, each of the frames comprising a plurality of transform coefficients, and allocating bits to the transform coefficients in each of the frames such that at least two of the transform coefficients in the same frame have different bit allocations and the total number of the bits allocated to the transform coefficients in at least two of the frames is equal.
In another aspect of the disclosure, an apparatus for audio or speech processing includes a processing system configured to generate a plurality of frames, each of the frames comprising a plurality of transform coefficients, and allocate bits to the transform coefficients in each of the frames such that at least two of the transform coefficients in the same frame have different bit allocations and the total number of the bits allocated to the transform coefficients in at least two of the frames is equal.
In yet another aspect of the disclosure, an apparatus for audio or speech processing includes means for generating a plurality of frames, each of the frames comprising a plurality of transform coefficients, and means for allocating bits to the transform coefficients in each of the frames such that at least two of the transform coefficients in the same frame have different bit allocations and the total number of the bits allocated to the transform coefficients in at least two of the frames is equal.
In a further aspect of the disclosure, a computer-program product for processing audio or speech includes computer-readable medium encoded with codes executable by one or more processors to generate a plurality of frames, each of the frames comprising a plurality of transform coefficients, and allocate bits to the transform coefficients in each of the frames such that at least two of the transform coefficients in the same frame have different bit allocations and the total number of the bits allocated to the transform coefficients in at least two of the frames is equal.
In yet a further aspect of the disclosure, a headset includes a transducer, a processing system configured to generate a plurality of frames from audio or speech output from the transducer, each of the frames comprising a plurality of transform coefficients, and allocate bits to the transform coefficients in each of the frames such that at least two of the transform coefficients in the same frame have different bit allocations and the total number of the bits allocated to the transform coefficients in at least two of the frames is equal, and a transmitter configured to transmit the frames.
In another aspect of the disclosure, a watch includes a user interface, processing system configured to generate a plurality of frames from audio or speech output from the user interface, each of the frames comprising a plurality of transform coefficients, and allocate bits to the transform coefficients in each of the frames such that at least two of the transform coefficients in the same frame have different bit allocations and the total number of the bits allocated to the transform coefficients in at least two of the frames is equal, and a transmitter configured to transmit the frames.
In yet another aspect of the disclosure, a sensing apparatus includes a sensor, a processing system configured to generate a plurality of frames from audio or speech output from the sensor, each of the frames comprising a plurality of transform coefficients, and allocate bits to the transform coefficients in each of the frames such that at least two of the transform coefficients in the same frame have different bit allocations and the total number of the bits allocated to the transform coefficients in at least two of the frames is equal, and a transmitter configured to transmit the frames.
Various aspects of methods and apparatus are described more fully hereinafter with reference to the accompanying drawings. These methods and apparatus may, however, be embodied in many different forms and should not be construed as limited to any specific structure or function presented in this disclosure. Rather, these aspects are provided so that this disclosure will be thorough and complete, and will fully convey the scope of these methods and apparatus to those skilled in the art. Based on the teachings herein, one skilled in the art should appreciate that that the scope of the disclosure is intended to cover any aspect of the methods and apparatus disclosed herein, whether implemented independently of or combined with any other aspect of the disclosure. For example, an apparatus may be implemented or a method may be practiced using any number of the aspects set forth herein. In addition, the scope of the disclosure is intended to cover such an apparatus or method which is practiced using other structure, functionality, or structure and functionality in addition to or other than the aspects presented throughout this disclosure herein. It should be understood that any aspect of the disclosure herein may be embodied by one or more elements of a claim.
Several aspects of audio and speech processing will now be presented. These aspects will be presented with reference to a transmitting and receiving apparatus in a wireless communications network. The transmitting apparatus includes an encoder for compressing audio or speech for transmission over a wireless medium. The receiving apparatus includes a decoder for expanding the audio or speech received over the wireless medium from the transmitting apparatus. In many applications, the transmitting apparatus may be part of an apparatus that receives as well as transmits. Such an apparatus would therefore require a decoder, which may be a separate processing system or integrated with the encoder into a single processing system known as a “codec.” Similarly, the receiving apparatus may be part of an apparatus that transmits as well as receives. Such an apparatus would therefore require an encoder, which may be a separate processing system or integrated with the decoder into a codec. As those skilled in the art will readily appreciate, the various concepts described throughout this disclosure are applicable to any suitable encoding or decoding function, regardless of whether such function is implemented in a stand-alone processing system, integrated into a codec, or distributed across multiple entities in a wireless apparatus or a wireless communications network.
The various audio and speech processing techniques presented throughout this disclosure are well suited for integration into various wireless apparatus including a headset, a phone (e.g., cellular phone), a personal digital assistant (PDA), an entertainment device (e.g., a music or video device), a microphone, a medical sensing device (e.g., a biometric sensor, a heart rate monitor, a pedometer, an EKG device, a smart bandage, etc.), a user I/O device (e.g., a watch, a remote control, a light switch, a keyboard, a mouse, etc.), a medical monitor that may receive data from the medical sensing device, an environment sensing device (e.g., a tire pressure monitor), a computer, a point-of-sale device, an entertainment device, a hearing aid, a set-top box, or any other device that processes audio or speech signals. The wireless apparatus may include other functions in addition to the audio or speech processing. By way of example, a headset, watch, or sensor may include various audio or speech transducers (e.g., microphone and speakers) for user interaction with the apparatus.
An example of a wireless communications network that may benefit from the various concepts presented throughout this disclosure is illustrated in
The various audio and speech processing techniques presented throughout this disclosure may be used in wireless apparatus supporting any suitable radio technology or wireless protocol. By way of example, the wireless apparatus shown in
The audio or speech source 202 represents conceptually any suitable source of audio or speech. By way of example, the audio or speech source 202 may represent various applications running in the apparatus 200 that retrieve compressed audio files (e.g., MP3 files) from memory and decompresses them using an appropriate file format decoding scheme. Alternatively, the audio or speech source 202 may represent a microphone and associated circuitry to process analog speech signal from the user of the apparatus into digital samples. The audio or speech source 202 could instead represent a transceiver or modem capable of accessing audio or speech from a wired or wireless backhaul. As those skilled in the art will readily appreciate, the manner in which the audio or speech source 202 is implemented will depend on the particular design and application of the transmitting apparatus 200.
The audio or speech sink 204 represents conceptually any suitable entity capable of receiving audio or speech. By way of example, the audio or speech source 204 may represent various applications running in the apparatus 200 that compress audio files using an appropriate file format encoding scheme (e.g., MP3 files) for storing in memory. Alternatively, the audio or speech sink 204 may represent a speaker and associated circuitry to provide audio or speech to the user of the apparatus 200. The audio or speech sink 204 could instead represent a transceiver or modem capable of transmitting audio or speech over a wired or wireless backhaul. As those skilled in the art will readily appreciate, the manner in which the audio or speech source 204 is implemented will depend on the particular design and application of the transmitting apparatus 200.
The audio or speech processing system 206 may implement a compression algorithm to encode and decode audio and speech. The compression algorithm may use transforms to convert between sampled audio and speech and a transform domain, typically the frequency domain. In the transform domain, the component frequencies are allocated bits according to their audibility. In this example, the processing system 206 may take advantage of the frame-by-frame processing involved in any transform domain approach to ensure optimal bit allocation for each frame. Although the bit allocations are specialized to each frame, the processing system 206 may be configured to ensure a constant bit rate across frames. This approach enables an optimal bit allocation strategy over the entire signal of interest which, in turn ensures optimal compression ratio for a given quality requirement, and optimal quality for a given compression ratio.
The transceiver 208 may be used to perform various physical (PHY) and Medium Access Control (MAC) layer functions in connection with the transmission of audio or speech across a wireless medium. The PHY layer functions may include several signal processing functions such as forward error correction (e.g., Turbo coding/decoding), digital modulation/demodulation (e.g., FSK, PSK, QAM, etc.), and analog modulation/demodulating of an RF carrier. The MAC layer functions may include managing the audio or speech content that is sent across the PHY layer so that several apparatus can share access to the wireless medium.
The transmitting apparatus 300 is shown with an audio or speech source 302, an audio or speech processing system 304, and a transmitter 306. The receiving apparatus 310 is shown with a receiver 312, an audio or speech processing system 314, and an audio or speech sink 316. The audio or speech source 302 and transmitter 306 in the transmitting apparatus 300 and the receiver 312 and the audio or speech sink 316 in the receiving apparatus 310 function in the same way as described earlier in connection with
The audio or speech processing system 304 in the transmitting apparatus 300 includes a transform 322. The transform 322 may be a Discrete Cosine Transform (DCT) that converts audio or speech from the source 302 into a series of transform coefficients in the frequency domain. The output of the transform 322 is processed in sets of coefficients called frames. Each frame consists of N transform coefficients. The N transform coefficients in each frame are logarithmically compressed by a log compressor 324 before being input to a quantizer 326. The quantizer 326 quantizes the logarithmically compressed N transform coefficients before being provided to the transmitter 306 and modulated onto an RF carrier for transmission over a wireless medium 308.
A bit allocator 328 is configured to control the level of quantization applied by the quantizer 326 to the logarithmically compressed N transform coefficients. In at least one configuration of the processing system 304, the bit allocator 328 is configured to distribute a fixed number of bits B across the logarithmically compressed N coefficients for each frame. This may be achieved by computing a metric M′ based on at least one of Mi (i=1, 2, . . . , N) correlated to the energy of each coefficient in a frame. By way of example, M can simply be the square of the coefficient's amplitude. M′ can also be computed over more than one frame and be the variance of each transform bin. A theoretically optimal bit allocation vector v of length N is computed by distributing the B bits in proportion to M′. This is then mapped to one of the K available vectors in a dictionary V of size (K×N) 330 that is “closest” to the ideal vector v. The K available vectors may be represented by dk.
The dictionary 330 contains a set of vectors, dk, each of which is N elements long. Each element in a vector dk represents a possible bit-allocation for a corresponding coefficient in a frame. The sum of elements of each vector dk in the dictionary 330 is equal to B. This ensures a constant bit rate across frames and across a collection of frames (e.g., MAC packets). For each frame, once a vector dk is selected by the bit allocator 328, it may be provided to the quantizer 326 to quantize the logarithmically compressed N transform coefficients of the said frame.
For a dictionary V comprising of K vectors, ceiling(log2(K)) bits are required to index the elements of the dictionary. Once a vector dk is selected by the bit allocator 328 for a frame, a corresponding index identifying the selected vector dk may be transmitted along with the frame to the receiving apparatus 310 for decoding the frame. The index may be sent via out-of-band signaling, side channel, interleaved within the frame, or by some other suitable means. The number of vectors in the dictionary 330 may generally be a function of the bandwidth limitations for sending the index over the wireless medium 308.
Various methods may be used to create the dictionary 330. By way of example, a statistical metric, Si, may be computed for each bin across multiple frames of a training database. The statistical metric Si can then be used in techniques like k-means clustering to create the elements of the dictionary. Each vector in the dictionary may be constructed to ensure that the sum of its elements equal B. Additionally, each vector may be constrained to comprise of positive whole numbers.
At the receiving apparatus 310, each frame and its corresponding index are recovered from the RF carrier by the receiver 312 and provided to the audio or speech processing system 314. The processing system 314 includes an inverse quantizer 332 which uses the index to expand the coefficients in the frame. The frame of expanded coefficients may then be provided to a log expander 334, which performs an inverse log function, before being provided to an inverse transform 336 to convert the coefficients in the frame back to digital samples in the time domain. The time domain samples may be provided to the audio or speech sink 316 for further processing.
The audio and speech processing techniques could be extended to processing multiple frames at a time using their joint-statistics to decide on the ideal bit-allocation vector for that set of frames. This would reduce the amount of information required to be sent over the wireless medium by using the same bit allocation vector across multiple consecutive frames. This would be suitable for signals like speech or audio where there is considerable correlation between frames.
In cases where a single bit allocation vector is required due to architectural and/or capacity constraints, the audio or speech processing system may be specialized to a one-element dictionary that does not require any additional information to be transmitted with the frames across the wireless medium.
The various concepts presented throughout this disclosure, provides a method for specializing compression factors to the frame level. This approach essentially maintains a constant bit rate while at the same time ensuring that each speech or audio frame is optimally compressed. This approach also elements the need for a variable bit rate pipe for transport, which makes the design of MAC/PHY more complex, generally associated with dynamic bit allocation schemes.
In addition, these concepts are agnostic to the signal structure and does not require any psycho-acoustic or a-priori knowledge of the signal's structure in either the temporal or transform domain. Bit allocation decisions are optimally made using the energy of individual components in each frame.
The “audio or speech processing system” shall be construed broadly to mean any apparatus, component, device, circuit, block, unit, module, element, or any other entity, whether implemented as hardware, software, or a combinations of both, that performs the various functions presented throughout this disclosure. Whether such functionality is implemented as hardware or software depends upon the particular application and design constraints imposed on the overall system. Skilled artisans may implement the described functionality in varying ways for each particular application.
The processing system may be implemented with one or more processors. The one or more processors, or any of them, may be dedicated hardware or a hardware platform for executing software on a computer-readable medium. Software shall be construed broadly to mean instructions, instruction sets, code, code segments, program code, programs, subprograms, software modules, applications, software applications, software packages, routines, subroutines, objects, executables, threads of execution, procedures, functions, etc., whether referred to as software, firmware, middleware, microcode, hardware description language, or otherwise. The one or more processor may include, by way of example, any combination of microprocessors, microcontrollers, digital signal processors (DSPs), field programmable gate arrays (FPGAs), programmable logic devices (PLDs), state machines, gated logic, discrete hardware circuits, and other suitable processors configured to perform the various functionalities described throughout this disclosure. The computer-readable medium may include, by way of example, a magnetic storage device (e.g., hard disk, floppy disk, magnetic strip), an optical disk (e.g., compact disk (CD), digital versatile disk (DVD)), a smart card, a flash memory device (e.g., card, stick, key drive), random access memory (RAM), read only memory (ROM), programmable ROM (PROM), erasable PROM (EPROM), electrically erasable PROM (EEPROM), a register, a removable disk, a carrier wave, a transmission line, or any other suitable medium for storing or transmitting software. The computer-readable medium may be resident in the processing system, external to the processing system, or distributed across multiple entities including the processing system. The computer-readable medium may be embodied in a computer-program product. By way of example, a computer-program product may include a computer-readable medium in packaging materials. The computer-readable medium may also be used to implement the dictionary.
The processing system, or any part of the processing system, may provide the means for performing the functions recited herein. Turning to
It is understood that the specific order or hierarchy of steps in the processes disclosed is an illustration of exemplary approaches. Based upon design preferences, it is understood that the specific order or hierarchy of steps in the processes may be rearranged. The accompanying method claims present elements of the various steps in a sample order, and are not meant to be limited to the specific order or hierarchy presented.
The previous description is provided to enable any person skilled in the art to practice the various aspects described herein. Various modifications to these aspects will be readily apparent to those skilled in the art, and the generic principles defined herein may be applied to other aspects. Thus, the claims are not intended to be limited to the aspects shown herein, but is to be accorded the full scope consistent with the language claims, wherein reference to an element in the singular is not intended to mean “one and only one” unless specifically so stated, but rather “one or more.” Unless specifically stated otherwise, the term “some” refers to one or more. Pronouns in the masculine (e.g., his) include the feminine and neuter gender (e.g., her and its) and vice versa. All structural and functional equivalents to the elements of the various aspects described throughout this disclosure that are known or later come to be known to those of ordinary skill in the art are expressly incorporated herein by reference and are intended to be encompassed by the claims. Moreover, nothing disclosed herein is intended to be dedicated to the public regardless of whether such disclosure is explicitly recited in the claims. No claim element is to be construed under the provisions of 35 U.S.C. §112, sixth paragraph, unless the element is expressly recited using the phrase “means for” or, in the case of a method claim, the element is recited using the phrase “step for.”
The present application for patent claims priority to Provisional Application No. 61/289,287 entitled “AUDIO AND SPEECH PROCESSING WITH OPTIMAL BIT-ALLOCATION FOR CONSTANT BIT RATE APPLICATION” filed Dec. 22, 2009, and assigned to the assignee hereof and hereby expressly incorporated by reference herein.
Number | Name | Date | Kind |
---|---|---|---|
5394473 | Davidson | Feb 1995 | A |
5414796 | Jacobs et al. | May 1995 | A |
5596676 | Swaminathan et al. | Jan 1997 | A |
5819224 | Xydeas | Oct 1998 | A |
5884010 | Chen et al. | Mar 1999 | A |
6006179 | Wu et al. | Dec 1999 | A |
8103015 | Meyer et al. | Jan 2012 | B2 |
20010023395 | Su et al. | Sep 2001 | A1 |
20020007273 | Chen | Jan 2002 | A1 |
20030185247 | Chen et al. | Oct 2003 | A1 |
20050013197 | Chung et al. | Jan 2005 | A1 |
20060149538 | Lee et al. | Jul 2006 | A1 |
Number | Date | Country |
---|---|---|
1151638 | Jun 1997 | CN |
1247415 | Mar 2000 | CN |
101030379 | Sep 2007 | CN |
101308661 | Nov 2008 | CN |
H09288498 | Nov 1997 | JP |
2000206990 | Jul 2000 | JP |
Entry |
---|
Zanartu, Matias, ‘Project Report: Audio Compression using Wavelet Techniques’, Perdue University Electrical and Computer Engineering ECE 648, Spring 2005. |
Kuldip K. Paliwal et al. “Efficient Vector Quantization of LPC Parameters at 24 Bits/Frame”. IEEE Transactions on Speech and Audio Processing, vol. 1., No. 1, Jan. 1993. |
Brandenburg K, et al., “ISO-MPEG-1 Audio: A Generic Standard for Coding of High-Quality Digital Audio”, Journal of the Audio Engineering Society, Audio Engineering Society, New York, NY, US, vol. 42, No. 10. Oct. 1, 1994, pp. 780-792. XP000978167, ISSN: 1549-4950. |
International Search Report and Written Opinion—PCT/US2010/061751, International Search Authority—European Patent Office—Mar. 25, 2011. |
Pedro De A Berger, et al., “Compression of EMG signals with wavelet transform and artificial neural networks; Compression of EMG signals”, Physiological Measurement, Institute of Physics Publishing, Bristol, GB, vol. 27, No. 6, Jun. 1, 2006, pp. 457-465, XP020105778. ISSN: 0967-3334, DOI: DOI:10.1088/0967-3334/27/6/003. |
Final text for DIS 11172-3 (rev. 2): Information Technology—Coding of Moving Pictures and Associated Audio for Digital Storage Media—Part 1—Coding at up to about 1.5 Mbit/s (ISO/IEC JTC 1/SC 29/WG 11 N 0156) [MPEG 92]—Section 3: Audio ED—International Standards Organization, Jan. 1, 1992, Coded Representation of Audio, Picture Multimedia and Hypermedia Information (Tentative Title), Apr. 20, 1992, ISO/IEC JTC 1/SC 29 N 147. |
Noll P., “MPEG standards section 2 digital audio coding standards”, Digital Consumer Electronics Handbook, Jan. 1, 1997, pp. 8.25-8.49,I/II. |
Berger P., et al., “A new wavelet-based algorithm for compression of Emg Signals”, Proceedings of the 29th Annual International Conference of the IEEE EMBS, 2007, pp. 1554-1557. |
Number | Date | Country | |
---|---|---|---|
20110153315 A1 | Jun 2011 | US |
Number | Date | Country | |
---|---|---|---|
61289287 | Dec 2009 | US |