Scalable coding method for high quality audio

Description

TECHNICAL FIELD

The present invention relates to audio coding and decoding and relates more particularly to scalable coding of audio data into a plurality of layers of a standard data channel and scalable decoding of audio data from a standard data channel.

BACKGROUND ART

Due in part to the widespread commercial success of compact disc (CD) technologies over the last two decades, sixteen bit pulse code modulation (PCM) has become an industry standard for distribution and playback of recorded audio. Over much of this time period, the audio industry touted the compact disc as providing superior sound quality to vinyl records and cassette tapes, and many people believed that little audible benefit would be obtained by increasing the resolution of audio beyond that obtainable from sixteen bit PCM.

Over the last several years, this belief has been challenged for various reasons. The dynamic range of sixteen bit PCM is too limited for noise free reproduction of all musical sounds. Subtle detail is lost when audio is quantized to sixteen bit PCM. Moreover, the belief may fail to consider the practice of reducing quantization resolutions to provide additional headroom at the cost of reducing the signal-to-noise ratio and lowering signal resolution. Due to such concerns, there currently is strong commercial demand for audio processes that provide improved signal resolution relative to sixteen bit PCM.

There currently is also strong commercial demand for multi-channel audio. Multi-channel audio provides multiple channels of audio which can improve spatialization of reproduced sound relative to traditional mono and stereo techniques. Common systems provide for separate left and right channels both in front of and behind a listening field, and may also provide for a center channel and subwoofer channel. Recent modifications have provided numerous audio channels surrounding a listening field for reproducing or synthesizing spatial separation of different types of audio data.

Perceptual coding is one variety of techniques for improving the perceived resolution of an audio signal relative to PCM signals of comparable bit rate. Perceptual coding can reduce the bit rate of an encoded signal while preserving the subjective quality of the audio recovered from the encoded signal by removing information that is deemed to be irrelevant to the preservation of that subjective quality. This can be done by splitting an audio signal into frequency subband signals and quantizing each subband signal at a quantizing resolution that introduces a level of quantization noise that is low enough to be masked by the decoded signal itself. Within the constraints of a given bit rate, an increase in perceived signal resolution relative to a first PCM signal of given resolution can be achieved by perceptually coding a second PCM signal of higher resolution to reduce the bit rate of the encoded signal to essentially that of the first PCM signal. The coded version of the second PCM signal may then be used in place of the first PCM signal and decoded at the time of playback.

One example of perceptual coding is embodied in devices that conform to the public ATSC AC-3 bitstream specification as specified in the Advanced Television Standards Committee (ATSC) A52 document (1994). This particular perceptual coding technique as well as other perceptual coding techniques are embodied in various versions of Dolby Digital® coders and decoders. These coders and decoders are commercially available from Dolby Laboratories, Inc. of San Francisco, California. Another example of a perceptual coding technique is embodied in devices that conform to the MPEG-1 audio coding standard ISO 11172-3 (1993).

One disadvantage of conventional perceptual coding techniques is that the bit rate of the perceptually coded signal for a given level of subjective quality may exceed the available data capacity of communication channels and storage media. For example, the perceptual coding of a twenty-four bit PCM audio signal may yield a perceptually coded signal that requires more data capacity than is provided by a sixteen bit wide data channel. Attempts to reduce the bit rate of the encoded signal to a lower level may degrade the subjective quality of audio that can be recovered from the encoded signal. Another disadvantage of conventional perceptual coding techniques is that they do not support the decoding of a single perceptually coded signal to recover an audio signal at more than one level of subjective quality.

Scalable coding is one technique that can provide a range of decoding quality. Scalable coding uses the data in one or more lower resolution codings together with augmentation data to supply a higher resolution coding of an audio signal. Lower resolution codings and the augmentation data may be supplied in a plurality of layers. There is also strong need for scalable perceptual coding, and particularly, for scalable perceptual coding that is backward compatible at the decoding stage with commercially available sixteen bit digital signal transport or storage means.

DISCLOSURE OF INVENTION

Scalable audio coding is disclosed that supports coding of audio data into a core layer of a data channel in response to a first desired noise spectrum. The first desired noise spectrum preferably is established according to psychoacoustic and data capacity criteria. Augmentation data may be coded into one or more augmentation layers of the data channel in response to additional desired noise spectra. Alternative criteria such as conventional uniform quantization may be utilized for coding augmentation data.

Systems and methods for decoding just a core layer of a data channel are disclosed. Systems and methods for decoding both a core layer and one or more augmentation layers of a data channel are also disclosed, and these provide improved audio quality relative to that obtained by decoding just the core layer.

Some embodiments of the present invention are applied to subband signals. As is understood in the art, subband signals may be generated in numerous ways including the application of digital filters such as the quadrature mirror filter, and by a wide variety of time-domain to frequency-domain transforms and wavelet transforms.

Data channels employed by the present invention preferably have a sixteen bit wide core layer and two four bit wide augmentation layers conforming to standard AES3 which is published by the Audio Engineering Society (AES). This standard is also known as standard ANSI S4.40 by the American National Standard Institute (ANSI). Such a data channel is referred to herein as a standard AES3 data channel.

Scalable audio coding and decoding according to various aspects of the present invention can be implemented by discrete logic components, one or more ASICs, program-controlled processors, and by other commercially available components. The manner in which these components are implemented is not important to the present invention. Preferred embodiments use program-controlled processors, such as those in the DSP563xx line of digital signal processors from Motorola. Programs for such implementations may include instructions conveyed by machine readable media, such as, baseband or modulated communication paths and storage media. Communication paths preferably are in the spectrum from supersonic to ultraviolet frequencies. Essentially any magnetic or optical recording technology may be used as storage media, including magnetic tape, magnetic disk, and optical disc.

According to various aspects of the present invention, audio information coded according to the present invention can be conveyed by such machine readable media to routers, decoders, and other processors, and may be stored by such machine readable media for routing, decoding, or other processing at later times. In preferred embodiments, audio information is coded according to the present invention, and stored on machine readable media, such as compact disc. Such data preferably is formatted in accordance with various frame and/or other disclosed data structures. A decoder can then read the stored information at later times for decoding and playback. Such decoder need not include encoding functionality.

Scalable coding processes according to one aspect of the present invention utilize a data channel having a core layer and one or more augmentation layers. A plurality of subband signals are received. A respective first quantization resolution for each subband signal is determined in response to a first desired noise spectrum, and each subband signal is quantized according to the respective first quantization resolution to generate a first coded signal. A respective second quantization resolution is determined for each subband signal in response to a second desired noise spectrum, and each subband signal is quantized according to the respective second quantization resolution to generate a second coded signal. A residue signal is generated that indicates a residue between the first and second coded signals. The first coded signal is output in the core layer, and the residue signal is output in the augmentation layer.

According to another aspect of the present invention, a process of coding an audio signal uses a standard data channel that has a plurality of layers. A plurality of subband signals are received. A perceptual coding and second coding of the subband signals are generated. A residue signal that indicates a residue of the second coding relative to the perceptual coding is generated. The perceptual coding is output in a first layer of the data channel, and the residue signal is output in a second layer of the data channel.

According to another aspect of the present invention, a processing system for a standard data channel includes a memory unit and a program-controlled processor. The memory unit stores a program of instructions for coding audio information according to the present invention. The program-controlled processor is coupled to the memory unit for receiving the program of instructions, and is further coupled to receive a plurality of subband signals for processing. Responsive to the program of instructions, the program controlled processor processes the subband signals in accordance with the present invention. In one embodiment, this comprises outputting a first coded or perceptually coded signal in one layer of the data channel, and outputting a residue signal in another layer of the data channel, for example, in accordance with the scalable coding process disclosed above.

According to another aspect of the present invention, a method of processing data uses a multi-layer data channel having a first layer that carries a perceptual coding of an audio signal and having a second layer that carries augmentation data for increasing the resolution of the perceptual coding of the audio signal. According to the method, the perceptual coding of the audio signal and the augmentation data are received via the data channel. The perceptual coding is routed to a decoder or other processor for further processing. This may include decoding of the perceptual coding, without further consideration of the augmentation data, to yield a first decoded signal. Alternatively, the augmentation data can be routed to the decoder or other processor, and therein combined with the perceptual coding to generate a second coded signal, which is decoded to yield a second decoded signal having higher resolution than the first decoded signal.

According to another aspect of the present invention, a processing system for processing data on a multi-layer data channel is disclosed. The multi-layer data channel has a first layer that carries a perceptual coding of an audio signal and a second layer that carries augmentation data for increasing the resolution of the perceptual coding of the audio signal. The processing system includes signal routing circuitry, a memory unit, and a program-controlled processor. The signal routing circuitry receives the perceptual coding and augmentation data via the data channel, and routes the perceptual coding and optionally the augmentation data to the program-controlled processor. The memory unit stores a program of instructions for processing audio information according to the present invention. The program-controlled processor is coupled to the signal routing circuitry for receiving the perceptual coding, and is coupled to the memory unit for receiving the program of instructions. Responsive to the program of instructions, the program controlled processor processes the perceptual coding and optionally the augmentation data according to the present invention. In one embodiment, this comprises routing and decoding of one or more layers of information as disclosed above.

According to another aspect of the present invention, a machine readable medium carries a program of instructions executable by a machine to perform a coding process according to the present invention. According to another aspect of the present invention, a machine readable medium carries a program of instructions executable by a machine to perform a method of routing and/or decoding data carried by a multi-layer data channel in accordance with the present invention. Examples of such coding, routing, and decoding are disclosed above and in the detailed description below. According to another aspect of the present invention, a machine readable medium carries coded audio information coded according to the present invention, such as any information processed in accordance with a disclosed process or method.

According to another aspect of the present invention, coding and decoding processes of the present invention may be implemented in a variety of manners. For example, a program of instructions executable by a machine, such as a programmable digital signal processor or computer processor, to perform such a process can be conveyed by a medium readable by the machine, and the machine can read the medium to obtain the program and responsive thereto perform such process. The machine may be dedicated to performing only a portion of such processes, for example, by only conveying corresponding program material via such medium.

The various features of the present invention and its preferred embodiments may be better understood by referring to the following discussion and the accompanying drawings in which like reference numerals refer to like elements in the several figures. The contents of the following discussion and the drawings are set forth as examples only and should not be understood to represent limitations upon the scope of the present invention.

BRIEF DESCRIPTION OF DRAWINGS

FIG. 1A

is a schematic block diagram of processing system for coding and/or decoding audio signals that includes a dedicated digital signal processor.

FIG. 1B

is a schematic block diagram of a computer-implemented system for coding and/or decoding audio signals.

FIG. 2A

is a flowchart of a process for coding an audio channel according to psychoacoustic principles and a data capacity criterion.

FIG. 2B

is a schematic diagram of a data channel that comprises a sequence of frames, each frame comprising a sequence of words, each word being sixteen bits wide.

FIG. 3A

is a schematic diagram of a scalable data channel that includes a plurality of layers that are organized as frames, segments, and portions.

FIG. 3B

is a schematic diagram of a frame for a scalable data channel.

FIG. 4A

is a flowchart of a scalable coding process.

FIG. 4B

is a flowchart of a process for determining appropriate quantization resolutions for the scalable coding process illustrated in FIG.

4

A.

FIG. 5

is a flowchart illustrating a scalable decoding process.

FIG. 6A

is a schematic diagram of a frame for a scalable data channel.

FIG. 6B

is a schematic diagram of preferred structure for the audio segment and audio extension segments illustrated in FIG.

6

A.

FIG. 6C

is a schematic diagram of preferred structure for the metadata segment illustrated in FIG.

6

A.

FIG. 6D

is a schematic diagram of preferred structure for the metadata extension segment illustrated in FIG.

6

A.

MODES FOR CARRYING OUT THE INVENTION

The present invention relates to scalable coding of audio signals. Scalable coding uses a data channel that has a plurality of layers. These include a core layer for carrying data that represents an audio signal according to a first resolution and one or more augmentation layers for carrying data that in combination with the data carried in the core layer represents the audio signal according to a higher resolution. The present invention may be applied to audio subband signals. Each subband signal typically represents a frequency band of audio spectrum. These frequency bands may overlap one another. Each subband signal typically comprises one or more subband signal elements.

Subband signals may be generated by various techniques. One technique is to apply a spectral transform to audio data to generate subband signal elements in a spectral-domain. One or more adjacent subband signal elements may be assembled into groups to define the subband signals. The number and identity of subband signal elements forming a given subband signal can be predetermined or alternatively can be based on characteristics of the audio data encoded. Examples of suitable spectral transforms include the Discrete Fourier Transform (DFT) and various Discrete Cosine Transforms (DCT) including a particular Modified Discrete Cosine Transform (MDCT) sometimes referred to as a Time-Domain Aliasing Cancellation (TDAC) transform, which is described in Princen, Johnson and Bradley, “Subband/Transform Coding Using Filter Bank Designs Based on Time Domain Aliasing Cancellation,”

Proc. Int. Conf. Acoust., Speech, and Signal Proc.

, May 1987, pp. 2161-2164. Another technique for generating subband signals is to apply a cascaded set of quadrature mirror filters (QMF) or some other bandpass filter to audio data to generate subband signals. Although the choice of implementation may have a profound effect on the performance of a coding system, no particular implementation is important in concept to the present invention.

The term “subband” is used herein to refer to a portion of the bandwidth of an audio signal. The term “subband signal” is used herein to refer to a signal that represents a subband. The term “subband signal element” is used herein to refer to elements or components of a subband signal. In implementations that use a spectral transform, for example, subband signal elements are the transform coefficients. For simplicity, the generation of subband signals is referred to herein as subband filtering regardless whether such signal generation is accomplished by the application of a spectral transform or other type of filter. The filter itself is referred to herein as a filter bank or more particularly an analysis filter bank. In conventional manner, a synthesis filter bank refers to an inverse or substantial inverse of an analysis filter bank.

Error correction information may be supplied for detecting one or more errors in data processed in accordance with the present invention. Errors may arise, for example, during transmission or buffering of such data, and it is often beneficial to detect such errors and correct the data appropriately prior to playback of the data. The term error correction refers to essentially any error detection and/or correction scheme such as parity bits, cyclic redundancy codes, checksums and Reed-Solomon codes.

Referring now to

FIG. 1A

there is shown a schematic block diagram of an embodiment of processing system

100

for encoding and decoding audio data according to the present invention. Processing system

100

comprises program-controlled processor

110

, read only memory

120

, random access memory

130

, audio input/output interface

140

interconnected in conventional manner by bus

116

. The program-controlled processor

110

is a model DSP563xx digital signal processor that is commercially available from Motorola. The read only memory

120

and random access memory

130

are of conventional design. The read only memory

120

stores a program of instructions which allows the program-controlled processor

110

to perform analysis and synthesis filtration and to process audio signals as described with respect to

FIGS. 2A through 7D

. The program remains intact in the read only memory

120

while the processing system

100

is in a powered down state. The read only memory

120

may alternatively be replaced by virtually any magnetic or optical recording technology, such as those using a magnetic tape, a magnetic disk, or an optical disc, according to the present invention. The random access memory

130

buffers instructions and data, including received and processed signals, for the program-controlled processor

110

in conventional manner. The audio input/output interface

140

includes signal routing circuitry for routing one or more layers of received signals to other components, such as the program-controlled processor

110

. The signal routing circuitry may include separate terminals for input and output signals, or alternatively, may use the same terminal for both input and output. Processing system

100

may alternatively be dedicated to encoding by omitting the synthesis and decoding instructions, or alternatively dedicated to decoding by omitting the analysis and encoding instructions. Processing system

100

is a representation of typical processing operations beneficial for implementing the present invention, and is not intended to portray a particular hardware implementation thereof.

To perform encoding, the program-controlled processor

110

accesses a program of coding instructions from the read only memory

120

. An audio signal is supplied to the processing system

100

at audio input/output interface

140

, and routed to the program-controlled processor

110

to be encoded. Responsive to the program of coding instructions, the audio signal is filtered by an analysis filter bank to generate subband signals, and the subband signals are coded to a generate coded signal. The coded signal is supplied to other devices through the audio input/output interface

140

, or alternatively, is stored in random access memory

130

.

To perform decoding, the program-controlled processor

110

accesses a program of decoding instructions from the read only memory

120

. An audio signal which preferably has been coded according to the present invention is supplied to the processing system

100

at audio input/output interface

140

, and routed to the program-controlled processor

110

to be decoded. Responsive to the program of decoding instructions, the audio signal is decoded to obtain corresponding subband signals, and the subband signals are filtered by a synthesis filter bank to obtain an output signal. The output signal is supplied to other devices through the audio input/output interface

140

, or alternatively, is stored in random access memory

130

.

Referring now also to

FIG. 1B

, there is shown a schematic block diagram of an embodiment of a computer-implemented system

150

for encoding and decoding audio signals according to the present invention. Computer-implemented system

150

includes a central processing unit

152

, random access memory

153

, hard disk

154

, input device

155

, terminal

156

, output device

157

, interconnected in conventional manner by bus

158

. Central processing unit

152

preferably implements Intel® x86 instruction set architecture and preferably includes hardware support for implementing floating-point arithmetic processes, and may, for example, be an Intel® Pentium® III microprocessor which is commercially available from Intel® Corporation of Santa Clara Calif. Audio information is provided to the computer-implemented system

150

via terminal

156

, and routed to the central processing unit

152

. A program of instructions stored on hard disk

154

allows computer-implemented system

150

to process the audio data in accordance with the present invention. Processed audio data in digital form is then supplied via terminal

156

, or alternatively written to and stored in the hard disk

154

.

It is anticipated that processing system

100

, computer-implemented system

150

, and other embodiments of the present invention will be used in applications that may include both audio and video processing. A typical video application would synchronize its operation with a video clocking signal and an audio clocking signal. The video clocking signal provides a synchronization reference with video frames. Video clocking signals could provide reference, for example, frames of NTSC, PAL, or ATSC video signals. The audio clocking signal provides synchronization reference to audio samples. Clocking signals may have substantially any rate. For example,

48

kilohertz is a common audio clocking rate in professional applications. No particular clocking signal or clocking signal rate is important for practicing the present invention.

Referring now to

FIG. 2A

there is shown a flowchart of a process

200

that codes audio data into a data channel according to psychoacoustic and data capacity criteria. Referring now also to

FIG. 2B

there is shown a block diagram of the data channel

250

. Data channel

250

comprises a sequence of frames

260

, each frame

260

comprising a sequence of words. Each word is designated as sequence of bits (n) where n is an integer between zero and fifteen inclusive, and where the notation bits (n˜m) represents bit (n) through bit (m) of the word. Each frame

260

includes a control segment

270

and an audio segment

280

, each comprising a respective integer number of the words of the frame

260

.

A plurality of subband signals are received

210

that represent a first block of an audio signal. Each subband signal comprises one or more subband elements, and each subband element is represented by one word. The subband signals are analyzed

212

to determine an auditory masking curve. The auditory masking curve indicates the maximum amount of noise that can be injected into each respective subband without becoming audible. What is audible in this respect is based on psychoacoustic models of human hearing and may involve cross-channel masking characteristics where the subband signals represent more than one audio channel. The auditory masking curve serves as a first estimate of a desired noise spectrum. The desired noise spectrum is analyzed

214

to determine a respective quantization resolution for each subband signal such that when the subband signals are quantized accordingly and then dequantized and converted into sound waves, the resulting coding noise is beneath the desired noise spectrum. A determination

216

is made whether accordingly quantized subband signals can be fit within and substantially fill the audio segment

280

. If not, the desired noise spectrum is adjusted

218

and steps

214

,

216

are repeated. If so, the subband signals are accordingly quantized

220

and output

222

in the audio segment

280

.

Control data is generated for the control segment

270

of frame

260

. This includes a synchronization pattern that is output in the first word

272

of the control segment

270

. The synchronization pattern allows decoders to synchronize to sequential frames

260

in the data channel

250

. Additional control data that indicates the frame rate of frames

260

, boundaries of segments

270

, parameters of coding operations, and error detection information are output in the remaining portion

274

of the control segment

270

. This process may be repeated for each block of the audio signal, with each sequential block preferably being coded into a corresponding sequential frame

260

of the data channel

250

.

Process

200

can be applied to coding data into one or more layers of a multi-layer audio channel. Where more than one layer is coded according to process

200

there is likely to be substantial correlation between the data carried in such layers, and accordingly substantial waste of data capacity of the multi-layer audio channel. Discussed below are scalable processes that output augmentation data into a second layer of a data channel to improve the resolution of data carried in a first layer of such data channel. Preferably, the improvement in resolution can be expressed as a functional relationship of coding parameters of the first layer, such as an offset that when applied to the desired noise spectrum used for coding the first layer yields a second desired noise spectrum used for coding the second layer. Such offset may then be output in an established location of the data channel, such as in a field or segment of the second layer, to indicate to decoders the value of the improvement. This may then be used to determine the location of each subband signal element or information relating thereto in the second layer. Next addressed are frame structures for organizing scalable data channels accordingly.

Referring now to

FIG. 3A

, there is shown a schematic diagram of an embodiment of a scalable data channel

300

that includes core layer

310

, first augmentation layer

320

, and second augmentation layer

330

. Core layer

310

is L bits wide, first augmentation layer

320

is M bits wide, and second augmentation layer

330

is N bits wide, with L, M, N being positive integer values. The core layer

310

comprises a sequence of L-bit words. The combination of the core layer

310

and the first augmentation layer

320

comprises a sequence of (L+N)-bit words, and the combination of core layer

310

, first augmentation layer

320

and second augmentation layer

330

comprises a sequence of (L+M+N)-bit words. The notation bits (n˜m) is used herein to represent bits (n) through (m) of a word, where n and m are integers and m>n, and where m, n can be between zero and twenty-three inclusive. Scalable data channel

300

may, for example, be a twenty-four bit wide standard AES3 data channel with L, M, N equal to sixteen, four, and four respectively.

Scalable data channel

300

may be organized as a sequence of frames

340

according to the present invention. Each frame

340

is partitioned into a control segment

350

followed by an audio segment

360

. Control segment

350

includes core layer portion

352

defined by the intersection of the control segment

350

with the core layer

310

, first augmentation layer portion

354

defined by the intersection of the control segment

350

with the first augmentation layer

320

, and second augmentation layer portion

356

defined by the intersection of the intersection of the control segment

350

with the second augmentation layer

330

. The audio segment

360

includes first and second subsegments

370

,

380

. The first subsegment

370

includes a core layer portion

372

defined by the intersection of the first subsegment

370

with the core layer

310

, a first augmentation layer portion

374

defined by the intersection of the first subsegment

370

with the first augmentation layer

320

, and a second augmentation layer portion

376

defined by the intersection of the first subsegment

370

with the second augmentation layer

330

. Similarly, the second subsegment

380

includes a core layer portion

382

defined by the intersection of the second subsegment

380

with the core layer

310

, a first augmentation layer portion

384

defined by the intersection of the second subsegment

380

with the first augmentation layer

320

, and a second augmentation layer portion

386

defined by the intersection of the second subsegment

380

with the second augmentation layer

330

.

In this embodiment, core layer portions

372

,

382

carry coded audio data that is compressed according to psychoacoustic criteria so that the coded audio data fits within core layer

310

. Audio data that is provided as input to the coding process may, for example, comprise subband signal elements each represented by a P bit wide word, with integer P being greater than L. Psychoacoustic principles may then be applied to code the subband signal elements into encoded values or “symbols” having an average width of about L bits. The data volume occupied by the subband signal elements is thereby compressed sufficiently that it can be conveniently transmitted via the core layer

310

. Coding operations preferably are consistent with conventional audio transmission criteria for audio data on an L bit wide data channel so that core layer

310

can be decoded in a conventional manner. First augmentation layer portions

374

,

384

carry augmentation data that can be used in combination with the coded information in core layer

310

to recover an audio signal having a higher resolution than can be recovered from only the coded information in core layer

310

. Second augmentation layer portions

376

,

386

carry additional augmentation data that can be used in combination with the coded information in core layer

310

and first augmentation layer

320

to recover an audio signal having a higher resolution than can be recovered from only the coded information carried in a union of core layer

310

with first augmentation layer

320

. In this embodiment, the first subsegment

370

carries coded audio data for a left audio channel CH_L, and the second subsegment

380

carries coded audio data for a right audio channel CH_R.

Core layer portion

352

of control segment

350

carries control data for controlling operation of decoding processes. Such control data may include synchronization data that indicates the location of the beginning of the frame

340

, format data that indicates program configuration and frame rate, segment data that indicates boundaries of segments and subsegments within the frame

340

, parameter data that indicates parameters of coding operations, and error detection information that protects data in core layer portion

352

. Predetermined or established locations preferably are provided in core layer portion

352

for each variety of control data to allow decoders to quickly parse each variety from the core layer portion

352

. According to this embodiment, all control data that is essential for decoding and processing the core layer

310

is included in core layer portion

352

. This allows augmentation layers

320

,

330

to be stripped off or discarded, for example by signal routing circuitry, without loss of essential control data, and thereby supports compatibility with digital signal processors designed to receive data formatted as L-bit words. Additional control data for augmentation layers

320

,

330

can be included in augmentation layer portion

354

according to this embodiment.

Within control segment

350

, each layer

310

,

320

,

330

preferably carries parameters and other information for decoding respective portions of the encoded audio data in audio segment

360

. For example, core layer portion

352

can carry an offset of an auditory masking curve that yields a first desired noise spectrum used for perceptually coding information into core layer portions

372

,

382

. Similarly, the first augmentation layer portion

354

can carry an offset of the first desired noise spectrum that yields a second desired noise spectrum used for coding information into augmentation layer portions

374

,

384

, and the second augmentation layer portion

356

can carry an offset of the second desired noise spectrum that yields a third desired noise spectrum used for coding information into the second augmentation layer portions

376

,

386

.

Referring now to

FIG. 3B

, there is shown a schematic diagram of an alternative frame

390

for the scalable data channel

300

. Frame

390

includes the control segment

350

and audio segment

360

of frame

340

. In frame

390

, the control segment

350

also includes fields

392

,

394

,

396

in the core layer

310

, first augmentation layer

320

and second augmentation layer

330

respectively.

Field

392

carries a flag that indicates the organization of augmentation data. According to a first flag value, augmentation data is organized according to a predetermined configuration. This preferably is the configuration of frame

340

, so that augmentation data for left audio channel CH_L is carried in the first subsegment

370

and augmentation data for right audio channel CH_R is carried in the second subsegment

380

. A configuration wherein each channel's core and augmentation data are carried in the same subsegment is referred to herein as an aligned configuration. According to a second flag value, augmentation data is distributed in the augmentation layers

320

,

330

in an adaptive manner, and fields

394

,

396

respectively carry an indication of where augmentation data for each respective audio channel is carried.

Field

392

preferably has sufficient size to carry an error detection code for data in the core layer portion

352

of control segment

350

. It is desirable to protect this control data because it controls decoding operations of the core layer

310

. Field

392

may alternatively carry an error detection code that protects the core layer portions

372

,

382

of audio segment

360

. No error detection need be provided for the data in augmentation layers

320

,

330

because the effect of such errors will usually be at most barely audible where the width L of the core layer

310

is sufficient. For example, where the core layer

310

is perceptually coded to a sixteen bit word depth, the augmentation data primarily provides subtle detail and errors in augmentation data typically will be difficult to hear upon decode and playback.

Fields

394

,

396

may each carry an error detection code. Each code provides protection for the augmentation layer

320

,

330

in which it is carried. This preferably includes error detection for control data, but may alternatively include error correction for audio data, or for both control and audio data. Two different error detection codes may be specified for each augmentation layer

320

,

330

. A first error detection code specifies that augmentation data for the respective augmentation layer is organized according to a predetermined configuration, such as that of frame

340

. A second error detection code for each layer specifies that augmentation data for the respective layer is distributed in the respective layer and that pointers are included in the control segment

350

to indicate locations of this augmentation data. Preferably the augmentation data is in the same frame

390

of the data channel

300

as corresponding data in the core layer

310

. A predetermined configuration can be used to organize one augmentation layer and pointers to organize the other. The error detection codes may alternatively be error correction codes.

Referring now to

FIG. 4A

there is shown a flowchart of an embodiment of a scalable coding process

400

according to the present invention. This embodiment uses the core layer

310

and first augmentation layer

320

of the data channel

300

shown in

FIG. 3A. A

plurality of subband signals are received

402

, each comprising one or more subband signal elements. In step

404

, a respective first quantization resolution for each subband signal is determined in response to a first desired noise spectrum. The first desired noise spectrum is established according to psychoacoustic principles and preferably also in response to a data capacity requirement of the core layer

310

. This requirement may, for example, be the total data capacity limits of core layer portions

372

,

382

. Subband signals are quantized according to the respective first quantization resolution to generate a first coded signal. The first coded signal is output

406

in core layer portions

372

,

382

of the audio segment

360

.

In step

408

, a respective second quantization resolution is determined for each subband signal. The second quantization resolution preferably is established in response to a data capacity requirement of the union of the core and first augmentation layers

310

,

320

and preferably also according to psychoacoustic principles. The data capacity requirement may, for example, be a total data capacity limit of the union of core and first augmentation layer portions

372

,

374

. Subband signals are quantized according to the respective second quantization resolution to generate a second coded signal. A first residue signal is generated

410

that conveys some residual measure or difference between the first and second coded signals. This preferably is implemented by subtracting the first coded signal from the second coded signal in accordance with two's complement or other form of binary arithmetic. The first residue signal is output

412

in first augmentation layer portions

374

,

384

of the audio segment

360

.

In step

414

, a respective third quantization resolution is determined for each subband signal. The third quantization resolution preferably is established according to the data capacity of the union of layers

310

,

320

,

330

. Psychoacoustic principles preferably are used to establish the third quantization resolution as well. Subband signals are quantized according to the respective third quantization resolution to generate a third coded signal. A second residue signal is generated

416

that conveys some residual measure or difference between the second and third coded signals. The second residue signal preferably is generated by forming the two's complement (or other binary arithmetic) difference between the second and third coded signals. The second residue signal may alternatively be generated to convey a residual measure or difference between the first and third coded signals. The second residue signal is output

418

in second augmentation layer portions

376

,

386

of the audio segment

360

.

In steps

404

,

408

,

414

, when a subband signal includes more than one subband signal element, the quantization of the subband signal to a particular resolution may comprise uniformly quantizing each element of the subband signal to the particular resolution. Thus if a subband signal (ss) includes three subband signal elements (se

1

, se

2

, se

3

), the subband signal may be quantized according to a quantization resolution Q by uniformly quantizing each of its subband signal elements according to this quantization resolution Q. The quantized subband signal may be written as Q(ss) and the quantized subband signal elements may be written as Q(se

1

), Q(se

2

), Q(se

3

). Quantized subband signal Q(ss) thus comprises the collection of quantized subband signal elements (Q(se

1

), Q(se

2

), Q(se

3

)). A coding range that identifies a range of quantization of subband signal elements that is permissible relative to a base point may be specified as a coding parameter. The base point preferably is the level of quantization that would yield injected noise substantially matching the auditory masking curve. The coding range may, for example, be between about 144 decibels of removed noise to about 48 decibels of injected noise relative to the auditory masking curve, or more briefly, −144 dB to +48 dB.

In an alternative embodiment of the present invention, subband signal elements within the same subband signal are on average quantized to a particular quantization resolution Q, but individual subband signal elements are non-uniformly quantized to different resolutions. In yet another alternative embodiment that provides non-uniform quantization within a subband, a gain-adaptive quantization technique quantizes some subband signal elements within the same subband to a particular quantization resolution Q and quantizes other subband signal elements in that subband to a different resolution that may be finer or more coarse than resolution Q by some determinable amount. A preferred method for carrying out non-uniform quantization within a respective subband is disclosed in a patent application by Davidson et al. entitled “Using Gain-Adaptive Quantization and Non-Uniform Symbol Lengths for Improved Audio Coding” filed Jul. 7, 1999, which is incorporated herein by reference.

In step

402

, the received subband signals preferably include a set of left subband signals SS_L that represent left audio channel CH_L and a set of right subband signals SS_R that represent right audio channel CH_R. These audio channels may be a stereo pair or may alternatively be substantially unrelated to one another. Perceptual coding of the audio signal channels CH_L, CH_R is preferably carried out using a pair of desired noise spectra, one spectrum for each of the audio channels CH_L, CH_R. A subband signal of set SS_L may thus be quantized at different resolution than a corresponding subband signal of set SS_R. The desired noise spectrum for one audio channel may be affected by the signal content of the other channel by taking into account cross-channel masking effects. In preferred embodiments, cross-channel masking effects are ignored.

The first desired noise spectrum for the left audio channel CH_L is established in response to auditory masking characteristics of subband signals SS_L, optionally the cross-channel masking characteristics of subband signals SS_R, as well as additional criteria such as available data capacity of core layer portion

372

, as follows. Left subband signals SS_L and optionally right subband signals SS_R as well are analyzed to determine an auditory masking curve AMC_L for left audio channel CH_L. The auditory masking curve indicates the maximum amount of noise that can be injected into each respective subbands of the left audio channel CH_L without becoming audible. What is audible in this respect is based on psychoacoustic models of human hearing and may involve cross-channel masking characteristics of right audio channel CH_R. Auditory masking curve AMC_L serves as an initial value for a first desired noise spectrum for left audio channel CH_L, which is analyzed to determine a respective quantization resolution Q

1

_L for each subband signal of set SS_L such that when the subband signals of set SS_L are quantized accordingly Q

1

_L(SS_L), and then dequantized and converted into sound waves, the resulting coding noise is inaudible. For clarity, it is noted that the term Q

1

_L refers to a set of quantization resolutions, with such set having a respective value Q

1

_L

ss

for each subband signal ss in the set of subband signals SS_L. It should be understood that the notation Q

1

_L(SS_L) means that each subband signal in the set SS_L is quantized according to a respective quantization resolution. Subband signal elements within each subband signal may be quantized uniformly or non-uniformly, as described above.

In like manner, right subband signals SS_R and preferably left subband signals SS_L as well are analyzed to generate an auditory masking curve AMC_R for right audio channel CH_R. This auditory masking curve AMC_R may serve as an initial first desired noise spectrum for right audio channel CH_R, which is analyzed to determine a respective quantization resolution Q

1

_R for each subband signal of set SS_R.

Referring now also to

FIG. 4B

, there is shown a flowchart of a process for determining quantization resolutions according to the present invention. Process

420

may be used, for example, to find appropriate quantization resolutions for coding each layer according to process

400

. Process

420

will be described with respect to the left audio channel CH_L, the right audio channel CH_R is processed in like manner.

An initial value for a first desired noise spectrum FDNS_L is set

422

equal to the auditory masking curve AMC_L. A respective quantization resolution for each subband signal of set SS_L is determined

424

such that were these subband signals accordingly quantized, and then dequantized and converted into sound waves, any quantization noise thereby generated would be substantially match the first desired noise spectrum FDNS_L. In step

426

, it is determined whether accordingly quantized subband signals would meet a data capacity requirement of the core layer

310

. In this embodiment of process

420

, the data capacity requirement is specified to be whether the accordingly quantized subband signals would fit in and substantially use up the data capacity of core layer portion

372

. In response to a negative determination in step

426

, the first desired noise spectrum FDNS_L is adjusted

428

. The adjustment comprises shifting the first desired noise spectrum FDNS_L by an amount that preferably is substantially uniform across the subbands of the left audio channel CH_L. The direction of the shift is upward, which corresponds to coarser quantization, where the accordingly quantized subband signals from step

426

did not fit in core layer portion

372

. The direction of the shift is downward, which corresponds to finer quantization, where the accordingly quantized subband signals from step

426

did fit in core layer portion

372

. The magnitude of the first shift is preferably equal to about one-half the remaining distance to the extrema of the coding range in the direction of the shift. Thus, where the coding range is specified as −144 dB to +48 dB, the first such shift may, for example, comprise shifting the FDNS_L upward by about 24 dB. The magnitude of each subsequent shift is preferably about one-half the magnitude of the immediately prior shift. Once the first desired noise spectrum FDNS_L is adjusted

428

, steps

424

and

426

are repeated. When a positive determination is made in a performance of step

426

, the process terminates

430

and the determined quantization resolutions Q

1

_L are considered to be appropriate.

The subband signals of set SS_L are quantized at the determined quantization resolutions Q

1

_L to generate quantized subband signals Q

1

_L(SS_L). The quantized subband signals Q

1

_L(SS_L) serve as a first coded signal FCS_L for the left audio channel CH_L. The quantized subband signals Q

1

_L(SS_L) can be conveniently output in core layer portion

372

in any pre-established order, such as by increasing spectral frequency of subband signal elements. Allocation of the data capacity of core layer portion

372

among quantized subband signals Q

1

_L(SS_L) is thus based on hiding as much quantization noise as practicable given the data capacity of this portion of the core layer

310

. Subband signals SS_R for the right audio channel CH_R processed in similar manner to generate a first coded signal FCS_R for that channel CH_R, which is output in core layer portion

382

.

Appropriate quantization resolutions Q

2

_L for coding first augmentation layer portion

374

are determined according to process

420

as follows. An initial value for a second desired noise spectrum SDNS_L for the left audio channel CH_L is set

422

equal to the first desired noise spectrum FDNS_L. The second desired noise spectrum SDNS_L is analyzed to determine a respective second quantization resolution Q

2

_L

ss

for each subband signal ss of set SS_L such that were subband signals of set SS_L quantized according to Q

2

_L(SS_L), and then dequantized and converted to sound waves, the resulting quantization noise would substantially match the second desired noise spectrum SDNS_L. In step

426

, it is determined whether accordingly quantized subband signals would meet a data capacity requirement of the first augmentation layer

320

. In this embodiment of process

420

, the data capacity requirement is specified to be whether a residue signal would fit in and substantially use up the data capacity of first augmentation layer portion

374

. The residue signal is specified as a residual measure or difference between the accordingly quantized subband signals Q

2

_L(SS_L) and the quantized subband signals Q

1

_L(SS_L) determined for core layer portion

372

.

In response to a negative determination in step

426

, the second desired noise spectrum SDNS_L is adjusted

428

. The adjustment comprises shifting the second desired noise spectrum SDNS_L by an amount that preferably is substantially uniform across the subbands of the left audio channel CH_L. The direction of the shift is upward where the residue signals from step

426

did not fit in the first augmentation layer portion

374

, and otherwise it is downward. The magnitude of the first shift is preferably equal to about one-half the remaining distance to the extrema of the coding range in the direction of the shift. The magnitude of each subsequent shift is preferably about one-half the magnitude of the immediately prior shift. Once the second desired noise spectrum SDNS_L is adjusted

428

, steps

424

and

426

are repeated. When a positive determination is made in a performance of step

426

, the process terminates

430

and the 25 determined quantization resolutions Q

2

_L are considered to be appropriate.

The subband signals of set SS_L are quantized at the determined quantization resolutions Q

2

_L to generate respective quantized subband signals Q

2

_L(SS_L) which serve as a second coded signal SCS_L for the left audio channel CH_L. A corresponding first residue signal FRS_L for the left audio channel CH_L is generated. A preferred method is to form a residue for each subband signal element and output bit representations for such residues by concatenation in a pre-established order, such as according to increasing frequency of subband signal elements, in first augmentation layer portion

374

. Allocation of the data capacity of first augmentation layer portion

374

among quantized subband signals Q

2

_L(SS_L) is thus based on hiding as much quantization noise as practicable given the data capacity of this portion

374

of the first augmentation layer

320

. Subband signals SS_R for the right audio channel CH_R are processed in similar manner to generate a second coded signal SCS_R and first residue signal FRS_R for that channel CH_R. The first residue signal FRS_R for the right audio channel CH_R is output in first augmentation layer portion

384

.

The quantized subband signals Q

2

_L(SS_L) and Q

1

_L(SS_L) can be determined in parallel. This is preferably implemented by setting the initial value of the second desired noise spectrum SDNS_L for the left audio channel CH_L equal to the auditory masking curve AMC_L or other specification that does not depend on the first desired noise spectrum FDNS_L determined for coding the core layer. The data capacity requirement is specified as being whether the accordingly quantized subband signals Q

2

_L(SS_L) would fit in and substantially use up the union of core layer portion

372

with the first augmentation layer portion

374

.

An initial value for the third desired noise spectrum for audio channel CH_L is obtained, and process

420

applied to obtain respective third quantization resolutions Q

3

_L as is done for the second desired noise spectrum. Accordingly quantized subband signals Q

3

_L(SS_L) serve as a third coded signal TCS_L for the left audio channel CH_L. A second residue signal SRS_L for the left audio channel CH_L may then be generated in a manner that is similar to that done for the first augmentation layer. In this case, however, residue signals are obtained by subtracting subband signal elements in the third coded signal TCS_L from corresponding subband signal elements in second coded signal SCS_L. The second residue signal SRS_L is output in second augmentation layer portion

376

. Subband signals SS_R for the right audio channel CH_R are processed in similar manner to generate a third coded signal TCS_R and second residue signal SRS_R for that channel CH_R. The second residue signal SRS_R for the right audio channel CH_R is output in second augmentation layer portion

386

.

Control data is generated for core layer portion

352

. In general, the control data allows decoders to synchronize with each frame in a coded stream of frames, and indicates to decoders how to parse and decode the data supplied in each frame such as frame

340

. Because a plurality of coded resolutions are provided, the control data typically is more complex than that found in non-scalable coding implementations. In a preferred embodiment of the present invention, control data includes a synchronization pattern, format data, segment data, parameter data, and an error detection code, all of which are discussed below. Additional control information is generated for the augmentation layers

320

,

330

that specifies how these layers

320

,

330

can be decoded.

A predetermined synchronization word may be generated to indicate the beginning of a frame. The synchronization pattern is output in the first L bits of the first word of each frame to indicate where the frame begins. The synchronization pattern preferably does not occur at any other location in the frame. Synchronization patterns indicate to decoders how to parse frames from a coded data stream.

Format data may be generated that indicates program configuration, bitstream profile, and frame rate. Program configuration indicates the number and distribution of channels included in the coded bitstream. Bitstream profile indicates what layers of the frame are utilized. A first value of bitstream profile indicates that coding is supplied in only the core layer

310

. The augmentation layers

320

,

330

preferably are omitted in this instance to save data capacity on the data channel. A second value of bitstream profile indicates that coded data is supplied in core layer

310

and in first augmentation layer

320

. The second augmentation layer

330

preferably is omitted in this instance. A third value of bitstream profile indicates that coded data is supplied in each layer

310

,

320

,

330

. The first, second, and third values of bitstream profile preferably are determined in accordance with the AES3 specification. The frame rate may be determined as a number, or approximate number, of frames per unit time, such as 30 Hertz, which for standard AES3 corresponds to about one frame per 3,200 words. The frame rate helps decoders to maintain synchronization and effective buffering of incoming coded data.

Segment data is generated that indicates boundaries of segments and subsegments. These include indicating boundaries of control segment

350

, audio segment

360

, first subsegment

370

, and second subsegment

380

. In alternative embodiments of scalable coding process

400

, additional subsegments are included in a frame, for example, for multi-channel audio. Additional audio segments can also be provided to reduce the average volume of control data in frames by combining audio information from a plurality of frames into a larger frame. A subsegment may also be omitted, for example, for audio applications requiring fewer audio channels. Data regarding boundaries of additional subsegments or omitted subsegments can be provided as segment data. The depths L, M, N respectively of the layers

310

,

320

,

330

can also be specified in similar manner. Preferably, L is specified as sixteen to support backward compatibility with conventional 16 bit digital signal processors. Preferably, M and N are specified as four and four to support scalable data channel criteria specified by standard AES3. Specified depths preferably are not explicitly carried as data in a frame but are presumed at coding to be appropriately implemented in decoding architectures.

Parameter data is generated that indicates parameters of coding operations. Such parameters indicate which species of coding operation is used for coding data into a frame. A first value of parameter data may indicate that core layer

310

is coded according to the public ATSC AC-3 bitstream specification as specified in the Advanced Television Standards Committee (ATSC) A52 document (1994). A second value of parameter data may indicate that the core layer

310

is coded according to a perceptual coding technique embodied in Dolby Digital® coders and decoders. Dolby Digital® coders and decoders are commercially available from Dolby Laboratories, Inc. of San Francisco, Calif. The present invention may be used with a wide variety of perceptual coding and decoding techniques. Various aspects of such perceptual coding and decoding techniques are disclosed in U.S. Pat. No. 5,913,191 (Fielder), U.S. Pat. No. 5,222,189 (Fielder), U.S. Pat. No. 5,109,417 (Fielder, et al.), U.S. Pat. No. 5,632,003 (Davidson, et al.), U.S. Pat. No. 5,583,962 (Davis, et al.), and U.S. Pat. No. 5,623,577 (Fielder), and in U.S. patent application Ser. No. 09/289,865 by Ubale, et al., each of which is incorporated by reference in its entirety. No particular perceptual coding or decoding technique is essential for practicing the present invention.

One or more error detection codes are generated for protecting data in core layer portion

352

and, if data capacity allows, data in the core layer portions

372

,

382

of core layer

310

. Core layer portion

352

preferably is protected to a greater degree than any other portion of frame

340

because it includes all essential information for synchronizing to frames

340

in a coded data stream and for parsing the core layer

310

of each frame

340

.

In this embodiment of the present invention, data is output into a frame as follows. First coded signals FCS_L, FCS_R are output respectively in core layer portions

372

,

382

, first residue signals FRS_L, FRS_R are output respectively in first augmentation layer portions

374

,

384

, and second residue signals SRS_L, SRS_R are output respectively in second augmentation layer portions

376

,

386

. This may be achieved by multiplexing these signals FCS_L, FCS_R, FRS_L, FRS_R, SRS_L, SRS_R together to form a stream of words each of length L+M+N, with, for example, signal FCS_L carried by the first L bits, FRS_L carried by the next M bits and SRS_L carried by final N bits, and similarly for signals FCS_R, FRS_R, SRS_R. This stream of words is output serially in the audio segment

360

. The synchronization word, format data, segment data, parameter data, and data protection information are output in core layer portion

352

. Additional control information for augmentation layers

320

,

330

is supplied to their respective layers

320

,

330

.

According to preferred embodiments of scalable audio coding process

400

, each subband signal in the core layer is represented in a block-scaled form comprising a scale factor and one or more scaled values representing each subband signal element. For example, each subband signal may be represented in a block-floating point in which a block-floating-point exponent is the scale factor and each subband signal element is represented by the floating-point mantissas. Essentially any form of scaling may be used. To facilitate parsing the coded data stream to recover the scale factors and scaled values, the scale factors may be coded into the data stream at pre-established positions within each frame such as at the beginning of each subsegment

370

,

380

within audio segment

360

.

In preferred embodiments, the scale factors provide a measure of subband signal power that can be used by a psychoacoustic model to determine the auditory masking curves AMC_L, AMC_R discussed above. Preferably, scale factors for the core layer

310

are used as scale factors for the augmentation layers

320

,

330

, and it is thus not necessary to generate and output a distinct set of scale factors for each layer. Only the most significant bits of the differences between corresponding subband signal elements of the various coded signals typically are coded into the augmentation layers.

In preferred embodiments, additional processing is performed to eliminate reserved or forbidden data patterns from the coded data. For example, data patterns in the encoded audio data that would mimic a synchronization pattern reserved to appear at the start of a frame should be avoided. One simple way in which a particular non-zero data pattern may be avoided is to modify the encoded audio data by performing a bit-wise exclusive OR between the encoded audio data and a suitable key. Further details and additional techniques for avoiding forbidden and reserved data patterns are disclosed in U.S. Pat. No. 6,233,718 entitled “Avoiding Forbidden Data Patterns in Coded Audio Data” by Vernon, et al. A key or other control information may be included in each frame to reverse the effects of any modifications performed to eliminate these patterns.

Referring now to

FIG. 5

, there is shown a flowchart illustrating a scalable decoding process

500

according to the present invention. Scalable decoding process

500

receives an audio signal coded into a series of layers. The first layer includes a perceptual coding of the audio signal. This perceptual coding represents the audio signal with a first resolution. Remaining layers each include data about another respective coding of the audio signal. The layers are ordered according to increasing resolution of coded audio. More particularly, data from the first K layers may be combined and decoded to provide audio with greater resolution than data in the first K−1 layers, where K is an integer greater than one and not greater than the number total number of layers.

According to process

500

a resolution for decoding is selected

511

. The layer associated with the selected resolution is determined. If the data stream was modified to remove reserved or forbidden data patterns, the effects of the modifications should be reversed. Data carried in the determined layer is combined

513

with data in each predecessor layer and then decoded

515

according to an inverse operation of the coding process employed to code the audio signal to the respective resolution. Layers associated with resolutions higher than that selected can be stripped off or ignored, for example, by signal routing circuitry. Any process or operation that is required to reverse the effects of scaling should be performed prior to decoding.

An embodiment is now described where scalable decoding process

500

is performed by processing system

100

on audio data received via a standard AES3 data channel. The standard AES3 data channel provides data in a series of twenty-four bit wide words. Each bit of a word may conveniently be identified by a bit number ranging from zero (0), which is the most significant bit, through twenty-three (23), which is the least significant bit. The notation bits (n˜m) is used herein to represent bits (n) through (m) of a word, where n and m are integers and m>n. The AES3 data channel is partitioned into a series of frames such as frame

340

in accordance with scalable data channel

300

of the present invention. Core layer

310

comprises bits (

0

˜

15

), first augmentation layer

320

comprises bits (

16

˜

19

), and second augmentation layer

330

comprises bits (

20

˜

23

).

Data in layers

310

,

320

,

330

is received via audio input/output interface

140

of processing system

100

. Responsive to the program of decoding instructions, processing system

100

searches for a sixteen-bit synchronization pattern in the data stream to align its processing with each frame boundary, partitions the data serially beginning with the synchronization pattern into twenty-four bit wide words represented as bits(

0

˜

23

). Bits (

0

˜

15

) of the first word are thus the synchronization pattern. Any processing required to reverse the effects of modifications made to avoid reserved patterns can be performed at this time.

Pre-established locations in core layer

310

are read to obtain format data, segment data, parameter data, offsets, and data protection information. Error detection codes are processed to detect any error in the data in core layer portion

352

. Muting of corresponding audio or retransmission of data may be performed in response to detection of a data error. Frame

340

is then parsed to obtain data for subsequent decoding operations.

To decode just the core layer

310

, the sixteen bit resolution is selected

511

. Established locations in core layer portions

372

,

382

of first and second audio subsegments

370

,

380

are read to obtain the coded subband signal elements. In preferred embodiments using block-scaled representations, this is accomplished by first obtaining the block scaling factor for each subband signal and using these scale factors to generate the same auditory masking curves AMC_L, AMC_R that were used in the encoding process. First desired noise spectrums for audio channels CH_L, CH_R are generated by shifting the auditory masking curves AMC_L, AMC_R by respective offsets O

1

_L, O

1

_R for each channel read from core layer portion

352

. First quantization resolutions Q

1

_L, Q

1

_R are then determined for the audio channels in the same manner used by coding process

400

. Processing system

100

can now determine the length and location of the coded scaled values in core layer portions

372

,

382

of audio subsegments

370

,

380

, respectively, that represent the scaled values of the subband signal elements. The coded scaled values are parsed from subsegments

370

,

380

and combined with the corresponding subband scale factors to obtain the quantized subband signal elements for audio channels CH_L, CH_R, which are then converted into digital audio streams. The conversion is performed by applying a synthesis filter bank complementary to the analysis filter bank applied during the encode process. The digital audio streams represent the left and right audio channels CH_L, CH_R. These digital signals may be converted into an analog signal by digital-to-analog conversion, which beneficially can be implemented in conventional manner.

The core and first augmentation layers

310

,

320

can be decoded as follows. The 20 bit coding resolution is selected

511

. Subband signal elements in the core layer

310

are obtained as just described. Additional offsets O

2

_L are read from augmentation layer portion

354

of control segment

350

. Second desired noise spectrums for audio channels CH_L are generated by shifting the first desired noise spectrum of left audio channel CH_L by the offset O

2

_L and responsive to the obtained noise spectrum, second quantization resolutions Q

2

_L are determined in the manner described for perceptually coding the first augmentation layer according to coding process

400

. These quantization resolutions Q

2

_L indicate the length and location of each component of residue signal RES

1

_L in augmentation layer portion

374

. Processing system

100

reads the respective residue signals and obtains the scaled representation of the quantized subband signal elements by combining

513

the residue signal RES

1

_L with the scaled representation obtained from core layer

310

. In this embodiment of the present invention, this is achieved using two's complement addition, where this addition is performed on a subband signal element by subband signal element basis. The quantized subband signal elements are obtained from the scaled representations of each subband signal and are then converted by an appropriate signal synthesis process to generate a digital audio stream for each channel. The digital audio stream may be converted to analog signals by digital-to-analog conversion. The core and first and second augmentation layers

310

,

320

,

330

can be decoded in a manner similar to that just described.

Referring now to

FIG. 6A

, there is shown a schematic diagram of an alternative embodiment of a frame

700

for scalable audio coding according to the present invention. Frame

700

defines the allocation of data capacity for a twenty-four bit wide AES3 data channel

701

. The AES3 data channel comprises a series of twenty-four bit wide words. The AES3 data channel includes a core layer

710

and two augmentation layers identified as an intermediate layer

720

, and a fine layer

730

. The core layer

710

comprises bits(

0

˜

15

), the intermediate layer

720

comprises bits (

16

˜

19

), and the fine layer

730

comprises bits (

20

˜

23

), respectively, of each word. The fine layer

730

thus comprises the four least significant bits of the AES3 data channel, and the intermediate layer

720

the next four least significant bits of that data channel.

Data capacity of the data channel

701

is allocated to support decoding of audio at a plurality of resolutions. These resolutions are referred to herein as a sixteen bit resolution supported by the core layer

710

, a twenty bit resolution supported by the union of the core layer

710

and intermediate layer

720

, and a twenty-four bit resolution supported by the union of the three layers

710

,

720

,

730

. It should be understood that the number of bits in each resolution mentioned above refers to the capacity of each respective layer during transmission or storage and does not refer to the quantization resolution or bit length of the symbols carried in the various layers to represent encoded audio signals. As a result, the so called “sixteen bit resolution” corresponds to perceptual coding at a basic resolution and typically is perceived upon decode and playback to be more accurate than sixteen bit PCM audio signals. Similarly, the twenty and twenty-four bit resolutions correspond to perceptual codings at progressively higher resolutions and typically are perceived to be more accurate than corresponding twenty and twenty-four bit PCM audio signals, respectively.

Frame

700

is divided into a series of segments that include a synchronization segment

740

, metadata segment

750

, audio segment

760

, and may optionally include a metadata extension segment

770

, audio extension segment

780

, and a meter segment

790

. The metadata extension segment

770

and audio extension segment

780

are dependent on one another, and accordingly, either both are included or neither is included. In this embodiment of frame

700

, each segment includes portions in each layer

710

,

720

,

730

. Referring now also to

FIGS. 6B

,

6

C, and

6

D there are shown schematic diagrams of preferred structure for the audio and audio extension segments

760

and

780

, the metadata segment

750

, and the metadata extension segment

770

.

In the synchronization segment

740

, bits (

0

˜

15

) carry a sixteen bit synchronization pattern, bits (

16

˜

19

) carry one or more error detection codes for the intermediate layer

720

, and bits (

20

˜

23

) carry one or more error detection codes for the fine layer

730

. Errors in augmentation data typically yield subtle audible effects, and accordingly data protection is beneficially limited to codes of four bits per augmentation layer to save data in the AES3 data channel. Additional data protection for augmentation layers

720

,

730

may be provided in the metadata segment

750

and metadata extension segment

770

as discussed below. Optionally, two different data protection values may be specified for each respective augmentation layer

720

,

730

. Either provides data protection for the respective layer

720

,

730

. The first value of data protection indicates that the respective layer of the audio segment

760

is configured in a predetermined manner such as aligned configuration. The second value of data protection indicates that pointers carried by the metadata segment

750

indicate where augmentation data is carried in the respective layer of the audio segment

760

, and if the audio extension segment

780

is included, that pointers in the metadata extension segment

770

indicate where augmentation data is carried in the respective layer of the audio extension segment

780

.

Audio segment

760

is substantially similar to the audio segment

360

of frame

390

described above. Audio segment

760

includes first subsegment

761

and second subsegment

7610

. The first subsegment

761

includes a data protection segment

767

, four respective channel subsegments (CS_

0

, CS_

1

, CS_

2

, CS_

3

) each comprising a respective subsegment

763

,

764

,

765

,

766

of first subsegment

761

, and may optionally include a prefix

762

. The channel subsegments correspond to four respective audio channels (CH_

0

, CH_

1

, CH_

2

, CH_

3

) of a multi-channel audio signal.

In optional prefix

762

, the core layer

710

carries a forbidden pattern key (KEY

1

_C) for avoiding forbidden patterns within that portion of the first subsegment carried respectively by core layer

710

, the intermediate layer

720

carries a forbidden pattern key (KEY

1

_I) for avoiding forbidden patterns within that portion of the first subsegment carried by intermediate layer

720

, and the fine layer

730

carries a forbidden pattern key (KEY

1

_F) for avoiding forbidden patterns within that portion of the first subsegment carried respectively by fine layer

730

.

In channel subsegment CS_

0

, the core layer

710

carries a first coded signal for audio channel CH_

0

, the intermediate layer

720

carries a first residue signal for the audio channel CH_

0

, and the fine layer

730

carries a second residue signal for audio channel CH_

0

. These preferably are coded into each corresponding layer using the coding process

400

modified as discussed below. Channel segments CS_

1

, CS_

2

, CS_

3

carry data respectively for audio channels CH_

1

, CH_

2

, CH_

3

in like manner.

In data protection segment

767

, the core layer

710

carries one or more error detection codes for that portion of the first subsegment carried respectively by core layer

710

, the intermediate layer

720

carries one or more error detection codes for that portion of the first subsegment carried by intermediate layer

720

, and the fine layer

730

carries one or more error detection codes for that portion of the first subsegment carried respectively by fine layer

730

. Data protection preferably is provided by a cyclic redundancy code (CRC) in this embodiment.

The second subsegment

7610

includes in like manner a data protection segment

7670

, four channel subsegments (CH_

4

, CH_

5

, CH_

6

, CH_

7

) each comprising a respective subsegment

7630

,

7640

,

7650

,

7660

of second subsegment

7610

, and may optionally include a prefix

7620

. The second subsegment

7610

is configured in a similar manner as the subsegment

761

. The audio extension segment

780

is configured like the audio segment

760

and allows for two or more segments of audio within a single frame, and may thereby reduce expended data capacity in the standard AES3 data channel.

The metadata segment

750

is configured as follows. That portion of metadata segment

750

carried by core layer

710

includes a header segment

751

, a frame control segment

752

, a metadata subsegment

753

, and a data protection subsegment

754

. That portion of metadata segment

750

carried by the intermediate layer

720

includes an intermediate metadata subsegment

755

and a data protection subsegment

757

, and that portion of metadata segment

750

carried by the fine layer

730

includes a fine metadata subsegment

756

and a data protection subsegment

758

. The data protection subsegments

754

,

757

,

758

need not be aligned between layers, but each preferably is located at the end of its respective layer or at some other predetermined location.

Header

751

carries format data that indicates program configuration and frame rate. Frame control segment

752

carries segment data that specifies boundaries of segments and subsegments in the synchronization, metadata, and audio segments

740

,

750

,

760

. Metadata subsegments

753

,

755

,

756

carry parameter data that indicates parameters of encoding operations performed for coding audio data into the core, intermediate, and fine layers

710

,

720

,

730

respectively. These indicate which type of coding operation is used to code the respective layer. Preferably the same type of coding operation is used for each layer with the resolution adjusted to reflect relative amounts of data capacity in the layers. It is alternatively permissible to carry parameter data for intermediate and fine layers

720

,

730

in the core layer

720

. However all parameter data for the core layer

710

preferably is included only in the core layer

710

so that augmentation layers

720

,

730

can be stripped off or ignored, for example by signal routing circuitry, without affecting the ability to decode the core layer

710

. Data protection subsegments

754

,

757

,

758

carry one or more error detection codes for protecting the core, intermediate, and fine layers

710

,

720

,

730

respectively.

The metadata extension segment

770

is substantially similar to the metadata segment

750

except that the metadata extension segment

770

does not include a frame control segment

752

. The boundaries of segments and subsegments in the metadata extension and audio extension segments

770

,

780

is indicated by their substantial similarity to the metadata and audio segments

750

,

760

in combination with the segment data carried by the frame control segment

752

in the metadata segment

750

.

Optional meter segment

790

carries average amplitudes of coded audio data carried in frame

700

. In particular, where the audio extension segment

780

is omitted, bits (

0

˜

15

) of meter segment

790

carry a representation of an average amplitude of coded audio data carried in bits (

0

˜

15

) of audio segment

760

, and bits (

16

˜

19

) and (

20

˜

23

) respectively carry extension data designated as intermediate meter (IM) and fine meter (FM) respectively. The IM may be an average amplitude of coded audio data carried in bits (

16

˜

19

) of audio segment

760

, and the FM may be an average amplitude of coded audio data carried in bits (

20

˜

23

) of audio segment

760

, for example. Where the audio extension segment

780

is included, average amplitudes, IM, and FM preferably reflect the coded audio carried in respective layers of that segment

780

. The meter segment

790

supports convenient display of average audio amplitude at decode. This typically is not essential to proper decoding of audio and may be omitted, for example, to save data capacity on the AES3 data channel.

Coding of audio data into frame

700

preferably is implemented using scalable coding processes

400

and

420

modified as follows. Audio subband signals for each of the eight channels are received. These subband signals preferably are generated by applying a block transform to blocks of samples for eight corresponding channels of time-domain audio data and grouping the transform coefficients to form the subband signals. The subband signals are each represented in block-floating-point form comprising a block exponent and a mantissa for each coefficient in the subband.

The dynamic range of the subband exponents of a given bit length may be expanded by using a “master exponent” for a group of subbands. Exponents for subband in the group are compared to some threshold to determine the value of the associated master exponent. If each subband exponent in the group is greater than a threshold of three, for example, the value of the master exponent is set to one and the associated subband exponents are reduce by three, otherwise the master exponent is set to zero.

The gain-adaptive quantization technique discussed briefly above may also be used. In one embodiment, mantissas for each subband signal are assigned to two groups according whether they are greater than one-half in magnitude. Mantissas less than or equal to one half are doubled in value to reduce the number of bits needed to represent them. Quantization of the mantissas is adjusted to reflect this doubling. Mantissas can alternatively be assigned to more than two groups. For example, mantissas may be assigned to three groups depending on whether their magnitudes are between 0 and ¼, ¼ and ½, ½ and 1, scaled respectively by 4, 2, and 1, and quantized accordingly to save additional data capacity. Additional information may be obtained from the U.S. patent application cited above.

Auditory masking curves are generated for each channel. Each auditory masking curve may be dependent on audio data of multiple channels (up to eight in this implementation) and not just one or two channels. Scalable coding process

400

is applied to each channel using these auditory masking curves, and with the modifications to quantization of mantissas discussed above. The iterative process

420

is applied to determine appropriate quantization resolutions for coding each layer. In this embodiment, a coding range is specified as about −144 dB to about +48 dB relative to the corresponding auditory masking curve. The resulting first coded, and first and second residue signal for each channel generated by processes

400

and

420

are then analyzed to determine forbidden pattern keys KEY

1

_C, KEY

1

_I, KEY

1

_F for the first subsegment

761

(and similarly for the second subsegment

7610

) of the audio segment

760

.

Control data for the metadata segment

750

is generated for the first block of multi-channel audio. Control data for the metadata extension segment

770

is generated for a second block of the multi-channel audio in similar manner, except that segment information for the second block is omitted. These are respectively modified by respective forbidden pattern keys as discussed above and output in the metadata segment

750

and metadata extension segment

770

, respectively.

The above described process is also performed on a second block of the eight audio channels, and with generated coded signals output in similar manner in the audio extension segment

780

. Control data is generated for the second block of multi-channel audio in essentially the same manner as for the first such block except that no segment data is generated for the second block. This control data is output in the metadata extension segment

770

.

A synchronization pattern is output in bits (

0

˜

15

) of the synchronization segment

740

. Two four bit wide error detection codes are generated respectively for the intermediate and fine layers

720

,

730

and output respectively in bits (

16

˜

19

) and bits (

20

˜

23

) of the synchronization segment

740

. In this embodiment, errors in augmentation data typically yield subtle audible effects, and accordingly, error detection is beneficially limited to codes of four bits per augmentation layer to save data capacity in the standard AES3 data channel.

According to the present invention, the error detection codes can have predetermined values, such as “0001”, that do not depend on the bit pattern of the data protected. Error detection is provided by inspecting such error detection code to determine whether the code itself has been corrupted. If so, it is presumed that other data in the layer is corrupt, and another copy of the data is obtained, or alternatively, the error is muted. A preferred embodiment specifies multiple predetermined error detection codes for each augmentation layer. These codes also indicate the layer's configuration. A first error detection code, “0101” for example, indicates that the layer has a predetermined configuration, such as aligned configuration. A second error detection code, “1001” for example, indicates that the layer has a distributed configuration, and that pointers or other data are output in the metadata segment

750

or other location to indicate the distribution pattern of data in the layer. There is little possibility that one code could be corrupted during transmission to yield the other, because two bits of the code must be corrupted without corrupting the remaining bits. The embodiment is thus substantially immune to single bit transmission errors. Moreover, any error in decoding augmentation layers typically yield at most a subtle audible effect.

In an alternative embodiment of the present invention, other forms of entropy coding are applied to compression of audio data. For example, in one alternative embodiment a sixteen bit entropy coding process generates compressed audio data that is output on a core layer. This is repeated for the data coding at higher resolution to generate a trial coded signal. The trial coded signal is combined with the compressed audio data to generate a trial residue signal. This is repeated as necessary until the trial residue signal efficiently utilizes the data capacity of a first augmentation layer, and the trial residue signal is output on a first augmentation layer. This is repeated for a second layer or multiple additional augmentation layers by again increasing the resolution of the entropy coding.

Upon reviewing the application, various modifications and variations of the present invention will be apparent to those skilled in the art. Such modifications and variations are provided for by the present invention, which is limited only by the following claims.

Claims

1. A scalable coding process, the process using a standard data channel that has a core layer and an augmentation layer, the process comprising:receiving a plurality of subband signals; determining a respective first quantization resolution for each subband signal in response to a first desired noise spectrum and quantizing each subband signal according to the respective first quantization resolution to generate a first coded signal; determining a respective second quantization resolution for each subband signal in response to a second desired noise spectrum and quantizing each subband signal according to the respective second quantization resolution to generate a second coded signal; generating a residue signal that indicates a residue between the first and second coded signals; and outputting the first coded signal in the core layer and the residue signal in the augmentation layer.
2. The process of claim 1, wherein the first desired noise spectrum is established in response to auditory masking characteristics of the subband signals determined according to psychoacoustic principles.
3. The process of claim 1, wherein the first quantization resolutions are determined responsive to subband signals quantized according to such first quantization resolutions meeting a data capacity requirement of the core layer.
4. The process of claim 1, wherein the first coded signal and residue signal are output in aligned configuration.
5. The process of claim 1, wherein additional data is output to indicate a configuration pattern of the residue signal with respect to the first coded signal.
6. The process of claim 1, wherein the second desired noise spectrum is offset from the first desired noise spectrum by a substantially uniform amount, and wherein an indication of the substantially uniform amount is output in the standard data channel.
7. The process of claim 1, wherein the first coded signal comprises a plurality of scale factors, and wherein the residue signal is represented by the scale factors of the first coded signal.
8. The process of claim 1, wherein a subband signal quantized to respective second quantization resolution is represented by a scaled value comprising a sequence of bits, and wherein the subband signal quantized to respective first quantization resolution is represented by another scaled value comprising a subsequence of said bits.
9. A scalable coding process, the process using a standard data channel that has a plurality of layers, the process comprising:receiving a plurality of subband signals; generating a perceptual coding and a second coding of the subband signals; generating a residue signal that indicates a residue of the second coding relative to the perceptual coding; and outputting the perceptual coding in a first layer and the residue signal in a second layer.
10. The scalable coding process of claim 9, further comprising:generating a third coding of the subband signals; generating a second residue signal that indicates a residue of the third coding relative to at least one of the perceptual and second codings; and outputting the second residue signal in a third layer.
11. The scalable coding process of claim 9, wherein the data channel conforms to standard AES3 of the Audio Engineering Society, the first layer is a 16 bit wide layer of the data channel, and the second and third layers are each a 4 bit wide layer of the data channel.
12. The process claim 9, further comprising:generating error detection data that indicates configuration of the residue signal with respect to the perceptual coding; and outputting the error detection data in the standard data channel.
13. The process claim 9, further comprising:generating a sequence of bits; outputting the sequence of bits in the standard data channel; receiving a sequence of bits corresponding to the output sequence of bits at a receiver; analyzing the received sequence of bits to determine whether it matches the generated sequence of bits; and determining in response to the analysis whether one of the perceptual coding and the residue signal includes a transmission error.
14. The process of claim 9, wherein the second coding is generated responsive to data capacity of the union of the first and second layers.
15. A method of processing data carried by a multi-layer data channel, wherein a first layer of the data channel carries a perceptual coding of an audio signal and a second layer of the data channel carries augmentation data for increasing the resolution of the perceptual coding of the audio signal, the method using a decoder and comprising:receiving the perceptual coding and augmentation data via the data channel; and routing the perceptual coding of the audio signal to the decoder.
16. The method of claim 15, further comprising decoding the perceptual coding of the audio signal.
17. The method of claim 15, further comprising:combining the perceptual coding with the augmentation data to generate a second coding of the audio signal having higher resolution than the perceptual coding of the audio signal; and decoding the second coding of the audio signal.
18. The method of claim 17, wherein the perceptual coding is received along a core sixteen bit layer of a data channel conforming to standard AES3 of the Audio Engineering Society, and wherein the augmentation data is received along at least one four bit wide augmentation layer of the data channel.
19. The method of claim 15, wherein combining the perceptual coding with the augmentation data comprises:identifying a plurality of segments along the data channel each corresponding to a distinct audio channel; and combining each portion of the perceptual coding carried by one of the segments with each portion of the augmentation data carried by said one of the segments to generate an intermediate signal that represents one of the audio channels.
20. The method of claim 17, wherein combining the perceptual coding with the augmentation data comprises:identifying a segment along the data channel that corresponds to a single audio channel; processing the augmentation data to determine a location of a residue for said audio channel and recovering the residue; and combining each portion of the perceptual coding carried by the segment with the residue to generate an intermediate signal that represents said audio channel at a resolution higher than the perceptual coding of the audio signal.
21. A processing system for a standard data channel, the standard data channel having a core layer and an augmentation layer, the processing system comprising:a memory unit that stores a program of instructions; a program-controlled processor coupled to receive a plurality the subband signals, and coupled to the memory unit for receiving the program, responsive to the program, the program-controlled processor determining a respective first quantization resolution for each subband signal in response to a first desired noise spectrum and quantizing each subband signal according to the respective first quantization resolution to generate a first coded signal, determining a respective second quantization resolution for each subband signal in response to a second desired noise spectrum and quantizing each subband signal according to the respective second quantization resolution to generate a second coded signal, generating a residue signal that indicates a residue between the first and second coded signals, and outputting the first coded signal on the core layer and the residue signal on the augmentation layer.
22. The processing system of claim 21, wherein, in response to the program, the program-controlled processor determines auditory masking characteristics of the subband signals according to psychoacoustic principles and establishes the first desired noise spectrum in response to the determined auditory masking characteristics.
23. The processing system of claim 21, wherein, in response to the program, the program-controlled processor determines the first quantization resolutions so that subband signals quantized according to the determined first quantization resolutions meet a data capacity requirement of the core layer.
24. The processing system of claim 21, wherein, in response to the program, the program-controlled processor outputs the first coded signal and residue signal in aligned configuration.
25. The processing system of claim 21, wherein, in response to the program, the program-controlled processor outputs on the data channel additional data that indicates a configuration pattern of the residue signal with respect to the first coded signal.
26. The processing system of claim 21, wherein, responsive to the program, the program-controlled processor determines the second desired noise spectrum by offsetting the first desired noise spectrum by a substantially uniform amount and outputs an indication of the substantially uniform amount in the standard data channel.
27. The processing system of claim 21, wherein, responsive to the program, the program-controlled processor generates a plurality of scale factors that represent the first coded signal and uses the generated scale factors to represent scale factors for the first coded signal.
28. The processing system of claim 21, wherein a subband signal quantized to respective second quantization resolution is represented by a scaled value comprising a sequence of bits, and wherein the subband signal quantized to respective first quantization resolution is represented by another scaled value comprising a subsequence of said bits.
29. A processing system for a multi-layer data channel, wherein a first layer of the data channel carries a perceptual coding of an audio signal and a second layer of the data channel carries augmentation data for increasing the resolution of the perceptual coding of the audio signal, the processing system comprising:signal routing circuitry that receives the perceptual coding and augmentation data via the data channel; a memory unit that stores a program of instructions; and a program-controlled processor coupled to the signal routing circuitry for receiving the perceptual coding and augmentation data, and coupled to the memory unit for receiving the program, and responsive to the program, generating a decoded signal.
30. The processing system of claim 29, wherein the program-controlled processor decodes the perceptual coding of the audio signal to generate the decoded signal.
31. The processing system of claim 29, wherein the program-controlled processor:combines the perceptual coding with the augmentation data to generate a second coding of the audio signal having higher resolution than the perceptual coding of the audio signal; and decodes the second coding of the audio signal to generate the decoded signal.
32. The processing system of claim 29, wherein the signal routing circuitry receives the perceptual coding along a core sixteen bit layer of a data channel conforming to standard AES3 of the Audio Engineering Society, and receives the augmentation data along at least one four bit wide augmentation layer of the data channel.
33. The processing system of claim 29, wherein the program-controlled processor:identifies a plurality of segments along the data channel each corresponding to a distinct audio channel; and combines each portion of the perceptual coding carried by one of the segments with each portion of the augmentation data carried by said one of the segments to generate an intermediate signal that represents one of the audio channels.
34. The processing system of claim 29, wherein the program-controlled processor:identifies a segment along the data channel that corresponds to a single audio channel; processes the augmentation data to determine a location of a residue for said audio channel and recovering the residue; and combines each portion of the perceptual coding carried by the segment with the residue to generate an intermediate signal that represents said audio channel at a resolution higher than the perceptual coding of the audio signal.
35. A medium readable by a machine, the medium carrying a program of instructions executable by the machine to perform a coding process, the coding process using a standard data channel that has a core layer and an augmentation layer, the process comprising:receiving a plurality of subband signals; determining a respective first quantization resolution for each subband signal in response to a first desired noise spectrum and quantizing each subband signal according to the respective first quantization resolution to generate a first coded signal; determining a respective second quantization resolution for each subband signal in response to a second desired noise spectrum and quantizing each subband signal according to the respective second quantization resolution to generate a second coded signal; generating a residue signal that indicates a residue between the first and second coded signals; and outputting the first coded signal in the core layer and the residue signal in the augmentation layer.
36. The medium of claim 35, wherein the first desired noise spectrum is established in response to auditory masking characteristics of the subband signals determined according to psychoacoustic principles.
37. The medium of claim 35, wherein the first quantization resolutions are determined responsive to subband signals quantized according to such first quantization resolutions meeting a data capacity requirement of the core layer.
38. The medium of claim 35, wherein the first coded signal and residue signal are output in aligned configuration.
39. The medium of claim 35, wherein additional data is output to indicate a configuration pattern of the residue signal with respect to the first coded signal.
40. The medium of claim 35, wherein the second desired noise spectrum is offset from the first desired noise spectrum by a substantially uniform amount, and wherein an indication of the substantially uniform amount is output in the standard data channel.
41. The medium of claim 35, wherein the first coded signal comprises a plurality of scale factors, and wherein the residue signal is represented by the scale factors of the first coded signal.
42. The medium of claim 35, wherein a subband signal quantized to respective second quantization resolution is represented by a scaled value comprising a sequence of bits, and wherein the subband signal quantized to respective first quantization resolution is represented by another scaled value comprising a subsequence of said bits.
43. A medium readable by a machine, the medium carrying a program of instructions executable by the machine to perform a method of processing data carried by a multi-layer data channel, wherein a first layer of the data channel carries a perceptual coding of an audio signal and a second layer of the data channel carries augmentation data for increasing the resolution of the perceptual coding of the audio signal, the method using a decoder and comprising:receiving the perceptual coding and augmentation data via the data channel; and routing the perceptual coding of the audio signal to the decoder.
44. The medium of claim 43, further comprising decoding the perceptual coding of the audio signal.
45. The medium of claim 43, further comprising:combining the perceptual coding with the augmentation data to generate a second coding of the audio signal having higher resolution than the perceptual coding of the audio signal; and decoding the second coding of the audio signal.
46. The medium of claim 43, wherein the perceptual coding is received along a core sixteen bit layer of a data channel conforming to standard AES3 of the Audio Engineering Society, and wherein the augmentation data is received along at least one four bit wide augmentation layer of the data channel.
47. The medium of claim 45, wherein combining the perceptual coding with the augmentation data comprises:identifying a plurality of segments along the data channel each corresponding to a distinct audio channel; and combining each portion of the perceptual coding carried by one of the segments with each portion of the augmentation data carried by said one of the segments to generate an intermediate signal that represents one of the audio channels.
48. The medium of claim 45, wherein combining the perceptual coding with the augmentation data comprises:identifying a segment along the data channel that corresponds to a single audio channel; processing the augmentation data to determine a location of a residue for said audio channel and recovering the residue; and combining each portion of the perceptual coding carried by the segment with the residue to generate an intermediate signal that represents said audio channel at a resolution higher than the first coded signal.
49. A machine readable medium that carries encoded audio information, the encoded audio information generated according to a coding process that comprises:receiving a plurality of subband signals; determining a respective first quantization resolution for each subband signal in response to a first desired noise spectrum and quantizing each subband signal according to the respective first quantization resolution to generate a first coded signal; determining a respective second quantization resolution for each subband signal in response to a second desired noise spectrum and quantizing each subband signal according to the respective second quantization resolution to generate a second coded signal; generating a residue signal that indicates a residue between the first and second coded signals; and outputting the first coded signal in the core layer and the residue signal in the augmentation layer.
50. The medium of claim 49, wherein the first desired noise spectrum is established in response to auditory masking characteristics of the subband signals determined according to psychoacoustic principles.
51. The medium of claim 49, wherein the first quantization resolutions are determined responsive to subband signals quantized according to such first quantization resolutions meeting a data capacity requirement of the core layer.
52. The medium of claim 49, wherein the first coded signal and residue signal are output in aligned configuration.
53. The medium of claim 49, wherein additional data is output to indicate a configuration pattern of the residue signal with respect to the first coded signal.
54. The medium of claim 49, wherein the second desired noise spectrum is offset from the first desired noise spectrum by a substantially uniform amount, and wherein an indication of the substantially uniform amount is output in the standard data channel.
55. The medium of claim 49, wherein the first coded signal comprises a plurality of scale factors, and wherein the residue signal is represented by the scale factors of the first coded signal.
56. The medium of claim 49, wherein a subband signal quantized to respective second quantization resolution is represented by a scaled value comprising a sequence of bits, and wherein the subband signal quantized to respective first quantization resolution is represented by another scaled value comprising a subsequence of said bits.

US Referenced Citations (16)

Number	Name	Date	Kind
4972484	Theile et al.	Nov 1990	A
5253055	Civanlar et al.	Oct 1993	A
5253056	Puri et al.	Oct 1993	A
5270813	Puri et al.	Dec 1993	A
5530655	Lokhoff et al.	Jun 1996	A
5537510	Kim	Jul 1996	A
5640486	Lim	Jun 1997	A
5712920	Spille	Jan 1998	A
5721806	Lee	Feb 1998	A
5812672	Herre et al.	Sep 1998	A
5832427	Shibuya	Nov 1998	A
5930750	Tsutsui	Jul 1999	A
6092041	Pan et al.	Jul 2000	A
6094636	Kim	Jul 2000	A
6108625	Kim	Aug 2000	A
6349284	Park et al.	Feb 2002	B1

Foreign Referenced Citations (9)

Number	Date	Country
9669248	Apr 1997	AU
9855571	Sep 1998	AU
0734021	Sep 1996	EP
0869622	Oct 1998	EP
0884850	Dec 1998	EP
0918401	May 1999	EP
0918407	May 1999	EP
0919989	Jun 1999	EP
2320870	Jul 1998	GB

Non-Patent Literature Citations (14)

Entry
Advanced Television System Committee (ATSC), “Digital Audio Compression Standard (AC-3),” Document A/52, pp. i-vii and 1-130, USA, Dec. 1995.
ISO/IEC 11172-3:1993, “Information technology-Coding of moving pictures and associated audio for digital storage media at up to about, 1,5 Mbit/s- Part 3: Audio” pp. i-v and 1-150, Gen{dot over (m)}eve, Switzerland, (Aug. 1993).
G. Stoll, M. Link and G. Thelie, “Masking-pattern adapted subband coding: use of the dynamic bit-rate margin,” presented at the 84th Convention of the Audio Engineering Society, Paris, France, Preprint 2585, pp. 1-33, (Mar. 1988).
J. Stautner, “Scalable Audio Compression for Mixed Computing Environments,” presented at the 93rd Convention of the Audio Engineering Society, San Francisco, California, Preprint 3357, pp. 1-6, Figs. 1-3, Table 1, (Oct. 1992).
P. Tudor and N. Wells, “Scalable source coding for HDTV,” from Audio and Video Digital Radio Broadcasting Systems and Techniques, pp. 131-142, Elsevier Science BV, Surrey, United Kingdom, (1994).
ISO/IEC 13818-3:1998(E), “Information Technology—Generic coding of moving pictures and associated audio information -Part 3: Audio,” pp. i-x and 1-115, Gen{dot over (m)}eva, Switzerland, (Apr. 1998).
K. Brandenburg and B. Grill, “First Ideas on Scalable Audio Coding,” presented at the 97th Convention of the Audio Engineering Society, San Francisco, California, Preprint 3924, pp. 1-6, Figs. 1-3, and Table 1, (Nov. 1994).
B. Grill and K. Brandenburg, “A Two- or Three-Stage Bit Rate Scalable Audio Coding System,” presented at the 99th Convention of the Audio Engineering Society, New York, NY, Preprint 4132, pp. 1-7, Figs. 1-3, (Oct. 1995).
B. Grill, “A Bit Rate Scalable Perceptual Coder for MPEG-4 Audio,” presented at the 103rd Convention of the Audio Engineering Society, New York, NY, Preprint 4620, pp. 1-16 and Fig. 1-8, (Sep. 1997).
S. Park, Y. Kim, S. Kim and Y. Seo, “Multi-Layer Bit-Sliced Bit-Rate Scalable Audio Coding,” presented at the 103rd Convention of the Audio Engineering Society, New York, NY, Preprint 4520, pp. 1-11, (Sep. 1997).
P. Kudumakis and M. Sandler, “Wavelet Packet Based Scalable Audio Coding,” Proceedings of the IEEE International Symposium on Circuits and Systems, Atlanta, vol. 2, pp. 41-44, (May 1996).
Y. Nakajima, H. Yanagihara, A. Yoneyama and M. Sugano, “MPEG Audio Bit Rate Scaling on Coded Data Domain,” Proceedings of the 1998 IEEE International Conference on Acoustics, Speech and Signal Processing, vol. 6, pp. 3669-3672, (1998).
G. Davidson, L. Fielder and B. Link, “Parametric Bit Allocation in a Perceptual Audio Coder,” presented at the 97th Convention of the Audio Engineering Society, San Francisco, California, Preprint 3921, pp. 1-15 and Figs. 1-9, (Nov. 1994).
A. Jin, T. Moriya, T. Norimatsu, M. Tsushima and T. Ishikawa, “Scalable Audio Coder Based on Quantizer Units of MDCT Coefficients,” presented at the International Conference on Acoustics, Speech and Signal Processing, Phoenix, (May 1999).

Scalable coding method for high quality audio

Information

Patent Number

Date Filed

Date Issued

Inventors

Original Assignees

Examiners

Agents

CPC

US Classifications

Field of Search

US

International Classifications

Abstract

Description

Claims

US Referenced Citations (16)

Foreign Referenced Citations (9)

Non-Patent Literature Citations (14)