1. Field of the Invention
The present invention relates to speech coding/decoding, and more particularly, to an apparatus, method, and medium for reproducing a scalable wide-band speech signal.
2. Description of the Related Art
With the increased amount of speech communication applications in various fields, and an increase of network transmission speeds, there is an emerging demand for high fidelity speech communication. Accordingly, wide-band speech signals in the range of 0.05 kHz to 7 kHz, which show excellent capability in terms of naturalness and intelligibility in comparison with a known speech communication band ranging from 0.3 kHz to 3.4 kHz, are required to be transmitted.
In a packet switching network in which data is transmitted in unit of packets, a channel bottleneck may be caused, which may lead to packet loss and poor speech quality. Although a technique for hiding packet damage is known, this is not a satisfactory solution. Thus, a technique for scalable coding/decoding a wide-band speech signal has been proposed in which the wide-band speech signal can be effectively compressed, and the channel bottleneck can be reduced. Currently proposed methods of coding/decoding wide-band speech signals include a method in which speech signals in the range of 0.05 kHz to 7 kHz are simultaneously compressed and then restored, and a method in which speech signals are hierarchically compressed by being divided into signals in the range of 0.05 kHz to 4 kHz and signals in the range of 4 kHz to 7 kHz, and then restored. The latter method above is a wide-band speech coding/decoding method using a bandwidth scalability function for enabling optimum communication under the given channel condition by controlling the size of layers to be transmitted according to a data bottleneck condition. In the speech coding method using a bandwidth scalability function, a speech signal is coded and decoded using a hierarchical coding method. That is, the speech signal is coded after being divided into a core layer and a speech enhancement layer. The core layer transmits only information capable of restoring a minimum speech quality. The speech enhancement layer transmits additional information capable of enhancing speech quality. A method for providing a bandwidth scalability function in order to enhance speech quality is disclosed in U.S. Pat. No. 5,455,888, which is incorporated by reference in its entirety.
However, if a high-band speech signal is coded using conventional methods, the speech signal cannot be easily restored with high fidelity when the speech signal is transmitted at a low bit-rate. Further, the lower the bit-rate, the poorer the speech restoring capability. In addition, the conventional methods have not provided scalable wide-band speech reproduction for reducing/eliminating the channel bottleneck.
Additional aspects, features and/or advantages of the invention will be set forth in part in the description which follows and, in part, will be apparent from the description, or may be learned by practice of the invention.
The present invention provides an apparatus, method, and medium capable of reproducing a scalable wide-band speech signal, wherein, in scalable wide-band speech coding/decoding, a high quality speech signal is ensured for all layers by solving a problem that speech restoration capability deteriorates as a bit-rate decreases when a speech signal is transmitted in the process of coding a high-band speech signal.
The present invention also provides an apparatus, method, and medium for coding/decoding a wide-band speech, wherein, in a wide-band speech coding/decoding apparatus having a quality and bandwidth extension function, a bit required for extension has a scalable structure.
According to an aspect of the present invention, there is provided a scalable speech coding apparatus having a mixed structure, the apparatus comprising: a band divider dividing a speech input signal into a low-band signal and a high-band signal according to a specific frequency, and outputting the low-band signal and the high-band signal; a low-band coder outputting a low-band first index by coding the low-band signal, transmitting information required for coding the high-band signal to a high-band coder, and transmitting an uncoded first error signal to a wide-band coder; a high-band coder outputting a high-band second index obtained when the high-band signal is coded by using information received from the low-band coder, and transmitting an uncoded second error signal to the wide-band coder; a wide-band coder quantizing coefficients of the first and second error signals using a modified discrete cosine transform (MDCT) method through time-frequency mapping, and outputting a low-band third index; and a bit-stream generator outputting a scalable bit-stream composed of the low-band first index received from the low-band coder, the high-band second index received from the high-band coder, and the low-band third index received from the wide-band coder.
According to another aspect of the present invention, there is provided a scalable speech coding method having a mixed structure, the method comprising: (a) dividing a speech input signal into a low-band signal and a high-band signal according to a specific frequency, and outputting the low-band signal and the high-band signal; (b) generating and outputting a low-band first index by coding the output low-band signal, and outputting specific information required for coding the high-band signal and an uncoded first error signal; (c) coding the output high-band signal by using the specific information, and outputting a high-band second index and an uncoded second error signal; (d) quantizing coefficients of the first and second error signals using a modified discrete cosine transform (MDCT) through time-frequency mapping, and outputting a low-band third index; and (e) outputting a scalable bit-stream composed of the low-band first index, the high-band second index, and the low-band third index.
According to another aspect of the present invention, there is provided a computer-readable medium having embodied thereon a computer program for executing the above-described scalable speech coding method having a mixed structure.
According to another aspect of the present invention, there is provided a scalable speech decoding apparatus having a mixed structure, the apparatus comprising: a bit-stream divider receiving a scalable bit-stream transmitted at a specific transmission rate according to a network condition, and transmitting the scalable bit-stream to each decoder of a corresponding frequency band by dividing the scalable bit-stream according to a frequency band used in reproduction; a low-band decoder receiving a low-band signal into which the scalable bit-stream is divided by the bit-stream divider, decoding and outputting the decoded low-band signal, and transmitting specific information required for decoding a high-band signal among coefficients decoded in a low-band; a high-band decoder decoding and outputting the high-band signal into which the scalable bit-stream is divided by the bit-stream divider, by using the specific information; a wide-band decoder decoding a wide-band signal into which the scalable bitstream is divided by the bit-stream divider and dividing and outputting the decoded wide-band signal into a low-band signal and a high-band signal according to a specific frequency; and a band combiner outputting a wide-band synthetic signal of a combined band by receiving a first synthetic signal, which is generated when a signal output from the low-band decoder is combined with the low-band signal output from the wide-band decoder, and a second synthetic signal which is generated when a signal output from the high-band decoder is combined with the high-band signal output from the wide-band decoder.
According to another aspect of the present invention, there is provided a scalable speech decoding method having a mixed structure, the method comprising: (a) receiving a scalable bit-stream transmitted at a specific transmission rate according to a network condition, and dividing and outputting the scalable bit-stream into a low-band signal, a high-band signal, and a wide-band signal according to a frequency band used for reproduction; (b) decoding and outputting the low-band signal of the scalable bitstream and outputting information on a pitch signal among coefficients decoded in a low-band; (c) receiving the high-band signal of the scalable bitstream and the pitch signal information and decoding and outputting the high-band signal using the pitch signal information; (d) receiving and decoding the wide-band signal of the scalable bitstream and dividing and outputting the decoded wide-band signal into a low-band signal and a high-band signal according to a specific frequency; and (e) outputting a wide-band synthetic signal of a combined band by receiving a first synthetic signal, which is generated when a signal output in (b) is combined with a low-band signal output in (d), and a second synthetic signal which is generated when a signal output in (c) is combined with a high-band signal output in (d).
According to another aspect of the present invention, there is provided a computer-readable medium having embodied thereon a computer program for executing the above-described scalable speech decoding method having a mixed structure.
These and/or other aspects, features, and advantages of the invention will become apparent and more readily appreciated from the following description of the embodiments, taken in conjunction with the accompanying drawings of which:
Reference will now be made in detail to exemplary embodiments of the present invention, examples of which are illustrated in the accompanying drawings, wherein like reference numerals refer to the like elements throughout. Exemplary embodiments are described below to explain the present invention by referring to the figures.
Referring to
In operation 102, the speech coding apparatus according to an exemplary embodiment of the present invention illustrated in
In operation 104, the band divider 100 classifies the wide-band speech signal received in operation 102 into a low-band signal in the frequency range of 0˜4 kHz, and a high-band signal in the frequency range of 4˜8 kHz by using a reference frequency, for example 4 kHz. Then the band divider 100 outputs the low-band signal to the low-band coder 200 (A in
In operation 106, the low-band coder 200 receives a low-band signal component in the frequency range of 0˜4 kHz.
In operation 108, the low-band coder 200 codes the received low-band signal component using a code excited linear prediction (CELP) method.
Now, a process of coding the received low-band signal by using the CELP method will be described with reference to
The low-band coder 200 includes a core layer coder 210, a speech enhancement layer coder 220, and a multiplexer 230.
Now, a process of coding a low-band signal received from the low-band coder 200 of
In operation 110, the core layer coder 210 performs quantization after a linear prediction analyzer/quantizer (not shown) obtains a linear prediction coefficient, and transmits the quantized linear prediction coefficient to the multiplexer 230. An excited signal generated by using the quantized linear prediction coefficient is passed through a synthetic filter (not shown), thereby generating a first synthetic signal included in the core layer. The speech enhancement layer coder 220 also generates a first synthetic signal included in the speech enhancement layer corresponding to the first synthetic signal included in the core layer. The first synthetic signal included in the core layer and the first synthetic signal included in the speech enhancement layer are combined to generate a first synthetic signal. A difference between the low-band signal input to the low-band coder 200 and the first synthetic signal output from the low-band coder 200 is defined as a first error signal. The first error signal is transmitted to the wide-band coder 400 of
A perceptual weighting filter (not shown) performs perceptual weighting linear prediction by using the quantized linear prediction coefficient. A pitch analyzer (not shown) searches for a pitch by using a prediction signal output from the perceptual weighting filter. A contribution factor for the pitch of a signal passing through the perceptual weighting filter is removed by using the found pitch, and a signal which has to be searched for in a fixed codebook is obtained. The signal obtained from the fixed codebook is transmitted to the low-band coder 200. The core layer coder 210 obtains an index and gain of an adaptive codebook as well as an index and gain of the fixed codebook by using an analysis-by-synthesis method. Further, the core layer coder 210 quantizes gain values of the adaptive codebook and the fixed codebook, and transmits information on the quantized gain value of the fixed codebook to the speech enhancement layer coder 220. The core layer coder 210 transmits to the multiplexer 230 information obtained by quantizing the fixed codebook index, the adaptive codebook index and gain value in addition to the quantized linear prediction coefficient.
The speech enhancement layer coder 220 generates a fixed codebook index and quantization information on a gain value difference included in the speech enhancement layer by using the signal obtained from a fixed codebook and which is received from the core layer coder 210 and information on a quantized gain value of the fixed codebook, and then transmits the generated information to the multiplexer 230.
The low-band coder 200 outputs information on low-band pitch delay generated by decoding the adaptive codebook index to the high-band coder 300. Further, the low-band coder 200 generates low-band excited signal energy by integrating quantized values of the adaptive codebook index and gain included in the core layer, the fixed codebook index and gain included in the core layer, the fixed codebook index included in the speech enhancement layer, and the gain value included in the speech enhancement layer, and then outputs the result to the high-band coder 300.
The multiplexer 230 outputs a low-band index indicating a low-band by using information received from the core layer coder 210, such as linear prediction coefficient quantization information, information on low-band pitch delay, an adaptive codebook index, gain value quantization information, and by using information received from the speech enhancement layer coder 220, such as the fixed codebook index included in the speech enhancement layer, and gain value difference quantization information. Referring back to FIG. 10, the high-band coder 300 receives a high-band signal component in the frequency range of 4 ˜8 k Hz in operation 112.
In operation 114, the high-band coder 300 receives information required for coding a high-band signal received from the low-band coder 200.
When a harmonic method is used as a coding method according to an exemplary embodiment of the present invention, examples of information required for coding a high-band signal include information on low-band pitch delay and information on low-band excited signal energy. In operation 116, the high-band coder 300 codes the received high-band signal by using the low-band pitch delay information and the low-band excited signal energy information received from the low-band coder 200.
Now, a coding process using a harmonic method will be described with reference to
The high-band coder 300 includes a linear prediction analyzer/quantizer 301, a time/frequency mapping unit 302, a harmonic analyzer 303, a harmonic phase quantizer 304, and an RMS power quantizer 306, each of which has a coding function. Further, the high-band coder 300 includes a harmonic phase dequantizer 305, an RMS power dequantizer 307, a harmonic synthesizer 308, a frequency/time mapping unit 309, a linear prediction synthesizer 310, and a multiplexer 311, each of which has a decoding function.
The linear prediction analyzer/quantizer 301 obtains a linear prediction coding coefficient using a general code excited linear prediction (CELP) method by using a high-band input signal received from a quadrature mirror filter (QMF), and then quantizes the coefficient. The quantized coefficient is output and transmitted to the multiplexer 311. The linear prediction analyzer/quantizer 301 performs linear prediction by using the quantized coefficient. Since the linear prediction coding is represented by parameters, a residual signal may be generated in the case of not being able to be represented by the parameters. The generated residual signal is transmitted to the time/frequency mapping unit 302. The time/frequency mapping unit 302 obtains amplitudes and phases of an input residual signal with respect to each frequency component. The amplitudes and phases for each frequency component obtained by the time/frequency mapping unit 302 are transmitted to the harmonic analyzer 303. The harmonic analyzer 303 searches for a harmonic position by using the amplitudes and phases for each frequency component received from the time/frequency mapping unit 302 and information on low-band pitch delay received from the low-band coder 200. Then, frequency information associated with the found harmonic position is coded. A pitch may differ according to features of an actual input speech signal, and in this case, the number of harmonics may vary. Thus, only some harmonics may be quantized. For this reason, in order to code frequency information associated with a harmonic position with a limited transmission rate, a signal associated with an important harmonic position has to be determined. The harmonic analyzer 303 selects the signal associated with an important harmonic position. The signal associated with an important harmonic position may contain a value of a harmonic component located in a relatively low frequency band, a value of a harmonic component having a relatively large energy magnitude over the entire frequency band, or a value of a harmonic component associated with a Formant frequency position when restored by using the linear prediction coding coefficient. Once a harmonic component to be coded by the harmonic analyzer 303 is determined, phase information associated with each harmonic position is extracted, and the extracted harmonic phase information is quantized by the harmonic phase quantizer 304. The harmonic phase quantizer 304 quantizes each harmonic phase obtained as above. When quantizing, various quantization methods may be used such as scalar quantization (SQ) or vector quantization (VQ).
In addition, the harmonic analyzer 303 obtains a high-band root mean square (RMS) power. When various scalability factors are given, a gain is not necessarily required for each layer due to the high-band RMS power. That is, a speech signal is synthesized by using the signal associated with an important harmonic position and the linear prediction coding coefficient, and then is scaled as much as by a high-band energy magnitude. The obtained high-band RMS power is quantized by the RMS power quantizer 306. In order to code the high-band RMS power further effectively, the RMS power quantizer 306 uses statistic information coded in the low-band. According to an exemplary embodiment of the present invention, energy information on a low-band excited signal received from the low-band coder 200 is used. Quantization can be further effectively achieved when the ratio of the low-band excited signal energy and the high-band RMS power is quantized.
Although coding is completed as described above, since a high-band portion is one sub-module of a coder/decoder (CODEC), an output signal can be synthesized only when a decoding module is included in a high-band coding module after coding is completed. Therefore, a decoding process is required as follows.
The harmonic phase dequantizer 305 dequantizes a phase by using a quantized parameter, and transmits the dequantized phase to the harmonic synthesizer 308. The RMS power dequantizer 307 obtains an RMS power that is quantized by inversely applying a quantization process performed by the RMS power quantizer 306 by utilizing the information on low-band excited signal energy received from the low-band coder 200, and transmits this value to the harmonic synthesizer 308. The harmonic synthesizer 308 synthesizes a harmonic component by using the transmitted value, predetermined harmonic position information, and the number of harmonics to be restored. Information on phase of frequency and amplitude of frequency does not seem right is obtained by using the synthesized harmonic information.
The information on the phase and amplitude of frequency is transformed into a time-domain signal by the frequency/time mapping unit 309. The transformed signal becomes an excited signal of the linear prediction synthesizer 310. The linear prediction synthesizer 310 passes the excited signal through a synthetic filter, and outputs a finally synthesized second synthetic signal. A signal representing a difference based on the second synthetic signal output from the high-band signal which has been input to the high-band coder 300 is transmitted to the wide-band coder 400 as a second error signal.
Referring back to
In operation 122, the wide-band coder 400 codes the received first and second error signals by using a modified discrete cosine transform (MDCT) method through time/frequency mapping.
Now, a coding process using the MDCT method will be described with reference to
The wide-band coder 500 includes a time/frequency mapping unit 510, a band divider 520, a normalization module 530, and a quantizer 540.
First and second error signals, that is, time-domain input signals of the wide-band coder 500, are first input to the time/frequency mapping unit 510. In the input first and second error signals, a low-band signal is first subjected to the MDCT through time-frequency mapping. Thereafter, a high-band signal is subjected to the MDCT through time-frequency mapping. Transformed coefficients are sequentially integrated in the order of low-band to high-band, thereby obtaining a wide-band signal. The wide-band signal is processed by the band divider 520 after being divided for each band. A band may be partitioned using various methods. For example, a band may be partitioned into uniformly spaced sections. In addition, by taking a human auditory model into account, a low-band may be narrowly partitioned, and a high-band may be widely partitioned.
The normalization module 530 classifies a signal of which a band is divided by the band divider 520 into power of band and a normalized coefficient for each band. Preferably, an RMS power of each band may be first obtained, and normalized coefficients may be then obtained by dividing all coefficients by the RMS power. The normalized coefficients are quantized by the quantizer 540.
Referring back to
In operation 128, the bit-stream generator 500 combines the received first, second, and third indexes so as to generate a bit-stream, and then outputs the bit-stream.
The bit-stream is constructed in the order of a low-band layer coded by the low-band coder 200 having a CELP structure, a high-band layer coded by the high-band coder 300 having a harmonic structure, and a wide-band layer coded by the wide-band coder 400 having an MDCT structure. Further, the bit-stream can be divided into one core layer, which is not optional, and a plurality of enhancement layers. Whenever the enhancement layers are added to the core layer, speech quality is improved, or bandwidth increases. Moreover, the bit-stream may be divided into narrow-band information and wide-band information. The narrow-band information is obtained from a low-band. K layers can be constructed in a scalable manner by using the narrow-band information. The wide-band information includes high-band information and wide-band information. L layers can be constructed by using the wide-band information. Therefore, according to an exemplary embodiment of the present invention, the number of bit-stream layers is K+L.
Referring to
In operation 1010, the bit-stream divider 1000 receives a bit-stream transmitted at a specific transmission rate according to a network environment.
In operation 1020, the bit-stream divider 1000 disassembles the received bit-stream according to a desired syntax. When disassembled, a corresponding portion of the bit-stream is divided according to whether a frequency band to be used in reproduction is a low-band (0˜4 kHz), or a wide-band (0˜8 kHz) including a high-band (4˜8 kHz).
In operation 1030, the bit-stream divider 1000 outputs the bit-stream divided according to a frequency band to each band decoder.
A low-band signal (0˜4 kHz) is output to the low-band decoder 2000. A high-band signal (4˜8 kHz) is output to the high-band decoder 3000. A wide-band signal (0˜8 kHz) is output to the wide-band decoder 4000.
In operation 1040, the low-band decoder 2000 decodes a signal portion of the low-band (0˜4 kHz) included in the divided bit-stream.
In operation 1050, the low-band decoder 2000 outputs information required for decoding a high-band signal among coefficients decoded in a low-band, and transmits the information to the high-band decoder 3000. The information required for decoding a high-band signal includes pitch information.
In operation 1060, the low-band decoder 2000 outputs a reproduction signal decoded in operation 1040, and transmits the reproduction signal to the band combiner 5000.
In operation 1070, the high-band decoder 3000 decodes a signal portion of a high-band (4˜8 kHz) included in the divided bit-stream. In this operation, the high-band decoder 3000 obtains a harmonic position by using a pitch signal received from the low-band decoder 2000, and uses a harmonic method in which a high-band signal is decoded by using information associated with the obtained harmonic position.
In operation 1080, the high-band decoder 3000 outputs the reproduction signal decoded in operation 1070, and transmits the regenerated signal to the band combiner 5000.
In operation 1090, the wide-band decoder 4000 decodes a signal portion of a wide-band (0˜8 kHz) included in the divided bit-stream.
In operation 1100, the wide-band decoder 4000 divides the decoded reproduction signal into a low-band signal and a high-band signal, and then transmits the divided signals.
Referring back to
In operation 1120, the band combiner 5000 combines signals received from the low-band decoder 2000, the high-band decoder 3000, and the wide-band decoder 4000, and then outputs the combined signals included in corresponding layers. A signal output to a (K+1)th layer is composed of only signals output from the low-band decoder 2000 and the high-band decoder 3000. Signals output to a (K+2)th layer through a (K+L)th layer are output after all signals output from the low-band decoder 2000, the high-band decoder 3000, and the wide-band decoder 4000 are combined.
According to the present invention, scalable speech service can be achieved, and a high-band signal can be effectively compressed using a bandwidth extension method. Further, the present invention can be easily applied in combination with a conventional speech coding method for a narrow-band signal. Since a code excited linear prediction (CELP) structure is used as a low-band coding method, excellent speech quality can be provided at a low bit-rate of a speech signal. A signal output from a high-band coder is combined with a low-band signal, so that a speech signal can be output with high fidelity at a low transmission rate. Since a wide-band output signal also can be combined therewith, not only a speech signal can be output as close as the original speech signal, but also a music signal can be reproduced.
In addition to the above-described exemplary embodiments, exemplary embodiments of the present invention can also be implemented by executing computer readable code/instructions in/on a medium/media, e.g., a computer readable medium/media. The medium/media can correspond to any medium/media permitting the storing and/or transmission of the computer readable code/instructions. The medium/media may also include, alone or in combination with the computer readable code/instructions, data files, data structures, and the like. Examples of computer readable code/instructions include both machine code, such as produced by a compiler, and files containing higher level code that may be executed by a computing device and the like using an interpreter. The computer readable code/instructions can be recorded/transferred in/on a medium/media in a variety of ways, with examples of the medium/media including magnetic storage media (e.g., floppy disks, hard disks, magnetic tapes, etc.), optical media (e.g., CD-ROMs, or DVDs), magneto-optical media (e.g., floptical disks), hardware storage devices (e.g., read only memory media, random access memory media, flash memories, etc.) and storage/transmission media such as carrier waves transmitting signals, which may include computer readable code/instructions, data files, data structures, etc. Examples of storage/transmission media may include wired and/or wireless transmission (such as transmission through the Internet). For example, wired storage/transmission media may include optical wires/lines, waveguides, and metallic wires/lines including a carrier wave transmitting signals specifying program instructions, data structures, data files, etc. The medium/media may also be a distributed network, so that the computer readable code/instructions is stored/transferred and executed in a distributed fashion. The medium/media may also be the Internet. The computer readable code/instructions may be executed by one or more processors. In addition, the above hardware devices may be configured to act as one or more software modules in order to perform the operations of the above-described exemplary embodiments.
Although a few exemplary embodiments of the present invention have been shown and described, it would be appreciated by those skilled in the art that changes may be made in these exemplary embodiments without departing from the principles and spirit of the invention, the scope of which is defined in the claims and their equivalents.
Number | Date | Country | Kind |
---|---|---|---|
10-2006-0049038 | May 2006 | KR | national |
This application claims the benefit of U.S. Provisional Patent Application No. 60/701,502, filed on Jul. 22, 2005, in the U.S. Patent and Trademark Office, and Korean Patent Application No. 10-2006-0049038, filed on May 30, 2006, in the Korean Intellectual Property Office, the disclosures of which are incorporated herein in their entirety by reference.
Number | Name | Date | Kind |
---|---|---|---|
5455888 | Iyengar et al. | Oct 1995 | A |
5819212 | Matsumoto et al. | Oct 1998 | A |
6895375 | Malah et al. | May 2005 | B2 |
7177804 | Wang et al. | Feb 2007 | B2 |
7469206 | Kjorling et al. | Dec 2008 | B2 |
7562021 | Mehrotra et al. | Jul 2009 | B2 |
7624022 | Son et al. | Nov 2009 | B2 |
20020007273 | Chen | Jan 2002 | A1 |
20020007280 | McCree | Jan 2002 | A1 |
20030187634 | Li | Oct 2003 | A1 |
20040111257 | Sung et al. | Jun 2004 | A1 |
20050004794 | Son et al. | Jan 2005 | A1 |
20050017879 | Linzmeier et al. | Jan 2005 | A1 |
20060149538 | Lee et al. | Jul 2006 | A1 |
20060277038 | Vos et al. | Dec 2006 | A1 |
20070088558 | Vos et al. | Apr 2007 | A1 |
20110280337 | Lee et al. | Nov 2011 | A1 |
Number | Date | Country |
---|---|---|
10-2004-0050141 | Jun 2004 | KR |
Number | Date | Country | |
---|---|---|---|
20070033023 A1 | Feb 2007 | US |
Number | Date | Country | |
---|---|---|---|
60701502 | Jul 2005 | US |