1. Technical Field
The present invention relates to an audio signal processing method and device which can encode or decode audio signals.
2. Background Art
Transmission of audio signals, especially transmission of speech signals, improves as encoding and decoding delay of speech signals decreases since the purpose of transmission of speech signals is often real-time communication.
When a speech signal or an audio signal is transmitted to a receiving side, an error or loss may occur causing a reduction in audio quality.
The present invention has been made in order to overcome such problem and it is an object of the present invention to provide an audio signal processing method and device for concealing frame loss at a receiver.
It is another object to provide an audio signal processing method and device for minimizing propagation of an error to a next frame due to a signal that is arbitrarily generated to conceal frame loss.
The present invention provides the following advantages and benefits.
First, since a receiver-based loss concealment method is performed, bits for additional information for frame error concealment are not required and therefore it is possible to efficiently conceal loss even in a low bit rate environment.
Second, when a current loss concealment method is performed, it is possible to minimize propagation of an error to a next frame and therefore it is possible to prevent audio quality degradation as much as possible.
An audio signal processing method according to the present invention to accomplish the above objects includes receiving an audio signal including data of a current frame, performing, when an error has occurred in the data of the current frame, frame error concealment on the data of the current frame using a random codebook to generate a first temporary output signal of the current frame, performing at least one of short term prediction, long term prediction, and fixed codebook search based on the first temporary output signal to generate a parameter, and updating a memory with the parameter for a next frame, wherein the parameter includes at least one of a pitch gain, a pitch delay, a fixed codebook gain, and a fixed codebook.
According to the present invention, the audio signal processing method may further include performing, when an error has occurred in the data of the current frame, extrapolation on a past input signal to generate a second temporary output signal, and selecting the first temporary output signal or the second temporary output signal according to speech characteristics of a previous frame, wherein the parameter may be generated by performing at least one of short term prediction, long term prediction, and fixed codebook search on the selected temporary output signal.
According to the present invention, the speech characteristics of the previous frame may be associated with whether voiced sound characteristics or unvoiced sound characteristics of the previous frame are greater and the voice sound characteristics may be greater when the pitch gain is high and the pitch delay changes little.
According to the present invention, the memory may include a memory for long term prediction and a memory for short term prediction and includes a memory used for parameter quantization of a prediction scheme.
According to the present invention, the audio signal processing method may further include generating a final output signal of the current frame by performing at least one of fixed codebook acquisition, adaptive codebook synthesis, and short term synthesis using the parameter.
According to the present invention, the audio signal processing method may further include updating the memory with the final output signal and an excitation signal acquired through the long term synthesis and fixed codebook synthesis.
According to the present invention, the audio signal processing method may further include performing at least one of long term synthesis and short term synthesis on a next frame based on the memory when no error has occurred in data of the next frame.
An audio signal processing device according to the present invention to accomplish the above objects includes a demultiplexer for receiving an audio signal including data of a current frame and checking whether or not an error has occurred in the data of the current frame, an error concealment unit for performing, when an error has occurred in the data of the current frame, frame error concealment on the data of the current frame using a random codebook to generate a first temporary output signal of the current frame, a re-encoder for performing at least one of short term prediction, long term prediction, and fixed codebook search based on the first temporary output signal to generate a parameter, and a decoder for updating a memory with the parameter for a next frame, wherein the parameter includes at least one of a pitch gain, a pitch delay, a fixed codebook gain, and a fixed codebook.
According to the present invention, the error concealment unit may include an extrapolation unit for performing, when an error has occurred in the data of the current frame, extrapolation on a past input signal to generate a second temporary output signal, and a selector for selecting the first temporary output signal or the second temporary output signal according to speech characteristics of a previous frame, wherein the parameter may be generated by performing at least one of short term prediction, long term prediction, and fixed codebook search on the selected temporary output signal.
According to the present invention, the speech characteristics of the previous frame may be associated with whether voiced sound characteristics or unvoiced sound characteristics of the previous frame are greater and the voice sound characteristics may be greater when the pitch gain is high and the pitch delay changes little.
According to the present invention, the memory may include a memory for long term prediction and a memory for short term prediction and includes a memory used for parameter quantization of a prediction scheme.
According to the present invention, the decoder may generate a final output signal of the current frame by performing at least one of fixed codebook acquisition, adaptive codebook synthesis, and short term synthesis using the parameter.
According to the present invention, the decoder may update the memory with the final output signal and an excitation signal acquired through the long term synthesis and fixed codebook synthesis.
According to the present invention, the decoder may perform at least one of long term synthesis and short term synthesis on a next frame based on the memory when no error has occurred in data of the next frame.
Preferred embodiments of the present invention will now be described in detail with reference to the accompanying drawings. Prior to the description, it should be noted that the terms and words used in the present specification and claims should not be construed as being limited to common or dictionary meanings but instead should be understood to have meanings and concepts in agreement with the spirit of the present invention based on the principle that an inventor can define the concept of each term suitably in order to describe his/her own invention in the best way possible. Thus, the embodiments described in the specification and the configurations shown in the drawings are simply the most preferable examples of the present invention and are not intended to illustrate all aspects of the spirit of the present invention. As such, it should be understood that various equivalents and modifications can be made to replace the examples at the time of filing of the present application.
The following terms used in the present invention may be construed as described below and other terms, which are not described below, may also be construed in the same manner. A term “coding” may be construed as encoding or decoding as needed and “information” is a term encompassing values, parameters, coefficients, elements, and the like and the meaning thereof varies as needed although the present invention is not limited to such meanings of the terms.
Here, in the broad sense, the term “audio signal” is distinguished from “video signal” and indicates a signal that can be audibly identified when reproduced. In the narrow sense, the term “audio signal” is discriminated from “speech signal” and indicates a signal which has little to no speech characteristics. In the present invention, the term “audio signal” should be construed in the broad sense and, when used as a term distinguished from “speech signal”, the term “audio signal” may be understood as an audio signal in the narrow sense.
In addition, although the term “coding” may indicate only encoding, it may also have a meaning including both encoding and decoding.
First, as shown in
The demultiplexer 110 receives an audio signal including data of a current frame through a network (S100). Here, the demultiplexer 110 performs channel encoding on a packet of the received audio signal and checks whether or not an error has occurred (S200). Then, the demultiplexer 110 provides the received data of the current frame to the decoder 120 or the error concealment unit 130 according to a bad frame indicator (BFI) which is an error check result. Specifically, the demultiplexer 110 provides the data of the current frame to the error concealment unit 130 when an error has occurred (yes in step S300) and provides the data of the current frame to the decoder 120 when no error has occurred (no in step S300).
Then, the error concealment unit 130 performs error concealment on the current frame using a random codebook and past information to generate a temporary output signal (S400). A procedure performed by the error concealment unit 130 will be described later in detail with reference to
The re-encoder 140 performs re-encoding on the temporary output signal to generate an encoded parameter (S500). Here, re-encoding may include at least one of short-term prediction, long-term prediction, and codebook search and the parameter may include at least one of a pitch gain, pitch delay, a fixed codebook gain, and a fixed codebook. A detailed configuration of the re-encoder 140 and step S500 will be described later in detail with reference to
When it is determined in step S300 that no error has occurred (i.e., no in step S300), the decoder 120 performs decoding on data of the current frame extracted from a bitstream (S700) or performs decoding based on the encoded parameter of the current frame received from the re-encoder 140 (S700). Operation of the decoder 120 and step S700 will be described later in detail with reference to
First, as shown in
First, the long term synthesizer 132 acquires an arbitrary pitch gain gpa and an arbitrary pitch delay Da (S410). The pitch gain and the pitch delay are parameters that are generated through long term prediction (LTP) and the LTP filter may be expressed by the following expression.
[Expression 1]
Here, gp denotes the pitch gain and D denotes the pitch delay.
That is, the received pitch gain and the received pitch delay, which may constitute an adaptive codebook, are substituted into Expression 1. Since the pitch gain and the pitch delay of the received data of the current frame may contain an error, the long term synthesizer 132 acquires the arbitrary pitch gain gpa and the arbitrary pitch delay Da for replacing the received pitch gain and the received pitch delay. Here, the arbitrary pitch gain gpa may be equal to a pitch gain value of a previous frame and may also be calculated by weighting the most recent gain value from among gain values stored in previous frames by a weight although the present invention is not limited thereto. The arbitrary pitch gain gpa may also be obtained by appropriately reducing the weighted gain value according to characteristics of the speech signal. The arbitrary pitch delay da may also be equal to that of data of a previous frame although the present invention is not limited thereto.
In the case in which data of a previous frame is used to generate the arbitrary pitch gain gpa and the arbitrary pitch delay Da, a value (not shown) received from a memory of the decoder 120 may be used.
An adaptive codebook is generated using the arbitrary pitch gain gpa and the arbitrary pitch delay Da acquired in step S410, for example, by substituting the arbitrary pitch gain gpa and the arbitrary pitch delay Da into Expression 1 (S420). Here, a past excitation signal of a previous frame received from the decoder 120 may be used in step S420.
Referring back to
ufec(n)=gpav(n)+gcarand(n) [Expression 2]
Here, ufec(n) denotes the error-concealed excitation signal, gpa denotes the arbitrary pitch gain (adaptive codebook gain), v(n) denotes the adaptive codebook, gca denotes the arbitrary codebook gain, and rand(n) denotes the random codebook.
The enhancer 136 is used to remove, from the error-concealed excitation signal ufec(n), artifact which may occur in a low transfer rate mode or which may occur due to insufficient information when error concealment has been applied. First, the enhancer 136 makes the codebook natural through an FIR filter in order to compensate the fixed codebook for a shortage of pulses and adjusts gains of the fixed codebook and the adaptive codebook through a speech characteristics classification process. However, the present invention is not limited to this method.
The short term synthesizer 138 first acquires a spectrum vector I[0] whose arbitrary short term prediction coefficient (or arbitrary linear prediction coefficient) has been converted for the current frame. Here, the arbitrary short term prediction coefficient has been generated in order to replace the received short term prediction coefficient since an error has occurred in data of the current frame. The arbitrary short term prediction coefficient is generated based on a short term prediction coefficient of a previous frame (including an immediately previous frame) and may be generated according to the following expression although the present invention is not limited thereto.
I[0]=αI[−1]+(1−α)Iref [Expression 3]
Here, I[0] denotes an Immittance Spectral Frequency (ISP) vector corresponding to the arbitrary short term prediction coefficient, I[−1] denotes an ISP vector corresponding to a short term prediction coefficient of a previous frame, Iref denotes an ISP vector of each order corresponding to a stored short term prediction coefficient, and α denotes a weight.
The short term synthesizer 138 performs short term prediction synthesis or linear prediction (LPC) synthesis using the arbitrary short term spectrum vector I[0]. Here, the STP synthesis filter may be represented by the following expression although the present invention is not limited thereto.
[Expression 4]
Here, ai is an ith-order short term prediction coefficient.
The short term synthesizer 138 then generates a first temporary output signal using a signal obtained by short term synthesis and the excitation signal generated in step S440 (S460). The first temporary output signal may be generated by passing the excitation signal through the short term prediction synthesis filter since the excitation signal corresponds to an input signal of the short term prediction synthesis filter.
The extrapolator 138-2 performs extrapolation to generate a future signal based on a past signal in order to generate a second temporary output signal for error concealment (S470). Here, the extrapolator 138-2 may perform pitch analysis on a past signal and store a signal corresponding to one pitch period and may then generate a second temporary output signal by sequentially coupling signals in an overlap and add manner through a Pitch Synchronous Overlap and Add (PSOLA) method although the extrapolation method of the present invention is not limited to PSOLA.
The selector 139 selects a target signal of the re-encoder 140 from among the first temporary output signal and the second temporary output signal (S480). The selector 139 may select the first temporary output signal upon determining, through speech characteristics classification of the past signal, that the input sound is unvoiced sound and select the second temporary output signal upon determining that the input sound is voiced sound. A function embedded in a codec may be used to perform speech characteristics classification and it may be determined that the input sound is voiced sound when the long term gain is great and the long term delay value changes little although the present invention is not limited thereto.
Hereinafter, the re-encoder 140 is described with reference to
First, referring to
As shown in
Then, the perceptual weighting filter 144 applies perceptual weighting filtering to a residual signal r(n) which is the difference between a temporary output signal and a predicted signal obtained through short term prediction (S520). Here, the perceptual weighting filtering may be represented by the following expression.
[Expression 5]
Here, γ1 and γ2 are weights.
It is preferable to use the same weights as used in encoding. For example, γ1 may be 0.94 and γ2 may be 0.6 although the present invention is not limited thereto.
The long term predictor 146 may obtain a long term prediction delay value D by performing open loop search on a weight input signal to which the perceptual weighting filtering has been applied and perform closed loop search on the long term prediction delay value D within a range of ±d from the long term prediction delay value D to select a final long term prediction delay value T and a corresponding gain (S530). Here, d may be 8 samples although the present invention is not limited thereto.
Here, it is preferable to use the same long term prediction method as used in the encoder.
Specifically, a long term prediction delay value (pitch delay) D may be calculated according to the following expression.
[Expression 6]
Here, the long term prediction delay D is k which maximizes the value of the function.
The long term prediction gain (pitch gain) may be calculated according to the following expression.
[Expression 7]
Here, d(n) denotes a long term prediction target signal and u(n) denotes a perceptual weighting input signal, L denotes the length of a subframe, D denotes a long term prediction delay value (pitch delay), and gp denotes a long term prediction gain (pitch gain).
d(n) may be an input signal x(n) in the closed-loop scheme and may be wx(n) to which the perceptual weighting filtering has been applied in the open-loop scheme.
Here, the long term prediction gain is obtained using the long term prediction gain D that is determined according to Expression 6 as described above.
The long term predictor 146 generates the pitch gain gp and the long term prediction delay value D through the above procedure and provides a fixed codebook target signal c(n), which is obtained by removing an adaptive codebook signal generated through long term prediction from the short term prediction residual signal r(n), to the codebook searcher 148.
c(n)=r(n)−gpv(n) [Expression 8]
Here, c(n) denotes the fixed codebook target signal, r(n) denotes the short term prediction residual signal, gp denotes the adaptive codebook gain, and v(n) denotes a pitch signal corresponding to the adaptive codebook delay D.
Here, v(n) may represent an adaptive codebook obtained using a long term predictor from a previous excitation signal memory which may be the memory of the decoder 120 described above with reference to
The codebook searcher 148 generates a fixed codebook gain gc and a fixed codebook ĉ(n) by performing codebook search on the codebook signal (S540). Here, it is preferable to use the same codebook search method as used in the encoder.
Here, the parameters may be generated in a closed loop manner such that encoded parameters are re-determined taking into consideration results of synthesis processes (such as long term synthesis and short term synthesis) that are performed using the parameters (including the short term prediction coefficient, the long term prediction gain, the long term prediction delay value, the fixed codebook gain, and the fixed codebook) generated in steps S510, S530, and S540.
The parameters generated through the above procedure are provided to the decoder 120 as described above with reference to
Referring to
The long term synthesizer 122 performs long term synthesis based on the long term prediction gain gp and the long term prediction delay D to generate an adaptive codebook (S720). The long term synthesizer 122 is similar to the long term synthesizer 132 described above with the difference being the input parameters.
The codebook acquirer 124 generates a fixed codebook signal ĉ(n) using the received fixed codebook gain gc and fixed codebook parameter (S730).
An excitation signal u(n) is generated by summing the pitch signal and the codebook signal.
Unlike the random signal generator 134 described above with reference to
The short term synthesizer 126 performs short term synthesis based on a signal of a previous frame and the short term prediction coefficient and adds the excitation signal u(n) to the short term synthesis signal to generate a final output signal (S740). Here, the following expression may be applied.
u(n)=gpv(n)+gcĉ(n) [Expression 9]
Here, u(n) denotes an excitation signal, gp denotes an adaptive codebook gain, v(n) denotes an adaptive codebook corresponding to a pitch delay D, gc(n) denotes a fixed codebook gain, and ĉ(n) denotes a fixed codebook having a unit size.
A detailed description of operation of the short term synthesizer 126 is omitted herein since it is similar to operation of the short term synthesizer 138 described above with reference to
Then, the memory 128 is updated with the received parameters, signals generated based on the parameters, the final output signal, and the like (S750). Here, the memory 128 may be divided into a memory 128-1 (not shown) for error concealment and a memory 128-2 (not shown) for decoding. The memory 128-1 for error concealment stores data required for the error concealment unit 130 (for example, a long term prediction gain, a long term prediction delay value, a past delay value history, a fixed codebook gain, and a short term prediction coefficient) and the memory 128-2 for decoding stores data required for the decoder 120 to perform decoding (for example, an excitation signal of a current frame for synthesis of a next frame, a gain value, and a final output signal). The two memories may be implemented as a single memory 128 rather than being separated. The memory 128-2 for decoding may include a memory for long term prediction and a memory for short term prediction. The memory 128-2 for long term prediction may include a memory required to generate an excitation signal from a next frame through long term synthesis and a memory required for short term synthesis.
In the case in which parameters are received from the demultiplexer 110 through the switch 121 of
By updating data of a frame which contains an error with parameters corresponding to an error-concealed signal in the above manner, it is possible to prevent error propagation as much as possible upon decoding of the next frame.
The audio signal processing method according to the present invention may be implemented as a program to be executed by a computer and the program may then be stored in a computer readable recording medium. Multimedia data having a data structure according to the present invention may also be stored in a computer readable recording medium. The computer readable recording medium includes any type of storage device that stores data that can be read by a computer system. Examples of the computer readable recording medium include read only memory (ROM), random access memory (RAM), CD-ROMs, magnetic tapes, floppy disk, optical data storage devices, and so on. The computer readable recording medium can also be embodied in the form of carrier waves (for example, signals transmitted over the Internet). A bitstream generated through the encoding method described above may be stored in a computer readable recording medium or may be transmitted over a wired/wireless communication network.
Although the present invention has been described above with reference to specific embodiments and drawings, the present invention is not limited to the specific embodiments and drawings and it will be apparent to those skilled in the art that various modifications and variations can be made in the present invention without departing from the spirit of the invention and the scope of the appended claims and their equivalents.
The present invention is applicable to audio signal processing and output.
This application is a continuation of U.S. application Ser. No. 13/511,331, filed May 22, 2012, now allowed, which is a U.S. National Phase of International Application PCT/KR2010/008336, filed on Nov. 24, 2010, which claims the benefit of U.S. Provisional Application No. 61/264,248, filed on Nov. 24, 2009, U.S. Provisional Application No. 61/285,183, filed on Dec. 10, 2009 and U.S. Provisional Application No. 61/295,166, filed on Jan. 15, 2010, all of which are hereby incorporated by reference in their entireties.
Number | Name | Date | Kind |
---|---|---|---|
5305332 | Ozawa | Apr 1994 | A |
5450449 | Kroon | Sep 1995 | A |
5615298 | Chen | Mar 1997 | A |
5699478 | Nahumi | Dec 1997 | A |
5778338 | Jacobs et al. | Jul 1998 | A |
5828811 | Taniguchi | Oct 1998 | A |
6052660 | Sano | Apr 2000 | A |
6085158 | Naka et al. | Jul 2000 | A |
6226604 | Ehara | May 2001 | B1 |
6385578 | Lee et al. | May 2002 | B1 |
6584438 | Manjunath et al. | Jun 2003 | B1 |
6597961 | Cooke | Jul 2003 | B1 |
6636829 | Benyassine et al. | Oct 2003 | B1 |
6665637 | Bruhn | Dec 2003 | B2 |
6810377 | Ho et al. | Oct 2004 | B1 |
6856955 | Ozawa | Feb 2005 | B1 |
6910009 | Murashima | Jun 2005 | B1 |
7146309 | Benyassine et al. | Dec 2006 | B1 |
7191123 | Bessette | Mar 2007 | B1 |
7519535 | Spindola | Apr 2009 | B2 |
7590531 | Khalil | Sep 2009 | B2 |
7613606 | Makinen | Nov 2009 | B2 |
7831421 | Khalil | Nov 2010 | B2 |
7873515 | Padhi et al. | Jan 2011 | B2 |
7962335 | Khalil | Jun 2011 | B2 |
8214203 | Sung et al. | Jul 2012 | B2 |
20020091523 | Makinen | Jul 2002 | A1 |
20040117178 | Ozawa | Jun 2004 | A1 |
20050154584 | Jelinek | Jul 2005 | A1 |
20060173687 | Spindola | Aug 2006 | A1 |
20060271359 | Khalil | Nov 2006 | A1 |
20060271373 | Khalil | Nov 2006 | A1 |
20070271480 | Oh | Nov 2007 | A1 |
20080270124 | Son | Oct 2008 | A1 |
20090276212 | Khalil | Nov 2009 | A1 |
Number | Date | Country |
---|---|---|
101268351 | Sep 2008 | CN |
10-2004-0050810 | Jun 2004 | KR |
10-2007-0091512 | Sep 2007 | KR |
10-2007-0099055 | Oct 2007 | KR |
10-2008-0011186 | Jan 2008 | KR |
WO 2006083826 | Aug 2006 | WO |
WO 2006130236 | Dec 2006 | WO |
Entry |
---|
PCT International Search Report dated Aug. 10, 2011 for Application No. PCT/KR2010/008336, w/English translation, 4 pages. |
European Search Report dated Nov. 28, 2013 for Application No. 10833553, 6 pages. |
Peter Kroon et al. “Performance of the Proposed ITU-T 8 KB/S Speech Standard for a Rayleigh Fading Channel”, 19950920; 19950920-19950922, Sep. 20, 1995, pp. 11-12, XP010269469, p. 12, “4. Error Concealment Procedure”. |
Frank Mertz et al. “Voicing Controlled Frame Less Concealment for Adaptive Multi-Rate (AMR) Speech Frames in Voice-over-IP”, Sep. 1, 2003, p. 1077, XP007006831, p. 1078, left-hand column, line 17-p. 1079, right-hand column, last line. |
Number | Date | Country | |
---|---|---|---|
20150221311 A1 | Aug 2015 | US |
Number | Date | Country | |
---|---|---|---|
61264248 | Nov 2009 | US | |
61285183 | Dec 2009 | US | |
61295166 | Jan 2010 | US |
Number | Date | Country | |
---|---|---|---|
Parent | 13511331 | US | |
Child | 14687991 | US |