The present invention relates generally to audio encoding/decoding and more specifically to audio frame loss recovery.
In the last twenty years microprocessor speed has increased by several orders of magnitude and Digital Signal Processors (DSPs) have become ubiquitous. As a result, it has become feasible and attractive to transition from analog communication to digital communication. Digital communication offers the advantage of being able to utilize bandwidth more efficiently and allows for error correcting techniques to be used. Thus, by using digital communication, one can send more information through an allocated spectrum space and send the information more reliably. Digital communication can use wireless links (e.g., radio frequency) or physical network media (e.g., fiber optics, copper networks).
Digital communication can be used for transmitting and receiving different types of data, such as audio data (e.g., speech), video data (e.g., still images or moving images) or telemetry. For audio communications, various standards have been developed, and many of those standards rely upon frame based coding in which, for example, high quality audio is encoded and decoded using audio frames (e.g., 20 millisecond frames containing information that describes the audio that occurs during the 20 milliseconds). For certain wireless systems, audio coding standards have evolved that use sequentially mixed time domain coding and frequency domain coding. Time domain coding is typically used when the source audio is voice and typically involves the use of CELP (code excited linear prediction) based analysis-by-synthesis coding. Frequency domain coding is typically used for such non-voice sources such as music and is typically based on quantization of MDCT (modified discrete cosine transform) coefficients. Frequency domain coding is also referred to “transform domain coding.” During transmission, a mixed time domain and transform domain signal may experience a frame loss. When a device receiving the signal decodes the signal, the device will encounter the portion of the signal having the frame loss, and may request that the transmitter resend the signal. Alternatively, the receiving device may attempt to recover the lost frame. Frame loss recovery techniques typically use information from frames in the signal that occur before and after the lost frame to construct a replacement frame.
The features of the invention believed to be novel are set forth with particularity in the appended claims. The invention itself however, both as to organization and method of operation, together with objects and advantages thereof, may be best understood by reference to the following detailed description, which describes embodiments of the invention. The description is meant to be taken in conjunction with the accompanying drawings in which:
Skilled artisans will appreciate that elements in the figures are illustrated for simplicity and clarity and have not necessarily been drawn to scale. For example, the dimensions of some of the elements in the figures may be exaggerated relative to other elements to help to improve understanding of embodiments of the present invention.
While this invention is susceptible of embodiment in many different forms, there is shown in the drawings and will herein be described in detail specific embodiments, with the understanding that the present disclosure is to be considered as an example of the principles of the invention and not intended to limit the invention to the specific embodiments shown and described. In the description below, like reference numerals are used to describe the same, similar or corresponding parts in the several views of the drawings.
Embodiments described herein provide a method of generating an audio frame as a replacement for a lost frame when the lost frame directly follows a transform domain coded audio frame. The decoder obtains pitch information related to the transform domain frame that precedes the first lost frame and uses that to construct replacement audio for the lost frame. The technique provides a replacement frame that has reduced distortion compared to other techniques.
Referring to
The voice audio can be effectively compressed by using certain time domain coding techniques, while music and other non-voice audio can be effectively compressed by certain transform domain encoding (frequency encoding) techniques. In some systems, CELP (code excited linear prediction) based analysis-by-synthesis coding is the time domain coding technique that is used. The transform domain coding is typically based on quantization of MDCT (modified discrete cosine transform) coefficients. The audio signal received at the UE 120 is a mixed audio signal that uses time domain coding and transform domain coding in a sequential manner. Although the UE 120 is described as a user device for the embodiments described herein, in other embodiments it may be a device not commonly thought of as a user device. For example, it may be an audio device used for presenting audio for a movie in a cinema. The network 110 and UE 120 may communicate in both directions using an audio frame based communication protocol, wherein a sequence of audio frames is used, each audio frame having a duration and being encoded with compression encoding that is appropriate for the desired audio bandwidth. For example, analog source audio may be digitally sampled 16000 times per second and sequences of the digital samples may be used to generate compression coded audio frames every 20 milliseconds. The compression encoding (e.g., CELP and/or MDCT) conveys the audio signal in a manner that has an acceptably high quality using far fewer bits than the quantity of bits resulting directly from the digital sampling. It will be appreciated that the frames may include other information such as error mitigation information, a sequence number and other metadata, and the frames may be included within groupings of frames that may include error mitigation, sequence number, and metadata for more than one frame. Such frame groups may be, for example, packets or audio messages. It will be appreciated that in some embodiments, most particularly those systems that include packet transmission techniques, frames may not be received sequentially in the order in which they are transmitted, and in some instances a frame or frames may be lost.
Some embodiments are designed to handle a mixed audio signal that changes between voice and non-voice by providing for changing from time domain coding to transform domain coding and also from transform domain coding to time domain coding. When changing from a transform domain portion of the audio signal to a subsequent time domain portion of the audio signal, the first frame that is transform coded is called the transform domain to time domain transition frame. As used herein decoding means generating, from the compressed audio encoded within each frame, a set of audio sample values that may be used as an input to a digital to analog converter. The method that is typically used for encoding and decoding transform coded frames (MDCT transform) results, at the output of the decoder in a set of audio samples representing each audio frame as well as a set of audio samples called MDCT synthesis memory samples that are usable for decoding the next audio frame.
In some embodiments, frame error recovery bits are added by the encoder 111 to certain defined ones or all of the transform domain encoded frames that are determined to be pitch based framer error recovery transform domain type frames. Referring to
At step 208, a time domain encoding technique is used to encode and transmit the current frame.
At step 210, which is used in those embodiments in which a speech/music classification is provided, the state of the speech/music indication is determined. A further determination is then made as to whether the current transform frame is to be classified as a pitch based frame error recovery transform domain type of frame (PITCH FER frame) or an MDCT frame error recovery type of frame (MDCT FER frame) based on some parameters received from the audio encoder, such as a speech/music indication, an open loop pitch gain of the frame or part of the frame, and a ratio of high frequency to low frequency energy in the frame. When the open loop gain of the frame is less than an open loop pitch gain threshold then the frame is classified as the MDCT FER frame and when the open loop gain is above the threshold, then the frame is classified as a PITCH FER frame. When the frame is classified as a MDCT FER frame at step 210, an FER indicator (which may be a single but), is set at step 215 to indicate that the frame is a MDCT FER frame and the FER indicator is transmitted to the decoder with other frame information (e.g., coefficients) at step 220. When the frame is classified as a PITCH FER frame, the FER indicator is set at step 225 to indicate a PITCH FER frame. A frame error recovery parameter referred to the FER pitch delay is determined as described below at step 230. The FER indicator and FER pitch delay are transmitted as parameters to the decoder at step 235 with either eight or nine bits that represent the pitch along with other frame information (e.g., coefficients).
In those embodiments in which the speech/music classification is provided, the threshold used to classify the frame as a PITCH FER frame or an MDCT FER frame may be dependent upon whether the frame is classified as speech or music, and may be dependent upon a ratio of high frequency energy versus low frequency energy of the frame. For example, the threshold above which a frame that has been classified as speech becomes classified as a PITCH FER frame may be an open loop gain of 0.5 and the threshold above which a frame that has been classified as music becomes classified as a PITCH FER frame may be an open loop gain of 0.75. Furthermore, in certain embodiments these thresholds may be modifiable based on a ratio of energies (gains) of a range of high frequencies versus a range of low frequencies. For example, the high frequency range may be 3 KHz to 8 KHz and the low frequency range may be 100 Hz to 3 KHz. In certain embodiments the speech and music thresholds are increased linearly with the ratio of energies or in some cases if the ratio is very high (i.e. high frequency to low frequency ratio is more than 5) then the frame is classified as a MDCT FER frame independent of the value of the open loop gain.
Since both the FER classification and the pitch FER information is going to be utilized for frame error recovery of the following frame, and because the parameters representing values near the end of the frame provide better information about the following frame than the parameters at the start of a frame, the classification at step 210 may be based on the open loop pitch gain near the end of the frame. Similarly the pitch delay information determined at step 230 may be based on the pitch delay near the end of the frame. The position that such parameters may represent within a frame may be dependent upon the source of the current frame at step 205. Audio characterization functions associated with certain frame sources (e.g., speech/audio classifiers and audio pitch parameter estimators) may provide parameters from different position ranges of each frame. For example, some speech/audio classifiers provide the open loop pitch gain and the pitch delay for three locations in each frame: the beginning, the middle and the end. In this case the open loop pitch gain and the pitch delay defined to be at the end of the frame would be used. Some audio characterization functions may utilize look-ahead audio samples to provide look ahead values, which would then be used as best estimates of the audio characteristics of the next frame. Thus, the open loop pitch gain and pitch delay values that are selected as frame error recovery parameters are the parameters that are the best estimates for those values for the next frame (which may be a lost frame).
The frame error recovery parameters for pitch in most systems can be determined with significantly better accuracy at the encoder at steps 210 and 230 than at the decoder because the encoder may have information of audio samples from the next frame in its look-ahead buffer.
In the event of a frame loss, if the most recent previous good frame (hereafter, the previous transform frame, or PTF) was a PITCH FER type frame then a combination of a frame repeat approach and pitch based extension approach may be used for frame error mitigation and if the PTF is a MDCT FER frame then just frame repeat approach may be used for frame error mitigation.
Referring to
When a determination is made at step 315 that the PTF is a PITCH FER frame, the FER pitch delay value is determined from the FER parameters sent with the PTF frame at step 315 and a pitch extended synthesized signal (PESS) is synthesized at step 320 using estimated linear predictive coefficients (LPC) of the PTF, the decoded audio of the PTF, and the FER pitch delay of the PTF. The PESS is a signal that extends at least slightly beyond the lost frame and may be extended further if more than of frame is lost. As noted above, there may be a limit at to how many lost frames are decoded by extension in these embodiments, depending on the type of audio. At step 325, a decoded audio for at least the lost frame is generated using at least the PESS. (In some other embodiments later described, the decoded audio is determined further based on audio determined using a frame repeat method based on the transform decoding of the PTF.) At step 330, a plurality of parameters are received for a next good frame that follows the lost frame, which may be a time domain frame, a transfer domain frame, or a transfer domain to time domain transition frame. The parameters for these frames are known and include, depending upon frame type, LPC coefficients and MDCT coefficients. At step 335 a decoded audio is generated from the plurality of parameters. More details for at least two of the above steps follow.
Referring to
r(L+n)=γ·r(L+n−D), 0≦n<2·L, γ≦1 (1)
wherein γ is a redefined value which may be frame dependent, and wherein n=0 defines the beginning of the lost frame. When only one frame is lost, γ may be 1 or slightly less, e.g., 0.8 to 1.0 (part of step 320,
The extended residual r(L+n) is passed through an LPC synthesis filter at step 445 using the inverse estimated LPC coefficients, generating the pitch extended synthesis signal (PESS). When there is one lost frame, the PESS is given by
sp(n) for 0≦n<2*L (2)
Note that the multiplier for L is larger when more than one frame is lost. E.g., for two lost frames, the multiplier is 3. In certain embodiments, another synthesis signal, referred to herein as the PTF repeat frame (PTFRF) is generated at step 450 based on MDCT decoding of scaled MDCT coefficients of the PTF frame and the synthesis memory values of the PTF frame. The scaling may be a value of 1 when one frame is lost. The decoded scaled MDCT coefficients and synthesis memory values are overlap added to generate the PTFRF. The PTFRF is given by
sr(n) for 0<n<L (3)
In certain embodiments, a decoded audio signal for the lost frame is generated at step 455 as
s(n)=w(n)·sp(n)+(1−w(n))·sr(n), 0≦n<L (4)
where w(n) is a predefined weighting function (part of step 325,
One value of m that has been experimentally determined to minimize the perceived distortion in the event of a lost frame, over a combination of PTF and next good frame values that represent a range of expected values, is ⅛ L. The reason for using the combination of MDCT based approach and the residual based approach in the initial part of the lost frame following a PTF is to make use of the MDCT synthesis memory of the PTF. In some embodiments the decoded audio for the lost frame is determined with w(n)=1 from 0≦n<L, or in other words, directly from the PESS (the portion of equation (2) for which 0≦n<L).
Referring to
One value of m that has been experimentally determined to minimize the perceived distortion in the event of a lost frame and matching pitch epochs, over a combination of PESS and next good frame values that represent a range of expected values, is ½ L. Alternatively in step 525, when the pitch epochs match, equation (6) may be used to modify the next good frame based on the PESS with an alternative weighting equation (8), in which m1 and m2 have experimentally determined values of weight boundaries that minimize the perceived distortion in the event of a lost frame and matching pitch epochs, over a combination of PESS and next good frame values that represent a range of expected values.
step 520, when the difference of the pitch epoch values do not match, then a determination is made at step 530 as to whether their difference is greater than one half the FER pitch delay obtained with the PTF. When the value of the difference is greater than one half the FER pitch delay, then m1 in equation (8) is set at step 535 to a location after the pitch epoch of the PESS. However, when the value of the difference in step 530 is less than one half the FER pitch delay, the value for m1 in equation (8) is set to a location after the pitch epoch of the audio output of the next good frame (as received). This avoids a problem of cancellation of pitch epochs and/or generation of two pitch epochs which are very close, which results in audible harmonic discontinuity. At step 545, m2 (which is greater than m1) of equation (8) is set to be before the next pitch epoch of the two output signals, which for one lost frame are Sp(n+L) and Sg(n). Now the values of m1 and m2 are set in equation (8) and a modified output signal is generated as the decoded audio for the next good frame for step 335 of
Thus, the values of m1 and m2 may be fixed in some embodiments or may be dependent on the FER pitch delay value of the PTF and the positions of the pitch epochs of the two outputs (the audio output of the PTF and the audio output of the next good frame). In certain embodiments, a pitch value may be obtained for the next good frame and that pitch value may be used as an additional value from which to determine the values of m1 and m2. If the pitch value of the PTF and the next good frame are significantly different or the next good frame is not a pitch FER frame then equation 6 is used as described above.
Referring to
Referring back to
s(n)=w(n)·sp(n)+(1−w(n))·sr(n), 0≦n<L+p (9)
wherein p is 15 for a WB signal and 30 for a SWB signal, and sp(n) is given by equation (2). It will be appreciated that for other types of decoded audio signals, p may be different, and may a value up to L.
In some embodiments, the techniques for using a CELP state generator may be those described in U.S. patent application Ser. No. 13/190,517, filed in the U.S. on Jul. 7, 2011, entitled “Method and Apparatus for Audio Encoding and Decoding” (hereafter “USPAN '517” or U.S. patent application Ser. No. 13/342,462, filed in the U.S. on Jan. 1, 2012, entitled “Method and Apparatus for Processing Audio Frames to Transition Between Differing Codecs” (hereafter “USPAN '462”, which are incorporated herein by reference, but with the techniques modified by substituting the above described decoded audio signal as the input to the CELP state generators that are described in USPAN '517 and USPAN '462. The CELP generator in USPAN '462 is described with reference to FIG. 4 of USPAN '462, with the input that is being replaced labeled “RECONSTRUCTED AUDIO (FRAME M)”. The CELP generator in USPAN '517 is described with reference to FIG. 5 of USPAN '517, with the input that is substituted being labeled “RECONSTRUCTED AUDIO (FRAME m)”. The extension to the decoded audio signal s(n) of equation (4) is obtained by using the pitch extended synthesis signal of equation (2) in generating the output signal of equation (4) and changing the upper length limit of equation (2) accordingly. This approach minimizes a discontinuity that would otherwise result from using the MDCT synthesis memory for extension values from the decoded lost frame that are needed to compensate for the delay of the down sampling filter used in the ACELP part (15). (Use of MDCT synthesis memory as an extension for generating CELP state in frames following lost frames which use PESS would result in discontinuity.) Using the approach described above, an audio output signal is generated at step 510 as the decoded audio output of a transform domain to time domain transition frame for the next good frame for step 335 of
Referring to
In this document, relational terms such as first and second, top and bottom, and the like may be used solely to distinguish one entity or action from another entity or action without necessarily requiring or implying any actual such relationship or order between such entities or actions. The terms “comprises,” “comprising,” or any other variation thereof, are intended to cover a non-exclusive inclusion, such that a process, method, article, or apparatus that comprises a list of elements does not include only those elements but may include other elements not expressly listed or inherent to such process, method, article, or apparatus. An element preceded by “comprises . . . a” does not, without more constraints, preclude the existence of additional identical elements in the process, method, article, or apparatus that comprises the element.
Reference throughout this document to “one embodiment”, “certain embodiments”, “an embodiment” or similar terms means that a particular feature, structure, or characteristic described in connection with the embodiment is included in at least one embodiment of the present invention. Thus, the appearances of such phrases or in various places throughout this specification are not necessarily all referring to the same embodiment. Furthermore, the particular features, structures, or characteristics may be combined in any suitable manner in one or more embodiments without limitation.
The term “or” as used herein is to be interpreted as an inclusive or meaning any one or any combination. Therefore, “A, B or C” means “any of the following: A; B; C; A and B; A and C; B and C; A, B and C”. An exception to this definition will occur only when a combination of elements, functions, steps or acts are in some way inherently mutually exclusive.
The processes illustrated in this document, for example (but not limited to) the method steps described in
It will be appreciated that some embodiments may comprise one or more generic or specialized processors (or “processing devices”) such as microprocessors, digital signal processors, customized processors and field programmable gate arrays (FPGAs) and unique stored program instructions (including both software and firmware) that control the one or more processors to implement, in conjunction with certain non-processor circuits, some, most, or all of the functions of the methods and/or apparatuses described herein. Alternatively, some, most, or all of these functions could be implemented by a state machine that has no stored program instructions, or in one or more application specific integrated circuits (ASICs), in which each function or some combinations of certain of the functions are implemented as custom logic. Of course, a combination of the approaches could be used.
Further, it is expected that one of ordinary skill, notwithstanding possibly significant effort and many design choices motivated by, for example, available time, current technology, and economic considerations, when guided by the concepts and principles disclosed herein will be readily capable of generating such stored program instructions and ICs with minimal experimentation.
In the foregoing specification, specific embodiments of the present invention have been described. However, one of ordinary skill in the art appreciates that various modifications and changes can be made without departing from the scope of the present invention as set forth in the claims below. As examples, in some embodiments some method steps may be performed in different order than that described, and the functions described within functional blocks may be arranged differently (e.g.,). As another example, any specific organizational and access techniques known to those of ordinary skill in the art may be used for tables. Accordingly, the specification and figures are to be regarded in an illustrative rather than a restrictive sense, and all such modifications are intended to be included within the scope of present invention. The benefits, advantages, solutions to problems, and any element(s) that may cause any benefit, advantage, or solution to occur or become more pronounced are not to be construed as a critical, required, or essential features or elements of any or all the claims. The invention is defined solely by the appended claims including any amendments made during the pendency of this application and all equivalents of those claims as issued.
Number | Name | Date | Kind |
---|---|---|---|
6073092 | Kwon | Jun 2000 | A |
6134518 | Cohen et al. | Oct 2000 | A |
6199035 | Lakaniemi et al. | Mar 2001 | B1 |
6804639 | Ehara | Oct 2004 | B1 |
7577565 | Anandakumar et al. | Aug 2009 | B2 |
7587315 | Unno | Sep 2009 | B2 |
7596489 | Kovesi et al. | Sep 2009 | B2 |
7774203 | Wang et al. | Aug 2010 | B2 |
7805297 | Chen | Sep 2010 | B2 |
7991621 | Oh et al. | Aug 2011 | B2 |
8015000 | Zopf et al. | Sep 2011 | B2 |
20030009325 | Kirchherr et al. | Jan 2003 | A1 |
20030074197 | Chen | Apr 2003 | A1 |
20050154584 | Jelinek et al. | Jul 2005 | A1 |
20080046233 | Chen et al. | Feb 2008 | A1 |
20100305953 | Susan et al. | Dec 2010 | A1 |
20110173008 | Lecomte et al. | Jul 2011 | A1 |
Number | Date | Country |
---|---|---|
0932141 | Jul 1999 | EP |
Entry |
---|
Krishnan, et al., “EVRC-Wideband: The New 3GGP2 Wideband vocoder standard” ICASSP, 2007, 4 pages. |
ITU-T G.718, “Series G: Transmission Systems and Media, Digital Systems and Networks; Digital terminal equipments—Coding of voice and audio signals; Frame error robust narrow-band and wideband embedded variable bit-rate coding of speech and audio from 8-32 kbit/s”, Jun. 2008, 157 pages. |
ITU-T G.711 Appendix I “Series G: Transmission Systems and Media, Digital Systems and Networks; Digital transmission systems—Terminal equipments—Coding of analogue signals by pulse code modulation; Pulse code modulation (PCM) of voice frequencies; Appendix I: A high quality low-complexity algorithm for packet loss concealment with G.711” Sep. 1999, 26 pages. |
Milan Jelinek et al.: “ITU-T G.EV-VBR baseline codec”, Acoustics, Speech and Signal Processing, 2008, ICASSP 2008, IEEE International Conference on, IEEE, Piscataway, NJ, USA, Mar. 31, 2008, pp. 4749-4752. |
Huan Hou et al.: “Real-time audio error concealment method based on sinusoidal model”, Audio, Language and Image Processing, 2008, ICALIP 2008, International Conference on, IEEE, Piscataway, NJ, USA, Jul. 7, 2008, pp. 22-28. |
Patent Cooperation Treaty, International Search Report and Written Opinion of the International Searching Authority for International Application No. PCT/US2013/058378, Jan. 30, 2014, 13 pages. |
Combesure, Pierre et al.: “A 16, 24, 32 kbit/s Wideband Speech Codec Based on ATCELP”, Proceedings ICASSP '99 Proceedings of the Acoustics, Speech, and Signal PRocessing, 1999, on 1999 IEEE International Conference, vol. 01, pp. 5-8. |
Number | Date | Country | |
---|---|---|---|
20140088974 A1 | Mar 2014 | US |