1. Field
One or more embodiments relate to technologies and techniques for encoding and decoding audio, and more particularly, to technologies and techniques for encoding and decoding audio with improved frame error concealment using a multi-rate speech and audio codec.
2. Description of the Related Art
In the technical field of speech and audio coding for environments where frames of encoded speech or audio are expected to be subjected to occasional losses during their transport, coded speech and audio transporting or decoding systems are designed to limit frame losses to the order of a few percent.
To limit these frame losses, or to compensate for the loss of frames, frame erasure concealment (FEC) algorithms may be implemented by a decoding system independent of the speech codec used to encode or decode the speech or audio. Many codecs use decoder-only algorithms to reduce the degradation caused by frame loss.
Such FEC algorithms have recently been utilized in cellular communication networks or environments, which operate in accordance with a given standard or specification. For example, the standard or specification may define the communication protocols and/or parameters that shall be used for a connection and communication. Examples of the different standards and/or specifications include Global System for Mobile Communications (GSM), GSM/Enhanced Data rates for GSM Evolution (EDGE), American Mobile Phone System (AMPS), Wideband Code Division Multiple Access (WCDMA) or 3rd generation (3G) Universal Mobile Telecommunications System (UMTS), International Mobile Telecommunications 2000 (IMT 2000), for example. Here, speech coding has previously been performed with either variable rate or fixed rate encoding. In variable rate encoding, the source uses an algorithm to classify speech into different rates, and encodes the classified speech according to respective predetermined bit rates. Alternatively, speech coding has been performed using fixed bit rates, where detected voice speech audio may be coded according to a fixed bit rate. An example of such fixed rate codecs include multi-rate speech codecs developed by the 3rd Generation Partnership Project (3GPP) for GSM/EDGE and WCDMA communication networks, such as the adaptive multi-rate (AMR) codec and the adaptive multi-rate wideband (AMR-WB) codec, which code the speech according to such detected voice information, and further based upon factors such as the network capacity and radio channel conditions of the air interface. The term multi-rate refers to fixed rates being available depending on the mode of operation of the codec. For example, AMR contains eight available bit-rates from 4.7 kbit/s to 12.2 kbit/s for speech, while AMR-WB contains nine bit-rates from 6.6 kbit/s to 23.85 kbit/s for speech. The specifications of the AMR and AMR-WB codecs are respectively available in the 3GPP TS 26.090 and 3GPP TS 26.190 technical specifications for the third generation of the 3GPP wireless systems, and voice detection aspect of the AMR-WB can be found in the 3GPP TS 26.194 technical specification for the third generation of the 3rd 3GPP wireless systems, the disclosures of which are incorporated herein.
In such cellular environments, for example, losses may be due to interference in a cellular radio link or router overflow in an IP network, for example. Currently, a new fourth generation of the 3GPP wireless system is currently being developed, known as Enhanced Packet Services (EPS), with a primary air interface for EPS being referred to as Long Term Evolution (LTE). As an example,
Even though LTE has been developed in view of the potential transmission interference and failing in cellular or wireless networks, speech frames transported in 3GPP cellular networks will still be subject to erasure, with a small percentage of frames and/or packets being lost during transmission. Erasure is a classification, e.g., by a decoder, for the decoder to assume information of that packet has been lost or unusable. In the case of the EPS network, for example, frame erasures may still be expected. To address the erased frames, the decoder will typically implement frame error concealment (FEC) algorithms to mitigate the impact of the corresponding lost frames.
Some FEC approaches use only the decoder to address the concealment of the erased frame, i.e., the lost frame. For example, the decoder is aware or is made aware that a frame erasure has occurred, and estimates the contents of the erased frame from known good frames that arrive at the decoder just before and sometimes also just after the erased frame.
A feature of some 3GPP cellular networks is the ability to identify and notify the receiving station of frame erasures that take place. Therefore, the speech decoder knows whether a received speech frame is to be considered a good frame or considered an erased frame. Due to the nature of speech and audio, a small percentage of frame erasures can be tolerated if proper frame erasure mitigation or concealment measures are put in place. Some FEC algorithms may merely substitute noise in place of the lost packet, silence, some type of fading out/in, or some type of interpolation, for example, to help make the loss of the frame less noticeable.
Alternate FEC approaches include having the encoder send specific information in a redundant fashion. For example, the ITU Telecommunication Standardization Sector G.718 (ITU-T G.718) standard, incorporated herein by reference, recommends sending redundant information pertaining to a core encoder output, in an enhancement layer. This enhancement layer could be sent in a different packet from the core layer.
In one or more embodiments, there is provided a terminal, including a coding mode setting unit to set a mode of operation, from plural modes of operation, for coding by a codec of input audio data, and the codec configured to code the input audio data based on the set mode of operation such that when the set mode of operation is a high frame erasure rate (FER) mode of operation the codec codes a current frame of the input audio data according to one frame erasure concealment (FEC) mode of one or more FEC modes, wherein, upon the coding mode setting unit setting the mode of operation to be the High FER mode of operation, the coding mode setting unit selects the one FEC mode, from the one or more FEC modes predetermined for the High FER mode of operation, to control the codec based on an incorporating of redundancy within a coding of the input audio data or as separate redundancy information separate from the coded input audio according to the selected one FEC mode.
The coding mode setting unit may perform the selecting of the one FEC mode from the one or more FEC modes for each of plural frames of the input audio data.
The High FER mode of operation may be a mode of operation for an Enhanced Voice Services (EVS) codec of a 3GPP standard and the codec may be the EVS codec, wherein, when the EVS codec encodes audio of a current frame, the EVS codec adds encoded audio from at least one neighboring frame, including respectively encoded audio of one or more previous frames and/or one or more future frames, to results of the encoding of the current frame in a current packet for the current frame as combined EVS encoded source bits, with the combined EVS encoded source bits being represented in the current packet distinct from any RTP payload portion of the current packet, and wherein the EVS codec may be configured to respectively encode audio from each of the at least one neighboring frame, as the encoded audio, and include the respectively encoded audio from each of the at least one neighboring frame in separate packets from the current packet.
At least one of the one or more FEC modes may control the codec to code the current frame and neighboring frames according to selectively different fixed bit rates and/or different packet sizes, control the codec to code the current frame and neighboring frames according to same fixed bit rates, or control the codec to encode the current frame and neighboring frames according to same packet sizes, wherein each of the at least one FEC mode of the one or more FEC modes controls the codec to divide the current frame into sub-frames, calculate respective numbers of codebook bits for each sub-frame based on the sub-frame being coded according to a bit rate less than the same fixed bit rate, and encode the sub-frame using the same fixed bit rate with the respective number of codebooks bits being used to define codewords for the bits of the sub-frame.
The EVS codec may be configured to provide unequal redundancy for bits of the current frame based on the division of the bits of the current frame into the sub-frames, including at least a first and second sub-frame, and to add results of an encoding of the bits of the current frame classified in the first sub-frame to respective one or more neighboring packets differently from any adding of results of an encoding of the bits of the current frame classified into the second sub-frame neighboring packets.
The EVS codec may be configured to provide unequal redundancy for linear prediction parameters of the current frame based on the division of the bits of the current frame into the sub-frames, including at least a first and second sub-frame, and to add linear prediction parameter results of an encoding of the bits of the current frame classified in a first sub-frame to respective one or more neighboring packets differently from any adding of linear prediction parameter results of an encoding of the bits of the current frame classified into the second sub-frame in neighboring packets.
The codec may be further configured to add a High FER mode flag to the current packet for the current frame to identify the set mode of operation for the current frame as being the High FER mode of operation, wherein the High FER mode flag may be represented in the current packet by a single bit in the RTP payload portion of the current packet. The codec may be further configured to add a FEC mode flag to the current packet for the current frame identifying which one of the one or more FEC modes was selected for the current frame, wherein the FEC mode flag may be represented in the current packet by a predetermined number of bits, as only an example, and wherein the codec codes the FEC mode flag for the current frame with redundancy in packets of different frames. As only an example, in one embodiment, the predetermined number of bits could be 2, though alternative embodiments are equally available.
The High FER mode of operation may be a mode of operation for an Enhanced Voice Services (EVS) codec of a 3GPP standard and the codec may be the EVS codec, wherein the EVS codec may be further configured to decode a High FER mode flag in at least the current packet to identify the set mode of operation for the current frame as being the High FER mode of operation, and upon detection of the High FER mode flag, decode a FEC mode flag for the current frame from the current packet identifying which one of the one or more FEC modes was selected for the current frame, wherein the coding of the input audio data may be a decoding of the input audio data according to the selected FEC mode, and wherein, when the EVS codec may be decoding the input audio data, encoded redundant audio from at least one neighboring frame are parsed from the current packet, including respectively encoded audio of one or more previous frames and/or one or more future frames to the current frame, and decoding a lost frame from the one or more previous frames and/or one or more future frames based on the respectively parsed encoded redundant audio in the current packet.
Here, the EVS codec may be configured to decode the current frame based on unequal redundancy for bits or parameters for the current frame within the input audio data, wherein the unequal redundancy may be based on a previous classification of the bits or parameters of the current frame into at least first and second categories, and an adding of results of an encoding of the bits or parameters of the current frame classified in the first category to respective one or more neighboring packets as respective redundant information differently from any adding of results of an encoding of the bits or parameters of the current frame classified into the second category in neighboring packets as respective redundant information, wherein the coding of the current frame includes decoding the current frame based on decoded audio of the current frame from the one or more neighboring packets when the current frame is lost.
The High FER mode of operation may be a mode of operation for an Enhanced Voice Services (EVS) codec of a 3GPP standard and the codec may be the EVS codec, wherein the EVS codec may be further configured to decode a High FER mode flag in at least the current packet to identify the set mode of operation for the current frame as being the High FER mode of operation, and upon detection of the High FER mode flag, decode a FEC mode flag for the current frame from the current packet identifying which one of the one or more FEC modes was selected for the current frame, and wherein the coding of the input audio data may be an encoding of the input audio data according to the selected FEC mode, wherein the EVS codec may be configured to decode the current frame based on unequal redundancy for bits or parameters for the current frame within the input audio data, wherein the unequal redundancy may be based on a previous classification of the bits or parameters of the current frame into at least first and second categories, and an adding of results of an encoding of the bits or parameters of the current frame classified in the first category to respective one or more neighboring packets unequally from any adding of results of an encoding of the bits or parameters of the current frame classified into the second category in neighboring packets, and wherein the coding of the current frame includes decoding the current frame based on decoded audio for the current frame from the one or more neighboring packets when the current frame is lost.
Here, the EVS codec may be configured to provide unequal redundancy for bits or parameters of the current frame by classifying the bits of the current frame into at least a first and second categories, and to add results of an encoding of the bits of the current frame classified in the first category to respective one or more neighboring packets differently from any adding of results of an encoding of the bits of the current frame classified into the second category in neighboring packets.
The EVS codec may be configured to provide unequal redundancy for linear prediction parameters of the current frame by classifying the bits or parameters of the current frame into at least a first and second categories, and to add linear prediction parameter results of an encoding of the bits or parameters of the current frame classified in the first category to respective one or more neighboring packets differently from any adding of linear prediction parameter results of an encoding of the bits or parameters of the current frame classified into the second category in neighboring packets.
The codec may encode audio of a current frame, the codec adds encoded audio from at least one neighboring frame, including respectively encoded audio of one or more previous frames and/or one or more future frames, to a frame error concealment (FEC) portion of a current packet for the current frame distinct from a codec encoded source bits portion of the current packet including results of the encoding of the current frame, with the codec encoded source bits portion of the current packet and the FEC portion of the current packet each being represented in the current packet distinct from any RTP payload portion of the current packet, and wherein the codec may be configured to respectively encode audio from each of the at least one neighboring frame, as the encoded audio, and include the respectively encoded audio from each of the at least one neighboring frame in separate packets from the current packet.
The codec may be configured to provide redundancy for bits of at least one neighboring frame by adding respective results of encodings of the bits of at least one neighboring frame to the current packet as separate distinct FEC portions. Further, the separate packets may not be contiguous.
The coding mode setting unit may set the mode of operation to be the FER mode of operation with different, increased, and/or varied redundancy compared to remaining modes of operation of the plural modes of operation for non-FER modes of operation, based upon an analysis of feedback information available to the terminal based upon one or more determined qualities of transmissions outside the terminal and/or a determination of the current frame in the input audio data being more sensitive to frame erasure upon transmission or having greater importance over other frames of the input audio data.
The feedback information may include at least one of: fast feedback (FFB) information, as hybrid automatic repeat request (HARQ) feedback transmitted at a physical layer; slow feedback (SFB) information, as fed back from network signaling transmitted at a layer higher than the physical layer; in-band feedback (ISB) information, as in-band signaling from the a codec at a far end; and high sensitivity frame (HSF) information, as a selection by the codec of specific critical frames to be sent in a redundant fashion.
The terminal may receive at least one of the FFB information, the HARQ feedback, the SFB information, and ISB information and perform the analysis of the received feedback information to determine the one or more qualities of transmission outside the terminal.
The terminal may receive information indicating that the analysis of at least one of the FFB information, the HARQ feedback, the SFB information, and ISB information has been previously performed based upon a received flag in a packet indicating that the current frame in the current packet is coded according the High FER mode or indicating that an encoding of the current packet should be performed by the codec in the High FER mode.
The coding mode setting unit may set the mode of operation to be at least one of the one or more FEC modes based upon one of a determined coding type of the current frame and/or neighboring frames, from plural available coding types, or a determined frame classification of the current frame and/or neighboring frames, from plural available frame classifications.
The plural available coding types may include an unvoiced wideband type for unvoiced speech frames, a voiced wideband type for voiced speech frames, a generic wideband type for non-stationary speech frames, and a transition wideband type used for enhanced frame erasure performance. The plural available frame classifications may include an unvoiced frame classification for unvoiced, silence, noise, voiced offset, an unvoiced transition classification for transition from unvoiced to voiced components, a voiced transition classification for transition from voiced to unvoiced components, a voiced classification for voiced frames and the previous frame was also a voiced or classified as an onset frame, and an onset classification for voiced onset being sufficiently well established to follow with a voice concealment by a decoder.
In one or more embodiments, there is provided a codec coding method, including setting a mode of operation, from plural modes of operation, for coding input audio data, coding the input audio data based on the set mode of operation such that when the set mode of operation is a high frame erasure rate (FER) mode of operation the coding includes coding a current frame of the input audio data according to one frame erasure concealment (FEC) mode of one or more FEC modes, wherein, upon the setting of the mode of operation to be the High FER mode of operation, selecting the one FEC mode, from the one or more FEC modes predetermined for the High FER mode of operation, and coding the input audio data based on an incorporating of redundancy within a coding of the input audio data or as separate redundancy information separate from the coded input audio according to the selected one FEC mode.
Additional aspects and/or advantages of one or more embodiments will be set forth in part in the description which follows and, in part, will be apparent from the description, or may be learned by practice of one or more embodiments of disclosure. One or more embodiments are inclusive of such additional aspects.
These and/or other aspects will become apparent and more readily appreciated from the following description of embodiments, taken in conjunction with the accompanying drawings of which:
Reference will now be made in detail to one or more embodiments, illustrated in the accompanying drawings, wherein like reference numerals refer to like elements throughout. In this regard, embodiments of the present invention may be embodied in many different forms and should not be construed as being limited to embodiments set forth herein, as various changes, modifications, and equivalents of the systems, apparatuses and/or methods described herein will be understood to be included in the invention by those of ordinary skill in the art after embodiments discussed herein are understood. Accordingly, embodiments are merely described below, by referring to the figures, to explain aspects of the present invention.
One or more embodiments relate to the technical field of speech and audio coding wherein frames of encoded speech or audio may be subjected to occasional losses during their transport. Losses can be due to interference in a cellular radio link or router overflow in an IP network, as only examples.
Here, though embodiments may be discussed regarding one or more EVS codecs for future adoption within the fourth generation of the 3GPP wireless system architecture, embodiments are not limited to the same.
3GPP is in the process of standardizing a new speech and audio codec for future cellular or wireless systems. This codec, known as the Enhanced Voice Services (EVS) codec, is being designed to efficiently compress speech and audio into wide range of encoded bit rates for 3GPP's fourth generation network known as Enhanced Packet Services (EPS). One key feature of EPS is the use of packet-based transport for all services including those of speech and audio, including over the EPS air interface, known as Long Term Evolution (LTE). The EVS codec is designed to operate efficiently in a packet-based environment.
The EVS codec will have the capability to compress audio bandwidths from narrowband up to full-band, in addition to stereo capability, and could be viewed as an eventual replacement for existing 3GPP codecs. The motivation for a new codec in 3GPP include advancement of speech and audio coding algorithms, expected new applications requiring higher audio bandwidths and stereo, and the migration of speech and audio services from a circuit-switched to packet-switched environment.
A key aspect of the environment for which the EVS codec will operate, as is the case with previous 3GPP-based networks, is the loss of speech/audio frames as they are transported from the sender to the receiver. This is an expected consequence of transport in a cellular network and is taken into account during the design of speech and audio codecs designed to operate in such environments. The EVS codec is no exception and will also include algorithms to minimize the impact of the loss of frames of speech or frame erasures. EPS, as well as the legacy 3GPP cellular networks, is designed to maintain a reasonable frame erasure rate for most users during normal conditions.
It is envisioned herein that the EVS codec, such as the EVS codec 26 of
This High FER mode may address frame erasure rates that are at the extreme of operating conditions in LTE, for example. The High FER mode would trade off additional resources (bit rate, delay) in return for better performance in frame erasure rates on the order of 10% or higher.
One or more embodiments are directed to a frame erasure concealment (FEC) framework for this High FER mode of the EVS codec 26, as only an example. One or more embodiments propose a redundancy scheme wherein various encoded parameters of a speech frame are transmitted with varying redundancy based on the importance of the particular parameter. In addition, FEC bits generated at the encoder, but not part of the encoded speech, may also be prioritized and transmitted with varying redundancy. Redundancy is achieved through repetition of some or all of the bits in multiple packets, and depending on embodiment is performed in an unequal manner between frames or within frames.
In audio coding FEC approaches have previously been implemented by the decoding system independent of the speech codec used to encode or decode the speech or audio. However, a potentially more effective approach, if there is the opportunity, is to design FEC algorithms into the EVS codec 26 during the development phases of the decoder side of the EVS codec 26. On the encoder side, the encoders have also typically only provided redundancies in data independent of the underlying codec being implemented to encode the speech of audio data. Thus, though previous codecs have used decoder-only algorithms to reduce the degradation caused by frame loss, a potentially more effective approach, albeit at the additional cost of system bandwidth and potentially delay, proposed herein is to incorporate FEC algorithms into at least the encoder side of the EVS codec 26, e.g., during the development phases of the encoder side of the EVS codec 26, according to one or more embodiments. One or more embodiments may include FEC algorithms applied by the encoder, as well as appropriate FEC algorithms of the decoder to conceal errors or lost packets, and may also be used in combination with additional frame error concealment algorithms or approaches of the decoder to adequately reconstruct erred bit(s) or lost packets, e.g., for the maintenance of proper timing in the decoded audio data and potentially with audio characteristics that are less noticeable as being erred or lost, or for identical reconstruction. Accordingly, the EVS codec 26 may implement both the previously discussed approaches to frame loss concealment, as well as aspects of the FEC framework discussed herein.
Accordingly, one or more embodiments involve at least encoder-based FEC algorithms, such in a fourth generation 3GPP wireless system, with one or more embodiments including an encoder and/or decoder that can perform respective encoding and decoding operations.
As an example, the encoding unit 205 digitally encodes input audio based on an FEC algorithm or framework, according to one or more embodiments. Stored codebooks may be selectively used based upon the FEC algorithm applied, such as codebooks stored the memories of the encoding unit 205 and decoding unit 250. The encoded digital audio may then be transmitted in packets modulated onto a carrier signal and transmitted by an antenna 240. The encoded audio data may also be stored for later playback in the memory 215, which can be non-volatile or volatile memory, for example. The encoded digital audio may then be transmitted in packets modulated onto a carrier signal and transmitted by an antenna 240. As another example, the decoding unit 250 may decoded input audio based on an FEC algorithm of one or more embodiments. The audio being decoded by the decoding unit 250 may be provided from the antenna 240, or obtained from memory 215 as the previously stored encoded audio data. In addition, stored codebooks may be stored in the memories of the encoding unit 205 and decoding unit 250, or in memory 215, and selectively used based upon the FEC algorithm applied, in one or more embodiments. As noted, depending on embodiment, the encoding unit 205 and the decoding unit 250 each include a memory, such as to store the appropriate codebooks and the appropriate codec algorithm or FEC algorithm. The encoding unit 205 and decoding unit 250 may be a single unit, e.g., together representing same use of an included processing device as the codec that is used to either encoding and/or decoding audio data. In an embodiment, the processing device is configured to perform encoding and/or decoding codec processing in parallel for different portions of input audio or different audio streams.
The terminal 200 further sets forth codec mode setting units 255 which select from plural available modes of operation of the encoding unit 205 and/or decoding unit 250. Each codec mode setting unit 255, considering there may could be one codec mode setting unit for both of the encoding unit 205 and decoding unit 250. The EVS codec can encode both speech and music with the same modes of operation. Further, if the input audio is non-speech audio then the encoding unit 205 or decoding unit 250 may encode or decode, respectively, for music or greater fidelity audio, for example. If the input audio is speech audio, then the codec mode setting unit may determine which of plural modes of operation the encoding unit 205 or decoding unit 250 should operate to encode or decode, respectively, the audio data. If the codec mode setting units 255 detect that a High FER mode of operating is determined, then one of one or more of FEC modes will be selected by the codec mode setting units 255 for operating within the High FER mode of operation. Though other modes of operation available for speech coding are not implemented, due to the setting of the mode of operation to the High FER mode of operation, the FEC modes may incorporate the use of the other speech coding modes within the FEC framework discussed herein. The codec mode setting units 255 may also perform parsing of encoded input packets to parse out information identifying whether received encoded audio is speech, the mode of operation for non-speech audio, whether the High FER mode is set, any potential one or more FEC modes of operation for the FER mode, etc. The codec mode setting units 255 may also add this information to packets of encoded output packets, though this information may also be added by the encoding unit 205, for example, based upon the ultimate encoding that is performed.
In one or more embodiments, the EVS codec 26 includes several modes of operation for speech audio. Each mode of operation will have an associated encoded bit rate, for example. Depending on the bit rate of a particular mode, some are capable of multiple uses to transport a choice of audio bandwidths, or to transport speech encoded with the legacy AMR-WB codec, for example. Examples of these modes of operation for speech audio are demonstrated below in Table 1.
The LTE air interface has been designed with a fixed number of transport block sizes for use in transporting packets of a wide variety of sizes. The smaller of the transport block sizes are designed for the existing 3GPP codecs, e.g., for the third generation 3GPP wireless systems, and may be reused by the EVS codec 26 through judicious selection of bit rates modes the codec will operate in. In an embodiment, the EVS codec 26 encodes speech into 20 ms frames, and to minimize end-to-end delay, one frame may be transported per packet, though embodiments are not limited to the same.
Table 1 below illustrates these example speech EVS codec bit rates at the lower end of the bit rate range and the associated transport block sizes used in conjunction with the bit rate modes. The example size of the RTP payload is based upon the existing RTP payload size in the AMR-WB codec, noting that embodiments are not limited to this RTP payload size, or the limitations that such a payload is required to be an RTP payload.
The above description is that of a fixed-rate codec, or a codec that encodes all active speech frames at a constant rate. For operation in packet-switched environments, the silence or pauses between speech utterances are encoded and transmitted at a very low rate and in a discontinuous fashion.
As discussed above, speech frames transported in networks are subject to erasure, and in particular in 3GPP cellular networks where there is an expectation of a small percentage of the transmitted data during transmission
Frame erasure concealment (FEC) algorithms can be broadly classified into two categories: those that are codec independent and those that are codec dependent. Codec independent FEC algorithms are generic enough to be applied without the knowledge of the specific coding algorithms involved, and as a result are not as effective as codec dependent algorithms. Codec dependent algorithms are designed in conjunction with the codec during its development phase, and are typically more effective. One or more embodiments include at least codec dependent FEC algorithms, and codec dependent and independent FEC algorithms.
Frame erasure concealment algorithms herein can also be divided into another set of two broad categories: receiver based and sender based. Receiver based algorithms may be located solely in the speech decoder and/or the jitter buffer of the decoding unit 250 and are triggered by the frame erasure flags that the receiving side generates for the decoder. Error concealment of the decoding unit 250 may include data concealment approaches, including concealment based on the use of silence, white noise, waveform substitution, sample interpolation, pitch waveform replacement, time scale modification, regeneration based on knowledge or neighboring audio characteristics, and/or model based recover matching speech characteristics on either side of an error or loss to a model, as only example. Simple algorithms include the silence or noise substitution in the restored audio for erased frames, or repetition of a previous good frame, with the desire to minimize the user's observance of the packet loss. For a continuing string of frame erasures, the decoder would typically gradually mute the volume of the decoded speech. The more advanced algorithms could take into account the characteristics of a previously received good frame of speech and interpolate the previously received good parameters. If a jitter buffer is involved, there is an opportunity to use good frames of speech on both sides of the erased frame (assuming a single frame erasure) for interpolation purposes.
Sender-based FEC algorithms consume more resources but are more powerful than receiver-only techniques. Sender-based FEC algorithms usually involve sending redundant information to the receiver in a side channel for use in reconstructing a lost frame in the case of a frame erasure. The performance of sender-based algorithms is attributable to the ability to de-correlate the transmission of side information from that of the primary channel. In real-time speech coding applications in cellular networks, a partial de-correlation can be achieved by delaying the transmission of the redundant information by one or more frames. This will typically incur a delay to the transmission path of an already delay-constrained system, a delay that may be partially mitigated by the jitter buffer at the receiving end, e.g., the jitter buffer of the decoding unit 250.
According to one or more embodiments, the side or redundancy information that is provided to the receiver may include a complete copy of the original speech frame (full redundancy) or a critical subset of that frame (partial redundancy). Selective redundancy is a technique herein wherein a selected subset of speech frames is sent with side information. The full speech frame or a subset of the frame can be sent in a selective manner. Another approach herein is to encode speech with two separate codecs, one a desired codec for most coding and the other a low-rate low-fidelity codec, according to one or more embodiments. In example embodiment including multiple renderings, both versions of encoded speech are transmitted to the decoder, with the low-rate version considered the side channel.
In addition, one or more embodiments implement unequal error protection, where encoded bits of a frame are separated into classes, for example, A, B and C based upon the sensitivity of the respective bits or parameters to erasure. Erasure of class A bits or parameters may have a higher impact of voice quality than when class C bits or parameters are lost. The separating of the encoded bits or parameters of the frame into classes may also be referred to as dividing the frame into sub-frames, noting that the use of the term sub-frame does not require the separated encoded bits to all be contiguous for each sub-frame.
The receiver's task in a sender-based FEC system is to identify a frame erasure, and to determine if redundant side information for that erased frame has been received. If that side information is also lost, the situation is similar to that of a receiver-based FEC system and receiver-based FEC algorithms can be applied. If the redundant side information is present, it is used to conceal the lost frame along with any other relevant information that the receiver has available for concealment purposes.
As introduced above, the EVS codec 26 may include a High FER mode of operation, distinguished from other modes of operation. The High FER mode of operation of the EVS codec 26 may not be a primary mode of operation, but a mode that is chosen when it is known that the user is experiencing a higher than normal rate of frame loss. The terminals 200 and network 140 implement the LTE air interface with use of a hybrid automatic repeat request (HARQ) to transmit blocks of bits at the physical layer level. The success or failure of this mechanism can provide quick feedback as to whether a frame was successfully transmitted through the air interface. Feedback on link quality involving the entire transmission path may typically be slow and could involve either higher layer communication or dedicated in-band signaling between EVS codecs 26 in the case of a mobile-to-mobile call, in one or more embodiments.
One or more embodiments provide the FEC framework for the High FER mode of operation of the EVS codec 26. This framework is valid for fixed rate modes and bandwidths of the EVS codec 26. In an embodiment, this FEC framework is valid for all fixed rate modes and bandwidths of the EVS codec 26. According to one or more embodiments, the framework includes a method for partial and full redundancy transport of fixed-rate encoded frames. In an embodiment, both the partial and full redundancy transport fixed size transport blocks during the High FER mode. The transition from a normal mode of operation to the High FER mode may also include a change in transport block size. Embodiments equally include methods using partial, unequal, or full redundancy with fixed size transport blocks with fixed or variable bit rates, and partial, unequal, or full redundancy with variable size transport blocks with fixed or variable bit rates.
According to one or more embodiments, the High-FER mode of the EVS codec 26 of
As noted below, there are two example interaction points with the EVS codec 26 in an EPS environment, e.g., feedback from the decoding unit 150 to the encoding unit 100, so the encoding unit 100 makes the decision of whether to enter the High FER mode of operation, and the decoding unit 150 makes the decision of whether to enter the High FER mode of operation based on the decoding unit 150 monitoring the frame erasure rate, for example. If the decoding unit 150 makes the decision to enter the High FER mode of operation, that decision is transmitted to the encoding unit 100 so the next frames of audio or speech are encoded in the High FER mode of operation. Similarly, with the arrangement of
Depending on embodiment, the EVS codec 26 enters the High FER mode of operation based upon information processed one or more of four sources: 1) fast feedback (FFB) information, as HARQ feedback transmitted at the physical layer; 2) slow feedback (SFB) information; feedback from network signaling transmitted at a layer higher than the physical layer; 3) in-band feedback (ISB) information: in-band signaling from the EVS codec 26 at a far end; and 4) high sensitivity frame (HSF) information: selection by the EVS codec 26 of specific critical frames to be sent in a redundant fashion. Sources (1) and (2) may be independent of the EVS codec 26, while (3) and (4) are dependent on the EVS codec 26 and would require EVS codec 26 specific algorithms.
The decision to enter the High FER mode of operation, HFM, is made by a High FER Mode Decision Algorithm. In one or more embodiments, the coding mode setting units 255 of
As noted above, depending on embodiment, coding mode setting units 255 of
In one or more embodiments, subsequent to the decision to enter a High FER mode of operation, there are a number of sub-modes within the High FER mode of operation that are further chosen from for encoding the audio or speech information. Thereafter, the High-FER mode of operation operates in one or more of the number of sub-modes, and a small number of bits may be used for signaling which of the respective sub-modes has been chosen. These small number of bits may become part of the overhead, and potentially they may be reserved bits within a current or future fourth generation 3GPP wireless network, as only an example.
In an embodiment, only one bit in an RTP payload may be required to signal the High FER mode of operation; this one bit can be considered a High FER mode flag. As an example, the RTP payload in the existing AMR-WB has four extra bits (in the octet mode), i.e., bits that are reserved or not assigned. Additionally, once in the High FER mode of operation only a few bits may need to be reserved to signal the sub-modes; these bits can be considered an FEC mode flag. These bits can be protected with redundancy similar to the below redundancy for the class A bits of Table 3, for example.
Sender-based FEC algorithms typically use a side channel to transport redundant information. In one or more embodiments, in the context of the EVS codec 26 and its use in EPS, one or more embodiments make efficient use of the transport blocks defined for the LTE air interface, even though the expected EVS codec does not provide for such side channels. For each mode of operation, the below Table 2 shows a number of additional bits available by selecting the next higher or second next higher transport block size (TBS). In an embodiment, for efficient operation, all of the additional bits may be used.
Robustness to frame loss is achieved by sending redundant bits or parameters associated with frame n in a packet not associated with frame n. For example, frame n encoded bits are sent in packet N, while redundancy bits associated with frame n are sent in packet N+1. This is known as time diversity. If packet N is erased and packet N+1 survives, the redundancy bits can be used to conceal or reconstruct frame n.
In
As illustrated in
In the
In addition to the placement of the redundancy bits in one or more different neighboring packets, redundancy bits may be selectively included with more or less redundancy based upon their perceptual importance.
Accordingly, in one or more embodiments, a High FER mode of operation for fixed bit rates uses an unequal redundancy protection concept wherein encoded speech bits are prioritized and protected with more, equal, or less redundancy according to their perceptual importance. In an example using 3GPP codecs AMR and AMR-WB, encoded bits are classified into classes, for example class A, B and C where class A bits are the most sensitive to erasure and class C bits are the least sensitive to erasure, according to one or more embodiments. Different mechanisms exist for providing protection of these bits, depending on whether the application uses circuit-switched or packet-switched transport.
According to one or more embodiments, the provision of unequal redundancy protection may be extended to both source encoded bits as well as additional FEC side information. The different classes of bits are transported in a redundant manner using time diversity, with the amount of redundancy depending upon the class of bits.
As illustrated in the embodiment of
With sufficient jitter buffer depth of the decoder, e.g., the decoding unit 250, the decoder has three opportunities to decode the class A bits or parameters, two opportunities to decode the class B bits or parameters and one opportunity to decode the class C bits or parameters. As a result, it takes three consecutive packet erasures to lose the class A bits or parameters and two consecutive packet erasures to lose the class B bits or parameters. As only an example, alternative embodiments may at least include an approach that divides the encoded source bits into more or fewer classes, for example (A, B) or (A, B, C, D), an approach that achieves full redundancy rather than partial redundancy by also redundantly transporting the class C bits, an approach directed toward a desired very high efficiency operation, the class C bits are not transmitted, and an approach where only the class A bits are redundantly transmitted for efficiency purposes.
Accordingly, in one or more embodiments, in addition to including FEC bits for a current frame in previous or subsequent neighboring frames, the bits of a source frame may be categorized based upon priority, such as according to their perceptual importance. Bits or parameters of the source frame that have the greatest perceptual importance, or which would be more noticeable to the human ear if lost, would be redundantly transmitted in more neighboring packets than bits or parameters of the same source frame that are differently categorized to have a lesser perceptual importance.
Side information from the encoder can be part of the encoding algorithm. This side information can also be redundantly transmitted as the other bits or parameters, as discussed in greater detail below.
For concealment purposes, a decoder can benefit not only from redundant copies of the encoded source bits, such as in
As only an example, we use the 6.6 Kbps mode of the EVS codec 26 and the side information from the G.718 codec in the below Table 3 example. The 6.6K mode of the EVS codec 26 contains 132 source bits. In addition we define 2 additional bits for FEC signaling and 16 more bits for FEC side information, similar to G.718. The table below shows an example allocation of the EVS source and FEC bits according to priorities, according to one or more embodiments.
In the example of Table 3 above, there are a total of 45+57+48 bits to be transported. Using the redundancy method outlined above, each packet will contain a total of 3A+2B+C bits, =297 bits+74 RTP payload bits for a total of 371 bits. This fits in the example transport block of size 376 with 5 bits left over. Here, differently classified A, B, and C bits may represent differently classified parameters of the speech, such as linear prediction parameters for when the codec operates as a code-excited linear prediction (CELP) codec based on the mode of operation.
Accordingly, once the High FER mode of operation has been entered, according to one or more embodiments, there are several sub-modes available depending on the amount of bandwidth available (capacity) and FEC protection (robustness) desired, as only examples. These parameters can be traded off with the amount of intrinsic speech quality required, for example. In one or more embodiments and only as an example, there are six sub-modes, each addressing differing priorities of bandwidth (capacity), quality, and error robustness. The attribute of the various sub-modes are listed in the below Table 4.
In the examples below, we assume only redundancy transport of source bits (represented by class A, B and C) and that there are no dedicated FEC bits. As only a convenience, an RTP payload size of 74 is assumed in all examples.
As illustrated in
As illustrated in
Similarly,
As an alternative to the transport block sizes being different in
Contrary to previous methods used by other 3GPP codecs in circuit-switched transport, e.g., where the multimode AMR and AMR-WB codecs can have their mode switched to lower or raise the bit rate based on channel conditions,
As illustrated in
According to one or more embodiments, another approach that maintains the same transport block size after entering the High FER mode of operation involves a procedure termed codebook ‘robbing’, and may be useful when it is desired to provide a small amount of redundancy similar to sub-mode 1 in Table 4 and
In this embodiment, as only an example, if the EVS codec 26 regular mode of operation is 12.65 Kbps, that mode is maintained as the High FER mode of operation is entered. When in the High FER mode of operation, the encoder, for one of the four sub-frames, computes the codebook bits as if the mode of operation was 8.85 Kbps, even though the mode of operation is actually 12.65 Kbps. The sub-frames may be represented by bits of the frame or parameters representing the audio of the frame, such as with linear prediction parameters of a code-excited linear prediction (CELP) coding produced by the codec, when the codec acts as a CELP codec. As indicated in the above Table 5, 20 bits can be used to define the codewords for the bits of the 1st-3rd sub-frames instead of the 36 bits that would have been required if the codebook bits were calculated according to the 12.65 Kbps mode of operation. The 16 bits that are saved by this codebook ‘robbing’ approach are then used for FEC purposes. Transport of the FEC bits can be performed in the same packet size as in the original mode since there is the same number of bits. As in most of the High FER sub-modes, there is some quality degradation associated with this approach.
Accordingly, different from the approaches of Table 4 and
Alternatively, in operation 1320, the number of codebook bits are calculated for each of the divided or separated bits or parameters, e.g., as classified into the separate classes or divided into separate sub-frames, for a bit rate less than the bit rate of the corresponding mode of operation the frame is being encoded in. Thereafter, in operation 1330, defined codewords based on the calculated number of codebook bits may be encoded.
Still further, in operation 1340, in consideration of the defined codewords, redundant information of the encoded separate classes or sub-frames may be unequally provided in the neighboring packets, similar to
The aforementioned approaches for the High FER mode of operation of
However, in some speech codecs, including the G.718 codec and an expected EVS candidate codec, input speech frames may be encoded with a variety of coding types, depending upon the type of speech. In both the G.718 codec and the EVS candidate codec, the encoded speech frames are further classified for FEC purposes. The classification of these frames is based upon the coding type and position of the speech frame in a sequence of speech frames.
As an example, Table 6 below shows, for wideband speech, the four coding types used in both the G.718 and EVS candidate codecs.
According to the G.718 codec, the coding type information is transmitted in a side channel. However, this side channel is currently not available in the expected EVS codec candidate. To overcome this lack of a side channel, side information similar to the approach of the G.718 codec can be transmitted as FEC bits using the concepts presented above and as shown in Table 3, as only an example. Given a dependence of one frame classification type on an adjacent frame classification type, the five coding types can be signaled with only two bits. According to one or more embodiments, such coding types are shown in the below table 7, as only an example.
As noted above, variations of the packet structure shown in
According to one or more embodiments, considering the approach of
With this approach, four subtypes of packets can be used for redundancy transport, as shown in
In this example, packet type “1” of
Using a selection of a data packet subtype from the four packet subtypes of
In the example of
As shown in
In view of the above,
In addition, as the terminal 200 of
One or more embodiments of the one or more terminals 200 include a landline telephone, a mobile phone, a personal digital assistant, a smartphone, a tablet computer, a set top box, a network terminal, a laptop computer, a desktop computer, server, router, or gateway, for example. The terminal 200 includes at least one processing device, such as a digital signal processor (DSP), Main Control Unit (MCU), or CPU, as only examples.
Depending on embodiment, the wireless network 140 is any of a Wireless Personal Area Network (WPAN) (such as through Bluetooth or IR communications), a Wireless LAN (such as in IEEE 802.11), a Wireless Metropolitan Area Network, any WiMax network (such as in IEEE 802.16), any WiBro network (such as in IEEE 802.16e), a network, a Global System for Mobile Communications (GSM), Personal Communications Service (PCS), and any 3GGP network, as only examples, as only non-limiting examples. The wired network can be any landline and/or satellite based telephone networks, cable television or internet access, fiber-optic communication, waveguide (electromagnetism), any Ethernet communication network, any Integrated Services Digital Network (ISDN) network, any Digital Subscriber Line (DSL) network, such as any ISDN Digital Subscriber Line (IDSL) network, any High bit rate Digital Subscriber Line (HDSL) network, any Symmetric Digital Subscriber Line (SDSL) network, any Asymmetric Digital Subscriber Line (ADSL) network, any local exchange carriers (ILECs) provision Rate-Adaptive Digital Subscriber Line (RADSL) network, any VDSL network, and any switched digital service (non-IP) and POTS system. A source terminal can be communicating with a network 140 that is different from the network 140 the receiving terminal communicates with, and audio data may be communicated through more than two different networks 140 with the terminal being at any point in a path between an audio source and an audio receiver 140. One or more embodiments include any encoding, transferring, storing, and/or decoding of audio data having the FEC information of one or more embodiments, and the audio data may be encased in a packet that is appropriate for the transport protocol carrying the audio data.
The transport protocol may be any protocol capable of supporting an RTP packet or HTTP packet, which may respectively have at least a header, table of contents, and payload data, as only an example, and may alternatively be any TCP protocol, UDP protocol, Cyclic UDP protocol, DCCP protocol, Fiber Channel Protocol, NetBIOS protocol, Reliable Datagram Protocol, RDP, SCTP protocol, Sequenced Packet Exchange (SPX), Structured Stream Transport (SST), VSP protocol, Asynchronous Transfer Mode (ATM), Multipurpose Transaction Protocol (MTP/IP), Micro Transport Protocol (pTP), and/or LTE, as only examples. One or more embodiments include a communication of a Quality of Service (QoS), e.g., to/from the decoding terminal 150 and an encoding terminal 100, and the QoS may be transmitted through any path or protocol, including RTCP or a separate path from the audio data transmission path, as only examples. The QoS may be determined based on error checking code included in the data packet. One or more embodiments include changing a coding bitrate and/or changing of coding modes while applying the FEC approach of one or more embodiments, including changing the FEC mode based on the QoS, for example.
One or more embodiments include using one or more thresholds to compare to the QoS to determine whether to apply the FEC approach of one or more embodiments, and/or what mode of the FEC approach of one or more embodiments should be applied. There may be more than one threshold for each comparison, including a threshold indicating that the FEC mode needs to be adjusted for more reliability, decreased or increased, if the QoS is < or <=Th1 and a threshold that indicates that the bit rate or FEC mode needs to be adjusted for less reliability, decreased or increased, if the QoS is > or >=Th2, within Th1 and Th2 being equal in an embodiment.
One or more embodiments include any audio codec used by the encoding terminal 100 and/or the decoding terminal 150 to code the audio data using the FEC approach of one or more embodiments, with the audio coding using one or more algorithms using LPC (LAR, LSP), WLPC, CELP, ACELP, A-law, μ-law, ADPCM, DPCM, MDCT, Bit rate control (CBR, ABR, VBR), and/or Sub-band coding, and may be any codec capable of incorporating the FEC approach of one or more embodiments, including AMR, AMR-WB (G.722.2), AMR-WB+, GSM-HR, GSM-FR, GSM-EFR, G.718, and any 3GPP codec, including any EVS codec, as only examples. In one or more embodiments, the used codec is backward compatible with at least a previous version of the codec. The encoded audio data packet produced by the encoding terminal 100 may include audio data encoded according to more than one codecs by encoder-side codec 120, and may include super wideband audio (SWB), which may be a mono signal that is downmixed by the encoder, binaural stereo audio data, which may also be downmixed by the encoder, full band audio (FB) and/or multi-channel audio. One or more embodiments include encoding one or more of the different types of audio data with the same or different bitrates. In one or more embodiments, the decoding terminal 150 is configured similarly to parse such an encoded audio data packet. Accordingly, one or more embodiments of the terminal 200 include a codec that performs a constant, multi-rate, and/or variable encoding, or translation within the communication path, and/or include a codec that performs any scalable coding, such as with multiple layers or enhancement layers, which may have the same sampling rate or different sampling rates. In one or more embodiments, the decoder includes a jitter buffer. The encoder-side codec 120 may include spatial parameter estimation and mono or binaural downmixing, and one or more of the above listed audio codecs to produce the one or more different audio data, and the decoder-side codec 150 may include corresponding codecs and a mono or binaural upmixing and spatial rendering based on a decoding of the estimated parameters.
In one or more embodiments, any apparatus, system, and unit descriptions herein include one or more hardware devices or hardware processing elements. For example, in one or more embodiments, any described apparatus, system, and unit may further include one or more desirable memories, and any desired hardware input/output transmission devices. Further, the term apparatus should be considered synonymous with elements of a physical system, not limited to a single device or enclosure or all described elements embodied in single respective enclosures in all embodiments, but rather, depending on embodiment, is open to being embodied together or separately in differing enclosures and/or locations through differing hardware elements.
In addition to the above described embodiments, embodiments can also be implemented through computer readable code/instructions in/on a non-transitory medium, e.g., a computer readable medium, to control at least one processing device, such as a processor or computer, to implement any above described embodiment. The medium can correspond to any defined, measurable, and tangible structure permitting the storing and/or transmission of the computer readable code.
The media may also include, e.g., in combination with the computer readable code, data files, data structures, and the like. One or more embodiments of computer-readable media include: magnetic media such as hard disks, floppy disks, and magnetic tape; optical media such as CD ROM disks and DVDs; magneto-optical media such as optical disks; and hardware devices that are specially configured to store and perform program instructions, such as read-only memory (ROM), random access memory (RAM), flash memory, and the like. Computer readable code may include both machine code, such as produced by a compiler, and files containing higher level code that may be executed by the computer using an interpreter, for example. The media may also be any defined, measurable, and tangible distributed network, so that the computer readable code is stored and executed in a distributed fashion. Still further, as only an example, the processing element could include a processor or a computer processor, and processing elements may be distributed and/or included in a single device.
The computer-readable media may also be embodied in at least one application specific integrated circuit (ASIC) or Field Programmable Gate Array (FPGA), as only examples, which execute (processes like a processor) program instructions.
While aspects of the present invention has been particularly shown and described with reference to differing embodiments thereof, it should be understood that these embodiments should be considered in a descriptive sense only and not for purposes of limitation. Descriptions of features or aspects within each embodiment should typically be considered as available for other similar features or aspects in the remaining embodiments. Suitable results may equally be achieved if the described techniques are performed in a different order and/or if components in a described system, architecture, device, or circuit are combined in a different manner and/or replaced or supplemented by other components or their equivalents.
Thus, although a few embodiments have been shown and described, with additional embodiments being equally available, it would be appreciated by those skilled in the art that changes may be made in these embodiments without departing from the principles and spirit of the invention, the scope of which is defined in the claims and their equivalents.
This application claims the benefit of Provisional Application No. 61/474,140, filed Apr. 11, 2011, in the U.S. Patent and Trademark Office, the disclosure of which is incorporated herein by reference.
Number | Date | Country | |
---|---|---|---|
61474140 | Apr 2011 | US |