The present invention relates generally to packet-based communications networks and more particularly to a method and apparatus for enhancing voice intelligibility for telecommunications technologies such as VoIP (Voice-Over-Internet-Protocol) in general, and wireless VoIP in particular, in the presence of packets which arrive too late for normal playout.
The telecommunications industry in North America and Europe is currently preparing the launch of “3G” (third generation) wireless technologies from both the CDMA and GMS worlds. (CDMA and GMS are wireless communication standards fully familiar to those of ordinary skill in the art.) On the CDMA side, the CDMA1xEvDO (also familiar to those skilled in the art) can provide wireless data connections that are ten times as fast as a regular modem. However, as the name EvDO (Evolution Data Only or Evolution Data Optimized) implies, voice traffic is still routed through 3G1xCS channels. Naturally, the next step is to move voice traffic over IP on wireless high-speed packet channels.
In order to achieve high quality VoIP (Voice over IP) on wireless packet channels, there are many challenges ahead. IP overhead is typically quite large relative to speech payload information. The typical end-to-end delay across a typical communications network needs to be reduced. One way of reducing such end-to-end delay is to minimize the jitter buffer playback delay at the decoder. Unfortunately, one direct effect of minimizing the jitter buffer playback delay is an associated increase of the packet loss rate due to packets that arrive late.
When one or more packets arrive late at the receiving end for playout, a conventional decoder simply discards the late packets, since the decoder has already provided replacement material in accordance with a packet loss concealment (PLC) scheme. (As is well known to those of ordinary skill in the art, PLC schemes are used by most speech decoders in response to lost packets. These schemes use various techniques to attempt to minimize the deleterious effects of missing the speech signal encoded in the lost packet, but most commonly, they use some sort of packet repetition scheme in which the previous packet, possibly modified, is repeated in place of the lost packet.)
In one prior art technique for use with prediction-based speech coders, however, some improvement over conventional decoders has been obtained by utilizing the late packets for purposes of re-synchronizing the decoder, so that the error resulting from the late packet (actually the error resulting from the replacement packet in accordance with the PLR scheme) does not adversely propagate. Such an approach can significantly improve the voice quality over conventional schemes. However, even with use of this re-synchronizing scheme, the late packets are never actually played out, which means that a part of the sound may be missing. This can lead to a potential intelligibility problem. For example, if packets carrying the phoneme “s” from the word “spy” are lost, the resultant speech may end up sounding like “pie” rather than “spy.” A PLC scheme alone, even with re-synchronization of the decoder using late packets, is unlikely to be able to rectify such a problem.
In accordance with the principles of the present invention, a method and apparatus for enhancing voice intelligibility for network communications of speech such as, for example, VoIP (Voice-Over-Internet-Protocol), in the presence of packets which arrive too late for normal playout is provided. Specifically, according to the principles of the present invention, when a late speech packet is received by a speech decoder, that packet and, if necessary, one or more additional packets subsequent thereto, are played out at a shorter than normal time scale so that the decoder can “catch up” with the encoder. Moreover, this is advantageously done without losing any potentially important sound segments—that is, the late packets are advantageously handled in such a way that phoneme segments are preserved thereby maintaining high voice quality.
In particular, illustrative embodiments of the present invention take advantage of the fact that a voice frame is usually decoded in several sub-frames—typically two or three. Thus, in accordance with one illustrative embodiment of the present invention, one sub-frame from each frame is skipped, while advantageously maintaining the phase relationship between successive frames. For example, if a frame is decoded in two sub-frames, skipping one sub-frame of a given frame results in effectively playing out the speech for a time period equal to half of the original time duration (e.g., 10 milliseconds for a 20 millisecond packet). (Note that this is not the same as playing the entire packet at twice the speed, which would severely distort the pitch of the speech.) If, on the other hand, a frame is decoded in three sub-frames, skipping one sub-frame of a given frame is effectively playing out the speech for only two-thirds of the time scale. Thus, when a single frame is late, the decoder is advantageously synchronized with the encoder within at most three frames (or, alternately, at a subsequent silence segment).
Suppose now that packet n is not available in time for playout (e.g., the jitter buffer is empty) because packet n is either lost or late, as determined by decision box 11. The illustrative algorithm of
More specifically, if there are packets available in the jitter buffer when the decoder checks at the end of a current cycle, it advantageously retrieves one packet and determines whether the new packet is the packet n that has arrived late or if it is packet n+1, having skipped the packet n. If the new packet is in fact packet n+1, it may be assumed that packet n is probably lost, and therefore it decodes the packet n+1. If, on the other hand, the new packet is the late packet n, this late packet n is also decoded and played before it proceeds to the next packet n+1. (Note that in this scenario in prior art systems, the late packet n is discarded and the decoder proceeds to the next packet n+1 in order to keep up with the encoder—that is, the packet n is never played out. In this manner, the decoder and the encoder remain synchronized, but the speech material in packet n is discarded.)
In order to synchronize decoder with the encoder, however, the late packet n is advantageously played over a shorter time scale than the original packet length in accordance with the principles of the present invention. Moreover, additional, future frames may also be played over a shorter time scale as well (as needed to synchronize the decoder). In particular, the number of such packets that will be shortened depends on the time scale modification factor which is chosen. For example, if frame n arrived late and it was played at a time scale of two-thirds of its normal duration, then frames n+1 and n+2 are also advantageously played at a time scale of two-thirds of their normal durations in order to synchronize with the encoder after packet n+2 has been played. (In accordance with other illustrative embodiments of the present invention, if there continue to be late packets, and the delay budget allows it, a decision may be made to allow the packets to play for their regular time course, effectively allowing for more jitter to be accommodated.)
Clearly, the decoder cannot wait for frames indefinitely. Thus, a predetermined time limit is advantageously provided in order to determine whether a packet is late or should be deemed to be actually lost. (See the discussion of the time threshold used in decision box 16 above.) Illustratively, this predetermined time limit may be advantageously set to be equal to the length of either 2 or 3 packets (which is typically 40-60 milliseconds). Then, any packets that arrive later than this threshold (i.e., the time limit) may, in accordance with one illustrative embodiment of the present invention, be used to update the decoder's internal state, but these packets are otherwise advantageously discarded (as shown in block 18 of the figure) without being played out. (In other words, if these “too late” packets are in fact used to update the decoder's internal state, any decoder output therefrom is advantageously discarded.)
a) shows a timing sequence diagram for an encoder and a decoder in a case where all packets arrive in time. In particular, the figure shows five packets, all of which arrive in time with small jitter. All packets are decoded and played out normally. This timing sequence diagram applies to both a prior art decoder and to a decoder in accordance with an illustrative embodiment of the present invention.
b) shows a timing sequence diagram for an encoder and a decoder in a case where a packet is missing and not received late. In particular, the figure shows that when a packet is lost (packet 2), a packet loss concealment algorithm fills the gap (represented as 1′ in the figure) by generating a replacement packet based on the previous packet (i.e., packet 1), skips packet 2, and then continues with packet 3 (which has been received in time). Again, this timing sequence diagram applies to both a prior art decoder and to a decoder in accordance with an illustrative embodiment of the present invention.
c) shows a timing sequence diagram for an encoder and a prior art decoder in a case where a packet is received late. In particular, for a prior art decoder, when a packet experiences excessive jitter and misses its sync (as is the case for packet 2 in the figure), a packet loss concealment algorithm again fills the gap (as in
d) shows a timing sequence diagram for an encoder and an illustrative decoder in accordance with an illustrative embodiment of the present invention in the case where a packet is received late. That is, in accordance with an illustrative decoder of the present invention, both the late packet 2 and (timely) packet 3 are advantageously played out, but with a shorter than normal duration, in order that the decoder is synchronized with the encoder (in this case, at packet 4) while not losing any sound that may be critical for intelligibility of the speech. Specifically, in
e) shows a timing sequence diagram for an encoder and an illustrative decoder in accordance with an illustrative embodiment of the present invention in a case where several consecutive packets are received late, and some, but not all, of the late packets are played out. As described above, a maximum timeout threshold is advantageously set so that the decoder does not wait indefinitely for late packets.
And finally,
There are several methods for time scale modification of speech signals which may be used in accordance with various illustrative embodiments of the present invention. In accordance with one illustrative embodiment of the invention, the well-known pitch synchronous overlap add (PSOLA) method may be used. This method provides a technique with high resultant voice quality, and it is the most popular signal processing method used in text-to-speech synthesis applications in which time scale modification is employed.
In accordance with other illustrative embodiments of the present invention, a simpler alternative (as compared to the use of the PSOLA method) is to merely control the number of sub-frames decoded and played at the decoder. In typical voice codecs (encoder/decoder systems), a voice frame is decoded into either two sub-frames (e.g., in the well known G.729 voice coding standard) or three sub-frames (e.g., in the well known EVRC coding standard). If a frame is decoded into two sub-frames, skipping one sub-frame is effectively the same as playing out the speech for half of the interval. In this case, when a single frame is late, the decoder is synchronized with the encoder after decoding two frames including the late one. If, on the other hand, a frame is decoded into three sub-frames, skipping one sub-frame (out of three) is equivalent to playing it out at two-thirds of its normal time scale. In this case, when a single frame is late, the decoder is synchronized with the encoder after decoding three frames including the late one.
It should be noted that all of the preceding discussion merely illustrates the general principles of the invention. It will be appreciated that those skilled in the art will be able to devise various other arrangements, which, although not explicitly described or shown herein, embody the principles of the invention, and are included within its spirit and scope. Furthermore, all examples and conditional language recited herein are principally intended expressly to be only for pedagogical purposes to aid the reader in understanding the principles of the invention and the concepts contributed by the inventors to furthering the art, and are to be construed as being without limitation to such specifically recited examples and conditions. Moreover, all statements herein reciting principles, aspects, and embodiments of the invention, as well as specific examples thereof, are intended to encompass both structural and functional equivalents thereof. It is also intended that such equivalents include both currently known equivalents as well as equivalents developed in the future—i.e., any elements developed that perform the same function, regardless of structure.
Thus, for example, it will be appreciated by those skilled in the art that any flow charts, flow diagrams, state transition diagrams, pseudocode, and the like represent various processes which may be substantially represented in computer readable medium and so executed by a computer or processor, whether or not such computer or processor is explicitly shown. Thus, the blocks shown, for example, in such flowcharts may be understood as potentially representing physical elements, which may, for example, be expressed in the instant claims as means for specifying particular functions such as are described in the flowchart blocks. Moreover, such flowchart blocks may also be understood as representing physical signals or stored physical data, which may, for example, be comprised in such aforementioned computer readable medium such as disc or semiconductor storage devices.
Number | Name | Date | Kind |
---|---|---|---|
4726019 | Adelmann et al. | Feb 1988 | A |
6366959 | Sidhu et al. | Apr 2002 | B1 |
6744764 | Bigdeliazari et al. | Jun 2004 | B1 |
6850496 | Knappe et al. | Feb 2005 | B1 |
7324444 | Liang et al. | Jan 2008 | B1 |
7337108 | Florencio et al. | Feb 2008 | B2 |
7447983 | Conway | Nov 2008 | B2 |
20030152093 | Gupta et al. | Aug 2003 | A1 |
20040047369 | Goel | Mar 2004 | A1 |
20040081106 | Bruhn | Apr 2004 | A1 |
20040120309 | Kurittu et al. | Jun 2004 | A1 |
20050010401 | Sung et al. | Jan 2005 | A1 |
20050243846 | Mallila | Nov 2005 | A1 |
Number | Date | Country | |
---|---|---|---|
20060074681 A1 | Apr 2006 | US |