The invention relates generally to audio communication over a network and more particularly to real time communication over the Internet.
The unreliable and stateless nature of today's Internet protocol (IP) results in a best-effort service, i.e., packets may be delivered with arbitrary delay or may even be lost. This quality of service (QoS) limitation is a major challenge for real-time voice communication over IP networks (VoIP). Since excessive end-to-end delay impairs the interactivity of human conversation, active error control techniques such as retransmission cannot be applied. Therefore, any packet loss directly degrades the quality of the reconstructed speech. Furthermore, delay variation (also known as jitter) obstructs the proper reconstruction of the voice packets in their original sequential and periodic pattern.
Considerable efforts have been made in different layers of current communication systems to reduce the delay, smooth the jitter, and recover the loss. On the application layer, receiver-based, passive methods have the advantage that no cooperation of the sender is required. Furthermore, these methods can operate independently of the network infrastructure.
The common way to control the playout of packets is to employ a playout buffer at the receiver to absorb the delay jitter before the audio is output. When using this jitter absorption technique, packets are not played out immediately after reception but held in a buffer until their scheduled playout time (playout deadline) arrives. Though this introduces additional delay for packets arriving early, it allows the playing of packets that arrive with a larger amount of delay. Note that there is a trade-off between the average time that packets spend in the buffer (buffering delay) and the number of packets that have to be dropped because they arrive too late (late loss). Scheduling a later deadline increases the possibility of playing out more packets and results in lower loss rate, but at the cost of higher buffering delay. On the other hand, it is difficult to decrease the buffering delay without significantly increasing the loss rate. Therefore, packet loss in delay-sensitive applications, such as VoIP, is a result of not only packets being dropped over the network, but also delay jitter, which greatly impairs communication quality.
Prior art attempts to solve this problem mainly focused on improving the trade-off between delay and loss, while trying to compensate the jitter completely or almost completely within talkspurts. By setting the same fixed time for all the packets in a talkspurt, the output packets are played in the original, continuous, and periodic pattern, e.g., every 20 ms. Therefore, even though there may be delay jitter on the network, the audio is reconstructed without any playout jitter. Other prior art solutions apply adaptive scheduling of audio and other types of multimedia, accepting a certain amount of playout jitter. However, in these methods, the playout time adjustment is made without regard to the audio signal and how continuous playout of the audio stream can actually be achieved is not addressed. As a result, the playout jitter that can be tolerated has to be small in order to preserve reasonable audio quality.
A method of adapting a playout schedule of a stream of media packets according to network and channel conditions includes (a) setting a playout schedule for a next packet i+1 of the stream upon receiving a current packet i; (b) computing a length of the current packet i based at least in part on a target playout schedule for the next packet i+1; (c) scaling packet i if necessary; (d) outputting packet i; and (e) updating the playout schedule for next packet i+1 based at least in part on the playout schedule and the length of current packet i.
A complete understanding of the present invention may be gained by considering the following detailed description in conjunction with the accompanying drawings, in which:
With reference to
The simplest method, shown in
With improved playout algorithms, the network delay is monitored and the playout time is adaptively adjusted during silence periods. This is based on the observation that, for a typical conversation the audio stream can be grouped into talkspurts separated by silence periods. The playout time of a new talkspurt may be adjusted by extending or compressing the silence periods. This approach is shown in
In a preferred embodiment of the invention, the playout is not only adjusted in silence periods but also within talkspurts. Each individual packet may have a different scheduled playout time, which is set according to the varying delay statistics. The results of the method of the invention are shown in
With reference to
The task of a particular scheduling scheme is to set the maximum allowable total delay dmaxi (playout deadline) for each packet. Note that for the methods shown in
When evaluating different scheduling schemes two quantities are of interest. The first one is the average buffering delay, which is given by
where ={i|tpi>tri} is the set of played packets, and || denotes the cardinality of this set. The second quantity is the associated late loss rate, given by εl=(||−||)/N. These two metrics also reflect the above mentioned trade-off between loss and delay and are used herein to compare the performance of different playout scheduling algorithms.
The link loss rate is defined as εn=(N−||)/N. The total loss rate is the sum of the late loss rate and the link loss rate, i.e., ε=εn+εl. The burst loss rate, denoted by εb, quantifies the burstiness of the loss. Burst losses are considered separately because they are more difficult to conceal and impair sound quality more severely. Defining the set of packets with two consecutive losses as ={i|tpi<tri, tpi+1>tri+1}, the burst loss rate is given by εb=||/N.
As described above, adaptive playout can only be achieved when individual voice packets can be scaled without impairing speech quality. The scaling of voice packet is realized by time-scale modification based on the Waveform Similarity Overlap-Add (WSOLA) algorithm, which is an interpolation-based method operating entirely in the time domain. This technique, described in W. Verhelst and M. Roelands, “An overlap-add technique based on waveform similarity (WSOLA) for high quality time-scale modification of speech,” in Proc. ICASSP 93, April 1993, vol II, pp. 554-557 and incorporated herein, is used to scale long audio blocks. The technique is described in modified form for loss concealment by expanding a block of several packets in A. Stenger, K. Ben Younes, R. Reng, and B. Girod, “A new error concealment technique for audio transmission with packet loss,” in Proc. European Signal Processing Conference, September 1996, vol. 3, pp. 1965-68 and in H. Sanneck, A. Stenger, K. Ben Younes, and B. Girod, “A new technique for audio packet loss concealment,” in IEEE GLOBECOM, November 1996, pp. 48-52 and incorporated herein. The basic idea of WSOLA is to decompose the input into overlapping segments of equal length, which are then realigned and superimposed to form the output with equal and fixed overlap. The realignment leads to a modified output length. For those segments to be added in overlap, their relative positions in the input are found through the search of the maximum correlation between them, so that they have the maximum similarity and the superposition will not cause any discontinuity in the output. Weighting windows are applied to the segments before they are superimposed to generate smooth transitions in the reconstructed output. For speech processing, WSOLA has the advantages of maintaining the pitch period, which results in improved quality compared to resampling.
To scale a voice packet, a template segment 300 of constant length in the input 302 is selected, and then a search for a similar segment 304 that exhibits maximum similarity to the template segment 300 is conducted. The start of the similar segment 304 is searched in a search region 310, as is shown in
In
The operations of searching for a similar segment 304 and extending the packet by multiple pitch periods, as described above, constitute one iteration of the method of the invention. If the output speech has not reached the desired length after such operations, additional iterations are performed. In a subsequent iteration, a new template segment of the same length is defined that immediately follows the template in the last iteration. All the defined template segments and the remaining samples following the last template in the input should cover the entire output with the target length. The total number of defined template segments, and hence the number of iterations used here is
where └x┘ represents the greatest integer number that is smaller than or equal to x, {circumflex over (L)}i is the target length of the output, and W is the length of a segment (either template segment or similar segment).
Packet compression is done in a similar way, as depicted in
Comparing the input 302 and output waveforms 320, 334 in
One advantage of working with a short packet is that the input is divided into fewer template segments so that typically only one or two iterations will yield the output with the desired length. Another important feature of the packet scaling method apparent in
Since the scaling of packets has to be performed in integer multiples of pitch periods, it is not possible to achieve arbitrary packet lengths and playout times as would be desirable for adaptive playout. In other words, the actual resulting packet length, Li, after single packet WSOLA can only approximate the required target length, {circumflex over (L)}i. For this reason, expansion and compression thresholds are defined. Only if the desired playout time precedes the currently scheduled playout time by more than the compression threshold, is a packet compressed to speed up the playout. The compression threshold is usually greater than a typical pitch period. The same strategy is used for the expansion of a packet, except that the two thresholds are asymmetric. To prevent unnecessary late loss, compression is applied conservatively enough to avoid dropping the playout time below a target. On the other hand, smaller expansion thresholds are defined, which might be smaller than a pitch period. In this way, the packet is expanded and the playout is slowed down in order to accommodate a sudden increase of the network delay. This asymmetry results in a hysteresis that can be observed in
To avoid extreme scaling, maximum and minimum target packet lengths are defined and denoted by Lmax and Lmin respectively. In the simulations described herein, Lmax=2.3L0 and Lmin=0.3L0 are used. However, during silence periods, the amount of adjustment made for the playout schedule is not limited by Lmax or Lmin, so that the playout schedule can be modified.
The general procedure of playout schedule adjustment is described by the algorithm in
Due to the real-time nature of the packet scaling operation and low-delay requirement, the algorithm has to be computationally efficient. Hence, the complexity of single packet WSOLA ins analyzed. Denoting the length of a segment in samples by W, and the length of search region in samples by R, in one iteration, the number of operations for correlation calculation is WR multiplications, plus 2W multiplications for windowing. For a typical 20 ms packet sampled at 8 kHz, if limiting the maximum scaling ratio to be Lmax/L0=2.3, there would be at most 3 iterations in total according to
Considering typical values of W=80 and R=100, and 3 iterations, the maximum complexity of scaling one packet is approximately 24,000 multiplications and 24,000 additions. Based on experiments on a 733 MHz Pentium III machine, this operation requires approximately 0.35 ms. In practice, scaling by Lmax/L0 is carried out infrequently, and the average load will be significantly lower than the peak load estimated above.
The basic operation of the playout scheduler is to set the playout time tp for each packet. Before packet i can be played out the length Li must be computed to perform the required scaling. According to Li=tpi+1−tpi, this implicitly sets the playout time of the next packet to tpi+1=tpi+Li. Therefore, in order to play packet i, the arrival and playout time of packet i+1 must be estimated, or equivalently, the network delay dni+1, which is Step 2 in
According to an embodiment of the invention, the delays are collected for a sliding window of the past w packets. A threshold of the total delay for the next packet, dmaxi+1, is defined according to the user-specified loss rate, {circumflex over (ε)}l. The next packet must arrive before that deadline in order to be played out. The determination of dmaxi+1 is described in detail as follows.
The network delay of past w packets recorded is dni−w+1, dni−w+2, . . . , dni. Its order statistics, or the sorted version of dni−w+1, dni−w+2, . . . , dni are denoted as D1, D2, . . . Dw, where D1≦D2≦ . . . ≦Dw. The probability that network delay dn is no greater than the rth order statistic Dr is F(Dr)=P(dn≦Dr), r=1, 2, . . . , w. It has been shown that
which is the expected probability that a packet with the same delay statistics can be received by Dr.
The method of the invention extends D1≦D2≦ . . . ≦Dw by adding an estimate of the lowest possible delay D0=max(D1−2sd
This solves the problem that the expected playout probability in
cannot reach beyond
or below
Given a user-specified loss rate {circumflex over (ε)}l, the index {circumflex over (r)} and corresponding delay D{circumflex over (r)} that achieves {circumflex over (ε)}l with the smallest possible delay is sought. Put differently, the greatest D{circumflex over (r)} such that (F(D{circumflex over (r)}))≦1−{circumflex over (ε)}l is sought. From
the corresponding index is given by {circumflex over (r)}=└(w+1)(1−{circumflex over (ε)}l)┘. Given this index, the playout deadline dmaxi+1 can be approximated by the interpolation between D{circumflex over (r)} and D{circumflex over (r)}+1 as dmaxi+1=D{circumflex over (r)}(D{circumflex over (r)}+1−D{circumflex over (r)})[(w+1)(1−{circumflex over (ε)}l)−{circumflex over (r)}].
Note that, due to the heavy-tailed nature of network delay, the maximum possible value of the delay dni+1 cannot be determined from a limited sample space. Hence, the statistic obtained from the last w samples is often too optimistic. By adding an estimate of the maximum delay, Dw+1, as shown by Dw+1=Dw+2sd
A more accurate estimation of the delay distribution is also possible by using a larger window size w. However, this has the disadvantage that the window-based estimator will adapt less responsively to the varying network delay. Hence, the choice of w determines how fast the algorithm is in adapting to the variation and is subject to a trade-off between accuracy and responsiveness. The experimental values are described herein.
One important feature of the history-based estimation is that the user can specify the acceptable loss rate, {circumflex over (ε)}l, and the algorithm automatically adjusts the delay accordingly. Therefore, the trade-off between buffering delay and late loss can be controlled explicitly. In practice, loss rates of up to 10% can be tolerated when good loss concealment techniques are employed, as discussed in more detail herein.
From network delay traces, it is common to observe sudden high delays (“spikes”) incurred by voice packets, as packets 113-115 show in
Delay spikes usually occur when new traffic enters the network and a shared link becomes congested, in which case past statistics are not useful to predict future delays. In an embodiment of the invention, a rapid adaptation mode is implemented when the present delay exceeds the previous one by more than a threshold value. In rapid adaptation mode, a first packet with unpredictable high delay has to be discarded. After that, the delay estimate is set to the last “spike delay” without considering or further updating the order statistics. Rapid adaptation mode is switched off when the delays drop down to the level before the mode is in force and the algorithm returns to its normal operation reusing the state of order statistics before the spike occurred. This rapid adaptation is only possible when individual packets are scheduled and scaled as in our scheme. It is often helpful to avoid burst loss as illustrated in
Even with adaptive playout scheduling a certain number of packets will arrive after their scheduled playout time or be lost over the network. To recover the lost information as well as possible, various loss recovery techniques have been investigated in the past. A survey studying different trade-offs among algorithm delay, voice quality and complexity is presented in C. Perkins, O. Hodson, and V. Hardman, “A survey of packet loss recovery techniques for streaming audio,” IEEE Network, vol. 12, no. 5, pp. 40-48, September-October 1998 and incorporated herein. In an embodiment of the invention a method that is based on the packet scaling operations is described in herein. It is a hybrid of time-scale modification and waveform substitution, which is used to conceal both late loss and link loss by exploiting the redundancy in the audio signal. The good sound quality by time-scale modification has already been demonstrated in the prior art. However, in the prior art an algorithm delay of 2-3 packet times is introduced by using one-sided information and working on a block of 2-3 packets. The method of the invention takes advantage of scaling one packet and using two-sided information by working together with adaptive playout scheduling. This concealment method reduces the delay to one packet time and results in better voice quality. Waveform repetition is built into the method of the invention to repair burst loss. Waveform repetition does not introduce any algorithm delay other than a short merging period, however it does not provide as good a sound quality as time-scale modification.
The concealment method of the invention is illustrated in
Besides single packet loss, the method of the invention can also handle interleaved loss patterns, or bursts loss, as shown in
Finally, the concealment of two or more consecutive packet losses is illustrated in
The algorithm for the proposed loss concealment method is summarized in
The performance of the three playout scheduling schemes shown in
The metrics used for the comparison between different algorithms are the late loss rate, εl, and the average buffering delay, db, as defined herein. These two quantities are of major concern since they are directly associated with the subjective quality, and they are the receiver-controllable components of the total loss rate and total delay respectively.
In the experiments, receiving and playing out of the voice packets are simulated offline using the three playout scheduling schemes under comparison. Delay traces and voice packets are read in from recorded files and different playout algorithms are executed to calculate the playout time, scale voice packets if necessary, and generate output audio. In this way, the playout scheduling schemes are compared under the same conditions. After the simulation of a whole trace, the loss rate and average buffering delay are calculated and plotted in
The trade-off of using different window size w is discussed herein. For the playout scheduling scheme shown in
In all cases, the playout scheduling scheme of the invention results in the lowest buffering delay for the same loss rate and hence outperforms the other two playout scheduling schemes. If targeting a late loss rate of 5% for Trace 1, the average buffering delay is reduced by 40.0 ms when using the playout scheduling scheme of the invention instead of the playout scheduling scheme shown in
On the other hand, if allowing the same buffering delay for different algorithms, the playout scheduling scheme of the invention also results in the lowest loss rate. For the example of Trace 1, if the same 40 ms buffering time is consumed, the total loss rate resulting from playout scheduling scheme of the invention is more than 10% lower than that from the playout scheduling schemes shown in
More importantly, the burst loss rate also drops when using the playout scheduling scheme of the invention. For Trace 1, by using the playout scheduling scheme of the invention, the burst loss rate drops from 12% to 1% at 40 ms buffering delay. As discussed above, burst loss significantly impairs voice quality even if its rate is as low as 5%. Even for Trace 3, where the gain from the playout scheduling scheme of the invention in terms of late loss rate and buffering delay is the smallest, the burst loss rate is 3.9% lower at 10 ms buffering delay.
The performance gain of the playout scheduling scheme of the invention over the playout scheduling schemes shown in
The adaptive playout scheduling scheme of the invention estimates the network delay based on short-term order statistics covering a relatively small window, e.g., the past 35 packets. In the case of delay spikes, a special mode is used to follow delay variations more rapidly. Given the estimate, the playout time of the voice packets is adaptively adjusted to the varying network statistics. In contrast to the prior art, the adjustment is not only performed between talkspurts, but also within talkspurts in a highly dynamic way. Proper reconstruction of continuous playout speech is achieved by scaling individual voice packets using a Single Packet WSOLA algorithm that works on individual packets without introducing additional delay or discontinuities at packet boundaries. Results of subjective listening tests show that the DMOS score for this operation is between inaudible and audible but not annoying. This negligible quality degradation can also be observed for extreme network conditions that require scaling ratios of 35-230% for up to 25% of the packets.
Simulation results based on Internet measurements show that the trade-off between buffering delay and late loss can be improved significantly. For a typical buffering delay of 40 ms, the late loss rate can be reduced by more than 10%. More importantly, the playout scheduling scheme of the invention is very well suited to avoid the loss of multiple consecutive packets, which is particularly important for loss concealment. For example, the burst loss rate can be reduced from 12 to 1% at 40 ms buffering delay, which results in significantly improved audio quality.
An embodiment of the invention includes a WSOLA based loss concealment technique to combat loss, and work together with adaptive playout scheduling. Compared to the prior art, the proposed scheme operates at very low delay, i.e., one packet time, and can handle various loss patterns more effectively. The loss concealment method of the invention can take advantage of the flexibility provided by adaptive playout, and hence, the use of time-scale modification for adaptive playout scheduling integrating seamlessly into the overall system.
Although the invention is disclosed by the preferred embodiment, it is not intended to limit the invention. Those knowledgeable in the art can make modifications within the scope and spirit of the invention which is determined by the claims below. By way of example, the methods of the invention can be used in applications other than VoIP such as streaming audio, streaming video, the audio component of streaming multimedia, and music. In the case of streaming video, those skilled in the art will recognize that media scaling may be accomplished by adjusting a frame rate rather than using time-scale modification.
This application claims priority from U.S. Provisional application Ser. No. 60/362,582 filed on Mar. 5, 2002, the specification of which is herein incorporated.
Number | Name | Date | Kind |
---|---|---|---|
6259677 | Jain | Jul 2001 | B1 |
6646986 | Beshai | Nov 2003 | B1 |
6665732 | Garofalakis et al. | Dec 2003 | B1 |
6823394 | Waldvogel et al. | Nov 2004 | B2 |
7039048 | Monta et al. | May 2006 | B1 |
7050465 | Leon | May 2006 | B2 |
20020075857 | LeBlanc | Jun 2002 | A1 |
20030061371 | Deshpande | Mar 2003 | A1 |
Number | Date | Country | |
---|---|---|---|
60362582 | Mar 2002 | US |