1. Field of the Invention
Embodiments of the invention relate generally to the field of digital networking communications. More particularly, an embodiment of the invention relates to methods and systems for packet (and/or frame) switched networking that include an adaptive slip double buffer.
2. Discussion of the Related Art
With the advent of Internet Protocol (“IP”), packet-based transmission and routing schemes are becoming ever more popular. It is well accepted that Next Generation Networks (“NGN”s) will be built upon these principles. However, several services, such as real-time voice and voice-band communication, that are well suited for circuit-switched (“TDM”) transmission and switching, have to be supported by this new architecture. VoIP (“voice over IP”) is one such example. The underlying premise of VoIP is that speech, after conversion from analog to digital format, can be packetized and several protocols such as RTP and RTCP (see Ref. [1,2]) have been developed to support the ability of IP networks to provide such real-time services.
One of the premises of NGNs is that the Quality of Experience (QoE) should be at least as good as good, or even better than, that provided by the legacy circuit-switched network or PSTN (Public Switched Telephone Network). It is clear that delay is an important parameter in determining the QoE. It is well known that one-way delays that are very large (of the order of 400 ms or larger) are extremely detrimental from the view of subjective quality, making regular full-duplex conversation difficult. At lower one-way delays, the impact of echo is important. The Quality of Experience, for a given level of Echo Return Loss (ERL) drops rapidly with increasing delay.
The overall delay has four principal components. The process of packetization involves buffering information to fill the packet payload and thus introduces delay. The encoding and decoding algorithms, especially in the case of source codecs, require buffering as well. These two delays are often known quantities. The third component is the delay through the network. This delay is difficult to predict a priori since it depends on the physical distance, the number of intermediate packet switches involved in the end-to-end transport of a packet, the bandwidth of the links between switches (routers). However, for two given end-points there is, in principle, a minimal network delay corresponding to the transit time of the fastest possible packet transmission. Considering that in a pure IP network the transmission path could be different for different packets, and the queuing delay in intermediate nodes is a function of congestion, the delay experienced by packets will be variable, ranging from the minimal delay to infinity (a packet lost in the network is construed as an instance of infinite delay). Some maximum delay threshold must be determined and packets with delay greater than this maximum are discarded. Received packets are stored in a buffer whose size corresponds to the difference between minimum and maximum delays and so, practically speaking, fast packets are delayed so that the packets can be decoded and converted back to analog signals in a smooth fashion. The notion of play-out, or dejittering, whereby some delay is introduced via a jitter buffer constitutes the fourth delay component. Clearly, in order to maximize the subjective quality of the call, the play-out buffer, also referred to as the jitter buffer, should be as small as possible.
There is a need for the following embodiments of the invention. Of course, the invention is not limited to these embodiments.
According to an embodiment of the invention, a process comprises: monitoring a fill in an adaptive slip buffer of a digital to analog convertor; adjusting a number of samples that are read from the adaptive slip buffer per page as a function of the fill; and reading the number of samples from the adaptive slip buffer. According to another embodiment of the invention, a machine comprises: a digital to analog convertor including an adaptive slip buffer and a read address generator coupled to the adaptive slip buffer, wherein the read address generator includes an increment control that adjusts a number of samples that are read from the adaptive slip buffer per page as a function of fill of the adaptive slip buffer.
These, and other, embodiments of the invention will be better appreciated and understood when considered in conjunction with the following description and the accompanying drawings. It should be understood, however, that the following description, while indicating various embodiments of the invention and numerous specific details thereof, is given for the purpose of illustration and does not imply limitation. Many substitutions, modifications, additions and/or rearrangements may be made within the scope of an embodiment of the invention without departing from the spirit thereof, and embodiments of the invention include all such substitutions, modifications, additions and/or rearrangements.
The drawings accompanying and forming part of this specification are included to depict certain embodiments of the invention. A clearer concept of embodiments of the invention, and of components combinable with embodiments of the invention, and operation of systems provided with embodiments of the invention, will be readily apparent by referring to the exemplary, and therefore nonlimiting, embodiments illustrated in the drawings (wherein identical reference numerals (if they occur in more than one view) designate the same elements). Embodiments of the invention may be better understood by reference to one or more of these drawings in combination with the following description presented herein. It should be noted that the features illustrated in the drawings are not necessarily drawn to scale.
Embodiments of the invention and the various features and advantageous details thereof are explained more fully with reference to the nonlimiting embodiments that are illustrated in the accompanying drawings and detailed in the following description. Descriptions of well known starting materials, processing techniques, components and equipment are omitted so as not to unnecessarily obscure the embodiments of the invention in detail. It should be understood, however, that the detailed description and the specific examples, while indicating preferred embodiments of the invention, are given by way of illustration only and not by way of limitation. Various substitutions, modifications, additions and/or rearrangements within the spirit and/or scope of the underlying inventive concept will become apparent to those skilled in the art from this disclosure.
Within this application several publications are referenced by Arabic numerals, or principal author's name followed by year of publication, within parentheses or brackets. Full citations for these, and other, publications may be found at the end of the specification immediately preceding the claims after the section heading References. The disclosures of all these publications in their entireties are hereby expressly incorporated by reference herein for the purpose of indicating the background of embodiments of the invention and illustrating the state of the art.
The invention described herein describes a novel approach to the play-out buffer, providing a method to maintain optimal performance even in situations where the analog-to-digital converter (ADC) and digital-to-analog converter (DAC) have different underlying time-bases. In particular, a method based on controlled slips, a technique that is well known as being efficient in TDM architectures for addressing clock offset, is presented. The invention is an extension of controlled slip behavior. In particular, the slip mechanism is invoked primarily when the speech segment represents a synthetic signal such as during periods of silence or if the characteristics of the speech segment are such that the repetition/deletion of a speech sample will have minimal subjective annoyance. It will be seen that an adaptive play-out buffer of the manner described here can form an integral part of an adaptive jitter buffer mechanism. Extensions of the invention include methods to implement adaptive clock control with minimal impact on subjective quality.
Strictly speaking, the term synchronization applies to alignment of time and the term syntonization applies to alignment of frequency, but in the telecommunication environment we often use the term synchronization to refer to either time-alignment, or frequency-alignment, or both. It is generally clear from the context which meaning is appropriate. All real-time communication carried over a digital network requires synchronization to some degree. This can be illustrated by considering the example of delivering a real-time voice signal between two geographically disparate points across a network.
The situation is depicted in
It is important to recognize that at each end the digital-to-analog converter (DAC or D/A) and analog-to-digital converter (ADC or A/D) are usually in the same integrated circuit chip or on the same circuit board and thus the same clock is used for both functions at any one end. In the event that the (digital) signal processing includes echo cancellation, it is mandatory that the same clock be used for both functions else the echo canceller will exhibit sub-par performance and there will be instances of echo leakage and other phenomena that negatively impact the quality of experience. In
The rate at which packets are generated (in the encoder) is determined by the A/D clock, shown as fA in
If the frequencies of the A/D clock (fA) and the D/A clock (fD) are not equal, then slips will occur. The notion of a slip is simple. If fA>fD then the DAC will experience a surfeit of samples; if fA<fD then the DAC will experience a shortage of samples. Rate-adaptation then requires that samples be deleted or inserted. In the circuit-switched architecture of the legacy PSTN, every transmission boundary element is required to extract DS0 s from an incoming digital signal (typically a DS1) and reinsert the information into an outgoing digital signal (typically a DS1) that may, potentially, have a different time-base. Therefore slip buffers are very common. To minimize the occurrence of slips, the circuit-switched network is well synchronized and this approach to network synchronization has the derivative benefit that the clock offset between the end points is minimized. In an NGN, where asynchronous transport is employed, there is no guarantee that the clock offset between the end points is negligible.
However, this phenomenon is not necessarily catastrophic, but the DAC would have to either insert or delete a sample to account for the difference in sampling rates. This insertion or deletion of a block of information, such as a sample, is referred to as a slip. Note that a slip is the result of the difference in sampling rates and is independent of the word length associated with the quantization and compression. The degradation of perceptual quality caused by slips is in addition to any degradation caused by other factors. In conventional circuit-switched telephony, the unit of information inserted or deleted is one sample (or octet). Considering the nominal sampling rate is 8 kHz (one sample every 125 □s), a slip occurs when the accumulated phase difference, expressed in time units, caused by the aforementioned frequency difference, crosses 125 □s. In a packetized scenario, the unit could be as large a block of speech, typically of duration 20 ms and thus slips would have an impact similar to packet loss. Note that 20-ms slips occur much less frequently than 125-□s slips but have a greater impact each time they occur. The thrust of the current invention is to get the benefits of single-octet (single-sample) slips in a packet environment. Furthermore, the thrust of the current invention is to get the benefits of a single-octet slip in low-cost implementations such as in customer-premises-equipment (CPE) integrated-access-devices (IADs) and residential gateways.
In the following table we provide the slip rate assuming that the D/A conversion clock uses a free-running oscillator and that the A/D clock is accurate (relative to a Primary Reference Source). Also provided is the typical technology used for that accuracy and a budgetary estimate (order of magnitude) of the cost of the oscillator. The last three columns provide an approximate time between slip occurrences for different block sizes. In generating this table it was assumed that the transmission link between the A/D and D/A is equivalent to a “null” link that adds no impairments such as excessive time-delay variation or transmission errors. The intent is to lay the baseline for the minimum impairment that is introduced by the lack of synchronization between the end-points.
With regard to Table I as shown below, the terminology used includes: XO: Crystal Oscillator, TCXO: Temperature-Compensated Crystal Oscillator and OCXO: Oven-Controlled Crystal Oscillator
It should be noted that in carrier-grade equipment such as that used in large telecom service provider networks, the higher quality clock sources (oscillators) are appropriate. For customer-premise equipment, including cases where the application runs on a personal computer, the quality of the oscillator is likely to be of the XO or, at best, TCXO class.
The perceptual degradation in quality caused by slips is very subjective. The impact of an isolated slip in conventional telephony using uncompressed signals (G.711) is typically a “click” that could well be imperceptible, especially if it occurs during a silent interval. However, the perceived quality degrades rapidly as the slip-rate increases. The various digital switches in the PSTN are all provided a PRS (Primary Reference Source) traceable reference and thus have an absolute accuracy of better than 1×10−11. A call traversing two distinct timing domains may experience slips corresponding to a worst-case frequency difference of 2×10−11. Considering that this equates to one slip every 72 days, we can, for all practical purposes, ignore the phenomenon of slips in the traditional circuit-switched network. In VoIP applications, the end points are quite cost sensitive and therefore it is likely that the quality of oscillator deployed will be represented by one of the last three rows of Table 1 and clearly slips may play an important role in determining the quality of experience (or lack thereof).
Most studies for evaluating the perceptual quality of compressed voice are done in a controlled environment and consider only a single compression/expansion. Additional study is required to assess the impact of tandem connections wherein there may be multiple conversions of format. Furthermore, the impact of an isolated slip may have a different perceptual effect on synthetic speech, such as that inherent in CELP (Code Excited Linear Prediction) methods for compression, such as G.729 (see Ref. [5]). However, it is quite well accepted that the controlled slip method, where one sample (octet) is deleted/inserted in an “uncompressed” stream, works very well provided that slips do not manifest themselves too often.
If the size of the buffer is large, then the relative frequency of occurrence of buffer overflow/underflow events will be small. However, large buffers imply the introduction of delay and the decrease in quality of experience. Nevertheless, even with large buffers deployed to mitigate the occurrence of buffer overflow/underflow, there are other impairments that arise because of a difference in clock between the end-points. Note that if there is a long-term-average difference in the clock (frequency) at the two end-points then buffer overflow/underflow will occur—the size of the buffer will just determine the interval between these catastrophic events.
The analog signal from the source enters the network and is converted into a digital signal by the analog-to-digital converter (ADC). The network acts as a pipe for these digital words (samples) that are delivered to the far-end digital-to-analog converter (DAC) for conversion back to analog. The conversion points could be in equipment, such as a customer-premise located IAD or PBX or even a Class-5 switch operated by the local telephone company. It is important to recognize that the time-base governing the A/D clock could be different from the time-base governing the D/A clock and thus there could be a difference in the sampling rates associated with these two conversions. That is, in every digital network there is the potential of encountering the pitch modification effect. The frequency difference could be small, of the order of 2 parts in 1011, if the conversion clocks are traceable to a Stratum-1 source (or sources); the frequency difference could be significant, of the order of 64 parts in 106 (64 parts per million or 64 ppm), if the only guarantee given is that the conversion clocks are Stratum-4 quality (Stratum-4 implies an accuracy of no worse than ±32 ppm). {The notions of clock strata and the frequency accuracy of different classes of clocks are available in Ref. [6,7].}
Clearly, if the conversion rates are different, then the DAC will experience a surfeit of samples if the ADC clock is higher than the DAC clock, or a dearth of samples if the situation is reversed. In fact, such a phenomenon could be manifested at multiple places in the network where there is a connection between two Network Elements with different clock references. Clock offsets of this type are accommodated by the use of slip-buffers. Whereas buffers are always required to compensate for accumulated jitter and wander, it is the effect of a frequency offset that is the primary focus here.
Again for simplicity, we shall assume that there is just one buffer, and that this buffer is associated with the DAC. This buffer will be of a FIFO (first-in-first-out) nature where the data is written into the buffer under control of the ADC clock and read out of the buffer under control of the DAC clock. Clearly, if there is a frequency offset between the two clocks, the buffer will, eventually, either overflow (ADC clock is higher) or underflow (DAC clock is higher). In practice the buffering method is called “double buffering” wherein there are two pages, say A and B, and while data is being written into page A, data is being read out of page B. If there is no frequency offset, then the opposite-page nature of read and write will, for the most part, be preserved. Such a buffer needs to be just big enough to accommodate any relative wander or jitter between the two clocks. It is convenient to describe the size of the buffer in terms of time. For example, if each page is “10 ms”, then each page has 80 octets, assuming a nominal sampling rate of 8 kHz and one octet per sample (e.g. G.711; see Ref. [3] or [4]). The overall buffer is then 20 ms deep, introduces a nominal delay of 10 ms and can accommodate ±10 ms of wander.
A good way of visualizing the double-buffer action is to consider a circular buffer as depicted in
One special case is when the buffer is 250 □sec deep. This is the notion of a conventional slip buffer. Considering the sampling rate is 8 kHz (125 □sec period), a slip buffer has two octets and the overflow/underflow results in either the deletion of an octet or the repetition of an octet. This is called a controlled slip. A slip occurs when the relative time interval error between read and write clocks exceeds 125 □s. For example, if the relative frequency offset between the two clocks is 64 ppm, then a slip will occur approximately every 2 seconds.
In packet-switched networks the delay through the network is not steady as is the case of circuit-switched networks. Therefore, even if the rates of the ADC and DAC are equal, the write clock may, on a short-term basis, appear to be faster (or slower) than the read clock. This requires the use of a buffer that is called a jitter buffer because the term used in the industry for variable transit delay is “jitter”.
Now suppose that the buffer is 200 ms deep. The buffer will overflow (underflow) when the relative time interval error between the two clocks exceeds 100 ms. A 64 ppm offset will thus result in overflows (underflows) approximately every 3000 seconds. Considering that a telephone call rarely lasts 50 minutes, it is clear that overflows (underflows) that are a result of a clock offset may be ignored for all practical purposes. This is one of the (incorrect) reasons given by proponents of IP networks that frequency synchronization is not required because free-running clocks can support VoIP considering that buffer overflows and underflows can be made rare by increasing the size of the buffer.
It should be recognized that:
The thrust of this invention is to use multiple buffers. One buffer is similar to a traditional jitter buffer. The incoming packets are written into the jitter buffer upon arrival. Note that this write operation is tied, effectively, to the ADC clock (of the far end) with additional jitter introduced by the packet delay variation in the network. The packets are extracted (read out) from the jitter buffer using the DSP block (explained later) that is nominally uniform. The rate of packet extraction by the DSP block is determined by the rate of the DAC clock. The second buffer is a double buffer whose size is altered occasionally to adjust the rate at which the jitter buffer data is extracted by the DSP block.
A network based on packet switching and transmission can be quite complex, but the simple model depicted in
In terms of the important processes involved after call set-up, a simple, though accurate, view is depicted in
Speech implementations also allow for voice activity detection (VAD) whereby intervals of silence are detected and transmission bandwidth conserved by just transmitting an indication of silence rather than (encoded) speech sample information. At the receiving end intervals of silence are synthesized using comfort noise.
Whereas packet architectures are superior to circuit-switched architectures in terms of efficiency of bandwidth utilization (because of statistical multiplexing), they have some drawbacks, comparatively speaking. Packet architectures tend to increase latency (average delay) and introduce time delay variations. In order to accommodate time delay variations, jitter buffers are required. That is, buffers of an “elastic” nature are used to account for the burstiness of the packet arrival pattern. In order to avoid loss of data the depth of these buffers must be large enough to span the peak-to-peak time delay variation over the network. Put another way, the size (depth) of the jitter buffer determines the peak-to-peak time delay variation that is allowed for the network and a variation greater than this maximum value will result in packets being lost or used incorrectly.
If the jitter buffer is too small, time delay variation can be the primary cause of packet loss. For normal voice (speech) calls, packet loss concealment (“PLC”) algorithms are available to mitigate the impact of lost packets. However, it should be emphasized that the mitigation of the deleterious impact does not mean that the problem is eliminated. In Ref. [8] a general picture of the impact of packet loss on Quality of Experience is provided. One way to reduce packet loss is to increase the size of the jitter buffer. However, this approach, too, has its drawbacks since the increase in delay caused by increasing the depth of the jitter buffer has a negative impact on the Quality of Experience for voice calls for several reasons (see Ref. [8]). Consequently most prior art VoIP implementations utilize what is referred to as an adaptive jitter buffer, algorithms have been developed to make the jitter buffer size dynamic, the intent being to keep the buffer just large enough such that the loss of packets due to time delay variation is within an acceptable limit, which the ITU-T Recommendations specify as 0.05%. However, adaptive litter buffer operation in the prior art has a major problem because the proponents of VoIP and adaptive jitter buffers have ignored the effects of lack of clock synchronization.
With the jitter buffer set at its “optimum” size, and providing adequate traffic engineering is in place to provide the real-time services (such as VoIP) the appropriate priority, it is assumed that time delay variation will not cause packet loss except in situations of high traffic congestion. However, the frequency offset between source and destination has two deleterious effects. One is the pitch modification effect that has been described elsewhere (see Ref. [12], for example) and while important, is not the thrust of this invention. The other is a “buffer shrink” effect. If the DAC clock is faster than the ADC clock, the jitter buffer will empty faster than it is being filled. Suppose for example the buffer size is 200 ms. Then, whereas at the start of the call a 200 ms buffer will, theoretically, allow a ±100 ms time delay variation, the emptying of the buffer will affect the lower threshold. Similarly, if the ADC clock is faster than the DAC clock, the buffer will fill faster than it is being emptied and this will affect the upper threshold. For example, a frequency difference of 50 ppm will cause a threshold reduction (either the upper or the lower) of 50 □sec every second or 1 ms every 20 seconds. Therefore, whereas the probability of losing a packet due to time delay variation may have been small to nonexistent at the start of the call, the probability increases with the duration of the call and, for calls of long duration could become appreciable.
For voice calls there have been several methods described in the literature to handle such problems. The notion of an adaptive jitter buffer is to modify the size of the jitter buffer to match the existing time-delay variation condition being experienced. Silence-stretching and silence-compressing algorithms have been proposed to delete or expand sections (sub-intervals) of silence. Packet loss concealment algorithms have been developed to insert or delete sections of “non-silence” in such a manner as to reduce (subjectively) any annoying effects of packet loss. The interested reader is pointed to Ref [9,10] for further information on these methods.
In the context of this invention, silence-manipulation and packet loss concealment will be designated as extreme measures. Such measures are necessary because the general behaviour of IP networks is such that packets will be lost in the network for a variety of reasons, including excessive time-delay variation that could lead to jitter buffer overflow or underflow. In the context of this invention, the block labeled “Depacketization, Jitter Buffer, and Signal Processing” in
a. Depacketization. The packets received from the IP network are processed and the information content required for synthesis of the speech signal extracted. As part of the depacketization process, the protocol wrappers are examined to detect whether a packet was lost in the network. If a packet is detected as “lost”, then the packet loss concealment algorithm must be invoked. The current invention does not relate in particular to depacketization algorithms and implementations and most methods prevalent in the state-of-the-art can be employed. Packets contain both time-stamps and sequence numbers (also called frame numbers) and between these two it is straightforward to decide whether there was a missing packet or whether the apperent missing packet was actually a “no_transmission” corresponding to a silence packet. Basically the block labeled “Extract Frames” in
b. Jitter Buffer. The jitter buffer in prior art VoIP decoders comprised a first-in first-out (FIFO) buffer that was large enough to accommodate the time delay variation encountered by packets as they traverse the IP network from source (encoder/packetization) to the destination decoder. In one possible first implementation, the incoming packets are written in as they arrive and read out by the signal processing entity at the play-out rate. That is, the jitter buffer contains the actual received packets with, possibly, the protocol wrappers removed. In a second possible implementation, the incoming packets are treated by the signal processing entity as they arrive and the synthesized speech samples written into the FIFO. In this second implementation the FIFO contains actual speech samples destined for the DAC and is emptied based on the clock of the DAC. The invention disclosed herein applies to both modes of operation. The reason for the first mode of operation is that the jitter buffer module includes the logic required to handle missing packets as well as “silence” when there are really no packets available and the missing packets are synthesized as “silence” based on other information such as time-stamps available in the packets. Specifically, if the sequence numbers of consecutive packets are in correct sequence but the time-stamps indicate a time gap greater than the unit (frames or packets) then it is deemed that there were silent frames/packets between the two in-correct-order-sequence-number packets. In the second mode of operation there must be logic to determine silence packets. The invention described here is applicable to both implementations though, for specificity, the first implementation scheme is assumed.
c. Signal Processing. The information extracted from the received packet is processed with the appropriate algorithms to generate the speech segment. This includes the codec function, echo treatment (if any), comfort noise generation to synthesize silence, and packet loss concealment. The current invention does not relate in particular to the signal processing algorithms and implementation and just about any methods prevalent in the state-of-the-art can be employed.
There is one additional (though optional) requirement on the signal processing implementation arising from the current invention. That is, a flag is associated with each sample (octet) of speech signal recreated/synthesized. This flag is asserted (“true”) if the speech sample generated was part of a silence segment or a segment of signal artificially created via the packet loss concealment algorithm or had some particular characteristic as will be described later. The intent in this flag is to indicate that the sample is “actionable” and will have a minimal subjective annoyance in the event that the sample was deleted (or repeated) as part of the adaptive slip double buffer that is the crux of the invention disclosed herein. If the signal processing entity is incapable of providing such a flag for any reason, then the play-out buffer will, in essence, ignore the flag and assume that all samples are “actionable”.
The notion of “actionable” is that the frame of speech is either representative of silence or is representative of a synthetic frame of speech used for packet loss concealment. In the case where the speech is compressed, the nominal short-term power of the speech is computed by the encoding function (at the analog-to-digital converter side) and communicated to the decoding side (the digital to analog converter side). In the case where there is no compression, the decoding side must compute the short-term power of the signal and invoke suitable algorithms to determine whether the current decoded speech is part of a silence interval. Implementing slips introduces degradation but the degradation is much less consequential is invoked during periods of silence.
The invention disclosed here deals with an adaptive play-out buffer that is also called an adaptive slip double buffer. This is described below by considering the fundamentals of prior-art and the extensions that comprise the invention.
The underlying principle of retiming is quite straightforward. The play-out buffer can be viewed as a retimer as described here. Fundamentally, the data (speech samples or octets) as well as a clock (“recovered clock”) are recovered from the incoming packet stream. The “recovered clock” is used to write the incoming packets into a buffer that is operated in a FIFO (“first-in-first-out”) mode. The recovered clock in this scenario is a burst mode clock corresponding to packet arrival instants. The data is read out of the buffer using, effectively, the DAC clock (the retiming function generally involves inserting the “reference” clock), and then packets read out from the FIFO can be applied to the signal processing function to generate the digital speech samples for the DAC. The function of “retiming” is illustrated in
Referring to
In
For illustrative purposes, the FIFO can be viewed as a “pipe” with the receive data that is written into the FIFO viewed as being pushed into the pipe. The transmit data that is read out of the FIFO is viewed as being pulled out of the pipe. The arrow designated as “fill position” indicates where the next frame/packet that must be read out is located within the pipe. The action of “write” moves the fill position to the right and each read operation moves the fill position to the left. At the beginning or “reset” situation, the fill position, arbitrarily, points to the middle of the FIFO buffer. With such an arrangement, if the size of the FIFO buffer is 2N units (typically frames), short-term frequency variations, referred to as wander, can be accommodated without loss of data. In particular, up to N unit intervals (“UI”) of time-delay variation in the packet network (2N UI, peak-to-peak) can be absorbed (1 UI is equivalent to 1 frame-time, 10 ms for a frame size of 80 samples if the underlying sample rate is 8 kHz). Needless to say, the arrangement adds transmission delay of, on the average, N UI. A FIFO of this nature can serve as a jitter buffer accommodating up to ±N UI of time-delay variation. For reference, if N is 10, up to ±100 ms of time-delay variation (wander) can be absorbed.
If the (long-term) average frequencies of the write clock and read clock are different, then the buffer will either overflow or underflow. With respect to
One key element of the disclosed invention is the anticipation of overflow/underflow events.
This will be described shortly.
Another key element of the disclosed invention is the manner in which the clock used by the DSP to read frames out of the jitter buffer is derived from the DAC and adjusted to minimize the impact of clock offset between the local DAC and the far-end ADC.
This is described next
The arrangement for delivering samples to the digital-to-analog converter generally involves a double buffer arrangement. The reason for this buffering is that the actual conversion is done on a sample by sample basis using a “continuous” clock. The DSP unit will usually generate the samples as a block of samples. Thus while the DSP unit generates the correct number of samples per unit time on the average, it generates the samples in bursts.
The most common arrangement for implementing the double-buffer function involves the use of two buffers of equal size, say N octets, and referred to as “Page-A” and “Page-B”. One of the sides (we shall assume the “write” side for specificity and ease of explanation) accesses the buffer(s) sequentially. That is, the write operation first fills buffer Page-A, moves to buffer Page-B, fills it, and returns to filling buffer Page-A. The read operation empties the buffers. Under “normal” conditions, the read side is accessing buffer Page-B while the write side is accessing buffer Page-A, and vice-versa. If the average (long-term) frequencies of the read and write operations are equal, then the accesses will, substantially, remain in opposite buffers. This arrangement is sometimes referred to as a linear buffer arrangement to distinguish it from a circular buffer arrangement. The advantage of a linear buffer arrangement is that the memory allocation for the buffer can be slightly more than the actual page size.
In
Write X[N1] into address defined by write_pointer [write instruction “W2”]
Switch page designation for next block (from page A to page B and vice versa)
In the above block of code N1 is 79 if the block size is 80 since the range of the index I starts at 0. It is assumed that the DSP has computed the requisite sample values and these are available in the array {X[j]; j=0, 1, 2, . . . , N1}. At the start write_pointer identifies the memory address of the first element in the appropriate page (page A or page B). The instruction following the loop (in bold font) is important. What this achieves is the replication of sample value X[N1] into page-A/B location N1 as well as (N1+1). Thus for the case of an 80-sample frame, the same value is placed in the 80th as well as the 81st location of the buffer. Note that this approach is suitable for a linear buffer arrangement; slight modifications are required for circular buffer operation.
Note that the speed of the write operation is determined by the speed at which the DSP operates and not by the rate of the DAC clock. Generally speaking the machine-cycle time of the DSP will be very small and the entire process of writing 81 samples will be a very small fraction of the 10 ms frame duration.
In common implementations in customer-premises equipment such as the Integrated Access Device (IAD), the DAC clock is locally generated and may or may not be locked to a network reference. That is, it may be derived from a free-running oscillator. In either case it is not controlled by the DSP module that is reading out from the jitter buffer because implementing a clock synchronization method based on jitter-buffer fill (also referred to as adaptive clock recovery) requires expensive oscillators to smooth out the jitter introduced by packet network that can be quite large (see Ref. [11] for example).
A key aspect of the invention is to allow the DAC clock to run asynchronously with respect to the far-end ADC clock but yet account for the frequency offset using a slip mechanism that is based on single-sample slips while simulating clock synchronization as applied to the jitter buffer read/write. That is, the intent of this simulated synchronization is to avoid the “buffer shrink” effect and keeping the data corrupted due to a slip small (one sample) minimizes the deleterious effect on end-user quality of experience.
The typical manner in which the “read clock” (the “DAC-derived clock” in
That is, the method of changing the “final value” provides the means to either shorten or lengthen the apparent frame interval corresponding to an apparent increase or decrease of the apparent DAC clock frequency from the viewpoint of the read clock.
Some important points associated with this method:
a. If the final value (N1) is set at 78 then the DAC will extract only 79 out of the 80 valid samples from the page. That is, effectively we have deleted one sample.
b. If the final value (N1) is set at 80 then the DAC will attempt to extract 81 samples from the page though there are only 80 valid samples. To ensure that this is done in a reasonable manner, the buffer size should be 81 and when the DSP writes 80 samples into the buffer it repeats the last sample to get the 81st sample. That is, sample 80 and 81 are the same. Consequently the DAC is repeating one sample.
c. The controlling entity should change the final value occasionally, and only when necessary. At all other times it should be left at the nominal value of 80 (N1 set to 79). Note that the example cited above assumed a frame size of 10 ms and a sampling rate of 8 kHz. The same technique is applicable for different frame sizes and different sampling rates though the specific values such as 79, 80, and 81 for the “final value” will depend on the sampling rate and chosen frame size.
The overall adaptive jitter double buffer arrangement can be viewed as a combination of the linear double buffer between the DSP block and the DAC and a “traditional” jitter buffer that stores packets between the depacketization block and the DSP block (as depicted by the FIFO in
A simplified view of the circular buffer arrangement is depicted in
The “Write Add. Gen.” block is quite straightforward. The starting address is provided as the initial value of the write_pointer and then for every write operation the write_pointer is incremented. Since a circular buffer operation is used, modulo-2N arithmetic provides the wrap-around feature. When the write instruction is asserted (see write instruction W1 in pseudo-code; this applies for the jitter buffer as well), the input data is written into the buffer in the location pointed to by the counter contents, “WR_ADD”, and the write_pointer incremented by one. In the case of a linear buffer arrangement software instructions are needed to determine the suitable memory address of the start of the page.
The “page ctrl” block represents a function that monitors whether the read operation as well as the write operations are happening in the “location”. If so then the buffer has overflowed/underflowed and the correct action is to forcibly move one or the other side to the opposite part of the circular buffer. This is achieved by adding “N” (modulo-2N) to write_address or to the read_address (depending which is to be forcibly moved to the other page). Minor modifications are required in the case of a linear buffer arrangement.
The block labeled “Δ” generates the difference [“RD_ADD”-“WR_ADD”]=Δn. This difference is done modulo-2N; when the memory addresses are at diametrically opposite parts of the circular buffer the difference will be N; when the addresses are close to each other the difference is small in magnitude; when they coincide the difference is zero. Considering the circular nature of the buffer, defining which is “ahead” is somewhat moot. For our purposes, if Δn is positive the write pointer is “catching up” to the read pointer; if Δn is negative the read pointer is catching up to the write pointer.
Assigning appropriate actions based on the value of do is a key aspect of the invention.
To this end, three “threshold values”, T3>T2>T1 are predetermined. Suitable choices for these thresholds and the underlying rationale are provided later. Comparison of Δn with these determines the “state” of the adaptive play-out buffer; the state then determines the appropriate action.
a. If |Δn−N|≦T1, the state is “green”. The implication of the “green” state is that the read and write pointers are far apart and no special action is taken. Note that the furthest they can be apart is, essentially, N, implying that the read and write operations are occurring in diametrically opposite parts of the circular buffer. The “increment” applied to the read address pointer (discussed shortly) is unity implying the read function operates in a normal manner.
b. If T2>|Δn−N|≧T1, the state is “yellow”. The implication of the “yellow” state is that the read and write pointers are possibly coming closer and some action is required. This takes the form of a controlled slip provided some other conditions are met. A controlled slip involves repeating or deleting one signal sample by changing the final_value in the linear double-buffer arrangement between the DSP and the DAC.
This is achieved by modifying the final_value to (N1+1) as described earlier. As described before, this implies that we essentially repeating a sample. This is done if Δn is negative (read catching up with write). What this accomplishes is artificially increasing the duration of a “frame” from the viewpoint of accessing the jitter buffer, slowing down the rate at which the read is catching up with the write.
Making the final_value equal to (N1−1) means the read address reads one less location from the page, essentially deleting a sample. This is done if Δn is positive (write catching up with read). What this accomplishes is artificially decreasing the duration of a “frame” from the viewpoint of accessing the jitter buffer, slowing down the rate at which the write is catching up with the read.
The aforementioned conditions for allowing a slip operation to take place are the following:
1) The flag associated with the current read data should be true. The flag will be set true by the signal processing block if the sample is part of an “actionable” signal segment.
2) The timer has expired. The timer is essentially a counter that is reset (to zero) when a slip event (repetition/deletion) has occurred. The timer counter is incremented by the DAC clock and saturates at a (pre-determined) maximum value. Until it reaches this maximum count, slip events are inhibited. The intent is to ensure that slip events are not allowed to occur too close together.
c. If T3>|Δn−N|≧T2, the state is “orange”. The implication of the “orange” state is that the read and write pointers are very likely coming closer and some action is definitely required. This takes the form of a controlled slip provided some other conditions are met. This is similar to the yellow state with relaxed conditions. In particular, the flag is ignored. The timer constraint is the same as for the yellow state.
d. If |Δn−N|>T3, the state is “red”. The implication of the “red” state is that the read and write pointers are very close to each other and some extreme action is required. This takes the form of a controlled slip provided the timer constraint is met (as in the orange state) as well as a request to the signal processing entity that packet loss concealment must be initiated. If Δn is negative a segment of synthetic speech must be inserted; if Δn is positive a segment of speech must be deleted. In the red state we invoke not just effective change of frame duration by 1 DAC sample interval, but an entire frame in addition.
Traditional “adaptive” jitter buffers adjust the size of the jitter buffer to mitigate the occurrence of such overflow/underflow events. That is, the size of the jitter buffer is increased if the trend is seen to be towards such overflow/underflow events. Traditional adaptive algorithms for jitter buffers malfunction because they make no distinction between overflow/underflow that is the result of packet delay variation and the result of a clock offset. The slip function implemented in this algorithm addresses the clock offset issue and therefore if overflow/underflow does occur it is because the jitter buffer is not large enough to accommodate the packet delay variation in the network. Consequently the invention disclosed here will improve and enhance the behavior of conventional adaptive jitter buffer algorithms.
e. If Δn=0, the state is “catastrophic” implying that the write pointer and read pointer are coincident. This requires very drastic action. This is achieved by re-centering the jitter buffer. That is, the read pointer is “reset” to be diametrically opposite to the write pointer. N packets will be lost or repeated by this action that is equivalent to jitter buffer overflow/underflow. Suitable values for the thresholds are T3=(¾)N; T2=(½)N; T1=(¼)N, where the size of the overall jitter buffer is 2N. If the packet loss concealment algorithm is not very sophisticated and thus should be minimally invoked, an alternate set of threshold values is T3=(⅞)N; T2=(¾)N; T1=(⅛)N. These choices are well suited for efficient implementation and it is unlikely that “optimum” values for these thresholds, derived by any sophisticated means, will provide an efficacy that much greater than this particular set to warrant an increase in implementation complexity. The value for N, the buffer size, depends on the expected time-delay variation. If we assume a packet size of 10 ms (80 speech samples) a “typical” time-delay variation will be ±10 ms, corresponding to ±0.5 packet duration.
A suitable value for the timer is the closest power of 2 less than the packet size and in this case is 64. With this choice of timer, the slip events will be constrained to no more than twice per packet duration.
The block labeled “Read Add. Gen.” is important since this is a key aspect of the invention. A simplified view of this block is shown in
The entity M-WR_ADD represents the WR_ADD modified to represent the address diametrically opposite the current location that is being written into. If Δn=0, the drastic action taken is to make the select control choose M-WR_ADD to load into the read address register (see item “e” above). The read address counter is implemented as an accumulator that is updated based on the DAC-derived clock (“Read_Clock”). Under normal operation the increment is one unit (corresponding to packet size). That is, the read operation will sequence through the jitter buffer in a normal manner. The adjustment of the “Read_Clock” interval based on the slip buffer mechanism between DSP and DAC will account for frequency offset between DAC and far-end ADC clock. If the condition is “red” (see item “d” above) then the increment is either 0 units (the packet loss concealment algorithm is invoked) or 2 units (one packet is effectively deleted).
The notion of “Final_value” is the control value for the double buffer between the DSP block and the DAC. The nominal value will be called “N” in the following. (N−1) and (N+1) are the values for Final_value that will delete or repeat a (DAC) sample, respectively
The block labeled “Increment Control” is one aspect of the invention of the adaptive play-out buffer. The actions have been described before but are summarized here for completeness. Based on the various state conditions this block controls the generation of the increment used by the read address counter:
1. If State is catastrophic (Δn=0):
i. Assert reset (forcing read pointer to be diametrically opposite to write pointer)
ii. Reset timer. This is optional. Included for specificity.
iii. Set increment to one unit. This is optional since counter action is overridden by reset action. Set Final_value to “N”.
i. Deliver message to signal processing entity that packet loss concealment (deletion or synthesis, based on sign of Δn) is required.
ii. If timer has not expired, set Final_value to “N”.
iii. If timer has expired, set Final_value to (NΔ1) or (N+1) depending on sign of Δn and reset timer.
3. If State is orange:
i. If timer has not expired, set Final_value to “N”.
ii. If timer has expired, set Final_value to (N−1) or (N+1) depending on sign of Δn and reset timer.
4. If State is yellow.
i. If timer has not expired, or flag is false, set Final_value to “N”.
ii. If timer has expired, and flag is true, set Final_value to (N−1) or (N+1) depending on sign of Δn and reset timer.
iii. Note: If the signal processing entity does not provide the flag it is deemed to be always true.
5. If State is green:
i. Set Final_value to “N”. (Normal slip buffer operation)
Note: In states orange, yellow, and green the increment for the read address for the jitter buffer (i.e. RD_ADD in
One of the problems associated with communication of real-time information over packet networks is the time-delay variation introduced. A second problem is that the transport is asynchronous and therefore the receiving end may be operating at a different timing-base from the sending end. The packetized nature of VoIP necessitates the use of a jitter buffer and, possibly, a second buffer to interface to the actual digital to analog converter (DAC). The invention described herein deals with simple and efficient methods to address the jitter buffer and clock offset issues.
Salient points of the invention are:
1) The DAC double buffer is made adaptive in the sense that controlled slips are implemented.
2) The signal-processing entity can flag samples from segments of speech that are considered “actionable”.
3) The slip action can, optionally, be inhibited if the sample affected has been flagged as “nonactionable”
4) The controlled slip action is instantiated by monitoring the fill of the jitter buffer.
5) The jitter buffer FIFO is implemented as a circular buffer and the difference between the read and write pointers used as a measure of buffer fill.
6) A timer is used to ensure that slip events do not occur too close to each other.
7) A timer is used to ensure that the frequency control is not too rapid.
The term program and/or the phrase computer program are intended to mean a sequence of instructions designed for execution on a computer system (e.g., a program and/or computer program, may include a subroutine, a function, a procedure, an object method, an object implementation, an executable application, an applet, a servlet, a source code, an object code, a shared library/dynamic load library and/or other sequence of instructions designed for execution on a computer or computer system).
The term substantially is intended to mean largely but not necessarily wholly that which is specified. The term approximately is intended to mean at least close to a given value (e.g., within 10% of). The term generally is intended to mean at least approaching a given state. The term coupled is intended to mean connected, although not necessarily directly, and not necessarily mechanically. The term proximate, as used herein, is intended to mean close, near adjacent and/or coincident; and includes spatial situations where specified functions and/or results (if any) can be carried out and/or achieved. The term distal, as used herein, is intended to mean far, away, spaced apart from and/or non-coincident, and includes spatial situation where specified functions and/or results (if any) can be carried out and/or achieved. The term deploying is intended to mean designing, building, shipping, installing and/or operating.
The terms first or one, and the phrases at least a first or at least one, are intended to mean the singular or the plural unless it is clear from the intrinsic text of this document that it is meant otherwise. The terms second or another, and the phrases at least a second or at least another, are intended to mean the singular or the plural unless it is clear from the intrinsic text of this document that it is meant otherwise. Unless expressly stated to the contrary in the intrinsic text of this document, the term or is intended to mean an inclusive or and not an exclusive or. Specifically, a condition A or B is satisfied by any one of the following: A is true (or present) and B is false (or not present), A is false (or not present) and B is true (or present), and both A and B are true (or present). The terms a and/or an are employed for grammatical style and merely for convenience.
The term plurality is intended to mean two or more than two. The term any is intended to mean all applicable members of a set or at least a subset of all applicable members of the set. The phrase any integer derivable therein is intended to mean an integer between the corresponding numbers recited in the specification. The phrase any range derivable therein is intended to mean any range within such corresponding numbers. The term means, when followed by the term “for” is intended to mean hardware, firmware and/or software for achieving a result. The term step, when followed by the term “for” is intended to mean a (sub)method, (sub)process and/or (sub)routine for achieving the recited result. Unless otherwise defined, all technical and scientific terms used herein have the same meaning as commonly understood by one of ordinary skill in the art to which this invention belongs. In case of conflict, the present specification, including definitions, will control.
The described embodiments and examples are illustrative only and not intended to be limiting. Although embodiments of the invention can be implemented separately, embodiments of the invention may be integrated into the system(s) with which they are associated. All the embodiments of the invention disclosed herein can be made and used without undue experimentation in light of the disclosure. Although the best mode of the invention contemplated by the inventor(s) is disclosed, embodiments of the invention are not limited thereto. Embodiments of the invention are not limited by theoretical statements (if any) recited herein. The individual steps of embodiments of the invention need not be performed in the disclosed manner, or combined in the disclosed sequences, but may be performed in any and all manner and/or combined in any and all sequences.
Various substitutions, modifications, additions and/or rearrangements of the features of embodiments of the invention may be made without deviating from the spirit and/or scope of the underlying inventive concept. All the disclosed elements and features of each disclosed embodiment can be combined with, or substituted for, the disclosed elements and features of every other disclosed embodiment except where such elements or features are mutually exclusive. The spirit and/or scope of the underlying inventive concept as defined by the appended claims and their equivalents cover all such substitutions, modifications, additions and/or rearrangements.
The appended claims are not to be interpreted as including means-plus-function limitations, unless such a limitation is explicitly recited in a given claim using the phrase(s) “means for” and/or “step for.” Subgeneric embodiments of the invention are delineated by the appended independent claims and their equivalents. Specific embodiments of the invention are differentiated by the appended dependent claims and their equivalents.
This application claims a benefit of priority under 35 U.S.C. 119(e) from copending provisional patent applications U.S. Ser. No. 61/340,923, filed Mar. 24, 2010, U.S. Ser. No. 61/340,922, filed Mar. 24, 2010 and U.S. Ser. No. 61/340,906, filed Mar. 24, 2010, the entire contents of all of which are hereby expressly incorporated herein by reference for all purposes.
Number | Date | Country | |
---|---|---|---|
61340923 | Mar 2010 | US | |
61340906 | Mar 2010 | US | |
61340922 | Mar 2010 | US |