The present invention relates to adjusting an algorithmic time delay for a signal encoder, which may function in a speech codec.
End-to-end time delay often affects the overall quality service of a communication system. For example, with speech communications, the time delay should be short enough to allow natural conversation. While target one-way delay is recommended to be less than 150 ms, generally it has been assumed that one-way delays up to 200 ms can be expected to provide high level of interactivity causing no degradation to the subjective quality. With certain assumptions delays up to 400 ms are considered acceptable. However, although pushing one-way delays clearly below 200 ms cannot be expected to provide a substantial improvement in subjective quality of service, many communications systems are designed and thus operating in the delay range 200 to 400 ms. Furthermore, packet switched networks, e.g., IP based networks, are operating in a best-effort manner, and therefore the delays during peak load can even exceed 400 ms. Thus, even small time delay reductions can significantly contribute in minimizing the overall delay of a communications system to provide an improved user-experience.
An aspect of the present invention provides methods and apparatus for adjusting an algorithmic time delay of a signal encoder. An input signal, e.g., a speech signal, is sampled at a predetermined sampling rate. A processing module processes a segment of input signal consisting of a current frame and a segment of future signal, typically referred as a look-ahead segment. When look-ahead operation is initiated, the algorithmic time delay is increased by the look-ahead time duration. When look-ahead operation is terminated, the algorithmic time delay is decreased by the look-ahead time duration. A set of input signal samples is aligned in accordance with the algorithmic time delay, and an output signal that is representative of the set of signal samples is formed.
With another aspect of the invention, a first signal segment is added to an input signal waveform when the look-ahead operation is initiated, and a second signal segment is removed from the input signal waveform when the look-ahead operation is terminated.
With another aspect of the invention, a first pointer is equal to a second pointer when the look-ahead operation is terminated. The first pointer points to a beginning of the current frame and the second pointer points to new input signal samples. When the look-ahead operation is initiated, the first pointer is offset from the second pointer by the look-head time duration.
With another aspect of the invention, input signal samples are smoothed around a point of discontinuity when the operational mode changes.
A more complete understanding of the present invention and the advantages thereof may be acquired by referring to the following description in consideration of the accompanying drawings, in which like reference numbers indicate like features and wherein:
In the following description of the various embodiments, reference is made to the accompanying drawings which form a part hereof, and in which is shown by way of illustration various embodiments in which the invention may be practiced. It is to be understood that other embodiments may be utilized and structural and functional modifications may be made without departing from the scope of the present invention.
An adaptive multi-rate algorithm is the default speech codec that is used for the narrowband telephony service in 3rd generation 3GPP networks. (The term CODEC denotes CODer-DECoder or the encoder-decoder combination. The adaptive multi-rate algorithm is also the third codec option for GSM and an optional codec for VoIP using RTP.) The algorithm has different algorithmic delay requirements between different configurations. Look-ahead operation is typically used for the LPC analysis to provide smoother transition of the signal spectrum from frame to frame, and partially also for the Voice Activity Detection (VAD) algorithm. However, the highest bit-rate mode (12.2 kbits/sec) does not use the look-ahead. The standard version of the AMR encoder (as used in 3rd generation 3GPP networks) also imposes look-ahead for the 12.2 kbits/sec mode, which enables fast adaptation between the 12.2 kbits/sec mode and the other AMR modes employing the look-ahead. However, in certain applications, the set of active modes may be limited only to 12.2 kbits/sec mode, which would make the 5 ms look-ahead unnecessary delay component. Such services may be the 3G circuit switched telephony, voice over IP (VoIP), and unlicensed mobile access (UMA). All these services have typically high enough bandwidth to provide the highest quality AMR mode for all voice traffic. Embodiments of the inventions, as shown in
Referring to
In accordance with embodiments of the invention, the speech encoder 400 (as shown in
Another class of encoder typically uses Time Domain or Frequency Domain coding and attempts to reproduce the original signal (waveform) with assuming that the original signal is a speech signal. Consequently, a waveform encoder does not assume any previous knowledge about the signal. The decoder output waveform is very similar to the signal input to the coder. Examples of these general encoders include uniform binary coding for music compact disks and pulse code modulation for telecommunications. Pulse code modulation (PCM) encoder is a general encoder often used in standard voice grade circuits.
As shown in
Speech and audio codecs typically operate on fixed algorithmic delay. Consequently, the time delay associated with the coding algorithm remains constant. The time delay may be a constant value for a given codec or may be dependent on the employed configuration of the codec. An example of a codec with different configurations having different time delay requirements is the AMR-WB+ codec, in which the mono operation has algorithmic delay of approximately 114 ms, while stereo operation imposes an algorithmic delay of approximately 163 ms. However, once the codec/encoder is initialized to operate using certain configuration, the configuration typically cannot be changed without re-initializing the codec and starting a new session.
With the embodiment shown in
In step 303, process 300 determines whether the operational mode should change to look-ahead operation (corresponding to
An improvement for voice quality when switching between look-ahead operation and look-ahead-free operation (when look-ahead operation is initiated or look-ahead operation terminates) may be obtained by modifying the signal around the point of discontinuity, i.e., between the input signal from the previous frame and the new input signal, to ensure smooth transition. One way to perform this is to use “cross-fading.” (This approach is termed as the non-pitch-synchronous method.) Because the signal segment is added in step 307, the signal waveform may be smoothed (cross-faded) around the resulting point of discontinuity by step 309. With an embodiment of the invention, the generation of the first signal segment when initiating the look-ahead operation is determined by:
current_frame (k)=w1(k)*current_frame(k−40)+w2(k)*new_speech(k) (EQ. 1)
where 0<=k<40 and
current_frame (k+40)=new_speech(k) (EQ. 2)
where 0<=k<160 and
w1(k)=(k+1)/41 (EQ. 3)
and
w2(k)=1−w1(k) (EQ. 4)
From EQs. 1-4, the first signal segment (as determined in step 307) has a weighted sum of 5 ms pieces surrounding the inserted signal segment. In this case, the whole new input frame (indices from 0 to 159) is written into the buffer unmodified. EQs. 1-4 are exemplary for providing smoothing (as determined by step 309) around the point of discontinuity resulting from initiating look-ahead operation. For example different weighting functions w1 and w2 may be used. The above computation implies that, in addition to inserting a 5 ms segment of speech, the first 5 ms segment of the new input speech is also modified to provide a smoother change from the signal segment that precedes the inserted piece of signal. The remaining 15 ms portion of the new input frame is inserted into the buffer unmodified.
With smoothing according to EQs. 1-4 around the point of discontinuity, the energy of the signal waveform changes smoothly so that there are no sudden and potentially annoying disturbances being introduced. For non-speech and unvoiced signals this approach provides essentially seamless transition. However, voiced speech having periodic structure with a period length clearly different from a time duration of 40 sample points (corresponding to 5 msec with a predetermined sampling rate of 8000 samples per second) may result in quality degradation due to an irregularity in periodicity introduced by processing.
Referring to
Similar to the above discussion, an improvement for voice quality when switching from look-ahead operation and look-ahead-free operation may be obtained by “cross-fading” the signal around the point of discontinuity, i.e., between the input signal from the previous frame and the new input signal. Because the signal segment is removed in step 317, the signal waveform may be smoothed (cross-faded) around the resulting point of discontinuity by step 319. When look-ahead operation is terminated, one can mix a portion of speech (having a 5 msec time duration corresponding to 40 samples of signal at 8 kHz sampling rate) that was used as a look-ahead for the previous frame (i.e. the signal segment between “current_frame” and “new_speech” as shown in
current frame (k)=w2(k)*current_frame(k)+w1(k)*new_speech(k) (EQ. 5)
where 0<=k<40 and
current_frame (k)=new_speech(k) (EQ. 6)
where 40<=k<160 and
where w1(k)=(k+1)/41 (EQ. 7)
and
w2(k)=1−w1(k) (EQ. 8)
Note that with the above embodiment, the weighing factors w1 and w2 are the same when look-ahead operation is initiated or terminated (corresponding to EQs. 3, 4, 7, and 8).
In step 311, a set of samples from the signal waveform is obtained in response to processing by steps 305-309 and 315-319 that corresponds to current frame 105. In step 313, an output signal is generated to represent the set of samples. For example, with an embodiment of the invention, linear predictive coefficients are determined from the samples in conjunction with an assumed speech mode.
Embodiments of the invention support other approaches when switching between look-ahead operation and look-ahead-free operation, in which the algorithmic time delay is changed. With an embodiment of the invention, the signal encoder is reset and the speech pointers are re-initialized according to the desired mode of operation (as shown in
Note that after the encoder reset, one should also reset the decoder to insure decoder stability due to encoder-decoder resynchronization. This action can be performed by sending a homing frame to the decoder. This approach simplifies implementation, where only few lines of the encoder source code may be modified to provide look-ahead-free operation. However, reduced voice quality may occur during the change of mode of operation. A codec reset can be expected to completely mute the decoder output for a short while, and the normal operation is restored only after few processed frames. (The term CODEC denotes CODer-DECoder or the encoder-decoder combination.)
Embodiments of the invention may also utilize an approach in which the pointers are re-initialized without resetting the encoder when changing between look-ahead operation and look-ahead-free operation. When switching look-ahead operation off, this approach requires only resetting the pointer values from values shown in
Embodiments of the invention also utilize an approach in which pitch-synchronous methods exploit the long-term periodicity of speech when switching between the look-ahead mode and the look-ahead-free mode. Consequently, when switching off look-ahead operation, waveform shortening is performed by removing pieces of signal that are integer multiples of the current (pitch) period length. When switching on look-ahead operation, this approach repeats the past signal in segments that are integer multiples of the current (pitch) period length. For example, when the current pitch period equals a time duration spanning p samples, waveform shortening (i.e., removing a segment equal to the look-ahead time duration) is determined by:
current_frame (40−p+k)=new_speech(k) (EQ. 9)
where 0<=k<160
Waveform extension (i.e., adding a segment equal to the look-ahead time duration), is determined by:
current_frame (k)=current_frame(k−p) (EQ. 10)
where 0<=k<p
current_frame (k+p)=new_speech(k) (EQ. 11)
where 0<=k<160
With the above approach, the amount of waveform shortening or extension is dependent on the current pitch period length, i.e., the processing is dependent on the current input signal characteristics. Therefore, in most cases, it is not possible to exactly match the desired change in signal length. Furthermore, when shortening the signal waveform, one can cut away at most 5 ms of signal in order to still provide a full 20 ms frame of signal for encoding. Thus, if the current pitch period is longer than 5 msec, one cannot perform pitch-synchronous shortening of signal. If the pitch is shorter than 5 msec, one can only remove part of the signal waveform spanning the look-ahead time duration. Similarly, when extending the signal waveform, one needs to insert at least 5 msec of an additional segment, which implies that, in case of a pitch shorter than 5 msec, one needs to repeat the pitch period as many times as it is required to have at least 5 msec of the first segment. Consequently, one may introduce a first segment that has a time duration that is longer than 5 msec.
Thus, although the pitch-synchronous approach provides good voice quality with respect to the approaches that are described above, one should be cognizant of the following considerations:
Embodiments of the invention also support the combination of the pitch-synchronous approach with other approaches as described above. For example, in case of non-speech and unvoiced input speech, one can use the non-pitch-synchronous processing, while for voiced speech one uses pitch-synchronous processing. One can further tune processing by inserting a first segment using non-pitch-synchronous processing (since it most probably is time critical) and employing pitch-synchronous processing only for removing/shortening the signal waveform (since it can be assumed to be less time critical).
In the above exemplary embodiments that support an AMR codec as shown in
As can be appreciated by one skilled in the art, a computer system with an associated computer-readable medium containing instructions for controlling the computer system can be utilized to implement the exemplary embodiments that are disclosed herein. The computer system may include at least one computer such as a microprocessor, digital signal processor, and associated peripheral electronic circuitry.
While the invention has been described with respect to specific examples including presently preferred modes of carrying out the invention, those skilled in the art will appreciate that there are numerous variations and permutations of the above described systems and techniques that fall within the spirit and scope of the invention as set forth in the appended claims.