In conventional communication systems, an encoder generates a stream of information bits representing voice or data traffic. This stream of bits is subdivided and grouped, concatenated with various control bits, and packed into a suitable format for transmission. Voice and data traffic may be transmitted in various formats according to the appropriate communication mechanism, such as, for example, frames, packets, subpackets, etc. For the sake of clarity, the term “transmission frame” will be used herein to describe the transmission format in which traffic is actually transmitted. The term “packet” will be used herein to describe the output of a speech coder. Speech coders are also referred to as voice coders, or “vocoders,” and the terms will be used interchangeably herein.
A vocoder extracts parameters relating to a model of voice information (such as human speech) generation and uses the extracted parameters to compress the voice information for transmission. Vocoders typically comprise an encoder and a decoder. A vocoder segments incoming voice information (e.g., an analog voice signal) into blocks, analyzes the incoming speech block to extract certain relevant parameters, and quantizes the parameters into binary or bit representation. The bit representation is packed into a packet, the packets are formatted into transmission frames and the transmission frames are transmitted over a communication channel to a receiver with a decoder. At the receiver, the packets are extracted from the transmission frames, and the decoder unquantizes the bit representations carried in the packets to produce a set of coding parameters. The decoder then re-synthesizes the voice segments, and subsequently, the original voice information using the unquantized parameters.
Different types of vocoders are deployed in various existing wireless and wireline communication systems, often using various compression techniques. Moreover, transmission frame formats and processing defined by one particular standard may be rather significantly different from those of other standards. For example, CDMA standards support the use of variable-rate vocoder frames in a spread spectrum environment while GSM standards support the use of fixed-rate vocoder frames and multi-rate vocoder frames. Similarly, Universal Mobile Telecommunications Systems (UMTS) standards also support fixed-rate and multi-rate vocoders, but not variable-rate vocoders. For compatibility and interoperability between these communication systems, it may be desirable to enable the support of variable-rate vocoder frames within GSM and UMTS systems, and the support of non-variable rate vocoder frames within CDMA systems. One common occurrence throughout all communications systems is the occurrence of echo. Acoustic echo and electrical echo are example types of echo.
Acoustic echo is produced by poor voice coupling between an earpiece and a microphone in handsets and/or hands-free devices. Electrical echo results from 4-to-2 wire coupling within PSTN networks. Voice-compressing vocoders process voice including echo within the handsets and in wireless networks, which results in returned echo signals with highly variable properties. The echoed signals degrade voice call quality.
In one example of acoustic echo, sound from a loudspeaker is heard by a listener at a near end, as intended. However, this same sound at the near end is also picked up by the microphone, both directly and indirectly, after being reflected. The result of this reflection is the creation of echo, which, unless eliminated, is transmitted back to the far end and heard by the talker at the far end as echo.
If the conventional echo canceller/suppressor 100 is used in a packet switched network, the conventional echo canceller must completely decode the vocoder packets associated with voice signals transmitted in both directions to obtain echo cancellation parameters because all conventional echo cancellation operations work with linear uncompressed speech. That is, the conventional echo canceller/suppressor 100 must extract packet from the transmission frames, unquantize the bit representations carried in the packets to produce a set of coding parameters, and re-synthesize the voice segments before canceling echo. The conventional echo canceller/suppressor then cancels echo using the re-synthesized voice segments.
Because transmitted voice information is encoded into parameters (e.g., in the parametric domain) before transmission and conventional echo suppressors/cancellers operate in the linear speech domain, conventional echo cancellation/suppression in a packet switched network becomes relatively difficult, complex, may add encoding and/or decoding delay and/or degrade voice quality because of, for example, the additional tandeming coding involved.
Example embodiments are directed to methods and apparatuses for packet-based echo suppression/cancellation. One example embodiment provides a method for suppressing/cancelling echo. In this example embodiment, a reference voice packet is selected from a plurality of reference voice packets based on at least one encoded voice parameter associated with each of the plurality of reference voice packets and a targeted voice packet. Echo in the targeted voice packet is suppressed/cancelled based on the selected reference voice packet.
The present invention will become more fully understood from the detailed description given herein below and the accompanying drawings, wherein like elements are represented by like reference numerals, which are given by way of illustration only and thus are not limiting of the present invention and wherein:
Methods and apparatuses, according to example embodiments, may perform echo cancellation and/or echo suppression depending on, for example, the particular application within a packet switched communication system. Example embodiments will be described herein as echo cancellation/suppression, an echo canceller/suppressor, etc.
Hereinafter, for example purposes, vocoder packets suspected of carrying echoed voice information (e.g., voice information received at the near end and echoed back to the far end) will be referred to as targeted packets, and coding parameters associated with these targeted packets will be referred to as targeted packet parameters. Vocoder or parameter packets associated with originally transmitted voice information (e.g., potentially echoed voice information) from the far end used to determine whether targeted packets include echoed voice information will be referred to as reference packets. The coding parameters associated with the reference packets will be referred to as reference packet parameters.
As discussed above,
One example vocoder used to encode voice information is a Code Excited Linear Prediction (CELP) based vocoder. CELP-based vocoders encode digital voice information into a set of coding parameters. These parameters include, for example, adaptive codebook and fixed codebook gains, pitch/adaptive codebook, linear spectrum pairs (LSPs) and fixed codebooks. Each of these parameters may be represented by a number of bits. For example, for a full-rate packet of Enhanced Variable Rate CODEC (EVRC) vocoder, which is a well-known vocoder, the LSP is represented by 28 bits, the pitch and its corresponding delta are represented by 12 bits, the adaptive codebook gain is represented by 9 bits and the fixed codebook gain is represented by 15 bits. The fixed codebook is represented by 120 bits.
Referring still to
Packet domain echo cancellers/suppressors and/or methods for the same, according to example embodiments, utilize this similarity in cancelling/suppressing echo in transmitted signals by adaptively adjusting coding parameters associated with transmitted packets.
For example purposes, example embodiments will be described with regard to a CELP-based vocoder such as an EVRC vocoder. However, methods and/or apparatuses, according to example embodiments, may be used and/or adapted to be used in conjunction with any suitable vocoder.
The echo cancellation/suppression module 206 may cancel/suppress echo from a signal (e.g., transmitted and/or received) signal based on at least one encoded voice parameter associated with at least one reference packet stored in the reference packet buffer memory 202 and at least one targeted packet stored in the targeted packet buffer 204. The echo cancellation/suppression module 206, and methods performed therein, will be discussed in more detail below.
The memory 208 may store intermediate values and/or voice packets such as voice packet similarity metrics, corresponding reference voice packets, targeted voice packets, etc. In at least on example embodiment, the memory 208 may store individual similarity metrics and/or overall similarity metrics. The memory 208 will be described in more detail below.
Returning to
The length of the buffer memory 202 may be determined based on the length of the echo tail, network delay and the trajectory match length. For example, if each vocoder packet carries a 20 ms voice segment, the echo tail length is equal to 180 ms and the trajectory match length is 120 ms (e.g., 6 packets), the buffer memory 202 may hold 15 reference packets. The maximum number of packets that may be stored in buffer 202 for reference packets may be represented by m.
Although
In at least one example, the echo tail length may be determined and/or defined by known network parameters of echo path or obtained using an actual searching process. Methods for determining echo tail length are well-known in the art. After having determined the echo tail length, methods according to at least some example embodiments may be performed within a time window equal to the echo tail length. The time window width may be equivalent to, for example, one or several transmission frames in length, or one or several packets in length. For example purposes, example embodiments will be described assuming that the echo tail length is equivalent to the length of a speech signal transmitted in a single transmission frame.
Example embodiments may be applicable to any echo tail length by matching reference packets stored in buffer 202 with targeted packets carrying echoed voice information. Whether a targeted packet contains echoed voice information may be determined by comparing a targeted packet with each of m reference packets stored in the buffer 202.
Referring to
At S306, if the counter value j is less than or equal to threshold value m, the echo cancellation/suppression module 206 extracts the encoded parameters from reference packet Rj at S308. Concurrently, at S308, the echo cancellation/suppression module 206 extracts encoded coding parameters from the targeted packet T. Methods for extracting these parameters are well-known in the art. Thus, a detailed discussion has been omitted for the sake of brevity. As discussed above, example embodiments are described herein with regard to a CELP-based vocoder. For a CELP-based encoder, the reference packet parameters and the targeted packet parameters may include fixed codebook gains Gf, adaptive codebook gains Ga, pitch P and an LSP.
Still referring to
Double talk detection may be used to determine whether a reference packet Rj includes double talk. In an example embodiment, double talk may be detected by comparing encoded parameters extracted from the targeted packet T and encoded parameters extracted from the reference packet Rj. In the above-discussed CELP vocoder example, the encoded parameters may be fixed codebook gains Gf and adaptive codebook gains Ga.
The echo cancellation/suppression module 206 may determine whether double talk is present according to the conditions shown in Equation (1):
According to Equation (1), if the difference between the fixed codebook gain GjR for the reference packet Rj and the fixed codebook gain GfT for the targeted packet T is less than a fixed codebook gain threshold value Δf, double talk is present in the reference packet Rj and the double talk detection flag DT may be set to 1 (e.g., DT=1). Similarly, if the difference between the adaptive codebook gain GαR for the reference packet Rj and the adaptive codebook gain GαT for the targeted packet T is less than an adaptive codebook gain threshold value Δa, double talk is present in the reference packet Rj and the double talk detection flag DT may be set to 1 (e.g., DT=1). Otherwise, double talk is not present in the reference packet Rj and the double talk detection flag may not be set (e.g., DT=0).
Referring back to
The similarity flags may be referred to as similarity indicators. The similarity flags or similarity indicators may include, for example, a pitch similarity flag (or indicator) PM and a plurality of LSP similarity flags (or indicators). The plurality of LSP similarity flags may include a plurality of bandwidth similarity flags BMi and a plurality of frequency similarity matching flags FMi.
Still referring to S312 of
As shown in Equation (2), PT is the pitch associated with the targeted packet, PR is the pitch associated with the reference packet Rj and Δp is a pitch threshold value. The pitch threshold value Δp may be determined based on experimental data obtained according to the specific type of vocoder used. As shown in Equation (2), if the absolute value of the difference between the pitch PT and the pitch PR is less than or equal to the threshold value Δp, the pitch PT is similar to the pitch PR and the pitch similarity flag PM may be set to 1. Otherwise, the pitch similarity flag PM may be set to 0.
Referring still to S312 of
Generally, a CELP vocoder utilizes a 10th order Linear Predictive Coding (LPC) predictive filter, which encodes 10 LSP values using vector quantization. In addition, each LSP pair defines a corresponding speech spectrum formant. A formant is a peak in an acoustic frequency spectrum resulting from the resonant frequencies of any acoustic system. Each particular formant may be expressed by bandwidth Bi given by Equation (3):
Bi=LSP2i−LSP2i-1,i=1, 2, . . . , 5; (3)
and center frequency Fi given by Equation (4):
As shown in Equations (3) and (4), Bi is the bandwidth of i-th formant, Fi is the center frequency of i-th formant, and LSP2i and LSP2i-1 are the i-th pair of LSP values.
In this example, for a 10th order LPC predictive filter, 5 pairs of LSP values may be generated.
Each of the first three formants may include significant or relatively significant spectrum envelope information for a voice segment. Consequently, LSP similarity evaluation may be performed based on the first three formants i=1, 2 and 3.
A bandwidth similarity flag BMi, indicating whether a bandwidth BTi associated with a targeted packet T is similar to a bandwidth BRi associated with the reference packet Rj, for each formant i, for i=1, 2, 3, may be set according to Equation (5):
As shown in Equation (5), BTi is the i-th bandwidth associated with targeted packet T, BRi is the i-th bandwidth associated with reference packet Rj and ΔBi is the i-th bandwidth threshold used to determine whether the bandwidths BTi and BRi are similar. If BMi=1, both i-th bandwidths BTi and BRi are within a certain range of one another and may be considered similar. Otherwise, when BMi=0, the i-th bandwidths BTi and BRi may not be considered similar. Similar to the pitch threshold, each bandwidth threshold may be determined based on experimental data obtained according to the specific type of vocoder used.
Referring still to S312 of
In Equation (6), FTi is the i-th center frequency associated with targeted packet T, FRi is the i-th center frequency associated with reference packet Rj and ΔFi is an i-th center frequency threshold. The i-th center frequency threshold ΔFi may be indicative of the similarity between i-th target and reference center frequencies FTi and FRi, for i=1, 2 and 3. Similar to the pitch threshold and bandwidth thresholds, the frequency thresholds may be determined based on experimental data obtained according to the specific type of vocoder used.
FMi is a center frequency similarity flag for the i-th bandwidth for a corresponding LSP pair. According to Equation (6), an FMi=1 indicates that FTi and FRi are similar, whereas FMi=0, indicates that FTi and FRi are not similar.
Returning to
The echo cancellation/suppression module 206 may then calculate an overall voice packet similarity metric at S316. The overall voice packet similarity metric may be, for example, an overall similarity metric Sj. The overall similarity metric Sj may indicate the overall similarity between targeted packet T and reference packet Rj.
In at least one example embodiment, the overall similarity metric Sj associated with reference packet Rj may be calculated based on a plurality of individual voice packet similarity metrics. The plurality of individual voice packet similarity metrics may be individual similarity metrics.
The plurality of individual similarity metrics may be calculated based on at least a portion of the encoded parameters extracted from the targeted packet T and the reference packet Rj. In this example embodiment, the plurality of individual similarity metrics may include a pitch similarity metric Sp, bandwidth similarity metrics SBi, for i=1, 2 and 3, and frequency similarity metrics SFi, for i=1, 2 and 3. Each of the plurality of individual similarity metrics may be calculated concurrently.
For example the pitch similarity metric Sp may be calculated according to Equation (7):
The bandwidth similarity SBi for each of i formants may be calculated according to Equation (8):
As shown in Equation (8) and as discussed above, BTi is the bandwidth of i-th formant for targeted packet T, and BRi is the bandwidth of i-th formant for reference packet Rj.
Similarly, the center frequency similarity SFi for each of i formants may be calculated according to equation (9):
As shown in Equation (9) and as discussed above, FTi is the center frequency for the i-th formant for the targeted packet T and FRi is the center frequency of the i-th formant for the reference packet Rj.
After obtaining the plurality of individual similarity metrics, the overall similarity matching metric Sj may be calculated according to Equation (10):
In Equation (10), each individual similarity metric may be weighted by a corresponding weighting function. As shown, αp is a similarity weighting constant for pitch similarity metric Sp, αLSP is an overall similarity weighting constant for LSP spectrum similarity metrics SBi and SFi, βBi is an individual similarity weighting constant for the bandwidth similarity metric SBi and βFi is an individual similarity weighting constant for frequency similarity metric SFi.
The similarity weighting constants αp and αLSP may be determined so as to satisfy Equation (11) shown below.
αp+αLSP=1; (11)
Similarly, individual similarity weighting constants βBi and βFi may be determined so as to satisfy Equation (12) shown below.
βBi+βFi=1;i=1, 2, 3; (12)
According to at least some example embodiments, the weighting constants may be determined and/or adjusted based on empirical data such that Equations (11) and (12) are satisfied.
Returning to
Returning to S314 of
Returning to S310 of
Returning to S306, if the counter value j is greater than threshold m, a vector trajectory matching operation may be performed at S321. Trajectory matching may be used to locate a correlation between a fixed codebook gain for the targeted packet and each fixed codebook gain for the stored reference packets. Trajectory matching may also be used to locate a correlation between the adaptive codebook gain for the targeted packet and the adaptive codebook gain for each reference packet vector. According to at least one example embodiment, vector trajectory matching may be performed using a Least Mean Square (LMS) and/or cross-correlation algorithm to determine a correlation between the targeted packet and each similar reference packet. Because LMS and cross-correlation algorithms are well-known in the art, a detailed discussion thereof has been omitted for the sake of brevity.
In at least one example embodiment, the vector trajectory matching may be used to verify the similarity between the targeted packet and each of the stored similar reference packets. In at least one example embodiment, the trajectory vector matching at S321 may be used to filter out similar reference packets failing a correlation threshold. Overall similarity metrics Sj associated with stored similar reference packets failing the correlation threshold may be removed from the memory 208. The correlation threshold may be determined based on experimental data as is well-known in the art.
Although the method of
At S322, the remaining stored overall similarity metrics Sj in the memory 208 may be searched to determine which of the similar reference packets includes echoed voice information. In other words, the similar reference packets may be searched to determine which reference packet matches the targeted packet. In example embodiments, the reference packet matching the targeted packet may be the reference packet with the minimum associated overall similarity metric Sj.
If the similarity metrics SJ are indexed in the memory (methods for doing which are well-known, and omitted for the sake of brevity) by targeted packet T and reference packet Rj, the overall similarity metrics may be expressed as S(T, Rj), for j=1, 2, 3 . . . m.
Representing the overall similarity metrics as S(T, Rj), for j=1, 2, 3 . . . m, the minimum overall similarity metric Smin may be obtained using Equation (13):
Smin=MIN[S(T,Rj),j=0, 1, . . . , m]. (13)
Returning again to
For example, echo may be cancelled/suppressed by attenuating adaptive codebook gains as shown in Equation (14):
GfR′=WfS*GfRj (14)
and/or fixed codebook gains as shown in Equation (15):
GaR′=WαS*GαR (15)
As shown in Equation (14), GfR′ is an adjusted gain for a fixed codebook associated with a reference packet, and Wf is the gain weighting for the fixed codebook.
As shown in Equation (15), GαR′ is the adjusted gain for the adaptive codebook associated with the reference packet and Wα is the gain weighting for the adaptive codebook. Initially, both Wf and Wα may be equal to 1. However, these values may be adaptively adjusted according to, for example, speech characteristics (e.g., voiced or unvoiced) and/or the proportion of echo in targeted packets relative to reference packets.
According to example embodiments, adaptive codebook gains and fixed codebook gains of targeted packets are attenuated. For example, based on the similarity of a reference and targeted packet, gains of adaptive and fixed codebooks in targeted packets may be adjusted.
According to example embodiments, echo may be canceled/suppressed using extracted parameters in the parametric domain without decoding and re-encoding the targeted voice signal.
Although only a single iteration of the method shown in
The invention being thus described, it will be obvious that the same may be varied in many ways. Such variations are not to be regarded as a departure from the invention, and all such modifications are intended to be included within the scope of the invention.
Number | Name | Date | Kind |
---|---|---|---|
5745871 | Chen | Apr 1998 | A |
6011846 | Rabipour et al. | Jan 2000 | A |
6577606 | Lee et al. | Jun 2003 | B1 |
6804203 | Benyassine et al. | Oct 2004 | B1 |
20040076271 | Koistinen et al. | Apr 2004 | A1 |
20040083107 | Noda et al. | Apr 2004 | A1 |
20060217971 | Sukkar et al. | Sep 2006 | A1 |
Number | Date | Country |
---|---|---|
1 521 240 | Apr 2005 | EP |
Number | Date | Country | |
---|---|---|---|
20080069016 A1 | Mar 2008 | US |