1. Field of the Invention
Embodiments are directed to methods and means for decoding background noise information in speech signal encoding methods.
2. Background of the Related Art
Since the beginnings of telecommunication, a limitation of bandwidth for analog voice transmission has been designated for telephone calls. Voice transmission takes place at a limited frequency range of 300 Hz to 3400 Hz.
Such a limited range of frequencies is also designated in many voice signal encoding methods for present-day digital telecommunications. To this end, prior to any encoding procedure, the analog signal's bandwidth is delimited. In the process, a codec is used for coding and decoding, which, because of the described delimitation of its bandwidth between 300 Hz and 3400 Hz, is also referred to as a narrowband speech codec in the following text. The term codec is understood to mean both the coding requirement for digital encoding of audio signals and the decoding requirement for decoding data with the goal of reconstructing the audio signal.
One example of a narrowband speech codec is known as the ITU-T Standard G.729. The transmission of a narrowband speech signal having a bit rate of 8 kbit/s is provided using the coding requirement described therein.
Moreover, so-called wideband speech codecs are known, which provide encoding in an expanded frequency range for the purpose of improving the auditory impression. Such an expanded frequency range lies, for example, between a frequency of 50 Hz and 7000 Hz. One example of a wideband speech codec is known as the ITU-T Standard G.729.EV.
Customarily, encoding methods for wideband speech codecs are configured so as to be scalable. Scalability is here taken to mean that the transmitted encoded data contain various delimited blocks, which contain the narrowband component, the wideband component, and/or the full bandwidth of the encoded speech signal. Such a scalable configuration, on the one hand, allows downward compatibility on the part of the recipient and, on the other hand, in the case of limited data transmission capacities in the transmission channel, makes it easy for the sender and recipient to adjust the bit rate and the size of transmitted data frames.
To reduce the data transmission rate by means of a codec, customarily the data to be transmitted are compressed. Compression is achieved, for example, by encoding methods in which parameters for an excitation signal and filter parameters are specified for encoding the speech data. The filter parameters as well as the parameter that specifies the excitation signal are then transmitted to the receiver. There, with the aid of the codec, a synthetic speech signal is synthesized, which resembles the original speech signal as closely as possible in terms of a subjective auditory impression. With the aid of this method, which is also referred to as the “analysis by synthesis” method, the samples that are established and digitized are not transmitted themselves, but rather the parameters that were ascertained, which render a synthesis of the speech signal possible on the receiver's side.
A method for discontinuous transmission, which is also known in the field as DTX, affords an additional way to reduce the data transmission rate. The fundamental goal of DTX is to reduce the data transmission rate when there is a pause in speaking.
To this end, the sender employs speech pause recognition (Voice Activity Detection, VAD), which recognizes a speech pause if a certain signal level is not met.
Customarily, the receiver does not expect complete silence during a speech pause. On the contrary, complete silence would lead to annoyance on the receiver's part or even to the suspicion that the connection had been interrupted. For this reason, methods are employed to produce a so-called comfort noise.
A comfort noise is a noise synthesized to fill phases of silence on the receiver's side. The comfort noise serves to foster a subjective impression of a connection that continues to exist without requiring the data transmission rate that is used for the purpose of transmitting speech signals. In other words, less energy is expended for the sender to encode the noise than to encode the speech data. To synthesize—i.e., decode—the comfort noise in a manner still perceived by the receiver as realistic, data are transmitted at a far lower bit rate. The data transmitted in the process are also referred to within the field as SID (Silence Insertion Descriptor).
In the current state of the art, problems exist with the method for discontinuous transmission using wideband speech codecs, such as ITU-T G.729.1, G.722.2 or 3GPP AMR-WB, for example. The speech codecs referred to as scalable wideband typically support different data transmission rates in a wideband range of 50 to 7000 Hz.
Possible bit rates for encoding speech information are, for instance, 8, 12, 14, 16, . . . , 32 kbit/s, which are used in Standard G.729.1, for example. The bit rates of 8 and 12 kbit/s are applied in narrowband signals (50 Hz to 4 kHz). Bit rates of more than 12 kbit/s are applied to the upper spectrum of 4 to 7 kHz.
A change between the aforementioned bit rates is possible during a transmission. A sudden change from a narrowband to a wideband bit rate is known to cause a disturbing effect to a human recipient. For instance, such a transition takes place in the sequence of a bitstream truncation, which can be caused by a transfer network between the sender and receiver, for example, in the sequence of establishing additional connections or due to congestion in the transfer network. This truncation leads to a change in the bit rate and finally to a transition from wideband to narrowband transfer of the speech signal.
If the discontinuous transmission or DTX method is used in the encoder method, a reduction of the data transmission rate for transmission of the respective data frame is possible. The DTX method is used precisely when a corresponding frame is characterized as a speech pause. Use of the DTX method achieves a reduced data transmission rate of the transmitted frame due to two factors. First, on the side of the encoder, all inactive frames do not have to be sent to the decoder. Second, a sent SID frame or inactive frame uses far fewer bits than a speech data frame.
Such a method requires involvement of voice activity detection (VAD) on the encoder side. By means of a voice activity detector, the encoder is informed as to whether a frame containing a current sampling rate and to be encoded contains a speech signal or a speech pause with background noise. Use of this characterization affects encoder actions, which ascertain the perceptional characteristics of an inactive speech frame. Such perceptional characteristics include the energy transmitted, for instance, as well as spectral and temporal characteristics.
The encoder sends a specially identified frame, an SID (Silence Insertion Descriptor) frame, to the decoder. The decoder synthesizes a comfort noise based on the information contained in the SID frame, in which the decoder can determine whether the noise information contained involves narrowband or wideband information based on the SID frame.
A change in the bit rate (Bit Rate Switching) between narrowband and wideband information is a typical scenario for every scalable wideband speech codec. Handling a bit rate switch during a normal speech phase, i.e., in the absence of speech pauses, is amply described in the literature, but handling one during entry into a DTX phase is still not yet known at this time. Therefore, an urgent need exists to provide a method for bit rate switching during a DTX phase and/or during entry into a DTX phase in order to optimally respond to a switch between a narrowband and wideband bit rate before or during the transition into the DTX phase.
During a speech pause, a truncation of the bit rate is unlikely, because the bitstream relocation of an SID frame needs fewer bits as it is than an active speech data frame in a “normal” codec operation, i.e., a codec operation during an exclusively speaking phase.
This leads to a possible scenario in which the bit rate is changed during an active speaking phase, but in speech pauses, i.e., during the DTX phase, remains in a wideband mode. Because this can be very disturbing to the human recipient on the decoder side, it is recommended in this case that the active speaking frames be decoded in narrowband and the background noise be rendered in the speech pauses in wideband.
This is more likely to occur, for instance, in situations in which the speech data frame sent on the encoder end is truncated by the transmission network, while on the side of the transmission network, there is still sufficient capacity remaining for transmission of the wideband SID frame.
As yet, no method for switching the bit rate of the SID frame during a speech pause is known. The existing method for bitstream switching applies solely to normal codec operation during an active speaking phase.
Embodiments of the invention provide a method for bitstream switching of SID frames during a speech pause that results in improved quality of the signal synthesized by the decoder.
A basic idea of the invention is to ascertain information in the course of the bit rate switching during an active speech phase. The scalable nature of the invented method for use in speech signal encoding methods and codecs has already shown the feasibility of the codec for bit rate switching.
According to embodiments of the invention, during the speech phase, information on the percentage proportion of wideband active speech frames is collected in comparison to the narrowband active speech frames on the decoder side. In other words, the information on the nature of the background noise in a speech pause is not collected for the first time at the point of the switch, as has been suggested by the state of the art to this point. A higher percentage proportion of wideband active speech frames shows thereby that wideband use on the side of the codec is preferable, and therefore a need exists to synthesize, i.e., decode, wideband noise information during a DTX phase. In contrast, if a lower percentage proportion is determined, narrowband noise will be generated by the decoder upon entry into a DTX phase, even when the received SID frame would have allowed the synthesis—i.e., decoding—of wideband noise.
With this method the intent of certain embodiments of the invention—to provide a method for bitstream switching of SID frames during a speech pause—is more than solved. The intent to be achieved of switching between noise information with different bit rates is improved, according to the invented solution presented here, by determining a proportion of noise information with different bit rates. The proportion is variable, in contrast to a switch, in any ratio between noise information with different bit rates.
Due to the variability and adaptability of the noise signal quality with respect to the previously collected speech signal quality (narrowband/wideband), the total resulting signal, that is, noise and speech signal, is considerably increased overall on the side of the receiver. Embodiments therefore may achieve an improved quality of the signal synthesized on the decoder.
Such an approach according to the invented method proves to be the foundation for advantageous further embodiments of the invention, which are the object of the subordinate claims.
If, according to the invented method, a decision is made to the effect that during a speech pause, a noise signal of a certain quality (i.e., wideband or narrowband) is synthesized, it can result that the active data frame is truncated on the side of the network in the last few frames during an active speech phase.
For clarification, it is initially assumed that the codec applied favors a wideband rendering mode and a wideband transmission mode also was predominantly provided through the transmission network. This can lead to the case that few active speech frames arrive as narrowband speech frames at the receiving decoder, before the first SID frames are received there.
In this case, without additional measures, an abrupt transition from the narrowband speech signal to a wideband noise signal occurs during the first few SID frames. However, such a transition for returning to a wideband receiver status is so significant that this transition is generally considered disturbing to the receiver.
A further embodiment of the invention provides that, on entering into the DTX phase, initially predominantly narrowband decoding of the background noise information occurs, which is converted after a variable time period into predominantly wideband decoding. Such a transition occurs preferably quasi-continuously, with a transition adjusted to a specified proportional factor at discrete time points—which is why it is “quasi”-continuous.
According to a further embodiment of the invention, a method for fast switching is proposed in which a quasi-continuous transition from a narrowband (proportional factor=0) to a wideband (proportional factor=1) noise signal quality is carried out within a set time frame of 100 ms.
This transition is carried out on the side of the decoder.
The following values for the proportional factor have proven to be particularly advantageous for subjective human hearing, according to a further embodiment of the invention:
A proportional factor of 0 for the time point of entry into the DTX phase, therefore exclusively narrowband noise;
A proportional factor of 0.09525986892242 for a time point 20 ms after entry into the DTX phase;
A proportional factor of 0.19753086419753 for a time point 40 ms after entry into the DTX phase;
A proportional factor of 0.36595031245237 for a time point 60 ms after entry into the DTX phase;
A proportional factor of 0.62429507696997 for a time point 80 ms after entry into the DTX phase; and;
A proportional factor of 1, therefore exclusively wideband signal, for a time point 100 ms after entry into the DTX phase.
According to a further embodiment of the invention, it is assumed that the codec used favors a narrowband rendering mode and/or a wideband transmission mode not allowed by the transmission network in the past. This can lead to the case that fewer active speech frames arrive as broadband speech frames at the receiving decoder before the first SID frames are received.
According to a further embodiment of the invention, it is provided that on entry into the DTX phase, initially predominantly wideband decoding of the background noise information takes place, which is converted into predominantly narrowband decoding after a variable amount of time. Such a transition takes place, preferably quasi-continuously, in a manner similar to the above-described further embodiment, in which a transition to discrete time points is adjusted to a specified proportional factor.
According to a further embodiment of the invention, a method for fast switching is proposed in which a quasi-continuous transition from wideband (proportional factor=1) to narrowband (proportional factor=0) noise signal quality is carried out within a specified time period of 100 ms. This transition is carried out on the side of the decoder.
For the quasi-continuous transition from wideband to narrowband noise signal quality, the proportional factor has values as above, but set in reverse order.
An embodiment example with additional advantages and configurations of the invention is illustrated in greater detail in the following by means of the drawing.
In
At a third time t3, it is assumed that a transfer occurs in a DTX phase based on a speech pause on the side of the sender. After the third time t3, consequently SID frames SID are sent in specified time periods.
After the third point t3, the situation previously explained commences: that in the past, during the phase of time between the second time t2 and the third time t3, a narrowband speech signal was transmitted, and after the third time point t3, from that point on a wideband noise signal is provided through the corresponding SID frame. The bit rate of the SID frame corresponds to 43 bit/20 ms=2.15 kbit/s at a length of 43 bits per SID frame and a period of 20 ms per SID frame sent.
In this situation, the case occurs that on the decoder side an immediate, i.e., discontinuous, transition from a narrowband speech signal to a wideband noise signal will take place. Such an abrupt transition is perceived by a human recipient as acutely disturbing.
In
In
In the following, it is assumed that entry into a DTX phase occurred at a time t3 for the example of
According to embodiments herein, during the speech phase on the side of the decoder, information on the proportion of wideband active speech frames is collected in comparison to the narrowband active speech frame.
For the example of
On entering into a DTX phase at time t3 in the
In the
In
The transition into the DTX phase occurs at the time TIME of 0 ms shown in the drawing of
A proportion HB-SHARE of 0.09525986892242 at the time TIME of 20 ms;
A proportion HB-SHARE of 0.19753086419753 at the time TIME of 40 ms;
A proportion HB-SHARE of 0.36595031245237 at the time TIME of 60 ms; and
A proportion HB-SHARE of 0.62429507696997 at the time TIME of 80 ms.
Another embodiment of the invention provides a transition from a wideband speech signal to a narrowband noise signal in a similar manner.
For this purpose, a scenario is assumed which is slightly modified in reference to
Number | Date | Country | Kind |
---|---|---|---|
10-2008-009-720.9 | Feb 2008 | DE | national |
This application is the United States national phase under 35 U.S.C. § 371 of PCT International Patent Application No. PCT/EP2009/051120, filed on Feb. 2, 2009, and claiming priority to German National Application No. 10 2008 009 720.9, filed on Feb. 19, 2008. Those applications are incorporated by reference herein.
Filing Document | Filing Date | Country | Kind | 371c Date |
---|---|---|---|---|
PCT/EP09/51120 | 2/2/2009 | WO | 00 | 8/16/2010 |