1. Field of the Invention
Embodiments relate to encoding background noise information in voice signal encoding methods.
2. Description of the Related Art
Since the beginnings of telecommunication, a limitation of bandwidth for analog voice transmission has been designated for telephone calls. Voice transmission takes place at a limited frequency range of 300 Hz to 3400 Hz.
Such a limited range of frequencies is also designated in many voice signal encoding methods for present-day digital telecommunications. To this end, prior to any encoding procedure, the analog signal's bandwidth is delimited. In the process, a codec is used for coding and decoding, which, because of the described delimitation of its bandwidth between 300 Hz and 3400 Hz, is also referred to as a narrowband speech codec in the following text. The term codec is understood to mean both the coding requirement for digital coding of audio signals and the decoding requirement for decoding data with the goal of reconstructing the audio signal.
One example of a narrowband speech codec is known as the ITU-T Standard G.729. Transmission of a narrowband speech signal having a bit rate of 8 kbits/s is possible using the coding requirement described therein.
Moreover, so-called wideband speech codecs are known, which provide encoding in an expanded frequency range for the purpose of improving the auditory impression. Such an expanded frequency range lies, for example, between a frequency of 50 Hz and 7000 Hz. One example of a wideband speech codec is known as the ITU-T Standard G.729.EV.
Customarily, encoding methods for wideband speech codecs are configured so as to be scalable. Scalability is here taken to mean that the transmitted encoded data contain various delimited blocks, which contain the narrowband component, the wideband component, and/or the full bandwidth of the encoded speech signal. Such a scalable configuration, on the one hand, allows downward compatibility on the part of the recipient and, on the other hand, in the case of limited data transmission capacities in the transmission channel, makes it easy for the sender and recipient to adjust the bit rate and the size of transmitted data frames.
To reduce the data transmission rate by means of a codec, customarily the data to be transmitted are compressed. Compression is achieved, for example, by encoding methods in which parameters for an excitation signal and filter parameters are specified for encoding the speech data. The filter parameters as well as the parameter that specifies the excitation signal are then transmitted to the recipient. There, with the aid of the codec, a synthetic speech signal is synthesized, which resembles the original speech signal as closely as possible in terms of a subjective auditory impression. With the aid of this method, which is also referred to as the “analysis by synthesis” method, the samples that are established and digitized are not transmitted themselves, but rather the parameters that were ascertained, which render a synthesis of the speech signal possible on the recipient's side.
A method for discontinuous transmission, which is also known in the field as DTX, affords an additional way to reduce the data transmission rate. The fundamental goal of DTX is to reduce the data transmission rate when there is a pause in speaking.
To this end, the sender employs speech pause recognition (Voice Activity Detection, VAD), which recognizes a speech pause if a certain signal level is not met. Customarily, the recipient does not expect complete silence during a speech pause. On the contrary, complete silence would lead to annoyance on the recipient's part or even to the suspicion that the connection had been interrupted. For this reason, methods are employed to produce a so-called comfort noise.
A comfort noise is a noise synthesized to fill phases of silence on the recipient's side. The comfort noise serves to foster a subjective impression of a connection that continues to exist without requiring the data transmission rate that is used for the purpose of transmitting speech signals. In other words, less energy is expended for the sender to encode the noise than to encode the speech data. To synthesize the comfort noise in a manner still perceived by the recipient as realistic, data are transmitted at a far lower bit rate. The data transmitted in the process are also referred to within the field as SID (Silence Insertion Descriptor).
Codecs presently in development focus on scalable encoding of speech information. By means of a scalable approach, the result of an encoding process is achieved that contains different blocks which contain the narrowband component of the original speech signal, the wideband component, or also contain the full bandwidth of the speech signal, that is, in the frequency range between 50 Hz and 7000 Hz, for example.
In the present scalable encoding method, the encoding of background noise information occurs either over the entire bandwidth of the input noise signal or over a section of the bandwidth of the input noise signal. The encoded noise signal is transmitted from SID frames by means of the DTX method and reconstructed on the receiver's side. The reconstructed, i.e., synthesized, comfort noise may then have a different quality than the synthesized speech information on the receiver's side. This negatively impacts the receiver's reception.
Embodiments of the invention may provide an improved implementation of the DTX method in scalable speech codecs.
Further embodiments may provide known scalability similar to the form of an SID frame for the transmission of voice information.
One method for encoding an SID frame for transmission of background noise information in the application of a scalable voice encoding method provides for encoding of a narrowband component of the background noise information first and a wideband component second. The encoding is customarily simultaneous and takes place in different ways. However, the encoding of a component can obviously also take place staggered in time before or after the encoding of another component. In addition, both components can optionally be encoded in the same way. After both components are encoded, an SID frame is formed with separate areas for the first and second components. In other words, in the SID frame, a first data area records the data for the encoded first component, while a separate data area records data for the second encoded component.
An important advantage of embodiments of the invention is that it is specified, on the receiver's side, whether comfort noise should occur based on the wideband component of the transmitted SID frame or on the narrowband component. This is a particular advantage for acoustic reception on the receiver's end in a situation in which the transmission rate for speech information frames is decreased such that only narrowband voice information is transmitted. If narrowband speech information is synthesized in combination with wideband noise, as in the current state of the art, this is very annoying to the receiver. The aforementioned decrease of the transmission rate for speech information frames can be caused by high utilization (congestion) of the network between the sender and receiver, for example. The significantly smaller SID frames are not affected by such a network bottleneck. Thus, for them, there is no constraint to reduce either their data transmission rate or their content.
According to a further advantageous embodiment of the invention, a third component is provided in the definition of the SID frame. This contains encoded background noise parameters which are encoded with a higher bit rate, although the third component still contains narrowband data (expanded narrowband or “Enhanced Low Band” data). The advantage of a definition of the SID frame with this third component lies in the ability to render a noise signal of increased quality in comparison to conventional narrowband encoding and thereby still remain in conformance with Standard G.729.B.
An embodiment example with additional advantages and configurations of the invention is illustrated in greater detail in the following by means of the drawing.
The FIGURE shows a structure of SID frame according to the invention.
In the following, the technical background underlying the invention is described in greater detail, initially without reference to the drawing.
Discontinuous transmission (DTX) methods implemented in current scalable encoding methods for wideband speech codecs do not currently support the scalability feature for transmission of background noise information, which is intended for the transmission of speech information.
As a current workaround, encoding takes place either over the entire bandwidth of an input noise signal or over a section of the bandwidth of the input noise signal.
In the past, two main types of speech codecs were developed: on the one hand, narrowband speech codecs such as 3GPP AMR, ITU-T G.729, for example, and on the other hand wideband speech codecs, such as 3GPP AMR-WB, ITU-T G.722, for example. A narrowband speech codec encodes speech signals with a sampling rate of 8 kHz with a bandwidth which customarily has a frequency range lying between 300 Hz and 3400 Hz. A wideband speech codec encodes a speech signal with 15 of a sampling rate of 16 kHz in a bandwidth in a frequency range between 50 Hz and 7000 Hz.
Some of these codecs use DTX methods, i.e., discontinuous transmission methods, in order to reduce the total transmission rate in the communication channel. According to the DTX method, SID frames are sent where the bandwidth of the SID frame corresponds to the bandwidth of the speech signal. The background noise during a speech pause is described in an SID frame.
Codecs currently in development focus on scalable encoding. With the aid of a scalable approach, an encoding process outcome is achieved that contains different blocks which contain the narrowband component of the original speech signal, the wideband component, or also the complete bandwidth of the speech signal, which is a frequency range between 50 Hz and 7000 Hz, for example. The wideband component customarily begins at a frequency of 4 kHz.
The existing DTX method does not currently support the scalable nature of codecs. Instead, encoding occurs either over the entire bandwidth of the input speech signal or over a section of the bandwidth of the input speech signal.
For clarification, the encoding method according to ITU-T Standard G.729.1 is described. This codec G.729.1 is a scalable speech codec in which the present non-scalable DTX method is applied to the entire bandwidth.
The encoding process during an active speech period—as opposed to a “Silent Period” identified speech pause—can be as follows:
The speech signal is separated into two components, namely a narrowband (Low Band) portion and a wideband (High Band) portion. Both signals are sampled at a sampling rate of 8 kHz. Partitioning into a narrowband and a wideband component takes place in a special band-pass filter, which is also called QMF (Quadrature Mirror Filter).
The narrowband component of the speech signal is encoded with a bit rate of 8 and 12 kbit/s. A CELP (Code Excited Linear Prediction) process is used for encoding of the speech signal. For bit rates above 14 kbit/s, the narrowband component is further modified in consideration of the “Transform Codec” section of G.729.1. The wideband component of the current frame—again on condition that this contains speech signals—is encoded at a bit rate of 14 kbit/s by applying the TDBWE (Time Domain Bandwidth Extension) method. For a bit rate above 14 kbit/s, the transform codec section of G.729.1 is applied.
The Standard G.729.1 does not provide a method for discontinuous transmission, so in speech pauses or “non-active voice periods”, a workaround is applied which is described in the following.
The speech signal is deconstructed into a narrowband and a wideband component, where both components are sampled at a frequency of 8 kHz. Decomposition takes place through a QMF filter as well.
The narrowband component is encoded by use of narrowband SID information. This narrowband SID information is sent to the receiver at a later point in time in an SID frame, which is compatible with Standard G.729. Additional measures as described above can contribute to an enhancement of the narrowband SID component.
The wideband component is encoded by applying a modified TDBWE method. During the so-called hangover periods, the speech signal is encoded at a bit rate of 14 kbit/s on top of that, while the speech pause of detected background noise is simultaneously analyzed and corresponding parameters are adjusted. The background noise is analyzed in terms of the energy of the noise signal and its frequency distribution. In contrast to the TDBWE methods provided by Standard G.729.1, the temporal fine structure is not analyzed; rather only an average of the energy over the frame is generated.
In the following, an embodiment of the invented method is explained based on the FIGURE.
The FIGURE shows an SID frame with separate areas for a narrowband first component LB (Low Band), a wideband second component HB (High Band) and an intermediate third component ELB (Enhanced Low Band).
The first component LB contains background noise parameters encoded with it, which are encoded at a bit rate of 8 kbit/s or lower. The data length of the first component LB is 15 bits, for example.
The second component HB contains encoded background noise parameters, which are encoded with a bit rate between 14 kbit/s and 32 kbit/s. The data length of the second component HB is 19 bits, for example.
The third component ELB contains encoded background noise parameters which are encoded at a bit rate of more than 8 kbit/s, such as 12 kbit/s for example. The data length of the third component ELB is 9 bits, for example. The advantage of a definition of the SID frame with a third component ELB consists of an option to render a noise signal of increased quality in comparison to conventional narrowband encoding methods while still remaining in conformance with Standard G.729.B.
During a speech pause, the characteristics of the background nose are acquired on the side of the encoder. The characteristics include the temporal distribution in particular as well as the spectral form of the background noise. For the acquisition process, a filter process is applied which considers the temporal and spectral parameters of the background noise from the previous frame. If significant changes in the character or in the strength of the background noise are revealed, a decision is made on the basis of threshold parameters (Threshold Values) about whether the acquired parameters need to be updated.
The following process is performed on the decoder or receiver side: When a “normal,” i.e., speech-signal-containing frame is received, customary decoding is performed. The bit rate for such a normal frame is typically 8 kbit/s or above. When an SID frame is received, comfort noise is synthesized, so that in the case of a wideband SID, wideband comfort noise is synthesized and distributed with a read-out gain factor.
Other embodiments include further details for inclusion of the DTX process in wideband codecs such as G.729.1, for example, and additional methods of modifying the TDBWE process, which support a synthesis of comfort noise during non-active frames, i.e., frames without speech information.
The following procedure is provided according to one embodiment.
This embodiment example is implemented in the following phases:
tenv—fidx==αtenv·tenvidx+(1−αtenv)·tenv—fidx-1
fenv—fidx[i]=αtenv·fenvidx[i]+(1−αtenv)·fenv—fidx-1[i]
Number | Date | Country | Kind |
---|---|---|---|
102008009719.5 | Feb 2008 | DE | national |
This application is the United States national phase under 35 U.S.C. §371 of International Application No. PCT/EP2009/051118, filed on Feb. 2, 2009, and claiming priority to German Patent Application No. 10 2008 009 719.5, filed on Feb. 19, 2008. Both of those applications are incorporated by reference herein.
Number | Date | Country | |
---|---|---|---|
Parent | 12867969 | Aug 2010 | US |
Child | 14880490 | US |