The present invention relates generally to processing of telecommunication signals. More particularly, the present invention relates to a method and apparatus for transcoding a bitstream encoded by a first voice speech coding format into a bitstream encoded by a second variable-rate voice coding format. Merely by way of example, the invention has been applied to variable-rate voice transcoding, but it would be recognized that the invention may also be applicable to other applications.
Telecommunication techniques have progressed through the years. One of the major desires of speech coding development is high quality output speech at a low average data rate. One approach is to employ a variable bit-rate scheme, whereby the transmission rate is not only determined by the network traffic but also from the characteristics of the input speech signal. For example, when the signal is highly voiced, a high bit rate may be chosen; if the signal is weak, a low bit rate is chosen; and if the signal has mostly silence or background noise, a lower bit rate is chosen. This often provides efficient allocation of the available bandwidth, without sacrificing output voice quality. Such variable-rate coders include the TIA IS-127 Enhanced Variable Rate Codec (EVRC), and 3rd generation partnership project 2 (3GPP2) Selectable Mode Vocoder (SMV). These coders use Rate Set 1 of the Code Division Multiple Access (CDMA) communication standards IS-95 and cdma2000, which include rates of 8.55 kbit/s (Rate 1 or full Rate), 4.0 kbit/s (half-rate), 2.0 kbit/s (quarter-rate) and 0.8 kbit/s (eighth rate). SMV selects the bit rate based on the input speech characteristics and operates in one of six network controlled modes, which limit the bit rate during high traffic. Depending on the mode of operation, different thresholds may be set to determine the rate usage percentages.
To accurately decide the desired transmission rate, and obtain high quality output speech at that rate, input speech frames are categorized into various classes. For example, in SMV, these classes include silence, unvoiced, onset, plosive, non-stationary voiced and stationary voiced speech. It is known that certain coding techniques are better suited for certain classes of sounds. Also, some types of sounds, for example, voice onsets or unvoiced-to-voiced transition regions, have higher perceptual significance and thus generally require higher coding accuracy than other classes of sounds, such as unvoiced speech. Thus, the speech frame classification may be used, not only to decide the most efficient transmission rate, but also the best-suited coding algorithm.
Accurate classification of input speech frames is desired to fully exploit the signal redundancies and perceptual importance. Typical frame classification techniques include voice activity detection, measuring the amount of noise in the signal, measuring the level of voicing, detecting speech onsets, and measuring the energy in a number of frequency bands. These measures generally require the calculation of numerous parameters, such as maximum correlation values, line spectral frequencies, and frequency transformations.
While coders such as SMV achieve much better quality at lower average data rate than existing speech codecs at similar bit rates, the frame classification and rate determination algorithms are complex. In the case of a tandem connection of two speech vocoders, however, many of the measurements performed for frame classification have already been calculated in the source codec. This can be capitalized on in a transcoding framework. In transcoding from the bitstream format of one CELP codec to the bitstream format of another CELP codec, rather than fully decoding to PCM and re-encoding the speech signal, smart interpolation methods may be applied directly in the CELP parameter space. Hence the parameters, such as pitch lag, pitch gain, fixed codebook gain, line spectral frequencies and the source codec bit rate are available to the destination codec. This allows frame classification and rate determination of the destination voice codec to be performed in a fast manner.
The simplest method of transcoding is a brute-force approach called tandem transcoding, shown in
Methods for “smart” transcoding similar to that illustrated in
Further, these transcoding methods do not cover the transcoding between variable-rate voice coders which determine the bit rate based on the characteristics of the input speech and, in some cases, external commands. During the transcoding process, the frame classification and rate decision of the destination voice codec in transcoding are still computed through the speech signal domain. The transcoder thus includes the equivalent amount of computational resources as the destination codec to classify frame types and to determine the bit rates. The smart transcoding of previous methods may lose part of their computational advantage, as the classification algorithms require parameters from intermediate stages of functions that have been omitted. For example, recalculation of the line spectral frequencies is often not performed in transcoding, however, the LPC prediction gain, LPC prediction error, autocorrelation function and reflection coefficients are often required in the classification and rate determination process.
From the above, it is seen that improved telecommunication techniques are desired.
According to the present invention, techniques for processing of telecommunication signals are provided. More particularly, the present invention relates to a method and apparatus for transcoding a bitstream encoded by a first voice speech coding format into a bitstream encoded by a second variable-rate voice coding format. Merely by way of example, the invention has been applied to variable-rate voice transcoding, but it would be recognized that the invention may also be applicable to other applications.
According to an aspect of the present invention, there is provided a voice transcoding apparatus comprising:
Numerous benefits are achieved using the present invention over conventional techniques. These benefits have been listed below:
To perform smart voice transcoding between variable-rate voice codecs;
To classify the destination codec frame type directly from the parameters of input source codec frames;
To determine the rate of the destination codec directly from the parameters of input source codec frames;
To improve voice quality through mapping parameters in the parameter space;
To reduce the computational complexity of the transcoding process;
To reduce the delay through the transcoding process;
To reduce the amount of memory required by the transcoding; and
To provide a generic transcoding architecture that may be adapted to current and future variable-rate codecs.
Depending upon the embodiment, one or more of these benefits may be achieved. These and other benefits are described throughout the present specification and more particularly below.
Other features and advantages of the present invention will be apparent from the following description taken in conjunction with the accompanying drawing, in which like reference characters designate the same or similar parts throughout the figures thereof.
The objectives, features, and advantages of the present invention, which are believed to be novel, are set forth in detail in the appended claims. The present invention, both as to its organization and manner of operation, together with further objectives and advantages, may best be understood by reference to the following description, in connection with the accompanying drawings.
According to the present invention, techniques for processing of telecommunication signals are provided. More particularly, the present invention relates to a method and apparatus for transcoding a bitstream encoded by a first voice speech coding format into a bitstream encoded by a second variable-rate voice coding format. Merely by way of example, the invention has been applied to variable-rate voice transcoding, but it would be recognized that the invention may also be applicable to other applications.
A method and apparatus of the invention are discussed in detail below. In the following description, for purposes of explanation, numerous specific details are set forth in order to provide a thorough understanding of the present invention. The case of SMV and EVRC are used for the purpose of illustration and for examples. The methods described here are generic and apply to the transcoding between any pair of linear prediction-based voice codecs. A person skilled in the relevant art will recognize that other steps, configurations and arrangements can be used without departing from the spirit and scope of the present invention.
A block diagram of a tandem connection between two voice codecs is shown in
A diagram of the apparatus for transcoding between two variable bit-rate voce codecs of the present invention is shown in
Firstly, the bitstream representing frames of data encoded according to the source voice codec is unpacked and unquantized by a bitstream unpacking module. The actual parameters extracted from the bitstream depend on the source codec and its bit rate, and may include line spectral frequencies, pitch delays, delta pitch delays, adaptive codebook gains, fixed codebook shapes, fixed codebook gains and frame energy. Particular voice codecs may also transmit information regarding spectral transition, interpolation factors, the switch predictor used as well as other minor parameters. The unquantised parameters are passed to the intermediate parameters interpolation module.
The intermediate parameters interpolation module interpolates between different frame sizes, subframe sizes and sampling rates. This is required if there are differences in the frame size or subframe size of the source and destination codecs, in which case the transmission frequency of parameters may not be matched. Also, a difference in the sampling rate between the source codec and destination codec requires modification of parameters. The output interpolated parameters are passed to the smart frame classification and rate determination module and one of the mapping modules.
The frame classification and rate determination module receives the unquantized interpolated parameters of the source codec and the external control commands of the destination codec, as shown in
The intermediate parameters interpolation module and the frame classification and rate determination module are linked to one of many parameter mapping modules by a switching module. The destination codec frame type and bit rate determined by the frame classification and rate determination module control which mapping module is to be chosen. Mapping modules may exist for each combination of bit-rate and frame class of the source codec to each bit rate and frame class of the destination codec.
Each mapping module comprises a speech spectral parameter mapping unit, an excitation mapping unit, and a mapping strategy decision unit. The speech spectral parameter mapping unit maps the spectral parameters, usually line spectral pairs (LSPs) or line spectral frequencies (LSFs), of the source codec, directly to the spectral parameters of the destination codec. A calibration factor is calculated and used to calibrate the excitation to account for the differences in the quantised spectral parameters of the source and destination codec. The excitation mapping unit takes CELP excitation parameters including pitch lag, adaptive codebook gain, fixed codebook gain and fixed codebook codevectors from the interpolator and maps these to encoded CELP excitation parameters according to the destination codec.
Linked to the excitation coding mapping unit is a mapping strategy decision unit, which controls the type of excitation mapping to be used. Several mapping approaches may be used, including those using direct mapping from source codec to destination codec without any further analysis or iterations, analysis in the excitation domain, analysis in the filtered excitation domain or a combination of these strategies, such as searching the adaptive codebook in the excitation space and fixed codebook in the filtered excitation space. The mapping strategy decision module determines which mapping strategy is to be applied. The decision may be based on available computational resources or minimum quality requirements and can change in a dynamic fashion.
Except for the direct mapping strategy, in which parameters are directly mapped from source codec format to destination codec format without any analysis, the excitation signal is reconstructed. Reconstruction of the excitation during active speech requires the interpolated excitation parameters of pitch delays, adaptive codebook gains, fixed codebook shapes, and fixed codebook gains. During silence or noise, the parameters required are the signal energy, signal shape if available, and a random noise generator.
Current variable-rate voice codecs applicable to the present invention include EVRC and SMV which are based on the Relaxed CELP (RCELP) principle. Typical excitation quantization in RCELP codecs is performed by the technique shown in
Another mapping strategy is to perform both the adaptive codebook and fixed codebook searches in the excitation domain. A further mapping strategy is to perform both the adaptive codebook and fixed codebook searches in the filtered excitation domain. Alternatively, parameters may be directly mapped from source to destination codec format without any searching. It is noted that any combinations of the above strategies may also be used. The best strategy in terms of both high quality and low complexity will depend on the source and destination codecs and bit rates.
A second-stage switching module links the interpolation and mapping module to the destination bitstream packing module. The destination bitstream packing module packs the destination CELP parameters in accordance with the destination codec standard. The parameters to be packed depend on the destination codec, the bit rate and frame type.
As an example, it is assumed that the source codec is the Enhanced Variable Rate Codec (EVRC) and the destination codec is the Selectable Mode Vocoder (SMV).
EVRC and SMV are both variable-rate codecs that determine the bit rate based on the characteristics of the input speech. These coders use Rate Set 1 of the Code Division Multiple Access communication standards IS-95 and cdma2000, which consists of the rates 8.55 kbit/s (Rate 1 or full Rate), 4.0 kbit/s (Rate ½ or half-rate), 2.0 kbit/s (Rate ¼ or quarter-rate) and 0.8 kbit/s (Rate ⅛ or eighth rate). EVRC uses Rate 1, Rate ½, and Rate ⅛; it does not use quarter-rate. SMV uses all four rates and also operates in one of six network controlled modes, Modes 0 to 6, which limits the bit rate during high traffic. Modes 4 and 5 are half-rate maximum modes. Depending on the mode of operation, different thresholds may be set to determine the rate usage percentages.
A diagram of the apparatus for transcoding from EVRC to SMV is shown in
In transcoding from EVRC to SMV, the bitstream representing frames of data encoded according to EVRC is unpacked by a bitstream unpacking module. The actual parameters from the bitstream depend on the EVRC bit rate and include line spectral frequencies, spectral transition indicator, pitch delay, delta pitch delay, adaptive codebook gain, fixed codebook shapes, fixed codebook gains and frame energy. The unquantised parameters are passed to the intermediate parameters interpolation module.
The intermediate parameter interpolation module interpolates between the different subframe sizes of EVRC and SMV. EVRC has 3 subframes per frame, whereas SMV has 1, 2, 3, 4, or 10 subframes per frame depending on the bit rate and frame type. Depending on the parameter and coding strategy, subframe interpolation may or may not be required.
The frame classification and rate determination module receives the EVRC CELP parameters, the EVRC bit rate, the SMV network-controlled mode and any other SMV external commands. The frame classification and rate determination module produces a frame class and rate decision for SMV based on these inputs. The frame classification and rate determination module comprises a classifier input parameter selector, for selecting which of the EVRC parameters will be used as inputs to the classification task, M sub-classifiers, buffers to store past input parameters and past output values and a final decision module. The sub-classifiers take as input the selected classification input parameters, the SMV network-controlled mode command, and past input and output values, and generate the frame class and rate decision. One sub-classifier may be used to determine the bit rate, and a second sub-classifier may be used to determine the frame class. The SMV frame class is either silence, noise-like, unvoiced, onset, non-stationary voiced or stationary voiced, and the SMV rate may be Rate 1, Rate ½, Rate ¼, or Rate ⅛. The SMV frame classification, using EVRC parameters, is performed according to a pre-defined configuration and classifier algorithm. The coefficients or rules of the classifier are determined during a prior EVRC-to-SMV classifier training or construction process. The frame classification and rate determination module includes a final decision module, that enforces all SMV rate transition rules to ensure illegal rate transitions are not allowed. For example, in SMV, a Rate 1 Type 1 cannot follow a Rate ⅛ frame. This frame classification and rate determination module replaces the SMV standard classifier, which requires a large amount of processing to derive the parameters and features required for classification. The SMV frame-processing functions are shown in
The intermediate parameters interpolation module and the SMV smart frame classification and rate determination module are linked to one of many interpolation and mapping modules by a switching module. EVRC has a single processing algorithm for each rate, whereas SMV has two possible processing algorithms for each of Rate 1 and Rate ½, and a single processing algorithm for each of Rate ¼ and Rate ⅛. The SMV frame type and bit rate determined by the frame classification and rate determination module control which interpolation and mapping module is to be chosen. For Rates 1 and ½ of SMV, the stationary voiced frame class uses subframe processing Type 1 and all other frame classes use subframe processing Type 0. As shown in
For the EVRC-to-SMV transcoder, interpolation and mapping modules include:
Each mapping module comprises a speech spectral parameter mapping unit, an excitation mapping unit, and a mapping strategy decision unit. The speech spectral parameter mapping unit maps the EVRC line spectral frequencies directly to SMV line spectral frequencies. This occurs for all source EVRC bit rates. The parameters passed to the excitation mapping unit depend on the source EVRC bit rate. For EVRC Rates 1 and ½, the input CELP excitation parameters are the pitch lag, delta pitch lag (Rate 1 only), adaptive codebook gain, fixed codevectors, and fixed codebook gain. For EVRC Rate ⅛, typically inactive frames, the input excitation parameter is the frame energy. The excitation parameters are mapped to SMV excitation parameters, depending on the selected mapping module and mapping strategy. The mapping strategy decision module controls the mapping strategy to be used. In this example, the mapping strategy for active speech is to perform analysis in the excitation domain.
Using the EVRC excitation parameters of pitch delay, delta pitch delay, adaptive codebook gain, fixed codevectors, fixed codebook gains and frame energy, the excitation signal is reconstructed. To reduce complexity and quality degradations, the EVRC decoder operations of filtering the excitation signal by the synthesis filter to convert to the speech domain and post-filtering are not used. Similarly, the pre-processing operations of SMV are not used. These include silence enhancement, high-pass filtering, noise suppression and adaptive tilt filtering. Since the EVRC encoder contains noise-suppression operations, the transcoder does not include further noise-suppression functions.
In RCELP-based coders like EVRC and SMV, a fundamental part of the signal processing is in the modification of the speech to match an interpolated pitch track. This saves quantisation bits required for pitch representation, but involves a large amount of computation as pitch pulses must be detected and individually shifted or time-warped. For the EVRC-to-SMV transcoding example, the signal modification functions within the SMV encoder may be bypassed. This is due to the fact that similar signal modification has already been performed in the EVRC encoder. Hence the reconstructed excitation signal already possesses a smooth pitch characteristic and is already in a form amenable to efficient quantization. The target signal for the adaptive codebook search is thus the excitation signal, without pitch modifications, that has been calibrated to account for differences between the quantized EVRC LSFs and the quantized SMV LSFs.
Mapping of excitation parameters is performed as described in the previous section. Simplifications can be made to the fixed codebook search, as SMV contains multiple sub-codebooks for each rate and frame type. Since the EVRC bit rate, fixed codevector and fixed codebook structure are known, it may not be necessary to search all sub-codebooks to best match target excitation. Instead, each mapping module may contain a single fixed sub-codebook or a subset of the fixed sub-codebooks to reduce computational complexity.
A second-stage switching module links the interpolation and mapping module to the SMV bitstream packing module. The bitstream is packed according to the SMV frame type and bit rate. One SMV output frame is produced for each EVRC input frame.
The invention of method and apparatus for voice transcoding between variable rate coders described in this document is generic to all linear prediction-based voice codecs, and applies to any voice transcoders between the existing codecs G.723.1, GSM-AMR, EVRC, G.728, G.729, G.729A, QCELP, MPEG-4 CELP, SMV, AMR-WB, VMR and all other future voice codecs. The invention applies especially to those transcoders, in which the destination coder makes use of rate determination and/or frame classification information.
The previous description of the preferred embodiment is provided to enable any person skilled in the art to make or use the present invention. The various modifications to these embodiments will be readily apparent to those skilled in the art, and the generic principles defined herein may be applied to other embodiments without the use of the inventive faculty. Thus, the present invention is not intended to be limited to the embodiments shown herein but is to be accorded the widest scope consistent with the principles and novel features disclosed herein.