The present invention relates generally to processing of telecommunication signals. More particularly, the invention provides a method and apparatus for classifying speech signals and determining a desired (e.g., efficient) transmission rate to code the speech signal with one encoding method when provided with the parameters of another encoding method. Merely by way of example, the invention has been applied to voice transcoding, but it would be recognized that the invention may also be applicable to other applications.
An important feature of speech coding development is to provide high quality output speech at low average data rate. To achieve this, one approach adapts the transmission rate based on the network traffic. This is the approach adopted by the Adaptive Multi-Rate (AMR) codec used for Global System for Mobile (GSM) Communications. In AMR, one of eight data rates is selected by the network, and can be changed on a frame basis. Another approach is to employ a variable bit-rate scheme Such variable bit rate scheme uses a transmission rate determined from the characteristics of the input speech signal. For example, when the signal is highly voiced, a high bit rate may be chosen, and if the signal has mostly silence or background noise, a low bit rate is chosen. This scheme often provides efficient allocation of the available bandwidth, without sacrificing output voice quality. Such variable-rate coders include the TIA IS-127 Enhanced Variable Rate Codec (EVRC), and 3rd generation partnership project 2 (3GPP2) Selectable Mode Vocoder (SMV). These coders use Rate Set 1 of the Code Division Multiple Access (CDMA) communication standards IS-95 and cdma2000, which is made of the rates 8.55 kbit/s (Rate 1 or full Rate), 4.0 kbit/s (half-rate), 2.0 kbit/s (quarter-rate) and 0.8 kbit/s (eighth rate). SMV combines both adaptive rate approaches by selecting the bit-rate based on the input speech characteristics as well as operating in one of six network controlled modes, which limits the bit-rate during high traffic. Depending on the mode of operation, different thresholds may be set to determine the rate usage percentages.
To accurately decide the best transmission rate, and obtain high quality output speech at that rate, input speech frames are categorized into various classes. For example, in SMV, these classes include silence, unvoiced, onset, plosive, non-stationary voiced and stationary voiced speech. It is generally known that certain coding techniques are often better suited for certain classes of sounds. Also, certain types of sounds, for example, voice onsets or unvoiced-to-voiced transition regions, have higher perceptual significance and thus should require higher coding accuracy than other classes of sounds, such as unvoiced speech. Thus, the speech frame classification may be used, not only to decide the most efficient transmission rate, but also the best-suited coding algorithm.
Accurate classification of input speech frames is typically required to fully exploit the signal redundancies and perceptual importance. Typical frame classification techniques include voice activity detection, measuring the amount of noise in the signal, measuring the level of voicing, detecting speech onsets, and measuring the energy in a number of frequency bands. These measures would require the calculation of numerous parameters, such as maximum correlation values, line spectral frequencies, and frequency transformations.
While coders such as SMV achieve much better quality at lower average data rate than existing speech codecs at similar bit rates, the frame classification and rate determination algorithms are generally complex. However, in the case of a tandem connection of two speech vocoders, many of the measurements desired to perform frame classification have already been calculated in the source codec. This can be capitalized on in a transcoding framework. In transcoding from the bitstream format of one Code Excited Linear Prediction (CELP) codec to the bitstream format of another CELP codec, rather than fully decoding to PCM and re-encoding the speech signal, smart interpolation methods may be applied directly in the CELP parameter space. Here, the term “smart” is those commonly understood by one of ordinary skill in the art. Hence the parameters, such as pitch lag, pitch gain, fixed codebook gain, line spectral frequencies and the source codec bit rate are available to the destination codec. This allows frame classification and rate determination of the destination voice codec to be performed in a fast manner. Depending upon the application, many limitations can exist in one or more of the techniques described above.
Although there has been much improvement in techniques for voice transcoding, it would be desirable to have improved ways of processing telecommunication signals.
According to the present invention, techniques for processing of telecommunication signals are provided. More particularly, the invention provides a method and apparatus for classifying speech signals and determining a desired (e.g., efficient) transmission rate to code the speech signal with one encoding method when provided with the parameters of another encoding method. Merely by way of example, the invention has been applied to voice transcoding, but it would be recognized that the invention may also be applicable to other applications.
In a specific embodiment, the present invention provides a method and apparatus for frame classification and rate determination in voice transcoders. The apparatus includes a source bitstream unpacker that unpacks the bitstream from the source codec to provide the codec parameters, a parameter buffer that stores input and output parameters of previous frames and a frame classification and rate decision module (e.g., smart module) that uses the source codec parameters from the current frame and from previous frames to determine the frame class, rate and classification feature parameters for the destination codec. The source bitstream unpacker separates the bitstream code and unquantizes the sub-codes into the codec parameters. These codec parameters may include line spectral frequencies, pitch lag, pitch gains, fixed codebook gains, fixed codebook vectors, rate and frame energy, among other parameters. A subset of these parameters is selected by a parameter selector as inputs to the following frame classification and rate decision module. The frame classification and rate decision module comprises M sub-classifiers, buffers storing previous input and output parameters and a final decision module. The coefficients of the frame classification and rate decision module are pre-computed and pre-installed before operation of the system. The coefficients are obtained from previous training by a classifier construction module, which comprises a training set generation module, a learning module and an evaluation module. The final decision module takes the outputs of each sub-classifier, previous states, and external commands and determines the final frame class output, rate decision output and classification feature parameters output results. The classification feature parameters are used in some destination codecs for later encoding and processing of the speech.
According to an alternative specific embodiment, the method includes deriving the speech parameters from the bitstream of the source codec, and determining the frame class, rate decision and classification feature parameters for the destination codec. This is done by providing the source codec's intermediate parameters and bit rate as inputs for the previously trained and constructed frame and rate classifier. The method also includes preparing training and testing data, training procedures and generating coefficients of the frame classification and rate decision module and pre-installing the trained coefficients into the system.
In yet an alternative specific embodiment, the invention provides a method for a classifier process derived using a training process. The training process comprises processing the input speech with the source codec to derive one or more source intermediate parameters from the source codec, processing the input speech with the destination codec to derive one or more destination intermediate parameters from the destination codec, and processing the source coded speech that has been processed through source codec with the destination codec. The method also includes deriving a bit rate and a frame classification selection from the destination codec and correlating the source intermediate parameters from the source codec and the destination intermediate parameters from the destination codec. A step of processing the correlated source intermediate parameters and the destination intermediate parameters using a training process to build the classifier process is also included. The present method can use suitable commercial software or custom software for the classifier process. As merely an example, such software can include, but is not limited to Cubist, Rule Based Classification, by Rulequest or alternatively custom software such as MuME Multi Modal Neural Computing Environment by Marwan Jabri.
In alternative embodiments, the invention also provides a method for deriving each of the N subclassifiers using an iterative training process. The method includes inputting to the classifier a training set of selected input speech parameters (e.g., pitch lag, line spectral frequencies, pitch gain, code gain, maximum pitch gain for the last 3 subframes, pitch lag of the previous frame, bit rate, bit rate of the previous frame, difference between the bit rate of the current and previous frame) and inputting to the classifier a training set of desired output parameters (e.g., frame class, bit rate, onset flag, noise-to-signal ratio, voice activity level, level of periodicity in the signal). The method also includes processing the selected input speech parameters to determine a predicated frame class and a rate and setting one or more classification model boundaries. The method also includes selecting a misclassification cost function and processing an error based upon the misclassification cost function (e.g., maximum number of iterations in the training process, Least Mean Squared (LMS) error calculation, which is the sum of the squared difference between the desired output and the actual output, weighted error measure, where classification errors are given a cost based on the extent of the error, rather than treating all errors as equal, e.g., classifying a frame with a desired rate of rate 1 (171 bits) as a rate ⅛ (16 bits) frame can be given a higher cost than classifying it as a rate ½ (80 bits) frame) between a predicted frame class and rate and a desired frame class and rate. The method also repeating setting one or more classifier model boundaries (e.g., weights in a neural network classifier, neuron structure (number of hidden layers, number of neurons in each layer, connections between the neurons) of a neural network classifier), learning rate of a neural network classifier, which indicates the relative size in the change in weights for each iteration, network algortihm (e.g. back propagation, conjugate gradient descent) of a neural network classifier. logical relationships in a decision tree classifier, decision boundary criteria (parameters used to define boundaries between classes and boundary values) for each class in a decision tree classifier, branch structure (max number of branches, max number of splits per branch, minimum cases covered by a branch) of a decision tree classifier) based upon the error and desired output parameters.
A number of different classifier models and options are presented, however the scope of this invention covers any classification techniques and learning methods.
Numerous benefits are achieved using the present invention over conventional techniques. For example, the present invention is to apply a smart frame and rate classifier in the transcoder between two voice codecs according to a specific embodiment. The invention can also be used to reduce the computational complexity of the frame classification and rate determination of the destination voice codec by exploiting the relationship between the parameters available from the source codec, and the parameters often required to perform frame classification and rate determination according to other embodiments. Depending upon the embodiment, one or more of these benefits may be achieved. These and other benefits are described throughout the present specification and more particularly below.
Other features and advantages of the present invention will be apparent from the following description taken in conjunction with the accompanying drawing, in which like reference characters designate the same or similar parts throughout the figures thereof.
Certain objects, features, and advantages of the present invention, which are believed to be novel, are set forth with particularity in the appended claims. The present invention, both as to its organization and manner of operation, together with further objects and advantages, may best be understood by reference to the following description, taken in connection with the accompanying drawings.
According to the present invention, techniques for processing of telecommunication signals are provided. More particularly, the invention provides a method and apparatus for classifying speech signals and determining a desired (e.g., efficient) transmission rate to code the speech signal with one encoding method when provided with the parameters of another encoding method. Merely by way of example, the invention has been applied to voice transcoding, but it would be recognized that the invention may also be applicable to other applications.
A block diagram of a tandem connection between two voice codecs is shown in
The procedures for creating classifiers may vary and the following specific embodiments presented are examples for illustration. Other classifiers (and associated procedures) may also be used without deviating from the scope of the invention.
The coefficients of each classifier are pre-installed and are obtained previously by a classification construction module, which comprises a training set, a generation module, a learning module and an evaluation module shown in
The resulting coefficients of the classifier are then pre-installed within the frame class and rate determination classifier.
Several embodiments for frame classifiers and rate classifiers are provided in the next section for illustration. Similar methods may be applied for training and construction of the frame class classifier. It is noted, that each classifier may use a different classification method, related features could be derived using additional classifiers and that both rate and frame class may be determined using a single classifier structure. Further details of certain methods according to embodiments of the present invention may be described in more detail throughout the present specification and more particularly below.
In order to show the embodiments of the present invention, an example of transcoding from a source codec EVRC bitstream to a destination codec SMV bitstream is shown.
According to the first embodiment, the Classifier 1 shown in
The procedure for training the neural network classifier is shown in
The resulting classifier coefficients are then pre-installed within the frame class and rate determination classifier. Other embodiments of the present invention may be found throughout the present specification and more particularly below.
According to a specific embodiment, which may be similar to the previous embodiment except at least that the classification method used is a Decision Tree, a method has been illustrated. Decision Trees are a collection of ordered logical expressions, which lead to a final category. An example of a decision tree classifier structure is illustrated in
Each criterion may take the form
For the rate determination classifier for SMV, the output classes are labeled Rate 1, Rate ½, Rate ¼ and Rate ⅛. Only one path through the decision tree is possible for each set of input parameters.
The size of the tree may be limited to suit implementation purposes.
The criteria of the decision tree can be obtained through similar training procedure as the embodiments shown in
An alternative embodiment will also be illustrated. Preferably, the present embodiment can be similar at least in part to the first and the second embodiment except at least that the classification method used is a Rule-based Model classifier. Rule-based Model classifiers comprise of a collection of unordered logical expressions, which lead to a final category or a continuous output value. The structure of a Rule-based Model classifier is illustrated in
Rule 1:
Each criterion may take the form
The continuous output variable may be compared to a set of predefined or adaptive thresholds to produce the final rate classification. For example,
The number of rules included may be limited to suit implementation purposes.
The invention of frame classification and rate determination described in this document is generic to all CELP based voice codecs, and applies to any voice transcoders between the existing codecs G.723.1, GSM-AMR, EVRC, G.728, G.729, G.729A, QCELP, MPEG-4 CELP, SMV, AMR-WB, VMR and any voice codecs that make use of frame classification and rate determination information.
The previous description of the preferred embodiment is provided to enable any person skilled in the art to make or use the present invention. The various modifications to these embodiments will be readily apparent to those skilled in the art, and the generic principles defined herein may be applied to other embodiments without the use of the inventive faculty. Thus, the present invention is not intended to be limited to the embodiments shown herein but is to be accorded the widest scope consistent with the principles and novel features disclosed herein. For example, the functionality above may be combined or further separated, depending upon the embodiment. Certain features may also be added or removed. Additionally, the particular order of the features recited is not specifically required in certain embodiments, although may be important in others. The sequence of processes can be carried out in computer code and/or hardware depending upon the embodiment. Of course, one or ordinary skill in the art would recognize many other variations, modifications, and alternatives.
Additionally, it is also understood that the examples and embodiments described herein are for illustrative purposes only and that various modifications or changes in light thereof will be suggested to persons skilled in the art and are to be included within the spirit and purview of this application and scope of the appended claims.