This invention relates in general to extending voice bandwidth and more particularly, to extending narrowband voice signals to wideband voice signals.
The use of portable electronic devices has increased dramatically in recent years. The primary purpose of cellular phones is for voice communication. A cellular phone operates on voice signals by compressing voice and sending the voice signals over a communications network. The compression reduces the amount of data required to represent the voice signal and intentionally reduce the voice bandwidth. The voice bandwidth on a cellular phone is generally band limited to between 200 Hz and 4 KHz, whereas natural spoken voice resides within a bandwidth between 20 to 10 KHz. The voice band-limiting associated with the compression provides for more efficient transmission and reception of digital signals in a cellular communication system. The voice band-limiting is part of the compression which reduces the amount of data and processing required to transmit and receive a voice signal over a cellular communication channel. Communication networks are allocated a certain amount of bandwidth for which they can utilize the bandwidth spectrum to transmit and receive voice data.
Voice is the composition of many frequency components spanning the natural voice bandwidth of 20 to 20 KHz. As is known in the art, vocoders can compress voice. The compressed voice (i.e. vocoded voice) sufficiently preserves the original voice character and intelligibility even though it does not include all the frequency components of the original voice. Vocoding also introduces quantization effects which reduce the dynamic range of the voice and the overall voice quality. Moreover, Vocoding can inherently remove the low frequency regions of voice as well as the high frequency regions of voice. An analysis of vocoded voice reveals that the low frequency and high frequency components of speech are missing in comparison to the original voice signal that underwent the compression.
Compressing the voice bandwidth is a standard vocoding technique used in the voice communication industry to reduce the amount of data necessary to allow for efficient voice communication. However, the resulting bandwidth is less than the natural bandwidth of voice and results in inferior subjective audio quality and reduced intelligibility compared to wideband speech. Accordingly, wideband speech, having a bandwidth at least approximating the natural voice bandwidth, is desirable for enhanced audio quality.
Speech processing techniques such as Voice Bandwidth Extension have been tested and applied in an attempt to restore the missing low frequency and high frequency voice components. These techniques are generally applied to bandlimited speech that is non-vocoded. That is, certain frequency components are absent, though the voice has not been vocoded. Voice Bandwidth Extension techniques on non-vocoded speech can restore voice in those regions of voice, which are absent from the bandlimited voice in comparison to the original non-vocoded voice. Methods of Voice Bandwidth Extension include techniques which determine how the missing low frequency and high frequency voice components can be restored based on differences between the original non-vocoded voice signal and bandlimited non-vocoded voice. However, applying Voice Bandwidth Extension to vocoded speech based on mapping functions generated from non-vocoded voice can lead to artifacts and reduction in perceived audio quality.
Embodiments of the invention are directed to a system and method for creating and using a wideband vocoder voice database. The wideband vocoder voice database can be employed in a bandwidth extension system for training mapping functions on wideband features of vocoded voice. The method can include filtering a wideband voice signal to produce a first filtered signal and a second filtered signal, vocoding the first filtered signal to produce a narrowband vocoded signal, adding the narrowband vocoded signal with the second filtered signal to produce a wideband vocoded signal, comparing wideband vocoded features of the wideband vocoded signal with wideband features of the wideband voice signal, and generating a mapping function based on one or more statistical differences between the wideband vocoded features and the wideband features. One or more features from the wideband vocoded signal can be extracted to create a wideband feature vector for storage in the wideband vocoded speech database. The method can also evaluate a speech quality difference between a narrowband vocoded voice signal and a wideband vocoded voice signal to determine an upper-bound voice quality based on the speech quality difference.
The features of the present invention, which are believed to be novel, are set forth with particularity in the appended claims. The invention, together with further objects and advantages thereof, may best be understood by reference to the following description, taken in conjunction with the accompanying drawings, in the several figures of which like reference numerals identify like elements, and in which:
While the specification concludes with claims defining the features of the invention that are regarded as novel, it is believed that the invention will be better understood from a consideration of the following description in conjunction with the drawing figures, in which like reference numerals are carried forward.
As required, detailed embodiments of the present invention are disclosed herein; however, it is to be understood that the disclosed embodiments are merely exemplary of the invention, which can be embodied in various forms. Therefore, specific structural and functional details disclosed herein are not to be interpreted as limiting, but merely as a basis for the claims and as a representative basis for teaching one skilled in the art to variously employ the present invention in virtually any appropriately detailed structure. Further, the terms and phrases used herein are not intended to be limiting but rather to provide an understandable description of the invention.
The terms “a” or “an,” as used herein, are defined as one or more than one. The term “plurality,” as used herein, is defined as two or more than two. The term “another,” as used herein, is defined as at least a second or more. The terms “including” and/or “having,” as used herein, are defined as comprising (i.e., open language). The term “coupled,” as used herein, is defined as connected, although not necessarily directly, and not necessarily mechanically. The terms “program,” “software application,” and the like as used herein, are defined as a sequence of instructions designed for execution on a computer system. The term “suppress” can be defined as reducing or removing, either partially or completely. A program, computer program, or software application may include a subroutine, a function, a procedure, an object method, an object implementation, an executable application, an applet, a servlet, a source code, an object code, a shared library/dynamic load library and/or other sequence of instructions designed for execution on a computer system. The term “processor” can be defined as any number of suitable processors, controllers, units, or the like that carry out a pre-programmed or programmed set of instructions.
The term “narrowband signal” can be defined as a signal having a bandwidth corresponding to a telephone bandwidth of approximately 200 Hz to 4 KHz. The term “wideband signal” can be defined as a signal having a bandwidth that is greater than a narrowband signal. A “narrowband vocoded signal” can be defined as a vocoded signal having a bandwidth corresponding to a vocoder bandwidth of approximately 200 Hz to 4 KHz. A “wideband vocoded signal” can be defined as a narrowband vocoded signal that is artificially extended to include either, or both, low frequency components and high frequency components. The low frequency components and high frequency components may be vocoded or not vocoded. The term “wideband vocoded features” can be defined as features extracted from a wideband vocoded signal. The term “wideband features” can be defined as features extracted from a non-vocoded signal, such as PCM speech. The term mapping function can be defined as a mathematical based hardware of software algorithm that translates a first feature set into a second feature set.
Embodiments of the invention concern a method of training voice bandwidth extension systems based on wideband feature mappings generated from a wideband vocoded database. The method can include comparing wideband vocoded features of a wideband vocoded signal with wideband features of a wideband voice signal, and generating a mapping function based on one or more statistical differences between the wideband vocoded features and the wideband features. The mapping function can describe changes to narrowband vocoded signals for extending a bandwidth of the narrowband vocoded signal to generate the wideband vocoded signal.
Embodiments of the invention also concern a system for extending the bandwidth of narrowband voice. The system can employ mapping functions derived from a pattern recognition training using the wideband vocoded voice database. The system can include a decoder for receiving a narrowband vocoded voice signal, and a processor for converting the narrowband vocoded voice signal to a wideband vocoded voice signal based on one or more mapping functions created during a training of a wideband vocoded voice database. The processor can map one or more narrowband features of the narrowband vocoded voice signal to one or more wideband features. In one arrangement, the processor can extend a set of narrowband reflection coefficients to a set of wideband reflection coefficients using one of the mapping functions for generating a wideband vocoded spectral envelope. The wideband vocoded spectral envelope can be combined with a wideband vocoded excitation signal to generate a wideband voice signal.
Referring to
As is known in the art, voice can undergo an encoding and decoding process referred to as vocoding that compresses the size of data required to represent the voice. The decoding can be performed by the decoder 120. For example, an 8 KHz vocoder can reduce a storage of 16 KHz sampled voice by a factor of two. However, the encoding process reduces the voice bandwidth to achieve the higher compression which results in a decoded signal 130 having half the bandwidth of the original voice. Accordingly, the BWE 140 can extend the bandwidth of voice beyond the bandwidth associated with the bandwidth of the decoder 120 to restore the voice bandwidth to the range prior to vocoding. For example, the decoder 120 may have a maximum fixed sample rate of 8 KHz which places a theoretical limit on the frequency range of the decoded voice at a bandwidth of 4 KHz. This is the Nyquist Theorem, and states that the maximum reconstructed bandwidth is half of the sampling frequency. The BWE 140 can extend the band-limited voice up to 8 KHz as will be discussed ahead. The BWE 140 can restore the missing high and low frequencies of narrowband (NB) voice 130 by extrapolating features to derive wideband (WB) voice 150 which results in improved audio quality of the handset. The BWE 140 as applied to narrowband voice 130 at an output of the speech decoder 120 can enhance speech quality.
The decoder 120 and the BWE 140 can be implemented in a processor, such as one or more microprocessors, microcontrollers, digital signal processors (DSPs), combinations thereof or such other devices known to those having ordinary skill in the art, that is in communication with one or more associated memory devices, such as random access memory (RAM), dynamic random access memory (DRAM), and/or read only memory (ROM) or equivalents thereof, that store data and programs that may be executed by the processor. The system 100 can be included in a communication device such as a cell phone, a handset, a radio, a personal digital assistant, a portable media player and the like.
The system 100 can include a communications module (not shown), for communicating with one or more communication networks such as a WLAN network, or a cellular network including, but not limited to, GSM, CDMA, iDEN, OFDM, WiDEN, and the like. In practice, the system 100 can provide wireless connectivity over a radio frequency (RF) communication network or a Wireless Local Area Network (WLAN). Communication within the network 100 can be established using a wireless, copper wire, and/or fiber optic connection using any suitable protocol (e.g., TCP/IP, HTTP, etc.). The system 100 can also connect to the Internet over a WLAN. Wireless Local Access Networks (WLANs) provide wireless access within a local geographical area. In typical WLAN implementations, the physical layer uses a variety of technologies such as 802.11b or 802.11g WLAN technologies. The physical layer may use infrared, frequency hopping spread spectrum in the 2.4 GHz Band, or direct sequence spread spectrum in the 2.4 GHz Band.
Referring to
In particular, the spectral envelope extension module 145 can apply mapping functions for converting one or more LPC features of the NB voice signal 130 to LPC features of the WB voice signal 147. The mapping function can translate features of the NB spectral envelope to corresponding features of the WB spectral envelope. The LPC features can be from the set of reflection coefficients, cepstral coefficients, Mel cepstral coefficients, but are not limited to these. Various feature sets can be derived from the LPC features which are suitable for applying mapping functions. The mapping functions can be generated during a training phase which associates changes in the features of a NB voice signal with changes in features of a corresponding WB voice signal.
For example, referring to
GMMs can be useful in statistical modeling applications in which information that represents the general characteristics or trends must be extracted from a large amount of data. Mapping functions such as GMMs are useful in gaining statistical insight of large quantities of data and for applying the statistical information. It should be noted that Gaussian Mixture Models (GMM) are merely one example of a mapping function. Those of skill in the art will appreciate that there are different ways to implement mapping functions such as Vector Quantization, or Hidden Markov Models.
During training, the GMM 222 learns an optimal transformation, known as a mapping, which can be applied to a NB voice signal to convert it to a WB voice signal in accordance with the statistical information provided by the GMM 222 based on the learning. It should be noted, that the GMM 222 provides statistical modeling capabilities based on the learning during training. For example, in practice, the GMM 222 can be presented off-line with input and output training data to learn statistics associated with the input to output data transformations of the NB features and WB features. In one arrangement, the GMM 222 can employ an Expectation-Maximization (EM) algorithm to learn the mapping between the NB features (143) and WB features (147)
Referring to
Understandably, bandwidth extension is based on the assumption that the NB speech correlates closely with WB voice signal. To ensure an accurate feature mapping, the voice signals used in training are reflective of the voice signals used during deployment. For example, the quality of the voice used during the training has a significant impact on the quality of the bandwidth extension. That is, good quality bandwidth extension of speech is possible when the feature mappings are an accurate representation of the voice signal undergoing the bandwidth extension. That is, the voice signal used during training is characteristic of the voice signal used for bandwidth extension. As an example, feature mappings can be generated for non-vocoded NB speech and non-vocoded wideband speech. The feature mappings are accurate when the mappings are applied to non-vocoded NB speech. However, applying the non-vocoded mappings to vocoded NB speech can result in anomalies which can deteriorate speech quality. Accordingly, using the same type of speech (vocoded or non-vocoded) should be used during training. This includes using vocoded speech for training the GMM 222 when extending the bandwidth of NB vocoded speech.
However, in the case of vocoded speech, WB vocoded speech is not generally available. For example, referring back to
Understandably, one aspect of the invention is directed to creating a WB vocoded voice database 220 from NB vocoded speech. That is, WB vocoded voice signals are artificially created from NB vocoded speech to provide WB vocoded voice signals for training the GMMs (222) and creating mapping functions. For example, referring to
The system 300 can include a filter 301 for filtering the wideband voice signal to produce a first filtered signal 306 and a second filtered signal 331 corresponding to step 402. The system can include a vocoder 308 for vocoding the first filtered signal to produce a narrowband vocoded signal 130 corresponding to step 404. The vocoder can be at least one of a VSELP, AMBE, AMD, and CELP type vocoder. The system 300 can include a compensator 326 for time aligning the second filtered signal 331 with the narrowband vocoded signal 130 corresponding to step 406. The system can include a combiner 335 for adding the narrowband vocoded signal 130 with the compensated second filtered signal 340 to produce a wideband vocoded signal 150 for storage in the wideband vocoded speech database, corresponding to step 408. Alternatively, one or more features of the wideband vocoded signal 150 can be extracted to create a wideband feature vector for storage in the wideband vocoded speech database, as shown at step 410.
Upon creation of the WB vocoded voice database, training can take place. For example, referring back to
Referring to
A down sampler 307 is also included to lower the sampling rate of the banded signal 306. For example, the WB speech sampled at 16 KHz can be down-sampled by a factor of 2 for providing a sampling frequency of 8 KHz. Understandably, the vocoder 308 input specifications may require 8 KHz speech having a bandwidth of 300-3.4 KHz. Various vocoders can have different input specifications which allow for different sampling and bandwidth requirements which are herein contemplated. Aspects of the invention are not limited to the specifications provided which are presented merely as example. The bandwidth and sampling rate may vary for different vocoders. For example, the bandwidth may extend from 140 Hz to 3.8 KHz. The down-sampled and bandlimited WB speech can be processed by the vocoder 308 to produce NB vocoded voice 314. The vocoder 308 can include an encoder section 310 and a decoder section 308. Understandably, the vocoder 308 compresses and quantizes the speech which can reduce data transmission requirements for a compromise in speech quality. The up-sampler 316 can resample the NB vocoded voice 314 to the WB sample rate. For example, the NB vocoded voice 314 having a sampling rate of 8 KHz can be up-sampled to 16 KHz. The LPF 318 can be applied to the up-sampled NB vocoded voice to suppress aliased frequency components resulting from the up-sampling. For example, the NB vocoded voice can be bandlimited to 8 KHz having an effective sampling rate of 16 KHz.
Briefly referring to
One aspect of bandwidth extension is to restore the missing frequency components. For example, the output WB vocoded voice signal 250 (See
Returning to the filter 301 of
The compensator 322 time aligns the second filtered signal 331 with the NB vocoded voice signal 130. Understandably, the vocoder 308 can introduce delays in processing which result in misalignment between the second filtered signal 331 with the NB vocoded voice signal 130. The compensator 322 can estimate a delay 330 between the second filtered signal 331 with the NB vocoded voice signal 130, and time-shift 333 the second filtered signal 331 to be coincident with the NB vocoded voice signal 130. The adder 335 can add the delayed second filtered signal 340 with the NB vocoded voice signal 130 to produce a wideband vocoded output signal 250. Notably, only the speech within the vocoder bandwidth is vocoded. For example, referring to
Referring back to
Furthermore, while a specific example of feature mapping and GMM training has been described, many such training mechanisms may be employed, and may depend on several factors in the design of the respective system, including vocoder types, bandwidth requirements, sample rates, and vocoder configurations. While the preferred embodiments of the invention have been illustrated and described for creating a wideband vocoder database suitable for training of bandwidth extension systems, it will be clear that the invention is not so limited. Numerous modifications, changes, variations, substitutions and equivalents will occur to those skilled in the art without departing from the spirit and scope of the present invention as defined by the appended claims.