1. Field of the Invention
The present invention relates generally to signal processing. More particularly, the present invention relates to speech signal processing.
2. Background Art
The VoIP (Voice over Internet Protocol) network is evolving to deliver better speech quality to end users by promoting and deploying wideband speech technology, which increases voice bandwidth by doubling sampling frequency from 8 kHz up to 16 kHz. This new sampling rate leads to include a new high band frequency up to 7.5 kHz (8 kHz theoretical) and will extend the speech low frequency region down to 50 Hz. This will result in an enhancement of speech naturalness, differentiation, nuance, and finally comfort. In other words, wideband speech allows more accuracy in hearing certain sounds, e.g. better hearing of fricative “s” and plosive “p”.
The main applications that are being targeted to take advantage of this new technology are voice calls and conferencing, and multimedia audio services. Wideband speech technology aims to reach higher voice quality than legacy Carrier Class voice services based on narrowband speech having sampling frequency of 8 kHz and a frequency range of 200 Hz to 3400 (4 kHz theoretical.) As the legacy narrowband phone terminals were prioritizing the understandability of speech, the new trend of wideband phone terminals will improve the speech comfort. Wideband speech technology is also named as “High Definition Voice” (HD Voice) in the art.
However, before the wideband speech can be fully deployed in infrastructure as network and terminals, an intermediate narrowband/wideband co-existence period will have to take place. Experts estimate the transition period from wideband to narrowband may take as long as several years because of the slowness to upgrading the infrastructure equipment to support wideband speech. In order to improve the speech quality during this intermediate period or in systems where narrowband and wideband speech co-exist, some signal processing researchers have proposed several models, which are mostly based on an extension mode of CELP speech coding algorithm. Unfortunately, the proposed models suffer from consumption of high processing power, while providing a limited performance improvement.
Accordingly, there is a need in the art to address the intermediate period of narrowband/wideband co-existence, and to further improve speech quality for systems, where narrowband and wideband speech co-exist, in an efficient manner.
There are provided systems and methods for speech bandwidth extension, substantially as shown in and/or described in connection with at least one of the figures, as set forth more completely in the claims.
The features and advantages of the present invention will become more readily apparent to those ordinarily skilled in the art after reviewing the following detailed description and accompanying drawings, wherein:
The present application is directed to a system and method for providing access to a virtual object corresponding to a real object. The following description contains specific information pertaining to the implementation of the present invention. One skilled in the art will recognize that the present invention may be implemented in a manner different from that specifically discussed in the present application. Moreover, some of the specific details of the invention are not discussed in order not to obscure the invention. The specific details not described in the present application are within the knowledge of a person of ordinary skill in the art. The drawings in the present application and their accompanying detailed description are directed to merely exemplary embodiments of the invention. To maintain brevity, other embodiments of the invention, which use the principles of the present invention, are not specifically described in the present application and are not specifically illustrated by the present drawings.
Various embodiments of the present invention aim to deliver speech signal processing systems and methods for VoIP gateways as well as wideband phone terminals in order to enhance the speech emitted by the legacy narrowband phone terminals up to a wideband speech signal, so as to improve wideband voice quality for new wideband phone terminals. The new and novel speech signal processing algorithms of various embodiments of the present invention may be called “Speech Bandwidth Extension” (which may use acronyms: SBE or BWE). In various embodiments of the present invention the narrow bandwidth speech is extended in high and low frequencies close to the original natural wideband speech. As a result, wideband phone terminals according to the present invention would receive a speech quality for a narrowband speech signal that a regular wideband phone terminal would receive for a wideband speech signal.
For ease of discussion, speech bandwidth extension system 400 is depicted and described in four main elements or steps. The four elements or steps are (1) pre-processing (410) element or step for locating signals cut off low and high frequencies; (2) signal classifier (420) element or step for optimized extension, so as to distinguish noise/unvoiced, voice and music, in one embodiment of the present invention; (3) optimized adaptive signal extension (430) element or step for low and high frequencies; and (4) short and long term post processing (440) element or step for final quality assurance, such as a smooth merger with narrow band signals; equalization and gain adaptation.
Turning to pre-processing (410) element or step, in one embodiment, includes a low pass filter between [0, 300] Hz that can detect the presence or absence of low frequency speech signals, and a high pass filter above 3200 Hz that can detect the presence or absence of high frequencies. Detection or location of the narrowband signals cut off at low and high frequencies can use for further processing at short and long term post processing (440) element or step, as explained below, for joining or connecting extended bandwidth signals at low and high frequencies to the existing narrowband signals. For example, at low frequencies, it may be determined where the signal is attenuated between 0-300 Hz, and high frequencies, it may be determined where the frequency cut off occurs between 3,200-4,000 Hz.
Regarding signal classifier (420) element or step, as explained above, in one embodiment, an enhanced voice activity detector (VAD) may be used to discriminate between noise, voice and music. In other embodiments, a regular VAD can be used to discriminate between noise and voice. The VAD may also be enhanced to use energy, zero crossing and tilt of spectrum to measure flatness of spectrum, to further provide for a smoother switching such that voice does not cut off suddenly for transition to noise, e.g. overhang period for voice may be extended.
Now, optimized adaptive signal extension (430) element or step can be divided into a high frequencies extension element or step and a low frequencies extension element.
As for the high frequencies extension element or step, the signal processing theoretical basis is explained as follows. In an embodiment of the present invention, for speech bandwidth extension in high frequencies non-linear signal components mapped into frequency domain are exploited. If we designate the linear 16-bit sampled signal “x(n) for n=0 . . . N” by “x” to simplify notation:
∀nε[0,N],x(n)≈x
The signal “x”, which designates the narrowband signal, is mapped into the interval value of [−1, 1] or interval of absolute value of [0, 1]:|x|≦1 which is then transformed by a function f(x) of values as well in [−1, 1].
According to Taylor's series f(x) can be than developed into linear combination of power of x by its limited development:
Taking benefit of the linearity of the Fourier transform, it follows:
in which the F(ejnθ) functions are bringing the new frequencies and especially the high frequencies needed for the speech bandwidth extension.
The choice of function “f(x)” applied to signal is also important, and for voiced frames or voiced speech segments, in one embodiment of the present invention, a sigmoid function, is applied:
for which, the theoretical shape, is shown in
At this point, for example, a centered and sigmoid of exponential scaling of a=10, is applied:
In order to provide a significant amount of new frequencies regardless of the input signal amplitude, i.e. small values fall into limited non linear part of the sigmoid, whereas high values should avoid falling into the higher non linear part, an embodiment of the present invention utilizes instantaneous gain provided by an Automatic Gain Control (AGC) to dynamically scale the sigmoid and get the optimal harmonics generation, as depicted in
In one embodiment of the present invention, for unvoiced frames or unvoiced speech segment, a different function than the one for voiced speech segment is applied, which is the following function:
p0≈0,1<p1<2,pi>1<<p1
f
poly(x)=x
Next, both results of transformed f(x) may be finally adaptively mixed with a programmable balance between the two components in order to avoid phase discontinuity (artifact) and to deliver a smooth extended speech signal:
F
Final(x)=(q(v)×fsigmoid(x)+(1−q(v))×fxp(x)
The adaptive balance may be defined by:
q(v)ε[0,1]
With the coefficient “v” determining the mixture in function of the voiced profile of speech signal from the VAD combining energy, zero crossing and tilt measurement:
q(v(E−VAD,t))ε[0,1]
In one embodiment, for voiced speech segment q(v) of 50% may be chosen for equivalent contribution from sigmoid or poly functions, and for unvoiced speech segment (also called fricative) q(v) of 10% may be chosen for affording greater contribution from the polynomial function. Of course, the values of 50% and 10% are exemplary. Also, a time parameter ‘t’ can be used to smooth transition from the two previous states.
It should also be noted that at least in one embodiment in which the VAD detects a music signal, then a function different than those of voiced and unvoiced speech signals will be used to improve the music quality.
Turning to the low frequencies extension, the presence of low frequencies in the narrow band signals is primarily identified according to a spectral analysis. Next, an equalizer applies an adaptive amplification to low frequencies to compensate for the estimated attenuation. This processing allows the low frequencies to be recovered from network attenuation (Ref. to ideal ITU P.830 MIRS model) or terminal attenuation.
With respect to the fourth element or step of short-term and long-term post processing (404) is utilized for joining the new extended high frequencies in wideband areas, e.g. wideband signals 229A and 229B of
Thus, various embodiments of the present invention create high frequency and recovers low frequency spectrum based on existing narrowband spectrum closely matching a pure wideband speech signal, and provide low complexity for minimizing voice system density, e.g. smaller than the CELP codebook mapping extension model, and offer flexible extension from voice up to noise/music for covering voice and audio. It should be further noted that the bandwidth extension of the present invention would also apply to next generation of wide band speech and audio signal communication as Super wide band with sampling frequencies of 14 kHz, 20 kHz, 32 kHz up to Ultra wide band of 44.1 kHz known as “Hi-Fi Voice”. In other words, a first band speech/audio may be extended to a second band speech/audio, where the second band speech/audio is wider than the first band speech/audio and includes the first band speech/audio.
From the above description of the invention it is manifest that various techniques can be used for implementing the concepts of the present invention without departing from its scope. Moreover, while the invention has been described with specific reference to certain embodiments, a person of ordinary skills in the art would recognize that changes can be made in form and detail without departing from the spirit and the scope of the invention. As such, the described embodiments are to be considered in all respects as illustrative and not restrictive. It should also be understood that the invention is not limited to the particular embodiments described herein, but is capable of many rearrangements, modifications, and substitutions without departing from the scope of the invention.
This application claims priority to U.S. Provisional Application No. 61/284,626, filed Dec. 21, 2009, which is hereby incorporated by reference in its entirety.
Number | Date | Country | |
---|---|---|---|
61284626 | Dec 2009 | US |