1. Technical Field
This system relates to bandwidth extension, and more particularly, to extending a high-frequency spectrum of a narrowband audio signal
2. Related Art
Some telecommunication systems transmit speech across a limited frequency range. The receivers, transmitters, and intermediary devices that makeup a telecommunication network may be band limited. These devices may limit speech to a bandwidth that significantly reduces intelligibility and introduces perceptually significant distortion that may corrupt speech.
While users may prefer listening to wideband speech, the transmission of such signals may require the building of new communication networks that support larger bandwidths. New networks may be expensive and may take time to become established. Since many established networks support a narrow band speech bandwidth, there is a need for systems that extend signal bandwidths at receiving ends.
Bandwidth extension may be problematic. While some bandwidth extension methods reconstruct speech under ideal conditions, these methods cannot extend speech in noisy environments. Since it is difficult to model the effects of noise, the accuracy of these methods may decline in the presence of noise. Therefore, there is a need for a robust system that improves the perceived quality of speech.
A system extends the high-frequency spectrum of a narrowband audio signal in the time domain. The system extends the harmonics of vowels by introducing a non linearity in a narrowband signal. Extended consonants are generated by a random-noise. The system differentiates the vowels from the consonants by exploiting predetermined features of a speech signal.
Other systems, methods, features, and advantages will be, or will become, apparent to one with skill in the art upon examination of the following figures and detailed description. It is intended that all such additional systems, methods, features, and advantages be included within this description, be within the scope of the invention, and be protected by the following claims.
The system may be better understood with reference to the following drawings and description. The components in the figures are not necessarily to scale, emphasis instead being placed upon illustrating the principles of the invention. Moreover, in the figures, like referenced numerals designate corresponding parts throughout the different views.
A system extends the high-frequency spectrum of a narrowband audio signal in the time domain. The system extends the harmonics of vowels by introducing a non linearity in a narrowband signal. Extended consonants may be generated by a random-noise generator. The system differentiates the vowels from the consonants by exploiting predetermined features of a speech signal. Some features may include a high level low-frequency energy content of vowels, the high high-frequency energy content of consonants, the wider envelop of vowels relative to consonants, and/or the background noise, and mutual exclusiveness between consonants and vowels. Some systems smoothly blend the extended signals generated by the multiple modes, so that little or substantially no artifacts remain in the resultant signal. The system provides the flexibility of extending and shaping the consonants to a desired frequency level and spectral shape. Some systems also generate harmonics that are exact or nearly exact multiples of the pitch of the speech signal.
A method may also generate a high-frequency spectrum from a narrowband (NB) audio signal in the time domain. The method may extend the high-frequency spectrum of a narrowband audio signal. The method may use two or more techniques to extend the high-frequency spectrum. If the signal in consideration is a vowel, then the extended high-frequency spectrum may be generated by squaring the NB signal. If the signal in consideration is a consonant or background noise, a random signal is used to represent that portion of the extended spectrum. The generated high-frequency signals are filtered to adjust their spectral shapes and magnitudes and then combined with the NB signal.
The high-frequency extended signals may be blended temporally to minimize artifacts or discontinuities in the bandwidth-extended signal. The method provides the flexibility of extending and shaping the consonants to any desired frequency level and spectral shape. The method may also generate harmonics of the vowels that are exact or nearly exact multiples of the pitch of the speech signal.
A block diagram of the high-frequency bandwidth extension system 100 is shown in
The level of background noise in the bandwidth extended signal, y(n), may be at the same spectral level as the background noise in the NB signal. Consequently, in moderate to high noise the background noise in the extended spectrum may be heard as a hissing sound. To suppress or dampen the background noise in the extended signal, the bandwidth extended signal, y(n), is then passed through a filter 122 that adaptively suppresses the extended background noise while allowing speech to pass through. The resulting signal, yBg(n), may be further processed by passing through an optional shaping filter 124. A shaping filter may enhance the consonants relative to the vowels and it may selectively vary the spectral shape of some or all of the signal. The selection may depend upon whether the speech segment is a consonant, vowel, or background noise.
The high-frequency signals generated by the random noise generator 104 and by squaring circuit 102 may not be at the correct magnitude levels for combining with the NB signal. Through gain factors, grnd(n) and gsqr(n), the magnitudes of the generated random noise and the squared NB signal may be adjusted. The notations and symbols used are:
To estimate the gain factor, grnd(n), the envelop of the high pass filtered NB signal, xh(n), is estimated. If the random noise generator output is adjusted so that it has a variance of unity then grnd(n) is given by (12).
grnd(n)=Envelop[xh(n)] (12)
The envelop estimator is implemented by taking the absolute value of xh(n) and smoothening it with a filter like a leaky integrator.
The gain factor, gsqr(n), adjusts the envelop of the squared-high pass-filtered NB signal, ξh(n), so that it is at the same level as the envelop of the high pass filtered NB signal xh(n). Consequently, gsqr(n) is given by (13).
The parameter, α, controls the mixing proportion between the gain-adjusted random signal and the gain-adjusted squared NB signal. The combined high-frequency generated signal is expressed as (14).
xc(n)=αgrnd(n)ξh(n)+(1−α)gsqr(n)eh(n) (14)
To estimate α some systems measure whether the portion of speech is more random or more periodic; in other words, whether it has more vowel or consonant characteristics. To differentiate the vowels from the consonants and background noise in block, k, of N speech samples, an energy measure, η(k), may be used given by (15)
where N is the length of each block and σvoice is the average voice magnitude.
Another measure that may be used to detect the presence of vowels detects the presence of low frequency energy. The low frequency energy may range between about 100 to about 1000 Hz in a speech signal. By combining this condition with η(k) a may be estimated by (16).
In (16) Γα is an empirically determined threshold, ∥·∥ is an operator that denotes the absolute mean of the last N samples of data, σx, is the low-frequency background noise energy, and γ(k) is given by (17).
In (17) thresholds, τl and τh, may be empirically selected such that, 0<τl<τh.
The extended portion of the bandwidth extended signal, xe(n), may have a background noise spectrum level that is close to that of the NB signal. In moderate to high noise, this may be heard as a hissing sound. In some systems an adaptation filter may be used to suppress the level of the extended background noise while allowing speech to pass there through.
In some circumstances, the background noise may be suppressed to a level that is not perceived by the human ear. One approximate measure for obtaining the levels may be found from the threshold curves of tones masked by low pass noise. For example, to sufficiently reduce the audibility of background noise above about 3.5 kHz, the power spectrum level above about 3.5 kHz is logarithmically tapered down so that the spectrum level at about 5.5 kHz is about 30 dB lower. In this application, that the masking level may vary slightly with different speakers and different sound intensities.
In
h(k)=β1(k)h1+β2(k)h2+ . . . +βL(k)hL (18)
In (18) h(k) is the updated filter coefficient vector, h1, h2, . . . , hL are the L basis filter-coefficient vectors, and β1(k), β2(k), . . . , βL(k) are the L scalar coefficients that are updated after every N samples as (19).
βi(k)=fi(φh) (19)
In (19) fi(z) is a certain function of z and φh is the high-frequency signal to noise ratio, in decibels, and given by (20).
In some implementations of the adaptive filter 122, four basis filter-coefficient vectors, each of length 7 may be used. Amplitude responses of these exemplary vectors are plotted in
In (21) thresholds, τ1, τ2, τ3, τ4 are estimated empirically and τ1<τ2<τ3<τ4.
A shaping filter 124 may change the shape of the extended spectrum depending upon whether speech signal in consideration is a vowel, consonant, or background noise. In the systems above, consonants may require more boost in the extended high-frequency spectrum than vowels or background noise. To this end, a circuit or process may be used to derive an estimate, ζ(k), and to classify the portion of speech as consonants or non-consonants. The parameter, ζ(k), may not be a hard classification between consonants and non-consonants, but, rather, may vary between about 0 and about 1 depending upon whether the speech signal in consideration has more consonant or non-consonant characteristics.
The parameter, ζ(k), may be estimated on the basis of the low-frequency and high-frequency SNRs and has two states, state 0 and state 1. When in state 0, the speech signal in consideration may be assumed to be either a vowel or background noise, and when in state 1, either a consonant or a high-formant vowel may be assumed. A state diagram depicting the two states and their transitions is shown in
When state is 0:
ζ(k)=0 (22)
When state is 1:
where χ(k) is given by
Thresholds, t1l, t1h, t2l, and t2h, may be dependent on the SNR as shown in (25).
In (25) I is a 4×1 unity column vector and thresholds, c1a, c2a, c3a, c4a, c1b, c2b, c3b, c4b, and Γt, are empirically selected.
The shaping filter may be based on the general adaptive filter in (18). In some systems two basis filter-coefficients vectors, each of length 6 may be used. Their amplitude responses are shown in
The relationship or algorithm may be applied to both speech data that has been passed over CDMA and GSM networks. In
A time domain high-frequency bandwidth extension method may generate the periodic component of the extended spectrum by squaring the signal, and the non-periodic component by generating a random using a signal generator. The method classifies the periodic and non-periodic portions of speech through fuzzy logic or fuzzy estimates. Blending of the extended signals from the two modes of generation may be sufficiently smooth with little or no artifacts, or discontinuities. The method provides the flexibility of extending and shaping the consonants to a desired frequency level and provides extended harmonics that are exact or nearly exact multiples of the pitch frequency through filtering.
An alternative time domain high-frequency bandwidth extension method 800 may generate the periodic component of an extended spectrum. The alternative method 800 determines if a signal represents a vowel or a consonant by detecting distinguishing features of a vowel, a consonant, or some combination at 802. If a vowel is detected in a portion of the narrowband signal the method generates a portion of the high frequency spectrum by generating a non-linearity at 804. A non-linearity may be generated in some methods by squaring that portion of the narrow band signal. If a consonant is detected in a portion of the narrowband signal the method generates a second portion of the high frequency spectrum by generating a random signal at 806. The generated signals are conditioned at 808 and 810 before they are combined together with the NB signal at 812. In some methods, the conditioning may include filtering, amplifying, or mixing the respective signals or a combination of these functions. In other methods the conditioning may compensate for signal attenuation, noise, or signal distortion or some combination of these functions. In yet other methods, the conditioning improves the processed signals.
In
Each of the systems and methods described above may be encoded in a signal bearing medium, a computer readable medium such as a memory, programmed within a device such as one or more integrated circuits, or processed by a controller or a computer. If the methods are performed by software, the software may reside in a memory resident to or interfaced to the processor, controller, buffer, or any other type of non-volatile or volatile memory interfaced, or resident to speech extension logic. The logic may comprise hardware (e.g., controllers, processors, circuits, etc.), software, or a combination of hardware and software. The memory may retain an ordered listing of executable instructions for implementing logical functions. A logical function may be implemented through digital circuitry, through source code, through analog circuitry, or through an analog source such through an analog electrical, or optical signal. The software may be embodied in any computer-readable or signal-bearing medium, for use by, or in connection with an instruction executable system, apparatus, or device. Such a system may include a computer-based system, a processor-containing system, or another system that may selectively fetch instructions from an instruction executable system, apparatus, or device that may also execute instructions.
A “computer-readable medium,” “machine-readable medium,” “propagated-signal” medium, and/or “signal-bearing medium” may comprise any apparatus that contains, stores, communicates, propagates, or transports software for use by or in connection with an instruction executable system, apparatus, or device. The machine-readable medium may selectively be, but not limited to, an electronic, magnetic, optical, electromagnetic, infrared, or semiconductor system, apparatus, device, or propagation medium. A non-exhaustive list of examples of a machine-readable medium would include: an electrical connection “electronic” having one or more wires, a portable magnetic or optical disk, a volatile memory such as a Random Access Memory “RAM” (electronic), a Read-Only Memory “ROM” (electronic), an Erasable Programmable Read-Only Memory (EPROM or Flash memory) (electronic), or an optical fiber (optical). A machine-readable medium may also include a tangible medium upon which software is printed, as the software may be electronically stored as an image or in another format (e.g., through an optical scan), then compiled, and/or interpreted or otherwise processed. The processed medium may then be stored in a computer and/or machine memory.
The above described systems may be embodied in many technologies and configurations that receive spoken words. In some applications the systems are integrated within or form a unitary part of a speech enhancement system. The speech enhancement system may interface or couple instruments and devices within structures that transport people or things, such as a vehicle. These and other systems may interface cross-platform applications, controllers, or interfaces.
While various embodiments of the invention have been described, it will be apparent to those of ordinary skill in the art that many more embodiments and implementations are possible within the scope of the invention. Accordingly, the invention is not to be restricted except in light of the attached claims and their equivalents.
This application claims the benefit of priority from U.S. Provisional Application No. 60/903,079, Feb. 23, 2007. The entire content of the application is incorporated by reference, except that in the event of any inconsistent disclosure from the present application, the disclosure herein shall be deemed to prevail.
Number | Name | Date | Kind |
---|---|---|---|
4255620 | Harris et al. | Mar 1981 | A |
4343005 | Han et al. | Aug 1982 | A |
4672667 | Scott et al. | Jun 1987 | A |
4700360 | Visser | Oct 1987 | A |
4741039 | Bloy | Apr 1988 | A |
4873724 | Satoh et al. | Oct 1989 | A |
4953182 | Chung | Aug 1990 | A |
5086475 | Kutaragi et al. | Feb 1992 | A |
5335069 | Kim | Aug 1994 | A |
5345200 | Reif | Sep 1994 | A |
5371853 | Kao et al. | Dec 1994 | A |
5396414 | Alcone | Mar 1995 | A |
5416787 | Kodama et al. | May 1995 | A |
5455888 | Iyengar et al. | Oct 1995 | A |
5497090 | Macovski | Mar 1996 | A |
5581652 | Abe et al. | Dec 1996 | A |
5771299 | Melanson | Jun 1998 | A |
5950153 | Ohmori et al. | Sep 1999 | A |
6115363 | Oberhammer et al. | Sep 2000 | A |
6144244 | Gilbert | Nov 2000 | A |
6154643 | Cox | Nov 2000 | A |
6157682 | Oberhammer | Dec 2000 | A |
6195394 | Arbeiter et al. | Feb 2001 | B1 |
6208958 | Cho et al. | Mar 2001 | B1 |
6226616 | You et al. | May 2001 | B1 |
6295322 | Arbeiter et al. | Sep 2001 | B1 |
6504935 | Jackson | Jan 2003 | B1 |
6513007 | Takahashi | Jan 2003 | B1 |
6539355 | Omori et al. | Mar 2003 | B1 |
6577739 | Hurtig et al. | Jun 2003 | B1 |
6615169 | Ojala et al. | Sep 2003 | B1 |
6681202 | Miet et al. | Jan 2004 | B1 |
6691083 | Breen | Feb 2004 | B1 |
6704711 | Gustafsson et al. | Mar 2004 | B2 |
6829360 | Iwata et al. | Dec 2004 | B1 |
6889182 | Gustafsson | May 2005 | B2 |
6895375 | Malah et al. | May 2005 | B2 |
7181402 | Jax et al. | Feb 2007 | B2 |
7191136 | Sinha et al. | Mar 2007 | B2 |
7248711 | Allegro et al. | Jul 2007 | B2 |
7461003 | Tanrikulu | Dec 2008 | B1 |
7546237 | Nongpiur et al. | Jun 2009 | B2 |
20010044722 | Gustafsson et al. | Nov 2001 | A1 |
20020128839 | Lindgren et al. | Sep 2002 | A1 |
20020138268 | Gustafsson | Sep 2002 | A1 |
20030009327 | Nilsson et al. | Jan 2003 | A1 |
20030050786 | Jax et al. | Mar 2003 | A1 |
20030093278 | Malah | May 2003 | A1 |
20030158726 | Philippe et al. | Aug 2003 | A1 |
20040028244 | Tsushima et al. | Feb 2004 | A1 |
20040158458 | Sluijter et al. | Aug 2004 | A1 |
20040166820 | Sluijter et al. | Aug 2004 | A1 |
20040174911 | Kim et al. | Sep 2004 | A1 |
20040264721 | Allegro et al. | Dec 2004 | A1 |
20050021325 | Seo et al. | Jan 2005 | A1 |
20050267739 | Konito et al. | Dec 2005 | A1 |
20070105269 | Maher et al. | May 2007 | A1 |
20070124140 | Iser et al. | May 2007 | A1 |
20070150269 | Nongpiur et al. | Jun 2007 | A1 |
Number | Date | Country |
---|---|---|
0 497 050 | Aug 1992 | EP |
0 706 299 | Apr 1996 | EP |
WO 9806090 | Feb 1998 | WO |
WO 0118960 | Mar 2001 | WO |
WO 2005015952 | Feb 2005 | WO |
Number | Date | Country | |
---|---|---|---|
20080208572 A1 | Aug 2008 | US |
Number | Date | Country | |
---|---|---|---|
60903079 | Feb 2007 | US |