This application claims the benefit of priority from European Patent 07021121.4, filed Oct. 29, 2007, which is incorporated by reference.
1. Technical Field
This disclosure relates to verbal communication and in particular to signal reconstruction.
2. Related Art
Mobile communications may use networks of transmitter to convey telephone calls from one destination to another. The quality of these calls may suffer from the naturally occurring or system generated interference that degrades the quality or performance of the communication channels. The interference and noise may affect the conversion of words into a machine readable input.
Some systems attempt to improve speech quality by only suppressing noise. Since the noise is not entirely eliminated, intelligibility may not sufficiently improve. Low signal-to-noise ratios may not be detected by some speech recognition systems. Therefore, there is a need for a system to improve intelligibility in communication systems.
A system enhances the quality of a digital speech signal that may include noise. The system identifies vocal expressions that correspond to the digital speech signal. A signal-to-noise ratio of the digital speech signal is measured before a portion of the digital speech signal is synthesized. The selected portion of the digital signal may have a signal-to-noise ratio below a predetermined level and the synthesis may be based on speaker identification.
Other systems, methods, features, and advantages will be, or will become, apparent to one with skill in the art upon examination of the following figures and detailed description. It is intended that all such additional systems, methods, features and advantages be included within this description, be within the scope of the invention, and be protected by the following claims.
The system may be better understood with reference to the following drawings and description. The components in the figures are not necessarily to scale, emphasis instead being placed upon illustrating the principles of the invention. Moreover, in the figures, like referenced numerals designate corresponding parts throughout the different views.
Systems may transmit, store, manipulate, and synthesize speech. Some systems identify speakers by comparing speech represented in digital formats. Based on power levels, a system may synthesize a portion of a digital speech signal. The power levels may be below a programmable threshold. The system may convert portions of the digital speech signal into aural signals based on speaker identification.
One or more sensors or input devices may convert sound into an analog signal or digital data stream 102 (in
A partial speech synthesis at 114 may be based on an identification of the speaker at 110. Speaker-dependent data at 112 may be processed during the synthesis that includes significant noise levels. The speaker-dependent data may comprise one or more pitch pulse prototypes (e.g., samples) and spectral envelopes. The samples and envelopes may be extracted from a current speech signal, a previous speech signal, or retrieved from a local or remote central or distributed database. Cepstral coefficients, line spectral frequencies, and/or speaker-dependent features may also be processed.
In some systems portions of a digital speech signal having power levels greater than a predetermined level or within a range are filtered at 116. The filter may selectively pass content or speech while attenuating, dampening, or minimizing noise. The selected signal and portions of the synthesized digital speech signal may be adaptively combined at 118. The combination and selected filtering may be based on a measured SNR. If the SNR (e.g., in a frequency sub-band) is sufficiently high, a predetermined pass-band and/or attenuation level may be selected and applied.
Some systems may minimize artifacts by combining only filtered and synthesized signals. The entire digital speech signal may be filtered or processed. A Wiener filter may estimate the noise contributions of the entire signal by processing each bin and sub-band. A speech synthesizer may process the relatively noisy signal portions. The combination of synthesized and filtered signal may be adapted based on a predetermined SNR level.
When the signal-to-noise ratio of one or more segments of a digital speech signal falls below (or is below) a threshold (e.g., a predetermined level), the segment(s) may be synthesized through one or more pitch pulse prototypes (or models) and spectral envelopes. The pitch pulse prototypes and envelopes may be derived from an identified speech segment. In some systems, a pitch pulse prototype represents an obtained excitation signal (spectrum) that represents the signal that would be detected near the vocal chords or a vocal tract of the identified speaker. The (short-term) spectral envelope may represent the tone color. Some systems calculate a predictive error filter through a Linear Predictive Coding (LPC) method. The coefficients of the predictive error filter may be applied or processed to parametrically determine the spectral envelope. In an alternative system, spectral envelope models are processed based on line spectral frequencies, cepstral coefficients, and/or mel-frequency cepstral coefficients.
A pitch pulse prototype and/or spectral envelope may be extracted from a speech signal or a previously analyzed speech signal obtained from a common speaker. A codebook database may retain spectral envelopes associated or trained by the identified speaker. The spectral envelope E(ejΩ
E(ejΩ
where ES(ejΩ
By a mapping function, the spectral envelope E(ejΩ
In some systems one or more portions of the synthesized speech signal may be filtered. The filter may comprise a window function that selectively passes certain elements of the signal before the elements are combined with one or more filtered portions of the speech signal. A windowing functions like a Hann window or a Hamming window, for example, may adapt the power of the filtered synthesized speech signal to that of the noise reduced signal parts. The function may smooth portions of the signal. In some applications the smoothed portions may be near one or more edges of a current signal frame.
Some systems identify speakers through speaker models. A speaker model may include a stochastic speaker model that may be trained by a known speaker on-line or off-line. Some stochastic speech models include Gaussian mixture models (GMM) and Hidden Markov Models (HMM). If an unknown speaker is identified, on-line training may generate a new speaker-dependent model. Some on-line training generates high-quality feature samples (e.g., pitch pulse prototypes, spectral envelopes etc.) when the training occurs under controlled conditions and when speaker is identified within a high confidence interval.
In those instances when speaker identification is not complete or a speaker is unknown, the speaker-independent data (e.g., pitch pulse prototypes, spectral envelopes, etc.) may be processed to partially synthesize speech. An analysis of the speech signal from an unknown speaker may extract new pitch pulse prototypes and spectral envelopes. The prototypes and envelopes may be assigned to the previously unknown speaker for future identification (e.g., during processing within a common session or whenever processing vocal expressions from that speaker).
When retained in a computer readable storage medium the process may comprise computer-executable instructions. The instructions may identify a speaker whose vocal expressions correspond to a digital speech signal. A speech input 202 of
The alternative system of
The analysis processor 306 may comprise separate physical or logical units or may be a unitary device (that may keep power consumption low). The analysis processor 306 may be configured to process digital signals in a sub-band regime (which allows for very efficient processing). The processor 306 may interface or include an optional analysis filter bank that applies a Hann window that divides the digital speech signal into sub-band signals. The processor 306 may interface or include an optional synthesis filter bank (that may apply the same window function as an analysis filter bank that may be part of or interface the analysis processor 306). The synthesis filter bank may synthesize some or all of the sub-band signals that are processed by the mixer 312 to obtain an enhanced digital speech signal.
Some alternative systems may include or interface a delay device and/or a filter that applies window functions. The delay device may be programmed or configured to delay the noise reduced digital speech signal. The window function may filter the synthesized portion of the digital speech signal. Some alternative systems may further include a local or remote central or distributed codebook database that retains speaker-dependent or speaker-independent spectral envelopes. The synthesizer 310 may be programmed or configured to synthesize some of the digital speech signal based on a spectral envelope accessed from the codebook database. In some applications, the synthesizer 310 may be configured or programmed to combine spectral envelopes that were estimated from the digital speech signal and retrieved from the codebook database. A combination may be formed through a linear mapping.
Some systems may include or interface an identification database. The identification database may retain training data that may identify a speaker. The analysis processor 306 in this system and the systems described above may be programmed or configured to identify the speaker by processing or generating a stochastic speech model. In the alternative systems (including those described) may interface or include a database that retains speaker-independent data (as, e.g., speaker-independent pitch pulse prototypes) that may facilitate speech synthesis when identification is incomplete or identification has failed. Each of the systems and alternatives described may process and convert one or more signals into a mediated verbal communication. The systems may interface or may be part of an in-vehicle (
Speakers may be identified in noisy environments (e.g., within vehicles). Some systems may assign a pitch pulse prototype to users that speak in noisy environments. In some processes one or more stochastic speaker-independent speech models (e.g., a GMM) may be trained by two or more different speakers articulating two or more different utterances (e.g., through a k-means or expectation maximization (EM) algorithm)). A speaker-independent model such as a Universal Background Model may be adapted or serve as a template for some speaker-dependent models. A speech signal articulated in a low-perturbed environment and exclusive noisy backgrounds (without speech) may be stored in a local or remote centrally located or distributed database. The stored representations may facilitate a statistical modeling of noise influences on speech (characteristics and/or features). Through this retention, the process may account for or compensate for the influence noise may have on some or all selected speech segments. In some processes the data may affect the extraction of feature vectors that may be processed to generate a spectral envelope.
Unperturbed feature vectors may be estimated from perturbed feature vectors by processing data associated with background noise. The data may represent the noise detected in vehicle cabins that may correspond to different speeds, interior and/or exterior climate conditions, road conditions, etc. Unperturbed speech samples of a Universal Background Model may be modified by noise signals (or modifications associated or assigned to them) and the relationships of unperturbed and perturbed features of the speech signals may be monitored and stored on or off-line. Data representing statistical relationships may be further processed when estimating feature vectors (and, e.g., the spectral envelope). In some processes, heavily perturbed low-frequency parts of processed speech signals may be removed or deleted during training and/or through the enhancement process of
In
For a relatively high SNR, some noise reduction filter may enhance the quality of speech signals. Under highly perturbed conditions, the same noise reduction filter may not be as effective. Because of this condition, the process may determine or estimate which parts of the detected speech signal exhibit an SNR below a predetermined or pre-programmed SNR level (e.g. below 3 dB) and which parts exhibit an SNR that exceeds that level. Those parts of the speech signal with relatively low perturbations (SNR above the predetermined level) are filtered at 608 by some a noise reduction filter. The filter may comprise a Wiener filter. Those portions of the speech signal with relatively high perturbations (SNR below the predetermined level) may be synthesized (or reconstructed) at 610 before the signal is combined with the filtered portions at 612.
The system that synthesizes the speech signal exhibiting high perturbations may access and process speaker-dependent pitch pulse prototypes retained in a database. When speaker is identified at 604, associated pitch pulse prototypes (that may comprise the long-term correlations) may be retrieved and combined with spectral envelopes (that may comprise short term correlations) to synthesize speech. In an alternative process, the pitch pulse prototypes may be extracted from a speaker's vocal expression, in particular, from utterances subject to relatively low perturbations.
To reliably extract some pitch pulse prototypes, the average SNR may be sufficiently high for a frequency that ranges from the speaker's average pitch frequency to a level that's about five to about ten times that frequency. The current pitch frequency may be estimated with sufficient accuracy. In addition, a suitable spectral distance measure may be made by e.g.,
where Y(ejΩ
When these conditions are satisfied, the spectral envelope may be extracted and stripped from the speech signal (consisting of L sub-frames) through a predictive error filtering, for example. The pitch pulse that is located closest to a middle or a selected frame, may be shifted so that it is positioned exactly or near the middle of the frame. In some processes, a Hann window may be overlaid across the frame. The spectrum of a speaker-dependent pitch pulse prototype may be obtained through a Discrete Fourier Transform and power normalization.
When a speaker is identified and if the environmental conditions allow for a precise estimate of a new pitch impulse, some processes extract two or more (e.g., a variety) speaker-dependent pitch pulse prototypes for different pitch frequencies. When synthesizing portion of the speech signal, a selected pitch pulse prototype may be processed that has a fundamental frequency substantially near the current estimated pitch frequency. When a number (e.g., predetermined number) of the extracted pitch pulses prototypes differ from those stored by a predetermined measure, one or more of the extracted pitch pulses prototypes may be written to memory (or a database) to replace the previously stored prototype. Through this dynamic refresh process or cycle, the process may renew the prototypes with more accurate representations. A reliable speech synthesis may be sustained even under atypical conditions that may cause undesired or outlier pitch pulses to be retained in memory (or the database).
At 612, the synthesized and noise reduced portions of the speech signal are combined. The result or enhanced speech signal may be generated or received by an in-vehicle or out-of-vehicle system. The system may comprise a navigation system interfaced to a structure for transporting persons or things (e.g., a vehicle shown in
A classifier 706 may discriminate the signal segments that display a noise-like structure (an unvoiced portion in which no periodicity may be apparent) and a quasi-periodic segment (a voiced portion) of the speech sub-band signals. A pitch estimator 708 may estimate the pitch frequency fp(n). The pitch frequency fp(n) may be estimated through an autocorrelation analysis, cepstral analysis, etc. A spectral envelope detector 710 may estimate the spectral envelope E(ejΩ
The excitation spectrum P(ejΩ
where m denotes a time instant in a current signal frame n. For each frame signal synthesis is performed by a synthesizer 714 wherever (within the frame) a pitch frequency is determined to obtain the synthesis signal vector ŝr(n). Transitions from voiced (fp determined) to unvoiced portions may be smoothed to avoid artifacts. The synthesis signal ŝr(n) may be multiplied (e.g., a multiplier) by the same window function that was applied by the analysis filter bank 702 to adapt the power of both the synthesis and noise reduced signals ŝg(n) and ŝr(n).
After the signal is transformed to the frequency domain through a Fast Fourier Transformer or controller 716 the synthesis signal ŝr(n) and the time delayed noise reduced signal ŝg(n) are adaptively mixed by mixer 718. Delay is introduced in the noise reduction path by a delay unit (or delayer) 722 to compensate for the processing delay in the upper branch of
The excitation signal may be shaped with the estimated spectral envelope. In
Based on the SNR determined by the noise reduction filter 704 of
where SNR0 denotes a suitable predetermined level with which the current SNR of a signal (portion) is compared.
The extracted spectral envelope Es(ejΩ
E(ejΩ
In the above examples, speaker-dependent data may be processed to partially synthesize speech. In some applications speaker identification may be difficult in noisy environments and reliable identification may not occur with the speaker's first utterance. In some alternative systems, speaker-independent data (pitch pulse prototypes, spectral envelopes) may be processed (in these conditions) to partially reconstruct a detected speech signal until the current speaker is or may be identified. After successful identification, the systems may continue to process speaker-dependent data.
While signals are processed in each time frame, speaker-dependent features may be extracted from the speech signal and may be compared with stored features. By this comparison, some or all of the extracted speaker-dependent features may replace the previously stored features (e.g., data). This process may occur under many conditions including environments subject to a higher level of transient or background noise. Other alternate systems and methods may include combinations of some or all of the structure and functions described above or shown in one or more or each of the figures. These systems or methods are formed from any combination of structures and function described or illustrated within the figures.
The methods, systems, and descriptions above may be encoded in a signal bearing medium, a computer readable medium or a computer readable storage medium such as a memory that may comprise unitary or separate logic, programmed within a device such as one or more integrated circuits, or processed by a controller or a computer. If the methods or descriptions are performed by software, the software or logic may reside in a memory resident to or interfaced to one or more processors, digital signal processors, or controllers, a communication interface, a wireless system, a powertrain controller, body control module, an entertainment and/or comfort controller of a vehicle, a non-vehicle system or non-volatile or volatile memory remote from or resident to the a speech recognition device or processor. The memory may retain an ordered listing of executable instructions for implementing logical functions. A logical function may be implemented through digital circuitry, through source code, through analog circuitry, or through an analog source such as through an analog electrical, or audio signals.
The software may be embodied in any computer-readable storage medium or signal-bearing medium, for use by, or in connection with an instruction executable system or apparatus resident to a vehicle or a hands-free or wireless communication system. Alternatively, the software may be embodied in a navigation system or media players (including portable media players) and/or recorders. Such a system may include a computer-based system, a processor-containing system that includes an input and output interface that may communicate with an automotive, vehicle, or wireless communication bus through any hardwired or wireless automotive communication protocol, combinations, or other hardwired or wireless communication protocols to a local or remote destination, server, or cluster.
A computer-readable medium, machine-readable storage medium, propagated-signal medium, and/or signal-bearing medium may comprise any medium that contains, stores, communicates, propagates, or transports software for use by or in connection with an instruction executable system, apparatus, or device. The machine-readable storage medium may selectively be, but not limited to, an electronic, magnetic, optical, electromagnetic, infrared, or semiconductor system, apparatus, device, or propagation medium. A non-exhaustive list of examples of a machine-readable medium would include: an electrical or tangible connection having one or more links, a portable magnetic or optical disk, a volatile memory such as a Random Access Memory “RAM” (electronic), a Read-Only Memory “ROM,” an Erasable Programmable Read-Only Memory (EPROM or Flash memory), or an optical fiber. A machine-readable medium may also include a tangible medium upon which software is printed, as the software may be electronically stored as an image or in another format (e.g., through an optical scan), then compiled by a controller, and/or interpreted or otherwise processed. The processed medium may then be stored in a local or remote computer and/or a machine memory.
While various embodiments of the invention have been described, it will be apparent to those of ordinary skill in the art that many more embodiments and implementations are possible within the scope of the invention. Accordingly, the invention is not to be restricted except in light of the attached claims and their equivalents.
Number | Date | Country | Kind |
---|---|---|---|
07021121.4 | Oct 2007 | EP | regional |