The disclosed embodiments relate to systems and methods for detecting and processing a desired acoustic signal in the presence of acoustic noise.
Many noise suppression algorithms and techniques have been developed over the years. Most of the noise suppression systems in use today for speech communication systems are based on a single-microphone spectral subtraction technique first develop in the 1970's and described, for example, by S. F. Boll in “Suppression of Acoustic Noise in Speech using Spectral Subtraction,” IEEE Trans. on ASSP, pp. 113-120, 1979. These techniques have been refined over the years, but the basic principles of operation have remained the same. See, for example, U.S. Pat. No. 5,687,243 of McLaughlin, et al., and U.S. Pat. No. 4,811,404 of Vilmur, et al. Generally, these techniques make use of a single-microphone Voice Activity Detector (VAD) to determine the background noise characteristics, where “voice” is generally understood to include human voiced speech, unvoiced speech, or a combination of voiced and unvoiced speech.
The VAD has also been used in digital cellular systems. As an example of such a use, see U.S. Pat. No. 6,453,291 of Ashley, where a VAD configuration appropriate to the front-end of a digital cellular system is described. Further, some Code Division Multiple Access (CDMA) systems utilize a VAD to minimize the effective radio spectrum used, thereby allowing for more system capacity. Also, Global System for Mobile Communication (GSM) systems can include a VAD to reduce co-channel interference and to reduce battery consumption on the client or subscriber device.
These typical single-microphone VAD systems are significantly limited in capability as a result of the analysis of acoustic information received by the single microphone, wherein the analysis is performed using typical signal processing techniques. In particular, limitations in performance of these single-microphone VAD systems are noted when processing signals having a low signal-to-noise ratio (SNR), and in settings where the background noise varies quickly. Thus, similar limitations are found in noise suppression systems using these single-microphone VADs.
Many limitations of these typical single-microphone VAD systems were overcome with the introduction of the Pathfinder noise suppression system by Aliph of San Francisco, Calif. (http://www.aliph.com), described in detail in the Related Applications. The Pathfinder noise suppression system differs from typical noise cancellation systems in several important ways. For example, it uses an accurate voiced activity detection (VAD) signal along with two or more microphones, where the microphones detect a mix of both noise and speech signals. While the Pathfinder noise suppression system can be used with and integrated in a number of communication systems and signal processing systems, so can a variety of devices and/or methods be used to supply the VAD signal. Further, a number of microphone types and configurations can be used to provide acoustic signal information to the Pathfinder system.
In the drawings, the same reference numbers identify identical or substantially similar elements or acts. To easily identify the discussion of any particular element or act, the most significant digit or digits in a reference number refer to the Figure number in which that element is first introduced (e.g., element 105 is first introduced and discussed with respect to
The headings provided herein are for convenience only and do not necessarily affect the scope or meaning of the claimed invention. The following description provides specific details for a thorough understanding of, and enabling description for, embodiments of the invention. However, one skilled in the art will understand that the invention may be practiced without these details. In other instances, well-known structures and functions have not been shown or described in detail to avoid unnecessarily obscuring the description of the embodiments of the invention.
Numerous communication systems are described below, including both handset and headset devices, which use a variety of microphone configurations to receive acoustic signals of an environment. The microphone configurations include, for example, a two-microphone array including two unidirectional microphones, and a two-microphone array including one unidirectional microphone and one omnidirectional microphone, but are not so limited. The communication systems can also include Voice Activity Detection (VAD) devices to provide voice activity signals that include information of human voicing activity. Components of the communications systems receive the acoustic signals and voice activity signals and, in response, automatically generate control signals from data of the voice activity signals. Components of the communication systems use the control signals to automatically select a denoising method appropriate to data of frequency subbands of the acoustic signals. The selected denoising method is applied to the acoustic signals to generate denoised acoustic signals when the acoustic signals include speech and noise.
Numerous microphone configurations are described below for use with the Pathfinder noise suppression system. As such, each configuration is described in detail along with a method of use to reduce noise transmission in communication devices, in the context of the Pathfinder system. When the Pathfinder noise suppression system is referred to, it should be kept in mind that noise suppression systems that estimate the noise waveform and subtract it from a signal and that use or are capable of using the disclosed microphone configurations and VAD information for reliable operation are included in that reference. Pathfinder is simply a convenient referenced implementation for a system that operates on signals comprising desired speech signals along with noise. Thus, the use of these physical microphone configurations includes but is not limited to applications such as communications, speech recognition, and voice-feature control of applications and/or devices.
The terms “speech” or “voice” as used herein generally refer to voiced, unvoiced, or mixed voiced and unvoiced human speech. Unvoiced speech or voiced speech is distinguished where necessary. However, the term “speech signal” or “speech”, when used as a converse to noise, simply refers to any desired portion of a signal and does not necessarily have to be human speech. It could, as an example, be music or some other type of desired acoustic information. As used in the Figures, “speech” is meant to mean any signal of interest, whether human speech, music, or anything other signal that it is desired to hear.
In the same manner, “noise” refers to unwanted acoustic information that distorts a desired speech signal or makes it more difficult to comprehend. “Noise suppression” generally describes any method by which noise is reduced or eliminated in an electronic signal.
Moreover, the term “VAD” is generally defined as a vector or array signal, data, or information that in some manner represents the occurrence of speech in the digital or analog domain. A common representation of VAD information is a one-bit digital signal sampled at the same rate as the corresponding acoustic signals, with a zero value representing that no speech has occurred during the corresponding time sample, and a unity value indicating that speech has occurred during the corresponding time sample. While the embodiments described herein are generally described in the digital domain, the descriptions are also valid for the analog domain.
The term “Pathfinder”, unless otherwise specified, denotes any denoising system using two or more microphones, a VAD device and algorithm, and which estimates the noise in a signal and subtracts it from that signal. The Aliph Pathfinder system is simply a convenient reference for this type of denoising system, although it is more capable than the above definition. In some cases (such as the microphone arrays described in
The Pathfinder system is a digital signal processing—(DSP) based acoustic noise suppression and echo-cancellation system. The Pathfinder system, which can couple to the front-end of speech processing systems, uses VAD information and received acoustic information to reduce or eliminate noise in desired acoustic signals by estimating the noise waveform and subtracting it from a signal including both speech and noise. The Pathfinder system is described further below and in the Related Applications.
Components of the signal processing system 100, for example the noise removal system 105, couple to the microphones MIC 1 and MIC 2 via wireless couplings, wired couplings, and/or a combination of wireless and wired couplings. Likewise, the VAD system 106 couples to components of the signal processing system 100, like the noise removal system 105, via wireless couplings, wired couplings, and/or a combination of wireless and wired couplings. As an example, the VAD devices and microphones described below as components of the VAD system 106 can comply with the Bluetooth wireless specification for wireless communication with other components of the signal processing system, but are not so limited.
The communications device 170 includes both handset and headset communication devices, but is not so limited. Handsets or handset communication devices include, but are not limited to, portable communication devices that include microphones, speakers, communications electronics and electronic transceivers, such as cellular telephones, portable or mobile telephones, satellite telephones, wireline telephones, Internet telephones, wireless transceivers, wireless communication radios, personal digital assistants (PDAs), and personal computers (PCs).
Headset or headset communication devices include, but are not limited to, self-contained devices including microphones and speakers generally attached to and/or worn on the body. Headsets often function with handsets via couplings with the handsets, where the couplings can be wired, wireless, or a combination of wired and wireless connections. However, the headsets can communicate independently with components of a communications network.
The VAD device 140 includes, but is not limited to, accelerometers, skin surface microphones (SSMs), and electromagnetic devices, along with the associated software or algorithms. Further, the VAD device 140 includes acoustic microphones along with the associated software. The VAD devices and associated software are described in U.S. patent application Ser. No. 10/383,162, entitled VOICE ACTIVITY DETECTION (VAD) DEVICES AND METHODS FOR USE WITH NOISE SUPPRESSION SYSTEMS, filed Mar. 5, 2003.
The configurations described below of each handset/headset design include the location and orientation of the microphones and the method used to obtain a reliable VAD signal. All other components (including the speaker and mounting hardware for headsets and the speaker, buttons, plugs, physical hardware, etc. for the handsets) are inconsequential for the operation of the Pathfinder noise suppression algorithm and will not be discussed in great detail, with the exception of the mounting of unidirectional microphones in the handset or headset. The mounting is described to provide information for the proper ventilation of the directional microphones. Those familiar with the state of the art will not have difficulty mounting the unidirectional microphones correctly given the placement and orientation information in this application.
Furthermore, the method of coupling (either physical or electromagnetic or otherwise) of the headsets described below is inconsequential. The headsets described work with any type of coupling, so they are not specified in this disclosure. Finally, the microphone configuration 110 and the VAD 130 are independent, so that any microphone configuration can work with any VAD device/method, unless it is desired to use the same microphones for both the VAD and the microphone configuration. In this case the VAD can place certain requirements on the microphone configuration. These exceptions are noted in the text.
Microphone Configurations
The Pathfinder system, although using particular microphone types (omnidirectional or unidirectional, including the amount of unidirectionality) and microphone orientations, is not sensitive to the typical distribution of responses of individual microphones of a given type. Thus the microphones do not need to be matched in terms of frequency response nor do they need to be especially sensitive or expensive. In fact, configurations described herein have been constructed using inexpensive off-the-shelf microphones, which have proven to be very effective. As an aid to review, the Pathfinder setup is shown in
There are many different types of microphones in use today, but generally speaking, there are two main categories: omnidirectional (referred to herein as “OMNI microphones” or “OMNI”) and unidirectional (referred to herein as “UNI microphones” or “UNI”). The OMNI microphones are characterized by relatively consistent spatial response with respect to relative acoustic signal location, and UNI microphones are characterized by responses that vary with respect to the relative orientation of the acoustic source and the microphone. Specifically, the UNI microphones are normally designed to be less responsive behind and to the sides of the microphone so that signals from the front of the microphone are emphasized relative to those from the sides and rear.
There are several types of UNI microphones (although really only one type of OMNI) and the types are differentiated by the microphone's spatial response.
Microphone Arrays Including Mixed OMNI and UNI Microphones
In an embodiment, an OMNI and UNI microphone are mixed to form a two-microphone array for use with the Pathfinder system. The two-microphone array includes combinations where the UNI microphone is the speech microphone and combinations in which the OMNI microphone is the speech microphone, but is not so limited.
UNI Microphone as Speech Microphone
With reference to
The general configurations 310 and 320 show how the microphones can be oriented in a general fashion as well as a possible implementation of this setup for a handset and a headset, respectively. The UNI microphone, as the speech microphone, points toward the user's mouth. The OMNI has no specific orientation, but its location in this embodiment physically shields it from speech signals as much as possible. This setup works well for the Pathfinder system since the speech microphone contains mostly speech and the noise microphone mainly noise. Thus, the speech microphone has a high signal-to-noise ratio (SNR) and the noise microphone has a lower SNR. This enables the Pathfinder algorithm to be effective.
OMNI Microphone as Speech Microphone
In this embodiment, and referring to
In this configuration where the speech microphone is an OMNI, the UNI is oriented in such a way as to keep the amount of speech in the UNI microphone small compared to the amount of speech in the OMNI. This means that the UNI will be oriented away from the speaker's mouth, and the amount it is oriented away from the speaker is denoted by ƒ, which can vary between 0 and 180 degrees, where ƒ describes the angle between the direction of one microphone and the direction of another microphone in any plane.
The embodiments of
Microphone Arrays Including Two UNI Microphones
The microphone array of an embodiment includes two UNI microphones, where a first UNI microphone is the speech microphone and a second UNI microphone is the noise microphone. In the following description the maximum of the spatial response of the speech UNI is assumed oriented toward the user's mouth.
Noise UNI Microphone Oriented Away from Speaker
Similar to the configurations described above with reference to
UNI/UNI Microphone Array
When using the UNI/UNI microphone array, the same type of UNI microphone (cardioid, supercardioid, etc.) should be used. If this is not the case, one microphone could detect signals that the other microphone does not detect, causing a reduction in noise suppression effectiveness. The two UNI microphones should be oriented in the same direction, toward the speaker. Obviously the noise microphone will pick up a lot of speech, so the full version of the Pathfinder system should be used to avoid de-signaling.
Placement of the two UNI microphones on the axis that includes the user's mouth at one end and the noise microphone on the other, and use of a microphone spacing d that is a multiple in space of a sample in time allows the differential transfer function between the two microphones to be simple and therefore allows the Pathfinder system to operate at peak efficiency. As an example, if the acoustic data is sampled at 8 kHz, the time between samples is a multiple of 1/8000 seconds, or 0.125 milliseconds. The speed of sound in air is pressure and temperature dependent, but at sea level and room temperature it is about 345 meters per second. Therefore in 0.125 milliseconds the sound will travel 345(0.000125)=4.3 centimeters and the microphones should be spaced about 4.3 centimeters apart, or 8.6 cm, or 12.9 cm, and so on.
For example, and with reference to
where Mn(z) is the discrete digital output from microphone n, C is a constant depending on the distance from MIC 1 to the acoustic source and the response of the microphones, and z−1 is a simple delay in the discrete digital domain. Essentially, for acoustic energy originating from the user's mouth, the information captured by MIC 2 is the same as that captured by MIC 1, only delayed by a single sample (due to the 4.3 cm separation) and with a different amplitude. This simple H2(z) could be hardcoded for this array configuration and used with Pathfinder to denoise noisy speech with minimal distortion.
Microphone Arrays Including Two OMNI Microphones
The microphone array of an embodiment includes two OMNI microphones, where a first OMNI microphone is the speech microphone and a second OMNI microphone is the noise microphone.
As with the UNI/UNI microphone array described above, perfect alignment between the two OMNI microphones and the speaker's mouth is not strictly necessary, although that alignment offers the best performance. This configuration is a likely implementation for handsets, for both price reasons (OMNIs are less expensive than UNIs) and packaging reasons (it is simpler to properly vent OMNIs than UNIs).
Voice Activity Detection (VAD) Devices
Referring to
General Electromagnetic Sensor (GEMS) VAD
The GEMS is a radiofrequency (RF) interferometer that operates in the 1-5 GHz frequency range at very low power, and can be used to detect vibrations of very small amplitude. The GEMS is used to detect vibrations of the trachea, neck, cheek, and head associated with the production of speech. These vibrations occur due to the opening and closing of the vocal folds associated with speech production, and detecting them can lead to a very accurate noise-robust VAD, as described in the Related Applications.
As the GEMS is an RF sensor, it uses an antenna. Very small (from approximately 4 mm by 7 mm to about 20 mm by 20 mm) micropatch antennae have been constructed and used that allow the GEMS to detect vibrations. These antennae are designed to be close to the skin for maximum efficiency. Other antennae may be used as well. The antennae may be mounted in the handset or earpiece in any manner, the only restriction being that sufficient energy to detect the vibration must reach the vibrating objects. In some cases this will require skin contact, in others skin contact may not be needed.
Surface Skin Vibration-Based VAD
As described in the Related Applications, accelerometers and devices called Skin Surface Microphones (SSMs) can be used to detect the skin vibrations that occur due to the production of speech. However, these sensors can be polluted by exterior acoustic noise, and so care must be taken in their placement and use. Accelerometers are well known and understood, and the SSM is a device that can also be used to detect vibrations, although not with the same fidelity as the accelerometer. Fortunately, constructing a VAD does not require high fidelity reproduction of the underlying vibration, just the ability to determine if vibrations are taking place. For this the SSM is well suited.
The SSM is a conventional microphone modified to prevent airborne acoustic information from coupling with the microphone's detecting elements. A layer of silicone gel or other covering changes the impedance of the microphone and prevents airborne acoustic information from being detected to a significant degree. Thus this microphone is shielded from airborne acoustic energy but is able to detect acoustic waves traveling in media other than air as long as it maintains physical contact with the media.
During speech, when the accelerometer/SSM is placed on the cheek or neck, vibrations associated with speech production are easily detected. However, the airborne acoustic data is not significantly detected by the accelerometer/SSM. The tissue-borne acoustic signal, upon detection by the accelerometer/SSM, is used to generate a VAD signal used to process and denoise the signal of interest.
Skin Vibrations in the Ear
One placement that can be used to cut down on the amount of external noise detected by the accelerometer/SSM and assure a good fit is to place the accelerometer/SSM in the ear canal. This is already done in some commercial products, such as Temco's Voiceducer, where the vibrations are directly used as the input to a communication system. In the noise suppression systems described herein, however, the accelerometer signal is only used to calculate a VAD signal. Therefore the accelerometer/SSM in the ear can be less sensitive and require less bandwidth, and thus be less expensive.
Skin Vibrations Outside the Ear
There are many locations outside the ear from which the accelerometer/SSM can detect skin vibrations associated with the production of speech. The accelerometer/SSM may be mounted in the handset or earpiece in any manner, the only restriction being that reliable skin contact is required to detect the skin-borne vibrations associated with the production of speech.
The areas of sensitivity 1102-1108 include areas of optimal sensitivity A-F where speech can be reliably detected by a SSM, under an embodiment. The areas of optimal sensitivity A-F include, but are not limited to, the area behind the ear A, the area below the ear B, the mid-cheek area C of the jaw, the area in front of the ear canal D, the area E inside the ear canal in contact with the mastoid bone or other vibrating tissue, and the nose F. Placement of an accelerometer/SSM in the proximity of any of these areas of sensitivity 1102-1108 will work with a headset, but a handset requires contact with the cheek, jaw, head, or neck. The above areas are only meant to guide, and there may be other areas not specified where useful vibrations can also be detected.
Two-Microphone Acoustic VAD
These VADs, which include array VAD, Pathfinder VAD, and stereo VAD, operate with two microphones and without any external hardware. Each of the array VAD, Pathfinder VAD, and stereo VAD takes advantage of the two-microphone configuration in a different way, as described below.
Array VAD
The array VAD, described further in the Related Applications, arranges the microphones in a simple linear array and detects the speech using the characteristics of the array. It functions best when the microphones and the user's mouth are linearly co-located and the microphones are located a multiple of a sample distance away. That is, if the sampling frequency of the system is 8 kHz, and the speed of sound is approximately 345 m/s, then in one sample sound will travel
d=345 m/s·( 1/8000 s)=4.3 cm
and the microphones should be separated by 4.3, 8.6, 12.9 . . . cm. Embodiments of the array VAD in both handsets and headsets are the same as the microphone configurations of
Pathfinder VAD
The Pathfinder VAD, also described further in the Related Applications, uses the gain of the differential transfer function H1(z) of the Pathfinder technique to determine when voicing is occurring. As such, it can be used with virtually any of the microphone configurations above with little modification. Very good performance has been noted with the UNI/UNI microphone configuration described above with reference to
Stereo VAD
The stereo VAD, also described further in the Related Applications, uses the difference in frequency amplitude from the noise and the speech to determine when speech is occurring. It uses a microphone configuration in which the SNR is larger in the speech microphone than in the noise microphone. Again, virtually any of the microphone configurations above can be configured to work with this VAD technique, but very good performance has been noted with the UNI/UNI microphone configuration described above with reference to
Manually Activated VAD
In this embodiment, the user or an outside observer manually activates the VAD, using a pushbutton or switching device. This can even be done offline, on a recording of the data recorded using one of the above configurations. Activation of the manual VAD device, or manually overriding an automatic VAD device like those described above, results in generation of a VAD signal. As this VAD does not rely on the microphones, it may be used with equal utility with any of the microphone configurations above.
Single-Microphone/Conventional VAD
Any conventional acoustic method can also be used with either or both of the speech and noise microphones to construct the VAD signal used by Pathfinder for noise suppression. For example, a conventional mobile phone VAD (see U.S. Pat. No. 6,453,291 of Ashley, where a VAD configuration appropriate to the front-end of a digital cellular system is described) can be used with the speech microphone to construct a VAD signal for use with the Pathfinder noise suppression system. In another embodiment, a “close talk” or gradient microphone may be used to record a high-SNR signal near the mouth, through which a VAD signal may be easily calculated. This microphone could be used as the speech microphone of the system, or could be completely separate. In the case where the gradient microphone is also used as the speech microphone of the system, the gradient microphone takes the place of the UNI microphones in either of the microphone array including mixed OMNI and UNI microphones when the UNI microphone is the speech microphone (described above with reference to
Pathfinder Noise Suppression System
As described above,
A VAD signal 106, derived in some manner, is used to control the method of noise removal. The acoustic information coming into MIC 1 is denoted by m1(n). The information coming into MIC 2 is similarly labeled m2(n). In the z (digital frequency) domain, we can represent them as M1(z) and M2(z). Thus
M1(z)=S(z)+N(z)H1(z)
M2(z)=N(z)+S(z)H2(z) (1)
This is the general case for all realistic two-microphone systems. There is always some leakage of noise into MIC 1, and some leakage of signal into MIC 2. Equation 1 has four unknowns and only two relationships and, therefore, cannot be solved explicitly.
However, perhaps there is some way to solve for some of the unknowns in Equation 1 by other means. Examine the case where the signal is not being generated, that is, where the VAD indicates voicing is not occurring. In this case, s(n)=S(z)=0, and Equation 1 reduces to
M1n(z)=N(z)H1(z)
M2n(z)=N(z)
where the n subscript on the M variables indicate that only noise is being received. This leads to
Now, H1(z) can be calculated using any of the available system identification algorithms and the microphone outputs when only noise is being received. The calculation should be done adaptively in order to allow the system to track any changes in the noise.
After solving for one of the unknowns in Equation 1, H2(z) can be solved for by using the VAD to determine when voicing is occurring with little noise. When the VAD indicates voicing, but the recent history (on the order of 1 second or so) of the microphones indicate low levels of noise, assume that n(s)=N(z)˜0. Then Equation 1 reduces to
This calculation for H2(z) appears to be just the inverse of the H1(z) calculation, but remember that different inputs are being used as the calculation now takes place when speech is being produced. Note that H2(z) should be relatively constant, as there is always just a single source (the user) and the relative position between the user and the microphones should be relatively constant. Use of a small adaptive gain for the H2(z) calculation works well and makes the calculation more robust in the presence of noise.
Following the calculation of H1(z) and H2(z) above, they are used to remove the noise from the signal. Rewriting Equation 1 as
allows solving for S(z)
Generally, H2(z) is quite small, and H1(z) is less than unity, so for most situations at most frequencies
H2(z)H1(z)<<1,
and the signal can be calculated using
S(z)≈M1(z)−M2(z)H1(z).
Therefore the assumption is made that H2(z) is not needed, and H1(z) is the only transfer to be calculated. While H2(z) can be calculated if desired, good microphone placement and orientation can obviate the need for H2(z) calculation.
Significant noise suppression can only be achieved through the use of multiple subbands in the processing of acoustic signals. This is because most adaptive filters used to calculate transfer functions are of the FIR type, which use only zeros and not poles to calculate a system that contains both zeros and poles as
Such a model can be sufficiently accurate given enough taps, but this can greatly increase computational cost and convergence time. What generally occurs in an energy-based adaptive filter system such as the least-mean squares (LMS) system is that the system matches the magnitude and phase well at a small range of frequencies that contain more energy than other frequencies. This allows the LMS to fulfill its requirement to minimize the energy of the error to the best of its ability, but this fit may cause the noise in areas outside of the matching frequencies to rise, reducing the effectiveness of the noise suppression.
The use of subbands alleviates this problem. The signals from both the primary and secondary microphones are filtered into multiple subbands, and the resulting data from each subband (which can be frequency shifted and decimated if desired, but it is not necessary) is sent to its own adaptive filter. This forces the adaptive filter to try to fit the data in its own subband, rather than just where the energy is highest in the signal. The noise-suppressed results from each subband can be added together to form the final denoised signal at the end. Keeping everything time-aligned and compensating for filter shifts is not easy, but the result is a much better model to the system at the cost of increased memory and processing requirements.
At first glance, it may seem as if the Pathfinder algorithm is very similar to other algorithms such as classical ANC (adaptive noise cancellation), shown in
Regarding the use of VAD to control adaptation of the noise suppression system to the received signals, classical ANC uses no VAD information. Since, during speech production, there is signal in the reference microphone, adapting the coefficients of H1(z) (the path from the noise to the primary microphone) during the time of speech production would result in the removal of a large part of the speech energy from the signal of interest. The result is signal distortion and reduction (de-signaling). Therefore, the various methods described above use VAD information to construct a sufficiently accurate VAD to instruct the Pathfinder system when to adapt the coefficients of H1 (noise only) and H2 (if needed, when speech is being produced).
An important difference between classical ANC and the Pathfinder system involves subbanding of the acoustic data, as described above. Many subbands are used by the Pathfinder system to support application of the LMS algorithm on information of the subbands individually, thereby ensuring adequate convergence across the spectrum of interest and allowing the Pathfinder system to be effective across the spectrum.
Because the ANC algorithm generally uses the LMS adaptive filter to model H1, and this model uses all zeros to build filters, it was unlikely that a “real” functioning system could be modeled accurately in this way. Functioning systems almost invariably have both poles and zeros, and therefore have very different frequency responses than those of the LMS filter. Often, the best the LMS can do is to match the phase and magnitude of the real system at a single frequency (or a very small range), so that outside this frequency the model fit is very poor and can result in an increase of noise energy in these areas. Therefore, application of the LMS algorithm across the entire spectrum of the acoustic data of interest often results in degradation of the signal of interest at frequencies with a poor magnitude/phase match.
Finally, the Pathfinder algorithm supports operation with the acoustic signal of interest in the reference microphone of the system. Allowing the acoustic signal to be received by the reference microphone means that the microphones can be much more closely positioned relative to each other (on the order of a centimeter) than in classical ANC configurations. This closer spacing simplifies the adaptive filter calculations and enables more compact microphone configurations/solutions. Also, special microphone configurations have been developed that minimize signal distortion and de-signaling, and support modeling of the signal path between the signal source of interest and the reference microphone.
In an embodiment, the use of directional microphones ensures that the transfer function does not approach unity. Even with directional microphones, some signal is received into the noise microphone. If this is ignored and it is assumed that H2(z)=0, then, assuming a perfect VAD, there will be some distortion. This can be seen by referring to Equation 2 and solving for the result when H2(z) is not included:
S(z)[1−H2(z)H1(z)]=M1(z)−M2(z)H1(z). (4)
This shows that the signal will be distorted by the factor [1−H2(z)H1(z)]. Therefore, the type and amount of distortion will change depending on the noise environment. With very little noise, H1(z) is approximately zero and there is very little distortion. With noise present, the amount of distortion may change with the type, location, and intensity of the noise source(s). Good microphone configuration design minimizes these distortions.
The calculation of H1 in each subband is implemented when the VAD indicates that voicing is not occurring or when voicing is occurring but the SNR of the subband is sufficiently low. Conversely, H2 can be calculated in each subband when the VAD indicates that speech is occurring and the subband SNR is sufficiently high. However, with proper microphone placement and processing, signal distortion can be minimized and only H1 need be calculated. This significantly reduces the processing required and simplifies the implementation of the Pathfinder algorithm. Where classical ANC does not allow any signal into MIC 2, the Pathfinder algorithm tolerates signal in MIC 2 when using the appropriate microphone configuration. An embodiment of an appropriate microphone configuration, as described above with reference to
Perhaps the best way to demonstrate the dependence of the noise suppression on the VAD is to examine the effect of VAD errors on the denoising in the context of a VAD failure. There are two types of errors that can occur. False positives (FP) are when the VAD indicates that voicing has occurred when it has not, and false negatives (FN) are when the VAD does not detect that speech has occurred. False positives are only troublesome if they happen too often, as an occasional FP will only cause the H1 coefficients to stop updating briefly, and experience has shown that this does not appreciably affect the noise suppression performance. False negatives, on the other hand, can cause problems, especially if the SNR of the missed speech is high.
Assuming that there is speech and noise in both microphones of the system, and the system only detects the noise because the VAD failed and returned a false negative, the signal at MIC 2 is
M2=H1N+H2S,
where the z's have been suppressed for clarity. Since the VAD indicates only the presence of noise, the system attempts to model the system above as a single noise and a single transfer function according to
TF model={tilde over (H)}1Ñ.
The Pathfinder system uses an LMS algorithm to calculate {tilde over (H)}1, but the LMS algorithm is generally best at modeling time-invariant, all-zero systems. Since it is unlikely that the noise and speech signal are correlated, the system generally models either the speech and its associated transfer function or the noise and its associated transfer function, depending on the SNR of the data in MIC 1, the ability to model H1 and H2, and the time-invariance of H1 and H2, as described below.
Regarding the SNR of the data in MIC 1, a very low SNR (less than zero (0)) tends to cause the Pathfinder system to converge to the noise transfer function. In contrast, a high SNR (greater than zero (0)) tends to cause the Pathfinder system converge to the speech transfer function. As for the ability to model H1, if either H1 or H2 is more easily modeled using LMS (an all-zero model), the Pathfinder system tends to converge to that respective transfer function.
In describing the dependence of the system modeling on the time-invariance of H1 and H2, consider that LMS is best at modeling time-invariant systems. Thus, the Pathfinder system would generally tend to converge to H2, since H2 changes much more slowly than H1 is likely to change.
If the LMS models the speech transfer function over the noise transfer function, then the speech is classified as noise and removed as long as the coefficients of the LMS filter remain the same or are similar. Therefore, after the Pathfinder system has converged to a model of the speech transfer function H2 (which can occur on the order of a few milliseconds), any subsequent speech (even speech where the VAD has not failed) has energy removed from it as well as the system “assumes” that this speech is noise because its transfer function is similar to the one modeled when the VAD failed. In this case, where H2 is primarily being modeled, the noise will either be unaffected or only partially removed.
The end result of the process is a reduction in volume and distortion of the cleaned speech, the severity of which is determined by the variables described above. If the system tends to converge to H1, the subsequent gain loss and distortion of the speech will not be significant. If, however, the system tends to converge to H2, then the speech can be severely distorted.
This VAD failure analysis does not attempt to describe the subtleties associated with the use of subbands and the location, type, and orientation of the microphones, but is meant to convey the importance of the VAD to the denoising. The results above are applicable to a single subband or an arbitrary number of subbands, because the interactions in each subband are the same.
In addition, the dependence on the VAD and the problems arising from VAD errors described in the above VAD failure analysis are not limited to the Pathfinder noise suppression system. Any adaptive filter noise suppression system that uses a VAD to determine how to denoise will be similarly affected. In this disclosure, when the Pathfinder noise suppression system is referred to, it should be kept in mind that all noise suppression systems that use multiple microphones to estimate the noise waveform and subtract it from a signal including both speech and noise, and that depend on VAD for reliable operation, are included in that reference. Pathfinder is simply a convenient referenced implementation.
The microphone and VAD configurations described above are for use with communication systems, wherein the communication systems comprise: a voice detection subsystem receiving voice activity signals that include information of human voicing activity and automatically generating control signals using information of the voice activity signals; and a denoising subsystem coupled to the voice detection subsystem, the denoising subsystem including microphones coupled to provide acoustic signals of an environment to components of the denoising subsystem, a configuration of the microphones including two unidirectional microphones separated by a distance and having an angle between maximums of a spatial response curve of each microphone, components of the denoising subsystem automatically selecting at least one denoising method appropriate to data of at least one frequency subband of the acoustic signals using the control signals and processing the acoustic signals using the selected denoising method to generate denoised acoustic signals, wherein the denoising method includes generating a noise waveform estimate associated with noise of the acoustic signals and subtracting the noise waveform estimate from the acoustic signal when the acoustic signal includes speech and noise.
The two unidirectional microphones are separated by a distance approximately in the range of zero (0) to 15 centimeters.
The two unidirectional microphones have an angle between maximums of a spatial response curve of each microphone approximately in the range of zero (0) to 180 degrees.
The voice detection subsystem of an embodiment further comprises at least one glottal electromagnetic micropower sensor (GEMS) including at least one antenna for receiving the voice activity signals, and at least one voice activity detector (VAD) algorithm for processing the GEMS voice activity signals and generating the control signals.
The voice detection subsystem of another embodiment further comprises at least one accelerometer sensor in contact with skin of a user for receiving the voice activity signals, and at least one voice activity detector (VAD) algorithm for processing the accelerometer sensor voice activity signals and generating the control signals.
The voice detection subsystem of yet another embodiment further comprises at least one skin-surface microphone sensor in contact with skin of a user for receiving the voice activity signals, and at least one voice activity detector (VAD) algorithm for processing the skin-surface microphone sensor voice activity signals and generating the control signals.
The voice detection subsystem can also receive voice activity signals via couplings with the microphones.
The voice detection subsystem of still another embodiment further comprises two unidirectional microphones separated by a distance and having an angle between maximums of a spatial response curve of each microphone, wherein the distance is approximately in the range of zero (0) to 15 centimeters and wherein the angle is approximately in the range of zero (0) to 180 degrees, and at least one voice activity detector (VAD) algorithm for processing the voice activity signals and generating the control signals.
The voice detection subsystem of other alternative embodiments further comprises at least one manually activated voice activity detector (VAD) for generating the voice activity signals.
The communications system of an embodiment further includes a portable handset that includes the microphones, wherein the portable handset includes at least one of cellular telephones, satellite telephones, portable telephones, wireline telephones, Internet telephones, wireless transceivers, wireless communication radios, personal digital assistants (PDAs), and personal computers (PCs). The portable handset can include at least one of the voice detection subsystem and the denoising subsystem.
The communications system of an embodiment further includes a portable headset that includes the microphones along with at least one speaker device. The portable headset couples to at least one communication device selected from among cellular telephones, satellite telephones, portable telephones, wireline telephones, Internet telephones, wireless transceivers, wireless communication radios, personal digital assistants (PDAs), and personal computers (PCs). The portable headset couples to the communication device using at least one of wireless couplings, wired couplings, and combination wireless and wired couplings.
The communication device can include at least one of the voice detection subsystem and the denoising subsystem. Alternatively, the portable headset can include at least one of the voice detection subsystem and the denoising subsystem.
The portable headset described above is a portable communication device selected from among cellular telephones, satellite telephones, portable telephones, wireline telephones, Internet telephones, wireless transceivers, wireless communication radios, personal digital assistants (PDAs), and personal computers (PCs).
The microphone and VAD configurations described above are for use with communication systems of alternative embodiments, wherein the communication systems comprise: a voice detection subsystem receiving voice activity signals that include information of human voicing activity and automatically generating control signals using information of the voice activity signals; and a denoising subsystem coupled to the voice detection subsystem, the denoising subsystem including microphones coupled to provide acoustic signals of an environment to components of the denoising subsystem, a configuration of the microphones including an omnidirectional microphone and a unidirectional microphone separated by a distance, components of the denoising subsystem automatically selecting at least one denoising method appropriate to data of at least one frequency subband of the acoustic signals using the control signals and processing the acoustic signals using the selected denoising method to generate denoised acoustic signals, wherein the denoising method includes generating a noise waveform estimate associated with noise of the acoustic signals and subtracting the noise waveform estimate from the acoustic signal when the acoustic signal includes speech and noise.
The omnidirectional and unidirectional microphones are separated by a distance approximately in the range of zero (0) to 15 centimeters.
The omnidirectional microphone is oriented to capture signals from at least one speech signal source and the unidirectional microphone is oriented to capture signals from at least one noise signal source, wherein an angle between the speech signal source and a maximum of a spatial response curve of the unidirectional microphone is approximately in the range of 45 to 180 degrees.
The voice detection subsystem of an embodiment further comprises at least one glottal electromagnetic micropower sensor (GEMS) including at least one antenna for receiving the voice activity signals, and at least one voice activity detector (VAD) algorithm for processing the GEMS voice activity signals and generating the control signals.
The voice detection subsystem of another embodiment further comprises at least one accelerometer sensor in contact with skin of a user for receiving the voice activity signals, and at least one voice activity detector (VAD) algorithm for processing the accelerometer sensor voice activity signals and generating the control signals.
The voice detection subsystem of yet another embodiment further comprises at least one skin-surface microphone sensor in contact with skin of a user for receiving the voice activity signals, and at least one voice activity detector (VAD) algorithm for processing the skin-surface microphone sensor voice activity signals and generating the control signals.
The voice detection subsystem of yet other embodiments further comprises two unidirectional microphones separated by a distance and having an angle between maximums of a spatial response curve of each microphone, wherein the distance is approximately in the range of zero (0) to 15 centimeters and wherein the angle is approximately in the range of zero (0) to 180 degrees, and at least one voice activity detector (VAD) algorithm for processing the voice activity signals and generating the control signals.
The voice detection subsystem can also include at least one manually activated voice activity detector (VAD) for generating the voice activity signals.
The communications system of an embodiment further includes a portable handset that includes the microphones, wherein the portable handset includes at least one of cellular telephones, satellite telephones, portable telephones, wireline telephones, Internet telephones, wireless transceivers, wireless communication radios, personal digital assistants (PDAs), and personal computers (PCs). The portable handset can include at least one of the voice detection subsystem and the denoising subsystem.
The communications system of an embodiment further includes a portable headset that includes the microphones along with at least one speaker device. The portable headset can couples to at least one communication device selected from among cellular telephones, satellite telephones, portable telephones, wireline telephones, Internet telephones, wireless transceivers, wireless communication radios, personal digital assistants (PDAs), and personal computers (PCs). The portable headset couples to the communication device using at least one of wireless couplings, wired couplings, and combination wireless and wired couplings. In one embodiment, the communication device includes at least one of the voice detection subsystem and the denoising subsystem. In an alternative embodiment, the portable headset includes at least one of the voice detection subsystem and the denoising subsystem.
The portable headset described above is a portable communication device selected from among cellular telephones, satellite telephones, portable telephones, wireline telephones, Internet telephones, wireless transceivers, wireless communication radios, personal digital assistants (PDAs), and personal computers (PCs).
The microphone and VAD configurations described above are for use with communication systems comprising: at least one transceiver for use in a communications network; a voice detection subsystem receiving voice activity signals that include information of human voicing activity and automatically generating control signals using information of the voice activity signals; and a denoising subsystem coupled to the voice detection subsystem, the denoising subsystem including microphones coupled to provide acoustic signals of an environment to components of the denoising subsystem, a configuration of the microphones including a first microphone and a second microphone separated by a distance and having an angle between maximums of a spatial response curve of each microphone, components of the denoising subsystem automatically selecting at least one denoising method appropriate to data of at least one frequency subband of the acoustic signals using the control signals and processing the acoustic signals using the selected denoising method to generate denoised acoustic signals, wherein the denoising method includes generating a noise waveform estimate associated with noise of the acoustic signals and subtracting the noise waveform estimate from the acoustic signal when the acoustic signal includes speech and noise.
In an embodiment, each of the first and second microphones is a unidirectional microphone, wherein the distance is approximately in the range of zero (0) to 15 centimeters and the angle is approximately in the range of zero (0) to 180 degrees.
In an embodiment, the first microphone is an omnidirectional microphone and the second microphone is a unidirectional microphone, wherein the first microphone is oriented to capture signals from at least one speech signal source and the second microphone is oriented to capture signals from at least one noise signal source, wherein an angle between the speech signal source and a maximum of a spatial response curve of the second microphone is approximately in the range of 45 to 180 degrees.
The transceiver of an embodiment includes the first and second microphones, but is not so limited.
The transceiver can couple information between the communications network and a user via a headset. The headset used with the transceiver can include the first and second microphones.
Aspects of the invention may be implemented as functionality programmed into any of a variety of circuitry, including programmable logic devices (PLDs), such as field programmable gate arrays (FPGAs), programmable array logic (PAL) devices, electrically programmable logic and memory devices and standard cell-based devices, as well as application specific integrated circuits (ASICs). Some other possibilities for implementing aspects of the invention include: microcontrollers with memory (such as electronically erasable programmable read only memory (EEPROM)), embedded microprocessors, firmware, software, etc. If aspects of the invention are embodied as software at least one stage during manufacturing (e.g. before being embedded in firmware or in a PLD), the software may be carried by any computer readable medium, such as magnetically- or optically-readable disks (fixed or floppy), modulated on a carrier signal or otherwise transmitted, etc.
Furthermore, aspects of the invention may be embodied in microprocessors having software-based circuit emulation, discrete logic (sequential and combinatorial), custom devices, fuzzy (neural) logic, quantum devices, and hybrids of any of the above device types. Of course the underlying device technologies may be provided in a variety of component types, e.g., metal-oxide semiconductor field-effect transistor (MOSFET) technologies like complementary metal-oxide semiconductor (CMOS), bipolar technologies like emitter-coupled logic (ECL), polymer technologies (e.g., silicon-conjugated polymer and metal-conjugated polymer-metal structures), mixed analog and digital, etc.
Unless the context clearly requires otherwise, throughout the description and the claims, the words “comprise,” “comprising,” and the like are to be construed in an inclusive sense as opposed to an exclusive or exhaustive sense; that is to say, in a sense of “including, but not limited to.” Words using the singular or plural number also include the plural or singular number respectively. Additionally, the words “herein,” “hereunder,” “above,” “below,” and words of similar import, when used in this application, shall refer to this application as a whole and not to any particular portions of this application. When the word “or” is used in reference to a list of two or more items, that word covers all of the following interpretations of the word: any of the items in the list, all of the items in the list and any combination of the items in the list.
The above descriptions of embodiments of the invention are not intended to be exhaustive or to limit the invention to the precise forms disclosed. While specific embodiments of, and examples for, the invention are described herein for illustrative purposes, various equivalent modifications are possible within the scope of the invention, as those skilled in the relevant art will recognize. The teachings of the invention provided herein can be applied to other processing systems and communication systems, not only for the communication systems described above.
The elements and acts of the various embodiments described above can be combined to provide further embodiments. These and other changes can be made to the invention in light of the above detailed description.
All of the above references and United States patent applications are incorporated herein by reference. Aspects of the invention can be modified, if necessary, to employ the systems, functions and concepts of the various patents and applications described above to provide yet further embodiments of the invention.
In general, in the following claims, the terms used should not be construed to limit the invention to the specific embodiments disclosed in the specification and the claims, but should be construed to include all processing systems that operate under the claims to provide a method for compressing and decompressing data files or streams. Accordingly, the invention is not limited by the disclosure, but instead the scope of the invention is to be determined entirely by the claims.
While certain aspects of the invention are presented below in certain claim forms, the inventors contemplate the various aspects of the invention in any number of claim forms. For example, while only one aspect of the invention is recited as embodied in a computer-readable medium, other aspects may likewise be embodied in a computer-readable medium. Accordingly, the inventors reserve the right to add additional claims after filing the application to pursue such additional claim forms for other aspects of the invention.
This application claims priority from U.S. Patent Application No. 60/368,209, entitled MICROPHONE AND VOICE ACTIVITY DETECTION (VAD) CONFIGURATIONS FOR USE WITH PORTABLE COMMUNICATION SYSTEMS, filed Mar. 27, 2002. Further, this application relates to the following U.S. Patent Applications: Application Ser. No. 09/905,361, entitled METHOD AND APPARATUS FOR REMOVING NOISE FROM ELECTRONIC SIGNALS, filed Jul. 12, 2001; application Ser. No. 10/159,770, entitled DETECTING VOICED AND UNVOICED SPEECH USING BOTH ACOUSTIC AND NONACOUSTIC SENSORS, filed May 30, 2002; application Ser. No. 10/301,237, entitled METHOD AND APPARATUS FOR REMOVING NOISE FROM ELECTRONIC SIGNALS, filed Nov. 21, 2002; and application Ser. No. 10/383,162, entitled VOICE ACTIVITY DETECTION (VAD) DEVICES AND METHODS FOR USE WITH NOISE SUPPRESSION SYSTEMS, filed Mar. 5, 2003.
Number | Name | Date | Kind |
---|---|---|---|
3789166 | Sebesta | Jan 1974 | A |
4006318 | Sebesta et al. | Feb 1977 | A |
4591668 | Iwata | May 1986 | A |
4901354 | Gollmar et al. | Feb 1990 | A |
5097515 | Baba | Mar 1992 | A |
5212764 | Ariyoshi | May 1993 | A |
5353376 | Oh et al. | Oct 1994 | A |
5400409 | Linhard | Mar 1995 | A |
5406622 | Silverberg et al. | Apr 1995 | A |
5414776 | Sims, Jr. | May 1995 | A |
5473702 | Yoshida et al. | Dec 1995 | A |
5515865 | Scanlon et al. | May 1996 | A |
5517435 | Sugiyama | May 1996 | A |
5539859 | Robbe et al. | Jul 1996 | A |
5590241 | Park et al. | Dec 1996 | A |
5625684 | Matouk et al. | Apr 1997 | A |
5633935 | Kanamori et al. | May 1997 | A |
5649055 | Gupta et al. | Jul 1997 | A |
5684460 | Scanlon et al. | Nov 1997 | A |
5729694 | Holzrichter et al. | Mar 1998 | A |
5754665 | Hosoi et al. | May 1998 | A |
5835608 | Warnaka et al. | Nov 1998 | A |
5853005 | Scanlon | Dec 1998 | A |
5917921 | Sasaki et al. | Jun 1999 | A |
5966090 | McEwan | Oct 1999 | A |
5986600 | McEwan | Nov 1999 | A |
6006175 | Holzrichter | Dec 1999 | A |
6009396 | Nagata | Dec 1999 | A |
6069963 | Martin et al. | May 2000 | A |
6191724 | McEwan | Feb 2001 | B1 |
6266422 | Ikeda | Jul 2001 | B1 |
6430295 | Handel et al. | Aug 2002 | B1 |
6795713 | Housni | Sep 2004 | B2 |
6963649 | Vaudrey et al. | Nov 2005 | B2 |
6980092 | Turnbull et al. | Dec 2005 | B2 |
7206418 | Yang et al. | Apr 2007 | B2 |
20020039425 | Burnett et al. | Apr 2002 | A1 |
20030044025 | Ouyang et al. | Mar 2003 | A1 |
20030130839 | Beaucoup et al. | Jul 2003 | A1 |
Number | Date | Country |
---|---|---|
0 637 187 | Feb 1995 | EP |
0 795 851 | Sep 1997 | EP |
0 984 660 | Mar 2000 | EP |
2000 312 395 | Nov 2000 | JP |
2001 189 987 | Jul 2001 | JP |
WO 02 07151 | Jan 2002 | WO |
Entry |
---|
Zhao Li et al: “Robust Speech Coding Using Microphone Arrays”, Signals Systems and Computers, 1997. Conf. record of 31st Asilomar Conf., Nov. 2-5, 1997, IEEE Comput. Soc. Nov. 2, 1997. USA. |
L.C. Ng et al.: “Denoising of Human Speech Using Combined Acoustic and EM Sensor Signal Processing”, 2000 IEEE Intl Conf on Acoustics Speech and Signal Processing. Proceedings (Cat. No. 00CH37100), Istanbul, Turkey, Jun. 5-9, 2000 XP002186255, ISBN 0-7803-6293-4. |
S. Affes et al.: “A Signal Subspace Tracking Algorithm for Microphone Array Processing of Speech”. IEEE Transactions on Speech and Audio Processing, N.Y, USA vol. 5, No. 5, Sep. 1, 1997. XP000774303. ISBN 1063-6676. |
Gregory C. Burnett: “The Physiological Basis of Glottal Electromagnetic Micropower Sensors (GEMS) and Their Use in Defining an Excitation Function for the Human Vocal Tract”, Dissertation. University of California at Davis. Jan. 1999. USA. |
Todd J. Gable et al.: “Speaker Verification Using Combined Acoustic and EM Sensor Signal Processing”, ICASSP-2001, Salt Lake City, USA. |
A. Hussain: “Intelligibility Assessment of a Multi-Band Speech Enhancement Scheme”, Proceedings IEEE Intl. Conf. on Acoustics, Speech & Signal Processing (ICASSP-2000). Istanbul, Turkey. Jun. 2000. |
Number | Date | Country | |
---|---|---|---|
20030228023 A1 | Dec 2003 | US |
Number | Date | Country | |
---|---|---|---|
60368209 | Mar 2002 | US |