The disclosed embodiments relate to systems and methods for detecting and processing a desired signal in the presence of acoustic noise.
Many noise suppression algorithms and techniques have been developed over the years. Most of the noise suppression systems in use today for speech communication systems are based on a single-microphone spectral subtraction technique first develop in the 1970's and described, for example, by S. F. Boll in “Suppression of Acoustic Noise in Speech using Spectral Subtraction,” IEEE Trans. on ASSP, pp. 113-120, 1979. These techniques have been refined over the years, but the basic principles of operation have remained the same. See, for example, U.S. Pat. No. 5,687,243 of McLaughlin, et al., and U.S. Pat. No. 4,811,404 of Vilmur, et al. Generally, these techniques make use of a microphone-based Voice Activity Detector (VAD) to determine the background noise characteristics, where “voice” is generally understood to include human voiced speech, unvoiced speech, or a combination of voiced and unvoiced speech.
The VAD has also been used in digital cellular systems. As an example of such a use, see U.S. Pat. No. 6,453,291 of Ashley, where a VAD configuration appropriate to the front-end of a digital cellular system is described. Further, some Code Division Multiple Access (CDMA) systems utilize a VAD to minimize the effective radio spectrum used, thereby allowing for more system capacity. Also, Global System for Mobile Communication (GSM) systems can include a VAD to reduce co-channel interference and to reduce battery consumption on the client or subscriber device.
These typical microphone-based VAD systems are significantly limited in capability as a result of the addition of environmental acoustic noise to the desired speech signal received by the single microphone, wherein the analysis is performed using typical signal processing techniques. In particular, limitations in performance of these microphone-based VAD systems are noted when processing signals having a low signal-to-noise ratio (SNR), and in settings where the background noise varies quickly. Thus, similar limitations are found in noise suppression systems using these microphone-based VADs.
A VAD signal 104, derived in some manner, is used to control the method of noise removal, and is related to the noise suppression technique discussed below as shown in
M
1(z)=S(z)+N(z)H1(z)
M
2(z)=N(z)+S(z)H2(z) (1)
This is the general case for all realistic two-microphone systems. There is always some leakage of noise into MIC 1, and some leakage of signal into MIC 2. Equation 1 has four unknowns and only two relationships and, therefore, cannot be solved explicitly. However, perhaps there is some way to solve for some of the unknowns in Equation 1 by other means. Examine the case where the signal is not being generated, that is, where the VAD indicates voicing is not occurring. In this case, s(n)=S(z)=0, and Equation 1 reduces to
M
1n(z)=N(z)H1(z)
M
2n(z)=N(z)
where the n subscript on the M variables indicate that only noise is being received. This leads to
Now, H1(z) can be calculated using any of the available system identification algorithms and the microphone outputs when only noise is being received. The calculation should be done adaptively in order to allow the system to track any changes in the noise.
After solving for one of the unknowns in Equation 1, H2(z) can be solved for by using the VAD to determine when voicing is occurring with little noise. When the VAD indicates voicing, but the recent (on the order of 1 second or so) history of the microphones indicate low levels of noise, assume that n(s)=N(z)≠0. Then Equation 1 reduces to
M
1s(z)=S(z)
M
2s(z)=S(z)H2(z)
which in turn leads to
This calculation for H2(z) appears to be just the inverse of the H1(z) calculation, but remember that different inputs are being used. Note that H2(z) should be relatively constant, as there is always just a single source (the user) and the relative position between the user and the microphones should be relatively constant. Use of a small adaptive gain for the H2(z) calculation works well and makes the calculation more robust in the presence of noise.
Following the calculation of H1(z) and H2(z) above, they are used to remove the noise from the signal. Rewriting Equation 1 as
S(z)=M1(z)−N(z)H1(z)
N(z)=M2(z)−S(z)H2(z)
S(z)=M1(z)−[M2(z)−S(z)H2(z)]H1(z)
S(z)[1−H2(z)H1(z)]=M1(z)−M2(z)H1(z)
allows solving for S(z)
Generally, H2(z) is quite small, and H1(z) is less than unity, so for most situations at most frequencies
H
2(z)H1(z)<<1,
and the signal can be estimated using
S(z)≈M1(z)−M2(z)H1(z) (4)
Therefore the assumption is made that H2(z) is not needed, and H1(z) is the only transfer function to be calculated. While H2(z) can be calculated if desired, good microphone placement and orientation can obviate the need for the H2(z) calculation.
Significant noise suppression can best be achieved through the use of multiple subbands in the processing of acoustic signals. This is because most adaptive filters used to calculate transfer functions are of the FIR type, which use only zeros and not poles to calculate a system that contains both zeros and poles as
Such a model can be sufficiently accurate given enough taps, but this can greatly increases computational cost and convergence time. What generally occurs in an energy-based adaptive filter system such as the least-mean squares (LMS) system is that the system matches the magnitude and phase well at a small range of frequencies that contains more energy than other frequencies. This allows the LMS to fulfill its requirement to minimize the energy of the error to the best of its ability, but this fit may cause the noise in areas outside of the matching frequencies to rise, reducing the effectiveness of the noise suppression.
The use of subbands alleviates this problem. The signals from both the primary and secondary microphones are filtered into multiple subbands, and the resulting data from each subband (which can be frequency shifted and decimated if desired, but it is not necessary) is sent to its own adaptive filter. This forces the adaptive filter to try to fit the data in its own subband, rather than just where the energy is highest in the signal. The noise-suppressed results from each subband can be added together to form the final denoised signal at the end. Keeping everything time-aligned and compensating for filter shifts is essential, and the result is a much better model to the system than the single-subband model at the cost of increased memory and processing requirements.
An example of the noise suppression performance using this system with an SSM VAD device is shown in
More information may be found in the applications referenced above in the Introduction, part 1.
Microphone Configuration
In an embodiment of the Pathfinder noise suppression system, unidirectional or omnidirectional microphones may be employed. A variety of microphone configurations that enable Pathfinder are shown in the references in the Introduction, part 2. We will examine only a single embodiment as implemented in the Jawbone headset, but many implementations are possible as described in the references cited in the Introduction, so we are not so limited by this embodiment.
The use of directional microphones has been very successful and is used to ensure that the transfer functions H1(z) and H2(z) remain significantly different. If they are too similar, the desired speech of the user can be significantly distorted. Even when they are dissimilar, some speech signal is received by the noise microphone. If it is assumed that H2(z)=0, then, as in Equation 4 above, even assuming a perfect VAD there will be some distortion. This can be seen by referring to Equation 3 and solving for the result when H2(z) is not included:
S(z)[1−H2(z)H1(z)]=M1(z)−M2(z)H1(z) (5)
This shows that the signal will be distorted by the factor [1−H2(z)H1(z)]. Therefore, the type and amount of distortion will change depending on the noise environment. With very little noise, H1(z) is nearly zero and there is very little distortion. With noise present, the amount of distortion may change with the type, location, and intensity of the noise source(s). Good microphone configuration design minimizes these distortions.
An embodiment of an appropriate microphone configuration is one in which two directional microphones are used as shown in configuration 500 in
is greater than 1 (730), and where the response of Mic1 is less than Mic2 G is less than 1 (720). Clearly as the angle f between the microphones is varied, the amount of overlap and thus the areas where G is greater or less than one varies as well. This variation affects the noise suppression performance both in terms of the amount of noise suppression and the amount of speech distortion, and a good compromise between the two must be found by adjusting f until satisfactory performance is realized.
In addition, the overlap of microphone responses can be induced or further changed by the addition of front and rear vents to the microphone mount. These vents change the response of the microphone by altering the delay between the front and rear faces of the diaphragm. Thus, vents can be used to alter the response overlap and thereby change the denoising performance of the system.
A good microphone configuration can be difficult to construct. The foundation of the process is to use two microphones that have similar noise fields and different speech fields. Simply put, to the microphones the noise should appear to be about the same and the speech should be different. This similarity for noise and difference for speech allows the algorithm to remove noise efficiently and remove speech poorly, which is desired. Proximity effects can be used to further increase the noise/speech difference (NSD) when the microphones are located close to the mouth, but orientation is the primary difference vehicle when the microphones are more than about five to ten centimeters from the mouth. The NSD is defined as the amount of difference in the speech energy detected by the microphones minus the difference in the noise energy in dB. NSDs of 4-6 dB result in both good noise suppression and low speech distortion. NSDs of 0-4 dB result in excellent noise suppression but high speech distortion, and NSDs of 6+ dB result in good to poor noise suppression and very low speech distortion. Naturally, since the response of a directional microphone is directly related to frequency, the NSD will also be frequency dependent, and different frequencies of the same noise or speech may be denoised or devoiced by different amounts depending on the NSD for that frequency.
Another very important stipulation is that there should be little or no noise in Mic1 that is not detected in some way by Mic2. In fact, generally, the closer the levels (energies) of the noise in Mic1 and Mic2, the better the noise suppression. However, if the speech levels are about the same in both microphones, then speech distortion due to de-voicing will also be high, and the overall increase in SNR may be low. Therefore it is crucial that the noise levels be as similar as possible while the speech levels are as different as possible. It is normally not possible to simultaneously minimize noise differences while maximizing speech differences, so a compromise must be made. Experimentation with a configuration can often yield one that works reasonably well for noise suppression and acceptable speech distortion.
In summary, the design process rules can be stated as follows:
In the configuration above, the amount of response overlap, and therefore the angle between the axes of the microphones f will depend on the responses of the microphones as well as mounting and venting of the microphones. However, a useable configuration is readily found through experimentation.
The microphone configuration implementation described above is a specific implementation of one of many possible implementations, but the scope of this application is not so limited. There are many ways to specifically implement the ideas and techniques presented above, and the specified implementation is simply one of many that are possible. For example, the references cited in the Introduction contain many different variations on the configuration of the microphones.
VAD Device
The VAD device for the Jawbone headset is based upon the references given in the Introduction part 3. It is an acoustic vibration sensor, also referred to as a speech sensing device, also referred to as a Skin Surface Microphone (SSM), and is described below. The acoustic vibration sensor is similar to a microphone in that it captures speech information from the head area of a human talker or talker in noisy environments. However, it is different than a conventional microphone in that it is designed to be more sensitive to speech frequencies detected on the skin of the user than environmental acoustic noise. This technique is normally only successful for a limited range of frequencies (normally ˜100 Hz to 1000 Hz, depending on the noise level), but this is normally sufficient for excellent VAD performance.
Previous solutions to this problem have either been vulnerable to noise, physically too large for certain applications, or cost prohibitive. In contrast, the acoustic vibration sensor described herein accurately detects and captures speech vibrations in the presence of substantial airborne acoustic noise, yet within a smaller and cheaper physical package. The noise-immune speech information provided by the acoustic vibration sensor can subsequently be used in downstream speech processing applications (speech enhancement and noise suppression, speech encoding, speech recognition, talker verification, etc.) to improve the performance of those applications.
The following description provides specific details for a thorough understanding of, and enabling description for, embodiments of a transducer. However, one skilled in the art will understand that the invention may be practiced without these details. In other instances, well-known structures and functions have not been shown or described in detail to avoid unnecessarily obscuring the description of the embodiments of the invention.
The sensor also includes electret material 120 and the associated components and electronics coupled to receive acoustic signals from the talker via the coupler 110 and the diaphragm 108 and convert the acoustic signals to electrical signals. Electrical contacts 130 provide the electrical signals as an output. Alternative embodiments can use any type/combination of materials and/or electronics to convert the acoustic signals to electrical signals and output the electrical signals.
The coupler 110 of an embodiment is formed using materials having acoustic impedances similar to the impedance of human skin (the characteristic acoustic impedance of skin is approximately 1.5×106 Pa×s/m). The coupler 110 therefore, is formed using a material that includes at least one of silicone gel, dielectric gel, thermoplastic elastomers (TPE), and rubber compounds, but is not so limited. As an example, the coupler 110 of an embodiment is formed using Kraiburg TPE products. As another example, the coupler 110 of an embodiment is formed using Sylgard® Silicone products.
The coupler 110 of an embodiment includes a contact device 112 that includes, for example, a nipple or protrusion that protrudes from either or both sides of the coupler 110. In operation, a contact device 112 that protrudes from both sides of the coupler 110 includes one side of the contact device 112 that is in contact with the skin surface of the talker and another side of the contact device 112 that is in contact with the diaphragm, but the embodiment is not so limited. The coupler 110 and the contact device 112 can be formed from the same or different materials.
The coupler 110 transfers acoustic energy efficiently from skin/flesh of a talker to the diaphragm, and seals the diaphragm from ambient airborne acoustic signals. Consequently, the coupler 110 with the contact device 112 efficiently transfers acoustic signals directly from the talker's body (speech vibrations) to the diaphragm while isolating the diaphragm from acoustic signals in the airborne environment of the talker (characteristic acoustic impedance of air is approximately 415 Pa×s/m). The diaphragm is isolated from acoustic signals in the airborne environment of the talker by the coupler 110 because the coupler 110 prevents the signals from reaching the diaphragm, thereby reflecting and/or dissipating much of the energy of the acoustic signals in the airborne environment. Consequently, the sensor 100 responds primarily to acoustic energy transferred from the skin of the talker, not air. When placed against the head of the talker, the sensor 100 picks up speech-induced acoustic signals on the surface of the skin while airborne acoustic noise signals are largely rejected, thereby increasing the signal-to-noise ratio and providing a very reliable source of speech information.
Performance of the sensor 100 is enhanced through the use of the seal provided between the diaphragm and the airborne environment of the talker. The seal is provided by the coupler 110. A modified gradient microphone is used in an embodiment because it has pressure ports on both ends. Thus, when the first port 104 is sealed by the coupler 110, the second port 106 provides a vent for air movement through the sensor 100. The second port is not required for operation, but does increase the sensitivity of the device to tissue-borne acoustic signals. The second port also allows more environmental acoustic noise to be detected by the device, but the device's diaphragm's sensitivity to environmental acoustic noise is significantly decreased by the loading of the coupler 110, so the increase in sensitivity to the user's speech is greater than the increase in sensitivity to environmental noise.
The acoustic vibration sensor provides very accurate Voice Activity Detection (VAD) in high noise environments, where high noise environments include airborne acoustic environments in which the noise amplitude is as large if not larger than the speech amplitude as would be measured by conventional microphones. Accurate VAD information provides significant performance and efficiency benefits in a number of important speech processing applications including but not limited to: noise suppression algorithms such as the Pathfinder algorithm available from AliphCom, San Francisco, Calif. and described in the Related Applications; speech compression algorithms such as the Enhanced Variable Rate Coder (EVRC) deployed in many commercial systems; and speech recognition systems.
In addition to providing signals having an improved signal-to-noise ratio, the acoustic vibration sensor uses only minimal power to operate (on the order of 200 micro Amps, for example). In contrast to alternative solutions that require power, filtering, and/or significant amplification, the acoustic vibration sensor uses a standard microphone interface to connect with signal processing devices. The use of the standard microphone interface avoids the additional expense and size of interface circuitry in a host device and supports for of the sensor in highly mobile applications where power usage is an issue.
As described above, the sensor includes additional electronic materials as appropriate that couple to receive acoustic signals from the talker via the coupler 410, the silicon gel 409, and the diaphragm 408 and convert the acoustic signals to electrical signals representative of human speech. Alternative embodiments can use any type/combination of materials and/or electronics to convert the acoustic signals to electrical signals representative of human speech.
The coupler 410 and/or gel 409 of an embodiment are formed using materials having impedances matched to the impedance of human skin. As such, the coupler 410 is formed using a material that includes at least one of silicone gel, dielectric gel, thermoplastic elastomers (TPE), and rubber compounds, but is not so limited. The coupler 410 transfers acoustic energy efficiently from skin/flesh of a talker to the diaphragm, and seals the diaphragm from ambient airborne acoustic signals. Consequently, the coupler 410 efficiently transfers acoustic signals directly from the talker's body (speech vibrations) to the diaphragm while isolating the diaphragm from acoustic signals in the airborne environment of the talker. The diaphragm is isolated from acoustic signals in the airborne environment of the talker by the silicon gel 409/coupler 410 because the silicon gel 409/coupler 410 prevents the signals from reaching the diaphragm, thereby reflecting and/or dissipating much of the energy of the acoustic signals in the airborne environment. Consequently, the sensor 400 responds primarily to acoustic energy transferred from the skin of the talker, not air. When placed again the head of the talker, the sensor 400 picks up speech-induced acoustic signals on the surface of the skin while airborne acoustic noise signals are largely rejected, thereby increasing the signal-to-noise ratio and providing a very reliable source of speech information.
There are many locations outside the ear from which the acoustic vibration sensor can detect skin vibrations associated with the production of speech. The sensor can be mounted in a device, handset, or earpiece in any manner, the only restriction being that reliable skin contact is used to detect the skin-borne vibrations associated with the production of speech.
Note that the silicon gel (block 702) is an optional component that depends on the embodiment of the sensor being manufactured, as described above. Consequently, the manufacture of an acoustic vibration sensor 100 that includes a contact device 112 (referring to
VAD Device Performance
The SSM device described above has been implemented and used in a variety of systems at AliphCom. Most importantly, the SSM is a vital part of the Jawbone headset and its proper functionality is critical to the overall performance of the Jawbone headset. Without the SSM or a similar device supplying VAD information, the noise suppression performance of the Jawbone headset would be very poor.
Referring again to
During speech, when the SSM is placed on the cheek or neck, vibrations associated with speech production are easily detected. However, the airborne acoustic data is not significantly detected by the SSM. The tissue-borne acoustic signal, upon detection by the SSM, is used to generate the VAD signal in processing and denoising the signal of interest, as described above with reference to the energy/threshold method outlined in
The implementation described above is a specific implementation of a VAD transducer, but the scope of this application is not so limited. There are many ways to specifically implement the ideas and techniques presented above, and the specified implementation is simply one of many that are possible.
Dynamic Audio Enhancement
Dynamic Audio Enhancement is a technique developed by AliphCom to help the user better hear the person he or she is conversing with. It uses the VAD above to determine when the person is not speaking, and during that time, a long-term estimate of the environmental noise power is calculated. It also calculates an estimate of the average power of the far-end signal that the user is trying to hear. The goal is to increase intelligibility over a wide range of noise levels with respect to incoming far-end levels; that is, a wide range of signal to noise ratio: far-end speech/near-end noise. The system varies the gain of the loudspeaker and filters the incoming far-end to attain these goals.
The DAE system comprises three stages:
These sub-systems operate on frames of 16 samples at a time (2 ms at 8 kHz) but are not so limited. First, the far-end signal is statically filtered trough an FIR high-pass filter. Then, for each frame the FL and NL sub-systems calculate the average power level in dB, Lf or Ln respectively, to the GM sub-system. Finally, the gain management sub-system varies slowly the gain such that a specific target SNR can be attained. This gain multiplies the far-end level and provides the signal to be sent to the speaker.
It has been demonstrated that raising high frequencies of speech can improve intelligibility. We use a 33-tap high-pass FIR to do so, but are not so limited.
Power levels are measured in the frequency range of 250 Hz-4000 Hz. They are calculated for each frame and filtered over a large number of frames (equivalent to 1 second of signal) using a cascade of two moving average (MA) filters. The moving average filter was chosen for its ability to completely “forget the past” after a period of time corresponding to the length of its impulse response, preventing large impulses from affecting for too long the system's response. Furthermore, the choice of a cascade of two filters was made where the second filter is fed with the decimated output of the first stage, guarantying low memory usage. One long MA would have required as many as 500 taps where a cascade of two requires only 25+20=45.
More specifically, once the power p is measured in the current frame and converted into a log scale (dB), it is processed by the following system:
A delay mechanism is implemented that removes possible unvoiced regions from the measurements (250 ms before any valid voicing frame and 200 ms after). This adds latency to the overall delay of the system and explains the delay mentioned above.
In addition, since a single false positive from the VAD can freeze adaptation for as long as 450 ms, a pulse rejection technique is used as follow: a frame is declared as voiced if there was at least 20 voiced frames among the most current past 25 frames.
Concerning the far-end signal, it is obvious that the level should not be measured during silences or comfort noise. This requires us to be able to detect speech in far-end, “far-end activity”, on a wide range of cell phones and volumes settings. This normally is not an issue and it is likely that a single fixed energy threshold can be used to separate comfort noise from weak speech. Otherwise, one can also use a system that ignores energies below the lowest 10% of the observed energy range for example.
Concerning the noise microphone, the problem is more challenging: It seems quite regrettable to limit noise level measures only to non-speech and non-echo frames (only around 30% of frames). However, the energy of the near-end speech in the noise microphone can be substantial, even if an LMS-based algorithm similar to Pathfinder or Pathfinder itself is used to remove the speech. Since we can't make assumption on the near-end speech intensity, it seems like we have no choice but stop measuring the noise level when near-end speech occurs.
Second, the energy of an echo from the far-end speech can be large as well but the measure is performed on the echo-cancelled signal, which can still contain an important residual echo. When measures are performed in presence of echo, it can lead the system to raise the speaker's gain G, which increases the echo, etc. This positive feedback loop is certainly not desirable. Since the gain is limited by a maximal value, it can actually start oscillating under certain conditions. There are ways around this; such as limiting the rate at which the gain can increase, but we have found the system to be much more reliable if the noise power level is only calculated when there is no near- or far-end speech taking place.
A cutoff is used on the incoming levels Lf and Ln in order to prevent problems at start-up:
Lf=max(Lf, −60 dB)
Ln=max(Ln, −60 dB)
The projected signal-to-noise ratio R is calculated. This is the SNR that would be reached if the gain remains unchanged:
R=Lf−Ln+20*log10(G)
The difference with the target SNR T is:
dR=R−T
Finally, a decision is made to change the gain if the actual SNR is too far from the target:
If dR<3 dB, then G=1.05*G
If dR>−3 dB, then G=0.95*G
Otherwise the gain remains unchanged. Also, the gain is saturated if it reaches a maximum gain limit (0 dB) or a minimum gain limit (−18 dB). This lowest limit is chosen such that it leads to a speaker's volume that is 3 dB above the level achieved when the DSP system is by-passed. Consequently, the system guaranties the volume of the speaker to increases by at least 3 dB at start-up. In fact, when the system is powered-up, G starts at the minimum value and converges to whatever gain corresponds to the desired target SNR.
Jawbone Headset
The Jawbone headset is a specific combination of the techniques and principles discussed above. It is presented as an explicit implementation of the techniques and algorithms discussed above, but the construction of a headset with the specified techniques and algorithms is not so limited to the configuration shown below. Many different configurations are possible whereby the techniques and algorithms discussed above may be implemented.
The physical Jawbone headset consists of two main components: an earpiece and a control module. The earpiece can be worn on either ear of the user. The control module, which is connected to the earpiece via a wire, can be clipped to the user's clothing during use. A unique attribute of the headset design is the design aesthetic of each component and, equally, of the two components together. These attributes are described in detail below:
The Jawbone headset is a comfortable, bi-aural, earpiece containing a number of transducers, which is attached via a wire to a control module bearing integrated circuits for processing the transducer signals. It uses the technology described above to suppress environmental noise so that the user can be understood more clearly. It also uses a technique dubbed DAE so that the user can hear the conversation more clearly.
By virtue of its design and the signal processing technology integrated within it, this headset is comfortable and stable when worn on either ear and is able to deliver great incoming and outgoing audio quality to its user in a wide range of noise environments.
The Earpiece (
The earpiece is made up of an earloop 120, and earbud barrel 130, and a body 240 which are connected together as one device prior to operation by user. Once assembled during manufacture, there is no requirement for the user to remove any components from the headset. The headset is intended for use on either ear, and on one ear at a time. The objective in such a design is to ensure that the headset is mechanically stable on either ear, comfortable on either ear, and the acoustic transducers are properly positioned during use.
The first mechanical design achievement is the ability for the headset to be used on either ear, without the need to remove any components. In addition, the electronic wiring that is used to connect the headset to a mobile phone or other device must be fed through the earloop 120 to ensure proper stability and comfort for the user. If this wiring is not fed through the earloop, but is rather allowed to drop directly down from the body of the earpiece, the stability of the headset can be significantly compromised. The body 240 is attached to the earbud barrel 130, around which the body is free to rotate. The “polarity” of the headset (i.e. whether it is configured for the left or right ear) is changed by rotating the body 240 through a 180° angle around the earbud barrel. Since the earloop is symmetrical along the plane of its core, the headset feels and functions in exactly the same way on both ears.
The second mechanical design achievement is the spring-loaded-body mechanism, which ensures that the body 240 is always turned inwards towards the cheek during use. This feature achieves three important requirements:
The spring-loading of the body is achieved by means of a symmetrical metal spring element 520 and a bi-polar cam 510 which together generate a torsional force between the earpiece body 810 and the earloop 500 respectively, around a rotational axis which is the earloop core. Note that the earloop is mechanically fastened to the cam, and the body is mechanically fastened to the spring. The spring is free to rotate within the cam. The metal spring is symmetrical in one axis, and the cam is symmetrical along the rotational axis, ensuring the headset behaves in exactly the same manner on each ear. When the earpiece is placed on the ear, the angle [θ] between the earloop 820 and the body 810 is widened, forcing the cam to rotate within and against the spring. The spring provides a reactive torsional force which operates to reduce the angle [θ] between the body 810 and the earloop 820. The body is thus always kept in contact with the user's cheek and the primary microphone 710 is always aligned toward the user's mouth.
The third mechanical design achievement is the 3-point headset mounting system, which ensures that the headset is stable and comfortable on a wide variety of ear anatomies. The first feature of this system is the semi-rigid, but elastic, earloop 820, which lightly grips the root of the pinna (see
The Jawbone headset captures the speech and VAD information in the earpiece. This information is then routed to the control module where the VAD and noise levels are calculated and the audio from Mic1 is noise suppressed. The output of this process is a cleaned speech signal. This cleaned speech signal may be directed to any number of communications devices such as mobile phones, landline phones, portable phones, Internet telephones, wireless transceivers, personal digital assistants (PDAs), VOIP telephones, and personal computers. The control module can be connected to the communication device using wired or wireless connections. The control module can be separated from the earpiece (as in the Jawbone implementation) or can be built into the earpiece, headset, or any device designed to be worn on the body.
This application is a continuation of U.S. Nonprovisional application Ser. No. 11/199,856, entitled “Noise Suppressing Multi-Microphone Headset” and filed on Aug. 8, 2005, which claims the benefit of U.S. Provisional Application No. 60/599,468, entitled “Jawbone Headset” and filed Aug. 6, 2004; this application further claims the benefit of U.S. Provisional Application No. 60/599,618, entitled “Wind and High Noise Compensation in a Headset” and filed Aug. 6, 2004, all of which are herein incorporated by reference for all purposes. This application is related to the following U.S. patent applications assigned to AliphCom, of San Francisco, Calif. These include: 1. A unique noise suppression algorithm (reference Method and Apparatus for Removing Noise from Electronic Signals, filed Nov. 21, 2002, and Voice Activity Detector (VAD)—Based Multiple Microphone Acoustic Noise Suppression, filed Sep. 18, 2003)2. A unique microphone arrangement and configuration (reference Microphone and Voice Activity Detection (VAD) Configurations for use with Communications Systems, filed Mar. 27, 2003)3. A unique voice activity detection (VAD) sensor, algorithm, and technique (reference Acoustic Vibration Sensor, filed Jan. 30, 2004, and Voice Activity Detection (VAD) Devices and Systems, filed Nov. 20, 2003)4. An incoming audio enhancement system named Dynamic Audio Enhancement (DAE) that filters and amplifies the incoming audio in order to make it simpler for the user to better hear the person on the other end of the conversation (i.e. the “far end”).5. A unique headset configuration that uses several new techniques to ensure proper positioning of the loudspeaker, microphones, and VAD sensor as well as a comfortable and stable position. All of the U.S. patents referenced herein are incorporated by reference herein in their entirety.