The need for hands-free communication has led to an increased popularity in the use of headsets with mobile phones and other speech interface devices. Concerns for comfort, portability, and cachet have led to the desire for headsets with a small form factor. Inherent to this size constraint is the requirement that the microphone be placed farther from the user's mouth, generally increasing its susceptibility to environmental noise. This has meant a tradeoff between audio performance and useability features such as comfort, portability and cachet.
A first set of signals from an array of one or more microphones, and a second signal from a reference microphone are used to calibrate a set of filter parameters such that the filter parameters minimize a difference between the second signal and a beamformer output signal that is based on the first set of signals. Once calibrated, the filter parameters are used to form a beamformer output signal that is filtered using a non-linear adaptive filter that is adapted based on portions of a signal that do not contain speech, as determined by a speech detection sensor.
A variety of other variations and embodiments besides those illustrative examples specifically discussed herein are also contemplated within the scope of the claims for the present invention, and will be apparent to those skilled in the art from the entirety of the present disclosure.
A variety of methods and apparatus are encompassed within different embodiments, an illustrative sampling of which are described herein. For example,
Embodiments are operational with numerous other general purpose or special purpose computing system environments or configurations. Examples of well-known computing systems, environments, and/or configurations that may be suitable for use with various embodiments include, but are not limited to, personal computers, server computers, hand-held or laptop devices, multiprocessor systems, microprocessor-based systems, set top boxes, programmable consumer electronics, network PCs, minicomputers, mainframe computers, telephony systems, distributed computing environments that include any of the above systems or devices, and the like.
Embodiments may be described in the general context of computer-executable instructions, such as program modules, being executed by a computer. Generally, program modules include routines, programs, objects, components, data structures, etc. that perform particular tasks or implement particular abstract data types. Some embodiments are designed to be practiced in distributed computing environments where tasks are performed by remote processing devices that are linked through a communications network. In a distributed computing environment, program modules are located in both local and remote computer storage media including memory storage devices.
With reference to
Computer 110 typically includes a variety of computer readable media. Computer readable media can be any available media that can be accessed by computer 110 and includes both volatile and nonvolatile media, removable and non-removable media. By way of example, and not limitation, computer readable media may comprise computer storage media and communication media. Computer storage media includes both volatile and nonvolatile, removable and non-removable media implemented in any method or technology for storage of information such as computer readable instructions, data structures, program modules or other data. Computer storage media includes, but is not limited to, RAM, ROM, EEPROM, flash memory or other memory technology, CD-ROM, digital versatile disks (DVD) or other optical disk storage, magnetic cassettes, magnetic tape, magnetic disk storage or other magnetic storage devices, or any other medium which can be used to store the desired information and which can be accessed by computer 110. Communication media typically embodies computer readable instructions, data structures, program modules or other data in a modulated data signal such as a carrier wave or other transport mechanism and includes any information delivery media. The term “modulated data signal” means a signal that has one or more of its characteristics set or changed in such a manner as to encode information in the signal. By way of example, and not limitation, communication media includes wired media such as a wired network or direct-wired connection, and wireless media such as acoustic, RF, infrared and other wireless media. Combinations of any of the above should also be included within the scope of computer readable media.
The system memory 130 includes computer storage media in the form of volatile and/or nonvolatile memory such as read only memory (ROM) 131 and random access memory (RAM) 132. A basic input/output system 133 (BIOS), containing the basic routines that help to transfer information between elements within computer 110, such as during start-up, is typically stored in ROM 131. RAM 132 typically contains data and/or program modules that are immediately accessible to and/or presently being operated on by processing unit 120. By way of example, and not limitation,
The computer 110 may also include other removable/non-removable volatile/nonvolatile computer storage media. By way of example only,
The drives and their associated computer storage media discussed above and illustrated in
A user may enter commands and information into the computer 110 through input devices such as a keyboard 162, a microphone 163, and a pointing device 161, such as a mouse, trackball or touch pad. Other input devices (not shown) may include a joystick, game pad, satellite dish, scanner, or the like. These and other input devices are often connected to the processing unit 120 through a user input interface 160 that is coupled to the system bus, but may be connected by other interface and bus structures, such as a parallel port, game port or a universal serial bus (USB). A monitor 191 or other type of display device is also connected to the system bus 121 via an interface, such as a video interface 190. In addition to the monitor, computers may also include other peripheral output devices such as speakers 197 and printer 196, which may be connected through an output peripheral interface 195.
The computer 110 is operated in a networked environment using logical connections to one or more remote computers, such as a remote computer 180. The remote computer 180 may be a personal computer, a hand-held device, a server, a router, a network PC, a peer device or other common network node, and typically includes many or all of the elements described above relative to the computer 110. The logical connections depicted in
When used in a LAN networking environment, the computer 110 is connected to the LAN 171 through a network interface or adapter 170. When used in a WAN networking environment, the computer 110 typically includes a modem 172 or other means for establishing communications over the WAN 173, such as the Internet. The modem 172, which may be internal or external, may be connected to the system bus 121 via the user input interface 160, or other appropriate mechanism. In a networked environment, program modules depicted relative to the computer 110, or portions thereof, may be stored in the remote memory storage device. By way of example, and not limitation,
Memory 204 includes an operating system 212, application programs 214 as well as an object store 216. During operation, operating system 212 is preferably executed by processor 202 from memory 204. Operating system 212, in one preferred embodiment, is a WINDOWS® CE brand operating system commercially system 212 is preferably designed for mobile devices, and implements database features that can be utilized by applications 214 through a set of exposed application programming interfaces and methods. The objects in object store 216 are maintained by applications 214 and operating system 212, at least partially in response to calls to the exposed application programming interfaces and methods.
Communication interface 208 represents numerous devices and technologies that allow mobile device 200 to send and receive information. The devices include wired and wireless modems, satellite receivers and broadcast tuners to name a few. Mobile device 200 can also be directly connected to a computer to exchange data therewith. In such cases, communication interface 208 can be an infrared transceiver or a serial or parallel communication connection, all of which are capable of transmitting streaming information.
Input/output components 206 include a variety of input devices such as a touch-sensitive screen, buttons, rollers, and a microphone as well as a variety of output devices including an audio generator, a vibrating device, and a display. The devices listed above are by way of example and need not all be present on mobile device 200. In addition, other input/output devices may be attached to or found with mobile device 200.
The first two air microphones 303, 305 are preferred-direction microphones and are noise-canceling. The third microphone 307 is omnidirectional; that is, it is not a preferred-direction microphone. Microphones 303 and 305 are configured to receive primarily the user's speech, while microphone 307 is configured to receive ambient noise, in addition to the user's speech. The omnidirectional third microphone 307 is thereby used both as part of the microphone array, and for capturing ambient noise for downstream adaptive filtering. This difference in function does not necessarily imply difference in structure; it is contemplated that all three microphones 303, 305, 307 are physically identical within normal tolerances in one illustrative embodiment, although their placement and orientation suit them particularly for their functions. Microphones 303 and 305 face toward the direction expected for the user's mouth, while microphone 307 faces in a direction expected to be directly away from the user's ear, thus making it more likely for microphone 307 to sample ambient noise in addition to the user's speech. Microphone 307 may be described as omnidirectional not because it receives sounds from every direction necessarily, but in the sense that it faces the user's ambient environment rather than being particularly aimed in a preferred direction toward a user's mouth.
Although each of microphones 303, 305, and 307 would each detect and include in their transmitted signals some finite inclusion of both speech and noise, the signal associated with the omnidirectional microphone 307 is designated separately as a speech plus noise signal since it is expected to feature a substantially greater noise-to-speech ratio than the signals received by the preferred-direction microphones 303 and 305.
Although this embodiment is depicted with one omnidirectional microphone 307 and two preferred-direction microphones in the microphone array, this is illustrative only, and many other arrangements may occur in various embodiments. For example, in another embodiment there may be only a single preferred-direction microphone and a single omnidirectional microphone; while in another example, three or more preferred-direction microphones may be included in an array; while in yet another embodiment, two or more omnidirectional microphones may be used—for example, to face two different ambient noise directions away from the user.
Regarding headset 301, the general direction of boom 311 defines a preferred direction for the directional array of microphones 303, 305, 307 as a whole, and particularly for microphones 303 and 305 individually. The headset 301 may be worn with the air microphones 303 and 305 oriented generally toward the user's mouth, and the microphone 307 oriented along a generally common line with microphones 303 and 305, in this embodiment. Omnidirectional microphone 307 is situated generally at the ear canal, in normal use, while the bone sensor 309 rests on the skull behind the ear. The bone-conductive sensor is highly insensitive to ambient noise, and as such, provides robust speech activity detection.
Bone sensor 309 is one example of a speech indicator sensor, configured for providing an indicator signal that is configured to indicate when the user is speaking and when the user is not speaking. Bone sensor 309 is configured to contact a user's head just behind the ear, where it receives vibrations that pass through the user's skull, such as those corresponding to speech. Other types of speech indicator sensors may occur in various embodiments, including a bone sensor configured to contact the user's jaw, or a throat microphone that measures the user's throat vibrations, as additional illustrative examples. A speech indicator may also take the form of a function of signal information, such as the audio energy received by the microphones. The energy level of the sensor signal may be compared to a stored threshold level of energy, pre-selected to match the threshold of energy anticipated for the user's speech. Microphones 303, 305, 307 are conventional air conduction microphones used to convert audio vibrations into electrical signals.
The filter parameters used by beamformer 423 are calibrated using a close-talking microphone reference signal 449, in one embodiment. Using a small sample of training recordings in which a user's speech is captured by both the microphone array 411 and a close-talking reference microphone 431, a calibration algorithm 421 associated with beamformer 423 operates to set the filters for the microphones of array 411. Close-talking microphone 431 is generally only used for calibration; then once system 401 is calibrated, reference microphone 431 is no longer needed, as suggested by the dashed lines associated with reference microphone 431.
Array 411 may form part of a headset, such as headset 301 of
Step 505 includes dividing Ym and R into time increments and frequency subbands as Ym,t[k] and Rt[k]. These steps may include additional details such as in one illustrative embodiment that might include conversion of the signals from analog to digital form, dividing the signals into time-domain samples, performing fast Fourier transforms on these time-domain samples, and thereby providing a signal in the form of subbands of frequency-domain frames.
In one illustrative example, analog-to-digital converters sample the analog signals at 16 kHz and 16 bits per sample, thereby creating 32 kilobytes of speech data per second. These digital signals are provided in new frames every 10 milliseconds, each of which includes 20 milliseconds worth of data. In this particular embodiment, therefore, the time-domain samples are partitioned in increments of 20 milliseconds each, with each frame overlapping the previous frame by half. Alternative embodiments may use increments of 25 milliseconds, or a timespan anywhere in a range from substantially less than 20 milliseconds to substantially more than 25 milliseconds. The frequency-domain frames may also occur in different forms. With each frame overlapping the previous frame by half, the number of subbands is designated here as N/2, where N is the size of a Discrete Fourier Transform (DFT).
These or other potential method steps will be recognized by those skilled in the art as advantageously contributing to embodiments similar to method 501. Some of the details of some of these and other potential method steps are also understood in the art, and need not be reviewed in detail here.
At step 507, the time-ordered frames and frequency subbands of the array signals Ym,t[k] and the reference signal Rt[k] are used to calibrate a set of filter parameters Hn[k] for beamformer 423. This involves solving a linear system which minimizes a function of the difference between the reference signal Rt[k] and the output signal Zt[k], which is a function of the set of signals Ym,t[k] from the array, and the filter parameters Hn[k]. This linear system and these functions are explained as follows.
In the illustrative example of the subband filter-and-sum linear forming architecture, the kth subband of short-time Fourier transform of the signal produced by microphone m at frame t is represented as Ym,t[k], and the beamformer output can be expressed as:
where Hm[k] is the filter coefficient applied to subband k of microphone m and M is the total number of microphones in the array. If the reference signal from the close-talking microphone 431 is defined as Rt[k], the goal of the proposed calibration algorithm is to find the array parameters that minimize the following objective function:
Equation 2 is therefore a function of the difference between the reference signal Rt[k] and the beamformer output signal Zt[k]. Minimizing this function is therefore a method of minimizing the difference between the output Rt[k] from a reference microphone 431 and the beamformer output signal Zt[k] produced by a beamformer 423, applying calibration parameters or filter coefficients Hm[k] derived from the present method to signals Ym,t[k] from a headset microphone array 411, according to one illustrative embodiment. Minimizing the function of Equation 2 may be done by taking the partial derivative of Equation 2 with respect to H*m[k], where H*m[k] represents the complex conjugate of Hm[k], and setting the result to zero; this gives
where Y*m,t[k] is the complex conjugate of Ym,t[k]. By rearranging the terms of Equation 3, this becomes:
The filter coefficients {H1[k], . . . , HM[k]} can then be found by solving the linear system in Equation 4, as represented in step 507 of method 501 of
Method 501 can thereby include minimizing the function εkof the difference between the reference signal Rt[k] and the beamformer output signal Zt[k], including by taking the derivative of the function εkwith respect to the complex conjugate H*m[k] of the filter parameters Hm[k], setting the derivative equal to zero, and solving the resulting linear system, as in Equation 4 and as depicted in step 507.
With the filter parameters Hm[k] calibrated, beamformer 423 is ready to receive a new set of signals Ym,t[k] from the array 411 at step 509. These new signals are then used to generate an output signal Zt[k] 451 as a function of the new set of signals and the stored filter parameters Hm[k], as depicted in step 511 of method 501.
The calibrated beamformer will generally not be able to remove all possible ambient noise from the signal. To reflect this, the beamformer output Z may be modeled as:
Zt=GZXt+HZ,tVt Eq. 5
where GZ is the spectral tilt induced by the array, Vt is the ambient noise, and HZ is the effective filter formed by the beamforming process.
To further enhance the output signal, a non-linear adaptive filter may be applied to the output of the calibrated beamformer. This filter relies on noise information from an omnidirectional microphone and exploits the precise speech activity detection provided by a speech indicator sensor, such as the particular example of the bone-conductive sensor 309 in the illustrative embodiment in
In system 601 of
At step 703, beamformer 623 uses the signals from microphones 603, 605, and 607 in equation 1 above to form a first signal having a specified noise characteristic. This first signal is a beamform primary speech signal, having a noise characteristic that represents a function of the signals from microphones 603, 605, and 607, for example. At step 705, speech activity indicator 625 uses the signal from speech activity sensor 609 to indicate whether a portion of the first signal does not represent speech, or which portions of the primary speech signal contain the user's speech. In one method performed in association with speech activity indicator 625, the energy level of the sensor signal is compared to a stored threshold level of energy, pre-selected to distinguish between speech and the absence of speech as calibrated to the specific instrument, to determine if the user is speaking.
Instead of using a separate speech activity sensor, other embodiments detect when the user is speaking using the microphone array 611. Under one embodiment, the overall rate of energy being detected by the array of microphones, may be used to determine when the user is speaking. Alternatively, the rate of energy being detected by a directional array of microphones from a source coinciding with a preferred direction of the array may be used to determine when the user is speaking. Either of these may be calibrated to provide a fairly effective indication of the occurrence or absence of the user's speech. Additional types of speech activity sensors besides these illustrative examples are also contemplated in various embodiments.
Speech activity indicator 625, provides an indicator signal to non-linear adaptive filter 627 to indicate when the user is speaking. Non-linear adaptive filter 627 also receives the primary speech signal output from beamformer 623, which is formed using equation 1 above, and microphone signal 650 from microphone 607, constituting a second signal having a second noise characteristic, at step 707. Microphone 607 is oriented to serve as an omnidirectional microphone rather than a preferred-direction microphone, and the second signal is anticipated to have a noise characteristic with a greater component of ambient noise.Filter 627 uses these signals to perform non-linear adaptive filtering. This includes estimating a magnitude of a noise transfer function based on portions of the first signal and second signal that do not represent speech, as depicted in step 709. Filter 627 then generates a filtered output signal as a function of the primary speech signal, the indicator signal, and microphone signal 650, as depicted in step 711. An example of such a mechanism is presented as follows, according to one illustrative embodiment. With Yo defined as the omnidirectional microphone signal 650, this signal can be modeled as:
Yo,t=GoXt+Ho,tVt Eq. 6
The following additional variables may also be defined as follows:
{tilde over (X)}t=GoXt Eq. 7
{tilde over (V)}t=Ho,tVt Eq. 8
{tilde over (G)}Z=GZ/Go Eq. 9
{tilde over (H)}Z,t=HZ,t/Ho,t Eq. 10
Substituting Equations 7-8 into Equations 5 and 6 gives:
Zt={tilde over (G)}Z{tilde over (X)}t+{tilde over (H)}Z,t{tilde over (V)}t Eq. 11
Yo,t={tilde over (X)}t+{tilde over (V)}t Eq. 12
In essence, {tilde over (G)}Z is the signal transfer function between the beamformer output and the omnidirectional microphone, while {tilde over (H)}Z,t is the corresponding noise transfer function.
{tilde over (H)}Z,t in equation 11 is a function of time. However, if this variation over time is modeled as strictly a function of its phase, while its magnitude is relatively constant, then {tilde over (H)}Z,t may be rewritten as:
{tilde over (H)}Z,t=|{tilde over (H)}Z|ejφt Eq. 13
If the speech X and the noise V can be modeled to be uncorrelated, equations 11-13 can be combined to obtain:
|Zt|2=|{tilde over (G)}Z|2|{tilde over (X)}t|2+|{tilde over (H)}Z|2|{tilde over (V)}t|2 Eq. 14
|Yo,t|2=|{tilde over (X)}t|2+|{tilde over (V)}t|2 Eq. 15
Solving for |{tilde over (X)}t|2 using these two equations leads to:
Because the denominator of Equation 16 is constant over time, it acts simply as a gain factor. Therefore, |{tilde over (X)}t|2 (after accounting for the gain factor) can be estimated simply as:
|{tilde over (X)}t|2=|Zt|2−|{tilde over (H)}Z|2|Yo,t|2 Eq. 17
This leads to an estimate of the magnitude of {tilde over (X)}t as:
where ε is a small constant and the square-root value represents an adaptive noise suppression factor. As can be seen, the noise suppression factor is a function of the microphone signal Yo,t and |{tilde over (H)}Z|2 which forms an effective filter coefficient. As in other magnitude-domain noise suppression algorithms, e.g. spectral subtraction, the phase of the beamformer output signal Z may be used for the filter output as well. Thus, the final estimate of {tilde over (X)} is:
{tilde over (X)}t=|{tilde over (X)}t|ej∠Z
where j∠Zt represents the phase of Zt.
|Hz| is estimated using non-speech frames, which are identified based on the signal from speech activity indicator 625. In these frames, Equations 14 and 15 simplify to:
|Zt|2=|{tilde over (H)}Z|2|{tilde over (V)}t|2 Eq. 20
|Yo,t|2=|{tilde over (V)}t|2 Eq. 21
Using these expressions, the least-squares solution for |{tilde over (H)}Z| is:
In other embodiments, the primary speech signal is formed using a delay-and-sum beamformer, that delays one or more signals in a microphone array and then sums the signals. Specifically, the primary speech signal is formed using a function that incorporates a time delay in superposing signals from the microphones of the microphone array 611 to enhance signals representing sound coming from a source in a preferred direction relative to the array. That is, the function may impose a time shift on the signals from each microphone in the array prior to superposing their signals into a combined signal.
For example, with reference once more to
In the systems of
Embodiments of calibrated beamformers, non-linear adaptive filters and associated processes, and devices embodying these new technologies, such as those illustrative embodiments illustrated herein, also have useful applicability to a wide range of technologies. They are applicable in combination with a broad range of additional microphone array processing methods and devices.
These are indicative of a few of the various additional features and elements that may be comprised in different embodiments corresponding to the claims herein. Although the present invention has been described with reference to particular illustrative embodiments, workers skilled in the art will recognize that changes may be made in form and detail without departing from the metes and bounds of the claimed invention.
Number | Name | Date | Kind |
---|---|---|---|
5012519 | Adlersberg et al. | Apr 1991 | A |
5353376 | Oh et al. | Oct 1994 | A |
5839101 | Vahatalo et al. | Nov 1998 | A |
6009396 | Nagata | Dec 1999 | A |
6041127 | Elko | Mar 2000 | A |
6289309 | deVries | Sep 2001 | B1 |
6643619 | Linhard et al. | Nov 2003 | B1 |
6778954 | Kim et al. | Aug 2004 | B1 |
6914854 | Heberley et al. | Jul 2005 | B1 |
7080007 | Son et al. | Jul 2006 | B2 |
7099822 | Zangi | Aug 2006 | B2 |
7139711 | Grover | Nov 2006 | B2 |
7167568 | Malvar et al. | Jan 2007 | B2 |
7174022 | Zhang et al. | Feb 2007 | B1 |
7366658 | Moogi et al. | Apr 2008 | B2 |
7415117 | Tashev et al. | Aug 2008 | B2 |
7565288 | Acero et al. | Jul 2009 | B2 |
20020002455 | Accardi et al. | Jan 2002 | A1 |
20020069054 | Arrowood et al. | Jun 2002 | A1 |
20020138254 | Isaka et al. | Sep 2002 | A1 |
20030040908 | Yang et al. | Feb 2003 | A1 |
20030055627 | Balan et al. | Mar 2003 | A1 |
20030097257 | Amada et al. | May 2003 | A1 |
20030177006 | Ichikawa et al. | Sep 2003 | A1 |
20030179888 | Burnett et al. | Sep 2003 | A1 |
20030228023 | Burnett et al. | Dec 2003 | A1 |
20040037436 | Rui | Feb 2004 | A1 |
20040049383 | Kato et al. | Mar 2004 | A1 |
20040071284 | Abutalebi et al. | Apr 2004 | A1 |
20040111258 | Zangi et al. | Jun 2004 | A1 |
20040175006 | Kim et al. | Sep 2004 | A1 |
20040230428 | Choi | Nov 2004 | A1 |
20050018861 | Tashev | Jan 2005 | A1 |
20050195988 | Tashev et al. | Sep 2005 | A1 |
20050281415 | Lambert et al. | Dec 2005 | A1 |
20060015331 | Hui et al. | Jan 2006 | A1 |
20060122832 | Takiguchi et al. | Jun 2006 | A1 |
Number | Date | Country | |
---|---|---|---|
20070088544 A1 | Apr 2007 | US |