The discussion above is merely provided for general background information and is not intended to be used as an aid in determining the scope of the claimed subject matter.
Small computing devices such as personal digital assistants (PDA), devices and portable phones are used with ever increasing frequency by people in their day-to-day activities. With the increase in processing power now available for microprocessors used to run these devices, the functionality of these devices is increasing, and in some cases, merging. For instance, many portable phones now can be used to access and browse the Internet as well as can be used to store personal information such as addresses, phone numbers and the like. Likewise, PDAs and other forms of computing devices are being designed to function as a telephone.
In many instances, mobile phones, PDAs and the like are increasingly being used in situations that require hands-free communication, which generally places the microphone assembly in a less than optimal position when in use. For instance, the microphone assembly can be incorporated in the housing of the phone or PDA. However, if the user is operating the device in a hands-free mode, the device is usually spaced significantly away from and not directly in front of the user's mouth. Environment or ambient noise can be significant relative to the user's speech in this less than optimal position. Stated another way, a low signal-to-noise ratio (SNR) is present for the captured speech. In view that mobile devices are commonly used in noisy environments, a low SNR is clearly undesirable.
To address this problem, at least in part, mobile phones and other devices can also be operated using a headset worn by the user. The headset includes a microphone and is connected either by wire or wirelessly to the device. For reasons of comfort, convenience and style, most users prefer headset designs that are compact and lightweight. Typically, these designs require the microphone to be located at some distance from the user's mouth, for example, alongside the user's head. This positioning again is suboptimal, and when compared to a well-placed, close-talking microphone, again yields a significant decrease in the SNR of the captured speech signal when compared to an optimal position.
One way to improve sound capture performance, with or without a headset, is to capture the speech signal using multiple microphones configured as an array. Microphone array processing improves the SNR by spatially filtering the sound field, in essence pointing the array toward the signal of interest, which improves overall directivity. However, noise reduction of the signal after the microphone array is still necessary and has had limited success with current signal processing algorithms.
This Summary is provided to introduce a selection of concepts in a simplified form that are further described below in the Detailed Description. This Summary is not intended to identify key features or essential features of the claimed subject matter, nor is it intended to be used as an aid in determining the scope of the claimed subject matter. The claimed subject matter is not limited to implementations that solve any or all disadvantages noted in the background.
A noise reduction system and a method of noise reduction includes a microphone array comprising a first microphone, a second microphone, and a third microphone. Each microphone has a known position and a known directivity pattern. An instantaneous direction-of-arrival (IDOA) module determines a first phase difference quantity and a second phase difference quantity. The first phase difference quantity is based on phase differences between non-repetitive pairs of input signals received by the first microphone and the second microphone, while the second phase difference quantity is based on phase differences between non-repetitive pairs of input signals received by the first microphone and the third microphone. A spatial noise reduction module computes an estimate of a desired signal based on a priori spatial signal-to-noise ratio and an a posteriori spatial signal-to-noise ratio based on the first and second phase difference quantities.
One concept herein described provides spatial noise suppression for a microphone array. Generally, spatial noise reduction is obtained using a suppression rule that exploits the spatio-temporal distribution of noise and speech with respect to multiple dimensions.
However, before describing further aspects, it may be useful to first describe exemplary computing devices or environments that can implement the description provided below.
In addition to the examples herein provided, other well known computing systems, environments, and/or configurations may be suitable for use with concepts herein described. Such systems include, but are not limited to, personal computers, server computers, hand-held or laptop devices, multiprocessor systems, microprocessor-based systems, set top boxes, programmable consumer electronics, network PCs, minicomputers, mainframe computers, distributed computing environments that include any of the above systems or devices, and the like.
The concepts herein described may be embodied in the general context of computer-executable instructions, such as program modules, being executed by a computer. Generally, program modules include routines, programs, objects, components, data structures, etc. that perform particular tasks or implement particular abstract data types. Those skilled in the art can implement the description and/or figures herein as computer-executable instructions, which can be embodied on any form of computer readable media discussed below.
The concepts herein described may also be practiced in distributed computing environments where tasks are performed by remote processing devices that are linked through a communications network. In a distributed computing environment, program modules may be located in both locale and remote computer storage media including memory storage devices.
With reference to
Computer 110 typically includes a variety of computer readable media. Computer readable media can be any available media that can be accessed by computer 110 and includes both volatile and nonvolatile media, removable and non-removable media. By way of example, and not limitation, computer readable media may comprise computer storage media and communication media. Computer storage media includes both volatile and nonvolatile, removable and non-removable media implemented in any method or technology for storage of information such as computer readable instructions, data structures, program modules or other data. Computer storage media includes, but is not limited to, RAM, ROM, EEPROM, flash memory or other memory technology, CD-ROM, digital versatile disks (DVD) or other optical disk storage, magnetic cassettes, magnetic tape, magnetic disk storage or other magnetic storage devices, or any other medium which can be used to store the desired information and which can be accessed by computer 100. Communication media typically embodies computer readable instructions, data structures, program modules or other data in a modulated data signal such as a carrier WAV or other transport mechanism and includes any information delivery media. The term “modulated data signal” means a signal that has one or more of its characteristics set or changed in such a manner as to encode information in the signal. By way of example, and not limitation, communication media includes wired media such as a wired network or direct-wired connection, and wireless media such as acoustic, FR, infrared and other wireless media. Combinations of any of the above should also be included within the scope of computer readable media.
The system memory 130 includes computer storage media in the form of volatile and/or nonvolatile memory such as read only memory (ROM) 131 and random access memory (RAM) 132. A basic input/output system 133 (BIOS), containing the basic routines that help to transfer information between elements within computer 110, such as during start-up, is typically stored in ROM 131. RAM 132 typically contains data and/or program modules that are immediately accessible to and/or presently being operated on by processing unit 120. By way o example, and not limitation,
The computer 110 may also include other removable/non-removable volatile/nonvolatile computer storage media. By way of example only,
The drives and their associated computer storage media discussed above and illustrated in
A user may enter commands and information into the computer 110 through input devices such as a keyboard 162, a microphone (herein an array) 163, and a pointing device 161, such as a mouse, trackball or touch pad. Other input devices (not shown) may include a joystick, game pad, satellite dish, scanner, or the like. These and other input devices are often connected to the processing unit 120 through a user input interface 160 that is coupled to the system bus, but may be connected by other interface and bus structures, such as a parallel port, game port or a universal serial bus (USB). A monitor 191 or other type of display device is also connected to the system bus 121 via an interface, such as a video interface 190. In addition to the monitor, computers may also include other peripheral output devices such as speakers 197 and printer 196, which may be connected through an output peripheral interface 190.
The computer 110 may operate in a networked environment using logical connections to one or more remote computers, such as a remote computer 180. The remote computer 180 may be a personal computer, a hand-held device, a server, a router, a network PC, a peer device or other common network node, and typically includes many or all of the elements described above relative to the computer 110. The logical connections depicted in
When used in a LAN networking environment, the computer 110 is connected to the LAN 171 through a network interface or adapter 170. When used in a WAN networking environment, the computer 110 typically includes a modem 172 or other means for establishing communications over the WAN 173, such as the Internet. The modem 172, which may be internal or external, may be connected to the system bus 121 via the user-input interface 160, or other appropriate mechanism. In a networked environment, program modules depicted relative to the computer 110, or portions thereof, may be stored in the remote memory storage device. By way of example, and not limitation,
It should be noted that the concepts herein described can be carried out on a computer system such as that described with respect to
Memory 204 is implemented as non-volatile electronic memory such as random access memory (RAM) with a battery back-up module (not shown) such that information stored in memory 204 is not lost when the general power to mobile device 200 is shut down. A portion of memory 204 is preferably allocated as addressable memory for program execution, while another portion of memory 204 is preferably used for storage, such as to simulate storage on a disk drive.
Memory 204 includes an operating system 212, application programs 214 as well as an object store 216. During operation, operating system 212 is preferably executed by processor 202 from memory 204. Operating system 212 is designed for mobile devices, and implements database features that can be utilized by applications 214 through a set of exposed application programming interfaces and methods. The objects in object store 216 are maintained by applications 214 and operating system 212, at least partially in response to calls to the exposed application programming interfaces and methods.
Communication interface 208 represents numerous devices and technologies that allow mobile device 200 to send and receive information. The devices include wired and wireless modems, satellite receivers and broadcast tuners to name a few. Mobile device 200 can also be directly connected to a computer to exchange data therewith. In such cases, communication interface 208 can be an infrared transceiver or a serial or parallel communication connection, all of which are capable of transmitting streaming information.
Input/output components 206 include a variety of input devices such as a touch-sensitive screen, buttons, rollers, as well as a variety of output devices including an audio generator, a vibrating device, and a display. The devices listed above are by way of example and need not all be present on mobile device 200.
However, in particular, device 200 includes an array microphone assembly 232, and in one embodiment, an optional analog-to-digital (A/D) converter 234, noise reduction modules described below and an optional recognition program stored in memory 204. By way of example, in response to audible information, instructions or commands from a user of device 200 generated speech signals are digitized by A/D converter 234. Noise reduction modules process the digitized speech signals to obtain an estimate of clean speech. A speech recognition program executed on device 200 or remotely can perform normalization and/or feature extraction functions on the clean speech signals to obtain intermediate speech recognition results. Using communication interface 208, speech data can be transmitted to a remote recognition server, not shown, wherein the results of which are provided back to device 200. Alternatively, recognition can be performed on device 200. Computer 110 processes speech input from microphone array 163 in a similar manner to that described above.
At this point it should be noted, that in one embodiment, the modules 304 (modules 306, 308, 310 and 312) can operate as a computer process entirely within a microphone array computing device, with the microphone array 302 receiving raw audio inputs from its various microphones, and then providing a processed audio output at 314. In this embodiment, the microphone array computing device includes an integral computer processor and support modules (similar to the computing elements of
When the microphone array 302 contains only some of the modules 304 or simply contains sufficient components to receive audio signals from the plurality of microphones forming the array and provide those signals to an external computing device which then performs the remaining processes, device drivers or device description files can be used. Device drivers or device description files contain data defining the operational characteristics of the microphone array, such as gain, sensitivity, array geometry, etc., and can be separately provided for the microphone array 302, so that the modules residing within the external computing device can be adjusted automatically for that specific microphone array.
In one embodiment, beamformer module 306 employs a time-invariant or fixed beamformer approach. In this manner, the desired beam is designed off-line, incorporated in beamformer module 306 and used to process signals in real time. However, although this time-invariant beamformer will be discussed below, it should be understood that this is but one exemplary embodiment and that other beamformer approaches can be used. In particular, the type of beamformer herein described should not be used to limit the scope or applicability of the spatial noise reduction module 310 described below.
Generally, the microphone array 302 can be considered as having M microphones with known positions. The microphones or sensors sample the sound field at locations Pm=(xm,ym,zm) where m={1, . . . , M} is the microphone index. Each of the m sensors has a known directivity pattern Um(f,c), where f is the frequency band index and c represents the location of the sound source in either a radial or a rectangular coordinate system. The microphone directivity pattern is a complex function, providing the spatio-temporal transfer function of the channel. For an ideal omni-directional microphone, Um(f,c) is constant for all frequencies and source locations. A microphone array can have microphones of different types, so Um(f,c) can vary as a function of m.
As is known to those skilled in the art, a sound signal originating at a particular location, c, relative to a microphone array is affected by a number of factors. For example, given a sound signal, S(f), originating at point c, the signal actually captured by each microphone can be defined by Equation (1), as illustrated below:
X
m(f,Pm)=Dm(f,c)Am(f)Um(f,C)S(f) Eq. 1
where Dm(f,c) represents the delay and the decay due to the distance between the source and the microphone. This is expressed as
where V is the speed of sound and Fm(f,c) represents the spectral changes in the sound due to the directivity of the human mouth and the diffraction caused by the user's head. It is assumed that the signal decay due to energy losses in the air can be ignored. The term Am(f) in Eq. (1) is the frequency response of the system preamplifier and analog-to-digital conversion (ADC). In most cases we can use the approximation Am(f)=
The exemplary beamformer design described herein operates in a digital domain rather than directly on the analog signals received directly by the microphone array. Therefore, any audio signals captured by the microphone array are first digitized using conventional A/D conversion techniques. To avoid unnecessary aliasing effects, the audio signal is processed into frames longer than two times the period of the lowest frequency in a modulated complex lapped transform (MCLT) work band.
The beamformer herein described uses the modulated complex lapped transform (MCLT) in the beam design because of the advantages of the MCLT for integration with other audio processing components, such as audio compression modules. However, the techniques described herein are easily adaptable for use with other frequency-domain decompositions, such as the FFT or FFT-based filter banks, for example.
Assuming that the audio signal is processed in frames longer than twice the period of the lowest frequency in the frequency band of interest, the signals from all sensors are combined using a filter-and-sum beamformer as:
where Wm(f) are the weights for each sensor m and subband f, and Y(f) is the beamformer output. (Note: Throughout this description the frame index is omitted for simplicity.) The set of all coefficients Wm(f) is stored as an N×M complex matrix W, where N is the number of frequency bins (e.g. MCLT) in a discrete-time filter bank, and M is the number of microphones A block diagram of the beamformer is provided in
The matrix W is computed using the known methodology described by I. Tashev, H. Malvar, in “A New Beamformer Design Algorithm for Microphone Arrays,” published by ICASSP 2005, Philadelphia, March 2005, or U.S. Patent Application US2005/0195988, published Sep. 8, 2005. In order to do so, the filter Fm(f,C) in Eq. (2) must be determined. Its value can be estimated theoretically using a physical model, or measured directly by using a close-talking microphone as reference.
However, it should be noted again the beamformer herein described is but an exemplary type, wherein other types can be employed.
In any beamformer design, there is a tradeoff between ambient noise reduction and the instrumental noise gain. In one embodiment, more significant ambient noise reduction was utilized at the expense of increased instrumental noise gain. However, this additional noise is stationary and it can easily be removed using stationary noise suppression module 308. Besides removing the stationary part of the ambient noise remaining after the time-invariant beamformer, the stationary noise suppression module 308 reduces the instrumental noise from the microphones and preamplifiers.
Stationary noise suppression modules are known to those skilled in the art. In one embodiment, stationary noise suppression module 308 can use a gain-based noise suppression algorithm with MMSE power estimation and a suppression rule similar to that described by P. J. Wolfe and S. J. Godsill, in “Simple alternatives to the Ephraim and Malah suppression rule for speech enhancement,” published in the Proceedings of the IEEE Workshop on Statistical Signal Processing, pages 496-499, 2001. However, it should be understood that this is but one exemplary embodiment and that other stationary noise suppression modules can be used. In particular, the type of stationary noise suppression module herein described should not be used to limit the scope or applicability of the spatial noise reduction module 310 described below.
The output of the stationary noise suppression module 308 is then processed by spatial noise suppression module 310. Operation of module 310 can be explained as follows. For each frequency bin f the stationary noise suppressor output Y(f)R(f).exp(jθ(f)) consists of signal S(f)A(f).exp(jα(f)) and noise D(f). If it is assumed that they are uncorrelated, then Y(f)S(f)+D(f).
Given an array of microphones, the instantaneous direction-of-arrival (IDOA) information for a particular frequency bin can be found based on the phase differences of non-repetitive pairs of input signals. In particular, for M microphones (where M equals at least three) these phase differences form an M−1 dimensional space, spanning all potential IDOA. In one embodiment as illustrated in
δ1(f) (between microphones 1 and 2) and δ2(f) (between microphones 1 and 3) exist, thereby forming a two-dimensional space. In this space each physical point from the real space has a corresponding point. However, the opposite is not correct, i.e. there are points in this two-dimensional space without corresponding points in the real space.
As appreciated by those skilled in the art, the technique described herein can be extended to more than three microphones. Generally, if an IDOA vector is defined in this space as
Δ(f)[δ1(f),δ2(f), . . . , δM-1(f)] Eq. 4
where
δj-1(f)=arg(X1(f))−arg(Xj(f))j={2, . . . , M} Eq. 5
then the signal and noise variances in this space can be defined as
λY(f|Δ)E[|Y(f|Δ)|2]
λD(f|Δ)E[|D(f|Δ)2] Eq. 6
The a priori spatial SNR ξ(f|Δ) and the a posteriori spatial SNR γ(f,Δ) can be defined as follows:
Based on these equations and the minimum-mean square error spectral power estimator, the suppression rule can be generalized to
where ∂(f|Δ) is defined as
Thus, for each frequency bin of the beamformer output, the IDOA vector Δ(f) is estimated based on the phase differences of the microphone array input signals {Xi(f), . . . , XM(f)}. The spatial noise suppressor output for this frequency bin is then computed as
A(f)=H(f|Δ)·|Y(f)| Eq. 11
which can be used to obtain an estimate of the clean speech signal (desired signal) from S(f)A(f).exp(jθ(f)).
Note that this is a gain-based estimator and accordingly the phase of the beamformer output signal is directly applied.
Method 500 provided in
At step 504, a determination is made as to whether the frame has a desired signal relative to noise therein. In the embodiment described, the desired signal is speech activity from the user, for example, whether the user of the headset having the microphone array is speaking. (However, in another embodiment, the desired signal could take any number of forms.)
At step 504, in the exemplary embodiment herein described, each audio frame is classified as having speech from the user therein or just having noise. In
At step 506, based on whether the user is speaking during a given frame, the signal or noise spatial variance λY and λD as provided by Eq. 6 is calculated for each frequency bin and used in the corresponding signal or noise model at the dimensional space computed at step 502.
In practical realizations of the proposed spatial noise reduction algorithm implemented by module 310, the (M−1)-dimensional space of the phase differences is mathematically discrete or discretized. Empirically, it has been found that using 10 bins to cover the range [−π,+π] provided adequate precision and results in a resolution of the differences in the phases of 36°. This converts λY and λD to square matrices for each frequency bin. In addition to updating the current cell in λY and λD, the averaging operator E[ ] can perform “aging” of the values in the other matrix cells.
In one embodiment, to increase the adaptation speed of the spatial noise suppressor, the signal and noise variance matrices λY and λD are computed for a limited number of equally spaced frequency subbands. The values for the remaining frequency bins can then be computed using a linear interpolation or nearest neighbor technique. Also in another embodiment, the computed value for a frequency bin can be duplicated or used for other frequencies having the same dimensional space position. In this manner, the signal and noise variance matrices λY and λD can adapt quicker, for example, for moving noise.
By way of example, the variance matrices for the subband around 1000 Hz are shown in
Method 700 in
Although the subject matter has been described in language directed to specific environments, structural features and/or methodological acts, it is to be understood that the subject matter defined in the appended claims is not limited to the environments, specific features or acts described above as has been held by the courts. Rather, the environments, specific features and acts described above are disclosed as example forms of implementing the claims.
The present application is a divisional of and claims priority of U.S. patent application Ser. No. 11/316,002, filed Dec. 22, 2005, the content of which is hereby incorporated by reference in its entirety.
Number | Date | Country | |
---|---|---|---|
Parent | 11316002 | Dec 2005 | US |
Child | 12464390 | US |