One aspect of the disclosure herein relates to digital signal processing techniques for reducing audible noise from an audio signal that contains voice or speech that is being picked up by a mobile phone.
Mobile phones can be used in acoustically different ambient environments, where the user's voice (speech) that is picked up during a phone call or during a recording session is usually mixed with a variety of types and levels of other undesirable sounds (including ambient sounds and the voice of another talker.) These undesirable sounds (also referred to as noise) are often picked up on the microphone(s) and thus often degrade the acquisition of the desired speech. For example, pickup of such undesirable sounds can reduce speech intelligibility of the user's speech as heard at the far-end of a phone call. Pickup of such undesirable sounds can also lead to significant voice distortion particularly after having been processed by voice coders in a cellular communication network. For at least these reasons, it is typically desirable to apply a high quality, digital noise suppression process to the mixture of speech and noise of the acquired audio signal, before passing the signal to next steps in its transmission to the far-end, e.g. passing the signal to a cell voice coder in a baseband communications chip of the mobile phone.
In the handset mode of operation (against the ear) in some current mobile phones, audio signals from more than one microphone can be used together in a multiple (e.g. two)-microphone noise suppression process. The general approach relies on the fact that some microphones, or combination of some microphones, can be used more effectively than others to estimate either the desired speech or the unwanted noise components. Such estimates help in the noise suppression process. In some cell-phones the microphones or combination of microphones is clear, e.g. microphones closer to the user's mouth would have a higher signal to noise ratio (SNR) than those further away, the signal being the desired speech. SNRs can also be tested or computed, a-priori, during the design process. This could be done by either measuring with known noise or estimating with unknown noise a stationary noise spectrum for the microphone signal and then further estimating spectrums of the desired speech when such speech is active. The ratio of two spectrums is used to estimate the SNR. The microphone signal having the largest SNR is then selected to be the voice dominant input of the two microphone NS process. Conversely, the microphone having the lower SNR can be used to better estimate or predict the noise spectrum, both stationary and dynamic.
The inventors herein have recognized that, while effective, a two-microphone noise suppression process has some limits as it is sometimes not able to accurately estimate either the desired speech (voice) spectrum or the noise spectrum. For example, sometimes the two-microphone noise suppression process does not work well in the presence of transient background noise (including a competing talker). In addition, the desired speech component and noise component can often be present in an acquired audio signal at high levels on both microphones. Thus an a-priori determination or selection of mics for noise estimation and those for voice estimation may not hold in all circumstances. Noise estimation, which is a computation or estimate of the noise component by itself, plays a key role when trying to remove noise components from a microphone signal without distorting the speech components therein. For greater accuracy, a multi-microphone noise estimation process needs i) increased “voice separation”, where voice separation refers to the sound pressure differences of the desired speech as seen on one set of microphones compared to another group of microphones, and ii) improved “noise matching”, where noise matching refers to how well the noise picked up on one group of microphones matches that on another group of microphones. Increased voice separation improves the ability of the audio system to estimate the desired speech spectrum and speech activity. Better noise matching improves the ability of one group of microphones, often those with lower SNR, to be used to predict the noise on another group of microphones.
Practically, voice-separation can be defined, as a measure of the difference between the energy or power spectrums of the desired speech component as seen on two audio channels, an audio channel being an individual microphone or a linear combinations of microphones, that are active during a phone call or during a recording session. If the noise components on the two channels are approximately the same (there is good noise matching) the voice separation value itself can be viewed as the difference between the energy or power spectrums, or even the SNRs, of the two channels. Thus, when desired speech is active it is expected that there is to be an energy or power spectrum difference between the two channels in line with the SNR difference. The parameters of a Voice Activity Detector (VAD) or of a noise estimator, where the latter could be part of a noise suppressor, can therefore be adjusted, based on the voice separation value. Determinations of voice activity can be made in different frequency sub-bands which typically helps to improve both the noise estimation process and speech estimation process. Generally, as the voice separation value increases, accuracy of VAD decisions and signal estimations may be improved. Increased voice separation also helps differentiate desired speech from other signals, like transient noise, which may show similar properties to speech.
Noise-matching, considers the characteristics of noises captured by the two audio pickup channels, an audio channel being an individual microphone or a linear combinations of microphones, that are active during a phone call or during a recording session. For a pair of ideal omni-directional microphones, noises that are either diffuse or emanating from a very far distance (noises in the far field of the microphones) often will show a very similar sound pressure pattern on the two microphones. Though there may be differences in the time of arrival of signals due to microphones being separated in space, for two closely spaced microphones the general power spectrums of audio signals received by the two microphones can be very similar. Practically, when microphones are mounted on a device, covered with meshes, and are placed against surfaces, the signals seen on the two microphones can contain some spectral differences. In this case, even with diffuse or far-field sources, the signals produced by the microphones are different and the spectral shapes of the responses, and thus the noise, do not “match”. In some cases, a correction factor may be determined and applied to compensate for any gross frequency variation between responses of the various microphones or combination of microphones, such that the spectral shapes of the responses “match”. This enables the system to use one set of microphones to better predict the noise on another set of microphones. When noise matching is achieved between signals, it also means that the voice separation value of the signals relates more directly to the SNR differences between groups of microphones. Thus, VADs and other estimators can operate more effectively. If, however, groups of mics, either due to separation in space or other effects, acquire audio signals including very different noise components, and as a result there is no fixed compensation that can be pre-determined and applied (such as the correction factor discussed above), and noise matching is therefore not achieved consistently, then prediction of noise, VAD determinations and speech estimation may be degraded.
An embodiment herein aims to maintain the effectiveness (or accuracy) of a noise estimation process in different ambient environments. In particular, when using beamforming or a combination of microphones to produce each audio channel, the maintenance of voice separation and noise matching may not be trivial. In fact, beamforming by itself can sometimes create a frequency dependent scaling of components in an audio channel, which by its very nature has an effect both on voice separation and on noise matching. At the same time, beamforming is very useful in compensating for and adapting to different environments and device positions relative to the desired talker, etc. Here, the audio system aims to maintain sufficiently large voice separation and noise-matching simultaneously in a variety of cases. The audio system may improve voice separation and noise-matching even over cases where acceptable voice separation and noise-matching can be achieved by a non-adaptive system. In the audio system, each audio channel or “beam” can be defined as a linear combination of the raw signals available from multiple microphones. Such a group of microphones often constitutes a microphone array or a microphone cluster. For example, on a mobile phone, a cluster may be localized on one part of the phone, e.g. the bottom. A cluster may include some microphones from the bottom and some microphones from the top.
An embodiment herein aims to address the problem of how to adaptively or dynamically, e.g., during in-the-field use of a mobile phone that can be in a changing ambient environment, analyze available microphone signals that generate a plurality of acoustic beams to determine an appropriate pair or group of beams, such that at least one pair shows both good voice separation and good noise matching. In one embodiment, one acoustic beam, often the one with larger SNR, is used to pick-up a desired local voice (referred to as a “voice beam”) and the other beam, typically having lower SNR, is used to pick up undesired ambient noise (referred to as a “noise beam”). Together the voice and noise beams drive VAD decisions, and the prediction and estimation processes previously mentioned. In this regard, in one embodiment, three or more acoustic pickup beams may be produced by any suitable combination of the microphone signals such that the acoustic pickup beams are simultaneously available, and a pair of the beams may be selected from these three or more available beams as inputs to a two-channel noise suppression process or a VAD. The analysis of the microphone signals and the available beams may be based on a number of factors, including positions of the microphones, and location information such as: the location of the source of the local desired voice relative to the positions of the microphones, the location of the source(s) of the ambient noise(s) relative to the positions of the microphones, the direction of the audio signal including the local voice relative to the position of the microphones, and the direction of the noise signal including the ambient noise relative to the position of the microphones. In one embodiment, these factors are also analyzed in order to determine which microphones should be assigned to produce a beam to pick up ambient noise (referred to as a “noise beam”) and to produce a beam to pick up a desired local voice (referred to as a “voice beam”).
In order to improve the reliability or accuracy of noise-matching and voice separation (which is expected to further improve the accuracy of the noise estimate computed by the noise suppression process), the beams may also be coordinated and designed. The acoustic pickup beams may be coordinated and designed based on a variety of factors including locations of the microphones, local voice and (ambient) noises as discussed above. In some embodiments, coordination and design of the beams may also include shaping the beams, directing the beams and identifying or assigning a subset of the microphones used to produce the beam. In this regard, in one embodiment, it is expected that the local voice or primary talker is closer to a first subset of microphones than another subset of microphones, and the acoustic pick up beam defined by the signals available from the first subset of microphones is considered to be the “voice beam”. In this embodiment, a second subset of microphones is assigned to produce a beam to pick up the ambient noise, and the acoustic pick up beam defined by the signals available from this subset of microphones is considered to be the “noise beam”. In other embodiments, the audio system may use audio-based blind source separation and estimation, or a camera, to locate a primary talker and/or any noise sources in the environment and to correlate this information with audio signals in order to determine which microphones should be used to generate a voice beam and which microphones should be used to generate a noise beam.
In one embodiment, possible pairs of noise beams and voice beams that may be produced by the microphone signals are tested based on the positions of the microphones, the locations of the local voice and the ambient noise and the directions of the local voice and the ambient noise to determine which beam pairs maintain thresholds for voice-separation and noise-matching. For example, thresholds are defined to maintain sufficiently large voice separation and noise-matching and two or more acoustic pickup beams are selected for input to a noise suppressor based on satisfaction of the thresholds. To determine whether there is sufficient noise-matching between two acoustic pick up beams, in one embodiment, instantaneous and average ratios are obtained over a time interval between a strength of a noise component in one beam and a strength of a noise component in another beam. In this regard, a conventional noise estimator may be used to extract the respective noise components, so that the respective strengths of the noise components may be calculated. The strengths of the respective noise components may be computed as power spectra in the spectral or frequency domain. The instantaneous and average ratios of the strengths of the respective noise components on the two pickup beams are then compared to the thresholds for noise-matching and if the thresholds for noise-matching are met, these beams are determined as being acceptable for noise-matching. Furthermore, a computed statistical central tendency of the difference in instantaneous and average ratios between the two beams can also be considered. This characterized central difference, which can be considered a long-term average of the differences, can be used to compute the correction factor for noise-matching. In one embodiment, the correction factor may be applied to compensate for any gross or stable frequency differences between responses of the various beams, such that after compensation the spectral shapes of the responses improve in matching.
To determine whether there is sufficiently large voice separation between two acoustic pick up beams, in one embodiment, initial ratios are obtained between a strength of the noise beam and a strength of the voice beam. The strengths of the respective beams may be computed as power spectra in the spectral or frequency domain. These ratios are considered during intervals of time when it is determined by a VAD or other means that the desired local talker is active. In embodiments in which a correction factor for noise-matching is used, this factor is applied appropriately to the initial ratios to account for the effect the correction factor would have on initial ratios if the correction factor had been applied first. Then the instantaneous and average ratios of the corrected ratios are obtained and compared to thresholds for voice separation.
If a pair of a voice beam and a noise beam is determined to satisfy the thresholds for noise-matching and voice separation, these beams can be selected for input to a noise suppressor or a voice activity detector (VAD). The selected voice beam that is voice dominant is provided as a voice input signal to a multi-channel noise suppression process or VAD, and the noise beam that is noise dominant is provided as a noise input signal to a multi-channel noise suppression process or VAD. This should enable the noise suppression process to produce more accurate voice activity decisions and noise and voice estimates which in turn should lead to a less distorted, noise-suppressed, voice output signal produced by the noise suppression process. In other embodiments, more than two beams may be selected as input to the multi-channel noise suppressor or the VAD. Also, in embodiments in which multiples pairs of beams satisfy the thresholds for voice-separation and noise-matching, selection of the beams balances the individual measures of voice separation and noise matching in order to select an appropriate beam pair. Long-term trends of the individual measures of voice separation and noise matching may also be considered, as well as the past selection of beams. If no pair of beams is found to satisfy the thresholds for voice-separation and voice-matching, the audio system may default to a single-channel noise suppression process, for example using the beam with the best estimated SNR as the single input to such a single-channel suppression process.
In order to improve control over coordination and design of the acoustic pickup beams, the microphones may be considered collectively as a microphone array or cluster whose geometrical relationship may be fixed and “known”. In these embodiments, in the case where there are two or more microphone clusters, and each cluster can produce a respective pick up beam, the microphone clusters are spatially separated, and a cluster may be defined as a two or more microphones whose relative distance to each other is smaller than a distance to one of the microphones of another cluster.
In one embodiment, the approach described above is used together with phase-based interference cancellers.
The above summary does not include an exhaustive list of all aspects of the present invention. It is contemplated that the invention includes all systems and methods that can be practiced from all suitable combinations of the various aspects summarized above, as well as those disclosed in the Detailed Description below and particularly pointed out in the claims filed with the application. Such combinations have particular advantages not specifically recited in the above summary.
The embodiments herein are illustrated by way of example and not by way of limitation in the figures of the accompanying drawings in which like references indicate similar elements. It should be noted that references to “an” or “one” embodiment of the invention in this disclosure are not necessarily to the same embodiment, and they mean at least one. Also, in the interest of conciseness and reducing the total number of figures, a given figure may be used to illustrate the features of more than one embodiment of the invention, and not all elements in the figure may be required for a given embodiment.
Several embodiments of the invention with reference to the appended drawings are now explained. Whenever aspects are not explicitly defined, the scope of the invention is not limited only to the parts shown, which are meant merely for the purpose of illustration. Also, while numerous details are set forth, it is understood that some embodiments of the invention may be practiced without these details. In other instances, well-known circuits, structures, and techniques have not been shown in detail so as not to obscure the understanding of this description.
The processes described below are performed by an audio system whose user as depicted in
A number of microphones 1 and 2 may be integrated within the housing of the audio system. An example is depicted in
In the embodiment of
It should be understood that other arrangements of microphones that may be viewed collectively as a microphone array or cluster whose geometrical relationship may be fixed and “known” at the time of manufacture are possible, e.g. arrangements of two or more microphones in the housing of a tablet computer, a laptop computer, or a desktop computer. In one embodiment, arrangements of any suitable number of microphones and microphones clusters in the housing of a tablet computer, a laptop computer, or a desktop computer are possible. In one embodiment, distributed arrangements of microphones and microphone clusters are possible. For example, the microphones and microphone clusters of the audio system may be arranged in separate housings of tablet computers, laptop computers, desktop computers, mobile phones or other audio systems. In one embodiment, the apparatuses and processes described herein may be implemented in audio systems for homes and vehicles, as well as speakers in consumer electronics devices.
Returning to
The microphones 1 and 2 including the individual sensitivities and directivities of the microphones included therein may be known and considered when configuring the beam analyzers 150 and 155, or defining each of the beams, such that the microphones 1 and 2 are each treated as a microphone array or cluster. Each of the beam analyzers 150 and 155 may be a digital processor that can utilize any suitable combination of the microphone signals in order to produce a number of acoustic pick up beams. Glancing at
Beams of other shapes and/or using other combinations of the microphones 1 and 2 (including ones that are not shown) are possible and may be suitable for a particular type of audio system, as a function of the shape of the housing, the geometrical relationship between the microphones 1 and 2, the sensitivities and directivities of the microphones 1 and 2, and the expected holding positions of the audio system by the user (e.g., handset mode vs. speaker phone mode). Design and production of the beams is discussed in more detail below with respect to the embodiments illustrated in
Beam analyzers 150 and 155 receive as input the signals from microphones 1 and 2 and analyze the microphone signals to coordinate and design beams to be produced and tested. In one embodiment, beam analyzers 150 and 155 may each include a digital processor that can utilize any suitable combination of the microphone signals in order to produce a number of acoustic pick up beams, and pairs of the produced beams are analyzed for voice separation and noise matching. The design of the beams can include a selection of a pair of beams from a plurality of beam pair options. Such a design can include a more flexible definition of beams based on analysis of the device in use. One example would be to use estimations of a direction of arrival of both the desired speech source and other noise sources. Beam analyzers 150 and 155 may be communicatively coupled to each other in order to share information such as statistics on beams and microphones. The noise suppressor 104 can also pass information back to the beam analyzers. In this regard, noise suppressor 104 may be communicatively coupled to one or more of the beam analyzers in order to share information such as, for example, voice activity information.
Thus, beam analyzer 153 may generate a number of beams in order to analyze beam pairs, and may test candidate beam pairs in order to select a pair of beams having appropriate voice separation and appropriate noise matching. Beam analyzer 153 may share the generated beams with beam selectors 130 and 135. In addition, beam analyzer 153 may provide to the beam selectors 130 and 135 the selection information indicating the pair of beams to be selected from the plurality of candidate beams. In particular, voice beam selector 130 may receive the candidate beams and the selection information from beam analyzer 153, and may select the appropriate voice beam to forward as the voice dominant input to noise suppressor 104. Noise beam selector 135 may also receive the candidate beams and the selection information from beam analyzer 153, and may select the appropriate noise beam to forward as the noise dominant input to noise suppressor 104.
Generally, in the embodiment of
As one example, suitable combinations of the signals from microphones 1 and 2 may generate a number of acoustic pick up beams. Beam analyzers 150 and 155 may each analyze the received microphone signals to determine which of the microphone signals will produce a beam that captures a desired source (such as a local voice) and an undesired source (such as ambient noise), respectively. The determination may be based on a variety of factors. For example, beam analyzers 150 and 155 may each determine a beam to be selected based on positions of the microphones 1 and 2, which may be known at the time of manufacture. In addition, the audio system may obtain location information about the source of the voice signal and/or the source of the noise signal relative to the positions of the microphones. The directions of a voice signal and/or a noise signal relative to the positions of the microphones may also be estimated. In this regard, in some embodiments, a camera may be used to locate a primary talker and/or any noise sources in an environment and to correlate this information with microphone signals, and the camera may provide this information to beam analyzers 150 and 155. In this way, beam analyzers 150 and 155 obtain the locations of the sources of the local voice and ambient noise, such that a “voice” beam may be selected and designed to pick up a desired voice and a “noise” beam may be selected and designed to pick up ambient noises. Also, in some embodiments, a blind source estimation technique may be used to analyze the microphone signals to determine locations and directions of a voice signal and a noise signal. For example, since the locations of the microphones are known, it is possible to perform blind source estimation to determine information on an angle at which the noise or voice source is located relative to the location of the microphones. Generally, beam analyzers 150 and 155 communicate to share information in order to select the voice and noise beams. In one embodiment, beam analyzers 150 and 155 compute how well a pair of beams match in estimating noise. Elements similar to those in the noise suppressor, such as Time to Frequency Calculators, Power Spectrum Calculators, Voice Activity Detectors, and Undesired Signal Power Spectrum Estimators, can also be included in the beam analyzers. In one embodiment, beam analyzers 150 and 155 compute the difference in signal strength between the beams when the desired speech is present. In both of these embodiments, such comparisons can be based on power spectrums of the two beams, which advantageously allows noise matching and voice separation to be considered both in time and frequency. In another embodiment the average difference in level between beams is determined when doing comparisons on noise matching. This average difference in level, if it shows a stable tendency over time, e.g. it does not change beyond a certain level (e.g. set or predetermined threshold) over time, can be used to compensate for gross average differences which may be due to the beamforming itself. This compensation is accounted for in both noise-matching and voice-separation determinations.
As mentioned above, in one embodiment, production of the beams by beam analyzers 150 and 155 includes design and coordination of a beam to pick up a desired local voice (referred to as a “voice beam”) and a beam to pick up ambient noise (referred to as a “noise beam”), including shaping the beams and directing the beams.
In one embodiment, the positions of the microphones 1 and 2, the locations of the local voice and noise sources and the directions of the local voice and noise sources may be used together with the digitized microphone signals to determine which of microphone 1 and 2 should be assigned to produce the beam to pick up ambient noise (a “noise beam”) and to produce the beam to pick up a desired local voice (a “voice beam”). Also, in one embodiment, assignment of the microphones clusters includes assigning a subset of the microphones used to produce the beam. For example, in the embodiment of
In one embodiment, rather than performing beam forming, the beam analyzer forwards the digitized microphone signals from microphones 1 to a “voice” beamformer, and forwards the digitized microphone signals from microphones 2 to a “noise” beamformer. In this embodiment, the beam analyzer may be communicatively coupled to the beamformers in order to share information needed for beam forming, such as, for example, assignment information indicating a first subset of microphones to be used to generate a voice beam and a second subset of microphones to be used to generate a noise beam. The beamformers may each be a digital processor that can utilize any suitable combination of the microphone signals in order to produce a number of acoustic pick up beams. For example, voice beamformer may produce a voice beam using a combination of at least two of the microphones 1 to pick up the desired local voice, according to the instructions provided by the beam analyzer and noise beamformer may produce a noise beam using a combination of at least two of the microphones 2 to pick up the ambient noise, according to the instructions provided by the beam analyzer, such that criteria for voice-matching and noise-matching are maintained as described above. The beam analyzer may also provide as input to voice beamformer and noise beamformer the instructions for design and production of the beams, as described above in connection with
Returning to the embodiment of
In one embodiment, beam analyzers 150 and 155 obtain two main states of the audio system, one associated with an active state of the local voice and another associated with an inactive state of the local voice. For example, during in-the-field use of a mobile phone, the system may obtain a first state associated with the local voice (or near end desired source) being active and a second state associated with the local voice being inactive. In one embodiment, the noise suppressor 104 itself supplies the system with information regarding these two main states. For example, a VAD may be used to determine whether audio frames are in the active state of the local voice (e.g., when the VAD outputs a decision indicating speech, VAD=1) and the inactive state of the local voice (e.g., when the VAD outputs a decision indicating non-speech, VAD=0). In other embodiments, state information may be determined based on differences between strengths of beams or other statistics regarding the audio system. Voice activity decisions can also be made in a soft way, e.g. as a probability of local voice being present in which case there is a value from 0 to 1, or in different frequency subbands.
For audio frames (of a pair of beam signals) that are found to be in the “inactive” state, strengths from the pairs of beams are compared by beam analyzers 150 and 155 in order to determine whether there is sufficient noise-matching between the beams. With respect to noise-matching, improved noise matching can help to improve the accuracy of noise estimation process and/or VAD that may be part of a multi-channel or two-channel suppression process (further described below in connection with
In comparing strengths of beams to determine whether there is sufficient noise-matching between two beams, weights used for filtering one beam to match another may be estimated using a gradient descent technique such as a least mean square algorithm. The weights may also be applied directly to power spectrums of the beams with a weight for each power spectral bin. A given weight could be, for example, the average ratio between the energy in a given bin when comparing the two power spectrums of the pair of beams. In other embodiments, stability of such a frequency dependent scaling may be considered by beam analyzers 150 and 155. As one example, instantaneous and average ratios may be obtained over a time interval (e.g., a digital audio time frame) between a strength of a noise component (e.g. power spectrum bin) in one beam and a strength of a noise component in another beam (e.g. the same power spectrum bin), and the stability of the ratios over time may be considered to determine whether there is sufficient noise-matching. If a ratio is not stable over time, it may be determined that the relatively fixed gross compensation discussed above, does not apply. If a ratio is stable over time, it may be determined that the relatively fixed gross compensation does apply and may be used in equalizing the beams before determining noise matching. In some embodiments, a noise estimator may first be used to process the noise beam (the noise dominant input) and the voice beam (the voice dominant input) to compute the respective noise components, and the respective strengths of these noise components are used to determine instantaneous and average ratios over the time interval. The instantaneous ratios may be computed directly in the discrete time domain on a frame by frame basis. Alternatively, the instantaneous ratios may be computed in the discrete time domain at different points in time in each audio frame. In other embodiments, the strengths of the voice and noise beams are computed as power spectra in the spectral or frequency domain, or they may be computed as energy spectra. This may be based on having first transformed the primary and secondary sound pick up channels on a frame by frame basis into the frequency domain (also referred to as spectral domain.)
In one embodiment, if the frequency dependent scaling estimation between two beams is very dynamic in strength and spectral shape, it is possible that the two beams are not picking up similar noise sources (i.e., not “matching”). In such a situation the two beams may not be appropriate for multi-channel noise suppression. On the other hand, if the frequency dependent scaling estimation between two beams is stable with respect to strength and spectral shape, it is possible that the two beams are picking up similar unintended noise sources (i.e., “matching”) and are candidates for selection. In one example embodiment, thresholds may be set for variation in strength and spectral shape of the frequency dependent scaling estimation between the two beams, and the variations in strength and spectral shape of the frequency dependent scaling estimation are compared to these thresholds in order to determine whether there is sufficient noise-matching between two beams. For instance, if values of the frequency dependent scaling estimation during the “inactive” periods are, for example: (5, 10, 1, 22, 11, 5, 100, 1, etc.) the beam analyzers may determine that beams do not meet the thresholds for noise-matching, since the variation between the values in the sequence is generally unstable. On the other hand, if values of the frequency dependent scaling estimation are, for example: (5, 4, 5, 4.5, 4.5, 4.5, etc.) or (11, 13, 11, 12, 11, 11, etc.) or (100, 110, 105, 120, 105, etc.), the beam analyzers may determine that the beams meet the thresholds for noise-matching, since the variations between the values in the sequence is generally fixed over time and is thus stable. In these examples, the thresholds for noise-matching may be set such that the variation between the values of the frequency dependent estimation should not be greater than a predetermined value. In these examples, the sequence of values of the frequency dependent scaling estimation may be values obtained from the microphone signals at different audio frames according to one embodiment. In other embodiments, the sequence of values are obtained at a different point in time in each audio frame.
In some embodiments, the frequency dependent scaling estimation discussed above is also used to determine the correction factor for the selected beams, in order to equalize (“EQ”) the selected beams and spectrally shape them to compensate for variations in their far-field frequency responses. According to these embodiments, if the thresholds for noise-matching are met (i.e., if there is sufficient noise-matching between two beams), a computed statistical central tendency of the instantaneous and average ratios (which may be, for example, a mean of the instantaneous and average ratios) is set as a correction factor for noise-matching. It is therefore possible to have the strength of one beam at a similar level and general spectral shape as the strength of another beam, and to compensate for any frequency variation between responses of the various beams, such that the spectral shapes of the responses “match”.
For audio frames that are found to be in an “active” state, a measure of difference between strengths from two beams is considered by beam analyzers 150 and 155 in order to determine whether there is sufficient voice-separation between the two beams. In this regard, generally, a voice-separation value may be a measure of the difference between the strength of a primary sound pick up beam, and the strength of a secondary sound pick up beam, where the local voice (primary talker's voice) is expected to be more strongly picked up by the primary beam than the secondary beam. In this case, the voice-dominated primary beam may be considered a “voice beam” and the secondary beam may be considered a “noise beam”. In order to improve the reliability or accuracy of the voice separation value for a given beam (which is expected to further improve the accuracy of the noise estimate computed by the noise suppression process), the difference calculation may be performed after having spectrally shaped the noise beam, the voice beam, or both, using the correction factor so as to compensate for any frequency response variation between the far field responses exhibited by the voice beam and the noise beam.
According to one embodiment, for a pair of beams to have sufficient voice-separation, the strength of a desired voice beam may exceed the strength of an undesired noise beam by a threshold decibel (dB) amount. In other words, the voice-separation value for the two beams may be greater than or equal to the threshold amount. As one example, studies show that the voice separation value may be high when the talker's voice is more prominently reflected in the primary channel than in the secondary channel, e.g. by about 14 dB or higher. The separation value drops when the mobile phone handset is no longer being held in its optimal or normal position, for example dropping to about 10 dB and even further in a high ambient noise environment to no more than 5 dB.
According to some embodiments, to determine whether there is sufficient voice-separation between two beams, ratios are considered between a strength of a voice beam (a desired signal or an acoustic pickup beam dominated primarily by a primary talker's voice) and a strength of a noise beam (an undesired signal, or an acoustic pickup beam dominated primarily by noise). For example, initially, ratios are obtained between a strength of the noise beam and a strength of the voice beam. In embodiments in which the correction factor for noise matching has been determined, these ratios may be adjusted by applying the correction factor for noise-matching. In such embodiments, these adjusted ratios are compared to set thresholds for voice-separation in order to determine whether there is sufficient voice-separation between the two beams. In some embodiments, the adjusted ratios are used to obtain instantaneous and average ratios over a time interval (e.g, a digital audio time frame), and the instantaneous and average ratios are compared to the set thresholds to determine whether there is sufficient voice-separation. The instantaneous ratios may be computed directly in the discrete time domain on a frame by frame basis. Alternatively, the instantaneous ratios may be computed in the discrete time domain at different points in time in each audio frame. In other embodiments, the strengths of the voice and noise beams are computed as power spectra in the spectral or frequency domain. This may be based on having first transformed the primary and secondary sound pick up channels on a frame by frame basis into the frequency domain (also referred to as spectral domain.)
In some embodiments, frequency dependent scaling estimation is also considered by beam analyzers 150 and 155 during the “active” frames. In a case where a voice beam with a positive (>0 dB) signal-to-noise ratio (SNR) is assumed, there is often an expected rise in beam strength when a desired voice signal is present and relative levels or strengths of desired (voice) and undesired (noise) components may be estimated for each beam. This provides both signal-to-noise ratio measurements as well as measures of the voice level, and therefore indicates the amount of voice-separation. In the case where a positive SNR is assumed, a frequency dependent scaling estimation that is sufficiently stable between a pair of beams during “active” frames indicates strong voice components on both beams. In such a case, the pair of beams may not be an appropriate candidate for selection since they imply small voice separation. In particular, if one of the beams is not dominated by the desired voice, and the other is, as would be a prerequisite for having some voice separation, it is expected than when the desired voice is active the frequency-dependent energy on the beams would be different and constantly changing with that of the desired voice.
As discussed above, the ratios and values used to analyze noise-matching and voice-separation may be computed in the spectral domain, for each digital audio time frame. For example, there may be a voice separation vector and/or a correction factor vector defined, that has a number of values that are associated with a corresponding number of frequency bins. Alternatively, the voice separation value and/or the correction factor may be a statistical measure of the central tendency, e.g. average, of the difference (subtraction or ratio) between the primary and secondary input audio channels, as an aggregate of all audio frequency bins, or alternatively across a limited band in which the local voice is expected (e.g. 400 Hz to 1 kHz), or a limited number of frequency bins, of the spectral representation of each frame. A sequence of such vectors or values are continually computed, each being a function of a respective time frame of the digital audio. An audio signal can be digitized or sampled into frames that are each, for example, between 5-50 milliseconds long, where there may be some time overlap between consecutive frames.
In one embodiment, the strengths of the voice and noise beams (the primary and secondary channels, respectively, or the desired and undesired signals, respectively) are computed as power spectra in the spectral or frequency domain. This may be based on having first transformed the primary and secondary sound pick up channels on a frame by frame basis into the frequency domain (also referred to as spectral domain.) Alternatively, the strengths of the primary and secondary sound pick up channels may be computed directly in the discrete time domain, on a frame by frame basis. An example voice separation value may be an average log spectral difference measure as follows:
Here, N is the number of frequency bins in the frequency domain representation of the digital audio frame, PSpri and PSsec are the power spectra of the primary and secondary channels, respectively, and i is the frequency index. This is an example where the strength of a signal is an average (over N frequency bins) power. Other ways of defining the voice separation value, based on a difference computation, are possible, where the term “difference” is understood to refer to not just a subtraction as shown in the example formula above of logarithmic values, but also a ratio calculation as well.
Also, with respect to the noise estimate produced by the noise estimator, each noise component extracted from the noise beam and the voice beam may be a respective noise estimate vector, where this vector has several spectral noise estimate components, each being a value associated with a different audio frequency bin. This is based on a frequency domain representation of the discrete time audio signal, within a given time interval or frame. A spectral component or value within a noise estimate vector may refer to magnitude, energy, or power, in a single frequency bin.
As described above, an embodiment herein aims to appropriately test and select two of several beams that are simultaneously available, for example during a phone call or during a meeting or recording session, as being the primary pickup channel (e.g., voice dominant input) and the secondary pickup channel (e.g., noise dominant input) of the two-channel noise suppressor 104. In other embodiments, more than two beams may be selected. In order to select the two or more beams, one or more of (1) the frequency dependent scaling estimation, (2) the stability of the frequency dependent scaling estimation and (3) SNR values may be considered by the beam analyzers during both “active” and “inactive” frames. For example, beam analyzers 150 and 155 may test one or more of these factors (1) to (3) for each pair of available beams that may be produced by microphone signals against the thresholds for noise-matching and voice-separation. In the example illustrated in
In the case that multiple pairs of beams satisfy the thresholds, the criteria of voice-separation and voice-matching are balanced by the beam analyzers 150 and 155 in order to select an appropriate beam pair. For example, beam analyzers 150 and 155 may determine which beam of the pair is the voice beam and may select the beam pair having the highest voice separation for input to the two-channel noise suppressor (or a VAD). Referring to
It is therefore possible to choose two or more of several, simultaneously available acoustic pickup beams for input to a two-channel noise suppression process, thereby enabling the noise suppressor to produce a noise reduced voice input signal as illustrated in
It is therefore possible to coordinate the choice, design and use of acoustic pickup beams to drive a noise suppression process, while maintaining good voice-separation and noise-matching. In addition, the noise suppression process may be simplified, since the spatially separated clusters of microphones 1 and 2 are used together with the beam analyzers 150 and 155 to produce beam pairs and beam selectors 130 and 135 to select an appropriate beam pair for input to the noise suppressor 104.
The beam analyzers 150 and 155 and the beam selectors 130 and 135 operate in parallel, where the term “parallel” here means that the sampling intervals or frames over which the audio signals are processed have to, for the most part, overlap in terms of absolute time. In addition, the beam analyzers 150 and 155 may be communicatively coupled to each other and to beam selectors 130 and 135 such that these components may exchange information and data. Indeed, to make comparisons on noise matching and voice separation the system compares pairs of beams created in 150 to those created in 155.
In the embodiment illustrated in
Referring to
In the embodiment of
It will be appreciated that the two-channel noise suppressor 104 illustrated in
With respect to a voice activity detector (VAD), a selected voice beam is provided to a voice dominant input of a voice activity detector (VAD), and a selected noise beam is provided to a noise dominant input of the VAD. In one embodiment, such a VAD is implemented by first computing
ΔX(k)=|X1(k)|−|X2(k)|
where X1(k) is the spectral domain component of the voice dominant input signal, and X2(k) is that of the noise dominant input signal. In other words, the term DeltaX(k) in the equation above is the difference in magnitude of spectral component k of the two input signals. Next, a binary VAD output decision (Speech or Non-speech) for spectral component k is produced as the result of a comparison between DeltaX(k) and a threshold: if DeltaX(k) is greater than the threshold, the decision for bin k is Speech, but if the DeltaX(k) is less than the threshold, the decision is Non-speech. The binary VAD output decision may be used by any available speech processing algorithms including for example automatic speech recognition engines.
For convenience, the embodiment of
Also, in some embodiments, echo-coupling may be considered by the beam analyzers. Additionally, the beam analyzers may augment analysis of the beams with models representing a voice signal and a noise signal. For example, in addition to enhancing performance of the VAD, linear-predictive models of short and long-term correlations may be used to detect a primary talker's voice and to help differentiate between voice and noise signals on different beams. In these situations, various considerations may be used, for example, it may be considered that noise beams should not include strong voice components.
By virtue of the arrangement of
It is therefore possible to coordinate the production, selection and use of acoustic pick up beams to drive a VAD, a noise estimation process and SNR calculations of a noise suppressor, and to set voice-separation and noise-matching criteria to ensure that these processes and calculations are effective in using two pick up beams. These criteria may include direct measures of how the power spectrum of one beam compares to the power spectrum of another beam. These criteria may also include measures on how the difference or ratio of the two power spectra change dynamically over time.
In the embodiment illustrated in
The embodiment of
As illustrated in the example of
In the embodiment of
Referring to
In the embodiments illustrated by
It is therefore possible to generate a voice beam and a noise beam that have sufficient voice-separation and noise-matching, such that unnecessary suppression of voice components and unmatched suppression of noise components may be avoided. In fact, in some situations where the microphones of one cluster have a similar fixed geometrical relationship to each other as the microphones of another cluster, and where the operating characteristics of one cluster are similar to the operating characteristics of another cluster, it may be possible for beamformers to generate the voice beam and the noise beam according to a similar design. In this way, the beam pair including a voice beam and a noise beam is provided as input to a noise suppressor or VAD, and a noise suppression process may be simplified. For example, both beams in
Turning to
The memory 1106 has stored therein instructions that when executed by the processor 1102 produce the acoustic pickup beams using the microphone signals, compute voice separation values and correction factors (as described above), select one of the acoustic pickup beams (as described above in connection with
Some portions of the preceding detailed descriptions have been presented in terms of algorithms and symbolic representations of operations on data bits within a computer memory. These algorithmic descriptions and representations are the ways used by those skilled in the data processing arts to most effectively convey the substance of their work to others skilled in the art. An algorithm is here, and generally, conceived to be a self-consistent sequence of operations leading to a desired result. The operations are those requiring physical manipulations of physical quantities. It should be borne in mind, however, that all of these and similar terms are to be associated with the appropriate physical quantities and are merely convenient labels applied to these quantities. Unless specifically stated otherwise as apparent from the above discussion, it is appreciated that throughout the description, discussions utilizing terms such as those set forth in the claims below, refer to the action and processes of an audio system, or similar electronic device, that manipulates and transforms data represented as physical (electronic) quantities within the system's registers and memories into other data similarly represented as physical quantities within the system memories or registers or other such information storage, transmission or display devices.
The processes and blocks described herein are not limited to the specific examples described and are not limited to the specific orders used as examples herein. Rather, any of the processing blocks may be re-ordered, combined or removed, performed in parallel or in serial, as necessary, to achieve the results set forth above. The processing blocks associated with implementing the audio system may be performed by one or more programmable processors executing one or more computer programs stored on a non-transitory computer readable storage medium to perform the functions of the system. All or part of the audio system may be implemented as, special purpose logic circuitry (e.g., an FPGA (field-programmable gate array) and/or an ASIC (application-specific integrated circuit)). All or part of the audio system may be implemented using electronic hardware circuitry that include electronic devices such as, for example, at least one of a processor, a memory, a programmable logic device or a logic gate. Further, processes can be implemented in any combination hardware devices and software components.
While certain embodiments have been described and shown in the accompanying drawings, it is to be understood that such embodiments are merely illustrative of and not restrictive on the broad invention, and the invention is not limited to the specific constructions and arrangements shown and described, since various other modifications may occur to those of ordinary skill in the art. The description is thus to be regarded as illustrative instead of limiting.
Number | Name | Date | Kind |
---|---|---|---|
6898566 | Benyassine et al. | May 2005 | B1 |
6963649 | Vaudrey et al. | Nov 2005 | B2 |
7274794 | Rasmussen | Sep 2007 | B1 |
7536301 | Jaklitsch et al. | May 2009 | B2 |
7761106 | Konchitsky | Jul 2010 | B2 |
8019091 | Burnett et al. | Sep 2011 | B2 |
8046219 | Zurek et al. | Oct 2011 | B2 |
8068619 | Zhang et al. | Nov 2011 | B2 |
8194882 | Every et al. | Jun 2012 | B2 |
8204252 | Avendano | Jun 2012 | B1 |
8204253 | Solbach | Jun 2012 | B1 |
8275609 | Wang | Sep 2012 | B2 |
8374362 | Ramakrishnan et al. | Feb 2013 | B2 |
8401178 | Chen et al. | Mar 2013 | B2 |
8521530 | Every et al. | Aug 2013 | B1 |
9100756 | Dusan et al. | Aug 2015 | B2 |
9215527 | Saric et al. | Dec 2015 | B1 |
20020193130 | Yang et al. | Dec 2002 | A1 |
20040181397 | Gao | Sep 2004 | A1 |
20070230712 | Belt et al. | Oct 2007 | A1 |
20070237339 | Konchitsky | Oct 2007 | A1 |
20070263845 | Hodges et al. | Nov 2007 | A1 |
20070263936 | Owechko | Nov 2007 | A1 |
20070274552 | Konchitsky et al. | Nov 2007 | A1 |
20080201138 | Visser et al. | Aug 2008 | A1 |
20080317259 | Zhang et al. | Dec 2008 | A1 |
20090190769 | Wang et al. | Jul 2009 | A1 |
20090196429 | Ramakrishnan et al. | Aug 2009 | A1 |
20090220107 | Every et al. | Sep 2009 | A1 |
20100081487 | Chen et al. | Apr 2010 | A1 |
20100091525 | Lalithambika et al. | Apr 2010 | A1 |
20100098266 | Mukund et al. | Apr 2010 | A1 |
20100100374 | Park et al. | Apr 2010 | A1 |
20110106533 | Yu | May 2011 | A1 |
20110317848 | Ivanov et al. | Dec 2011 | A1 |
20120121100 | Zhang et al. | May 2012 | A1 |
20120130713 | Shin et al. | May 2012 | A1 |
20120185246 | Zhang et al. | Jul 2012 | A1 |
20120209601 | Jing | Aug 2012 | A1 |
20120310640 | Kwatra et al. | Dec 2012 | A1 |
20130054231 | Jeub | Feb 2013 | A1 |
20130216050 | Chen et al. | Aug 2013 | A1 |
20130282372 | Visser et al. | Oct 2013 | A1 |
20130329895 | Dusan et al. | Dec 2013 | A1 |
20130329896 | Krishnaswamy | Dec 2013 | A1 |
20130329909 | Krishnaswamy | Dec 2013 | A1 |
20130332157 | Iyengar | Dec 2013 | A1 |
20140126745 | Dickins | May 2014 | A1 |
20140286497 | Thyssen et al. | Sep 2014 | A1 |
20150110284 | Niemisto et al. | Apr 2015 | A1 |
20150221322 | Iyengar | Aug 2015 | A1 |
20150379992 | Lee | Dec 2015 | A1 |
20160029111 | Wacquant | Jan 2016 | A1 |
20160127535 | Theverapperuma | May 2016 | A1 |
20160358619 | Ramprashad | Dec 2016 | A1 |
20170337932 | Iyengar et al. | Nov 2017 | A1 |
Entry |
---|
Nearfield broadband frequency invariant beamforming; Acoustics, Speech, and Signal Processing, 1996. ICASSP-96. Conference Proceedings., 1996 IEEE International Conference on (vol. 2 ) May 1996; pp. 905-908. |
Near-field beamforming for microphone arrays; Acoustics, Speech, and Signal Processing, 1997. ICASSP-97., 1997 IEEE International Conference on (vol. 1 ) Apr. 1997; pp. 363-366. |
U.S. Notice of Allowance, dated Sep. 29, 2016, U.S. Appl. No. 14/170,136. |
Final Office Action, dated Oct. 20, 2016, U.S. Appl. No. 13/911,915. |
Non-Final Office Action (dated Jul. 31, 2014), U.S. Appl. No. 13/911,915, filed Jun. 6, 2014, First Named Inventor: Vasu Iyengar, 19 pages. |
Non-Final Office Action (dated Jan. 30, 2015), U.S. Appl. No. 13/715,422, filed Dec. 14, 2012, First Named Inventor: Sorin V. Dusan, 16 pages. |
Final Office Action (dated Apr. 21, 2015), U.S. Appl. No. 13/911,915, filed Jun. 6, 2014, First Named Inventor: Vasu Iyengar, 21 pages. |
Non-Final Office Action (dated Mar. 18, 2016), U.S. Appl. No. 13/911,915, filed Jun. 6, 2013, First Named Inventor: Vasu Iyengar, 20. |
Final Office Action (dated Apr. 27, 2016) U.S. Appl. No. 14/170,136, filed Jan. 31, 2014, First Named Inventor: Vasu Iyengar, 14. |
“Sound Basics”, Acoustic and vibrations, Internet document at: http://www.acousticvibration.com/sound-basis.htm, 3 pages. |
Jeub, Marco , et al., “Noise Reduction for Dual-Microphone Mobile Phones Exploiting Power Level Differences”, Acoustics, Speech and Signal Processing (ICASSP), 2012 IEEE International Conference, Mar. 25-30, 2012, ISSN: 1520-6149, E-ISBN: 978-1-4673-0044-5, pp. 1693-1696. |
Khoa, Pham C., “Noise Robust Voice Activity Detection”, Nanyanq Technological University, School of Computer Engineering, a thesis, 2012, Title page, pp. i-ix, and pp. 1-26. |
Nemer, Elias , “Acoustic Noise Reduction for Mobile Telephony”, Nortel Networks, 17 pages. |
Schwander, Teresa , et al., “Effect of Two-Microphone Noise Reduction on Speech Recognition by Normal-Hearing Listeners”, Journal of Rehabilitation Research and Development, vol. 24, No. 4, Fall 1987, pp. 87-92. |
Tashev, Ivan , et al., “Microphone Array for Headset with Spatial Noise Suppressor”, Microsoft Research, One Microsoft Way, Redmond, WA, USA, In Proceedings of Ninth International Workshop on Acoustics, Echo and Noise Control, Sep. 2005, 4 pages. |
Verteletskaya, Ekaterina , et al., “Noise Reduction Based on Modified Spectral Subtraction Method”, IAENG International Journal of Computer Science, 38:1, IJCS 38 1 10, (Advanced online publication: Feb. 10, 2011), 7 pages. |
Widrow, Bernard , et al., “Adaptive Noise Cancelling: Principles and Applications”, Proceedings of the IEEE, vol. 63, No. 12, Dec. 1975, ISSN: 0018-9219, pp. 1692-1716 and 1 additional page. |
Number | Date | Country | |
---|---|---|---|
20180033447 A1 | Feb 2018 | US |