Some microphones, for example, micro-electro-mechanical systems (MEMS) microphones, have an omnidirectional response (that is, they are equally sensitive to sound in all directions). However, in some applications it is desirable to have an unequally sensitive microphone. A remote speaker microphone, as used, for example, in public safety communications, should be more sensitive to the voice of the user than it is to ambient noise. Some remote speaker microphones use beamforming arrays of multiple microphones (for example, a broadside array or an endfire array) to form a directional response (that is, a beam pattern). Adaptive beamforming algorithms may be used to steer the beam pattern toward the desired sounds (for example, speech), while attenuating unwanted sounds (for example, ambient noise).
The accompanying figures, where like reference numerals refer to identical or functionally similar elements throughout the separate views, together with the detailed description below, are incorporated in and form part of the specification, and serve to further illustrate embodiments of concepts that include the claimed invention, and explain various principles and advantages of those embodiments.
Skilled artisans will appreciate that elements in the figures are illustrated for simplicity and clarity and have not necessarily been drawn to scale. For example, the dimensions of some of the elements in the figures may be exaggerated relative to other elements to help to improve understanding of embodiments of the present invention.
The apparatus and method components have been represented where appropriate by conventional symbols in the drawings, showing only those specific details that are pertinent to understanding the embodiments of the present invention so as not to obscure the disclosure with details that will be readily apparent to those of ordinary skill in the art having the benefit of the description herein.
One exemplary embodiment provides a method for beamforming audio signals received from a microphone array. The method includes receiving, by an electronic processor communicatively coupled to the microphone array, at least one audio signal from the microphone array. The method includes determining a plurality of beams based on the at least one audio signal. The method includes receiving, by the electronic processor, from a vibration microphone communicatively coupled to the electronic processor, at least one vibration signal. The method includes time aligning the at least one vibration signal and the at least one audio signal. The method includes determining a plurality of correlation values, each of the plurality of correlation values based on one of the plurality of beams and the at least one vibration signal. The method includes determining a peak correlation value based on the plurality of correlation values, and selecting one of the plurality of beams based on the peak correlation value.
Another embodiment provides a beamforming system. The beamforming system includes a microphone array, a vibration microphone, and an electronic processor communicatively coupled to the microphone array and the vibration microphone. The electronic processor is configured to receive at least one audio signal from the microphone array. The electronic processor is configured to determine a plurality of beams based on the at least one audio signal. The electronic processor is configured to receive from the vibration microphone at least one vibration signal. The electronic processor is configured to time align the at least one vibration signal and the at least one audio signal. The electronic processor is configured to determine a plurality of correlation values, each of the plurality of correlation values based on one of the plurality of beams and the at least one vibration signal. The electronic processor is configured to determine a peak correlation value based on the plurality of correlation values, and select one of the plurality of beams based on the peak correlation value.
Another embodiment provides a remote speaker microphone. The remote speaker microphone includes a microphone array, a vibration microphone, and an electronic processor communicatively coupled to the microphone array and the vibration microphone. The electronic processor is configured to receive at least one audio signal from the microphone array. The electronic processor is configured to determine a plurality of beams based on the at least one audio signal. The electronic processor is configured to receive from the vibration microphone at least one vibration signal. The electronic processor is configured to time align the at least one vibration signal and the at least one audio signal. The electronic processor is configured to determine a plurality of correlation values, each of the plurality of correlation values based on one of the plurality of beams and the at least one vibration signal. The electronic processor is configured to determine a peak correlation value based on the plurality of correlation values, and select one of the plurality of beams based on the peak correlation value.
For ease of description, some or all of the exemplary systems presented herein are illustrated with a single exemplar of each of its component parts. Some examples may not describe or illustrate all components of the systems. Other exemplary embodiments may include more or fewer of each of the illustrated components, may combine some components, or may include additional or alternative components.
It should be noted that, in the following specification, the terms “beamforming” and “adaptive beamforming” refer to microphone beamforming using a microphone array, and one or more known or future-developed beamforming algorithms, or combinations thereof.
The memory 108 may include read-only memory (ROM), random access memory (RAM), other non-transitory computer-readable media, or a combination thereof. The electronic processor 106 is configured to retrieve instructions and data from the memory 108 and execute, among other things, instructions to perform the methods described herein.
The input/output interface 110 is configured to receive input and to provide system output. The input/output interface 110 obtains information and signals from, and provides information and signals to (for example, over one or more wired and/or wireless connections), devices both internal and external to the remote speaker microphone 102 (for example, the microphone array 112, the portable radio 120, and the vibration microphone 104).
The microphone array 112 includes two or more microphones capable of sensing sound, for example, the speech sound waves 150 generated by a speech source 152 (for example, a human speaking). The microphone array 112 converts the speech sound waves 150 to electrical signals, and transmits the electrical signals to the electronic processor 106 via the input/output interface 110. The electronic processor 106 processes the electrical signals received from the microphone array 112 according to the methods described herein. The electronic processor 106 provides the processed electrical signals to the portable radio 120 for voice encoding and transmission.
The vibration microphone 104 is a microphone capable of sensing vibrations, for example, the speech vibrations 154 made by the speech source 152. The vibration microphone 104 is communicatively coupled to the electronic processor 106 via the input/output interface 110. The vibration microphone 104 converts the speech vibrations 154 to electrical signals, and transmits the electrical signals to the electronic processor 106 via the input/output interface 110.
Although the vibration microphone 104 and the microphone array 112 both convert speech signals from the speech source 152 into electrical signals, they differ in at least three respects.
First, unlike the microphone array 112, the vibration microphone 104 senses vibrations in the speech source, and not sound waves transmitted through the air. In some embodiments (for example, using a bone conduction microphone, an in-ear microphone, or a tooth bone conduction microphone), the vibration microphone 104 senses the speech vibrations 154 through direct physical contact with the speech source 152. In other embodiments (for example, using a laser or other optical microphone), the vibration microphone 104 senses the speech vibrations 154 without direct physical contact with the speech source 152.
Second, when an ambient noise source 160 (for example, a vehicle, a crowd, or environmental noise) produces ambient sound waves 164, the microphone array 112 picks up both the speech sound waves 150 and the ambient sound waves 164. However, the vibration microphone 104, even in the presence of the ambient sound waves 164, picks up only the speech vibrations 154 generated by the speech source 152.
Third, the vibration microphone 104 is sensitive to vibrations within a limited frequency range, for example 100 Hz to 1 KHz, and outputs electrical signals having a corresponding frequency range. However, this range does not contain enough speech spectrum to be encoded by a typical voice encoder and used as a primary audio input to a transmitter of the portable radio 120.
Oftentimes, the speech source 152 is not the only source of sound waves near to the remote speaker phone 100. For example, a police officer using the remote speaker phone 100 may be in an environment with an ambient noise source 160 (for example, in a vehicle, or in a crowd), which produces ambient sound waves 164. In order to assure timely and accurate communications, the microphones of the microphone array 112 are configured to produce a directional response (that is, a beam pattern) to pick up desirable sound waves (for example, from the speech source 152), while attenuating undesirable sound waves (for example, from the ambient noise source 160).
In one example, as illustrated in
Adaptive beamforming algorithms use electronic signal processing (for example, executed by the electronic processor 106) to digitally “steer” the beam pattern 202 to focus on a desired sound (for example, speech) and to attenuate ambient noise. Accordingly, beamforming algorithms may be used with a microphone array (for example, the microphone array 112) to isolate or extract speech sound under noisy conditions. However, current beamforming algorithms are effective with a signal-to-noise ratio (SNR) down to about zero dB, at which point the algorithms struggle to separate speech from ambient noise. Thus, in high ambient noise situations, the signal-to-noise ratio may not be sufficient for current beamforming algorithms to correctly steer the beam pattern 202.
For example, in
As noted above, the vibration microphone 104 remains unaffected by ambient noise, but only captures useful audio between 100 Hz and 1 KHz. However, between 100 Hz and 1 KHz, the electrical signals produced from the speech vibrations 154 highly correlate to the electrical signals produced from the speech sound waves 150 for the same speech source 152.
For example,
Because the vibration microphone 104 does not capture ambient noise, the high degree of correlation between the signals is also exhibited in noisy environments. For example,
As noted above, adaptive beamforming algorithms steer a beam to focus on a desired sound and to attenuate ambient noise. However, when the signal-to-noise ratio between the desired sound and the ambient noise is too low (for example, at or below zero dB), current beamforming algorithms may steer the beam incorrectly, and fail to effectively pick up the desired sound. Accordingly, embodiments provide, among other things, methods for beamforming audio signals received from a microphone array.
By way of example, the methods presented are described in terms of the remote speaker microphone 102, as illustrated in
At block 706, the electronic processor 106 receives at least one vibration signal from the vibration microphone 104. The vibration signal is an electrical signal based on the speech vibrations 154 detected by the vibration microphone 104.
Because the time bases for the acoustic and vibration mics may differ (for example, when the vibration microphone 104 communicates the vibration signal over a wireless link), at block 707, the electronic processor 106 time aligns the vibration signal and the audio signal. For example, where the time bases differ by a constant known delay, the electronic processor 106 may implement an all-pass filter (for example, in the time or frequency domain) that has a group delay equal to the known constant delay that is applied to the leading signal(s). In another example, when the time bases differ by a constant unknown delay, the electronic processor 106 may perform a one-time cross-correlation or similar operation may be used to determine the unknown constant delay, which may then be fed into an all-pass filter and applied to the leading signal(s). In another example, when time bases differ by a varying unknown delay, the electronic processor 106 may periodically calculate a cross-correlation at the output of an adaptive all-pass filter, where the coefficients are adapted to maximize the peak signal power in the cross-correlations.
At block 708, the electronic processor 106 filters the vibration signal. The vibration signal may be filtered by processing the vibration signal through a high-pass filter (for example, with a cutoff frequency of 100 Hz), a low-pass filter (for example, with a cutoff frequency of approximately 1 kHz), or both. The formant content of the speech being detected is proportional to the volume of the speech source 152. Accordingly, some embodiments adjust the low-pass filter adaptively based on the formant context of the speech, to prevent loss of the higher frequency content captured by the vibration microphone 104 under such conditions. In some embodiments, the electronic processor 106 does not filter the vibration signal.
At block 710, the electronic processor 106 filters the plurality of beams to generate a plurality of filtered beams. In some embodiments, plurality of filtered beams generated by processing the plurality of beams through a low-pass filter (for example, with a cutoff frequency of approximately 1 kHz). In some embodiments, the electronic processor 106 does not filter the plurality of beams.
At block 712, the electronic processor 106 determines a plurality of correlation values (for example, cross-correlation values). Each one of the plurality of correlation values is based on one of the plurality of filtered beams generated at block 710, and the filtered vibration signal. For each of the plurality of filtered beams, the electronic processor 106 determines a value based on the degree of correlation between the two. At block 714, the electronic processor 106 determines the peak correlation value. The peak correlation value is the value that indicates the highest degree of correlation with the filtered vibration signal. Because two signals with a high degree of correlation were likely produced by the same speech input, it can be inferred that the beam associated with the peak correlation value is the beam aligned most closely to the speech source 152.
Accordingly, at block 716, the electronic processor 106 selects one of the plurality of beams based on the peak correlation value. The electrical signal produced by the selected beam may then be further processed (for example, by using other noise reduction algorithms) or transmitted to the portable radio 120 for voice encoding and transmission.
In some embodiments, the correlation values may be power values.
At block 802, the electronic processor 106 divides the filtered vibration signal into a plurality of vibration signal sub-bands between, for example, 100 Hz and approximately 1 KHz. At block 804, the electronic processor 106 determines whether a correlation value has been determined for each of the plurality of filtered beams. When there are unprocessed filtered beams, the electronic processor 106 divides the next of the plurality of beams to be processed into sub-bands (for example, 100 Hz and approximately 1 KHz), to generate a plurality of beam sub-bands, at block 806.
At block 808, the electronic processor 106 multiplies each of the plurality of vibration signal sub-bands by each of the plurality of beam sub-bands to generate a plurality of sub-band outputs. At block 810, the electronic processor 106 processes the plurality of sub-band outputs through a moving-average filter (for example, a fast Fourier transformation) to generate a plurality of filtered sub-band outputs. In some embodiments, the corner frequency of the moving-average filter is selected to match the cross-correlation length that is being emulated (for example, one second). The number of sub-bands generated at blocks 802 and 804 may be based on the fast Fourier transformation used at block 810. For example, a 128-point fast Fourier transformation would result in twenty-eight sub-bands, while a 512-point fast Fourier transformation would result in 115 sub-bands.
At block 812, the plurality of filtered sub-band outputs is summed to determine a correlation value for the filtered beam being processed. Returning to block 804, when correlation values have been determined for each of the plurality of filtered beams, the electronic processor 106 determines a peak filtered sub-band output value at block 814. The peak filtered sub-band output value corresponds to the beam with the highest signal power. At block 816, the electronic processor 106 selects one of the plurality of beams based on the peak filtered sub-band output value. The electrical signal produced by the selected beam may then be further processed (for example, by using other noise reduction algorithms) or transmitted to the portable radio 120 for voice encoding and transmission.
Accordingly, by use of the method 700 or the method 800, a beamforming algorithm may be used effectively in low signal-to-noise environments, where it may otherwise be ineffective.
Some embodiments may integrate the vibration mic signal into an adaptive beamforming algorithm more directly. That is, instead of using the correlation with the vibration mic signal to choose between beams, the correlation between the vibration signal and the audio signal could be used to assist in the formation of the beams to steer the beams more directly to the source of the speech. To do this, the correlation of the beams to the vibration signal would be used in determining the beamforming algorithm weights.
An adaptive beamformer uses an adjustable set of weights (for example, filter coefficients) to combine multiple microphone sources into a single signal with improved spatial directivity. The adaptive beamforming algorithm uses numerical optimization to modify or update these weights as the environment varies. Such algorithms use many possible optimization schemes (for example, least mean squares, sample matrix inversion, and recursive least squares). Such optimization schemes depend on what criteria are used as an objective function (that is, what parameter to optimize). For example, when the main lobe of a beam is in a known fixed direction, beamforming could be based on maximizing signal-to-noise ratio or minimizing total noise not in the direction of the main lobe, thereby steering the nulls to the loudest interfering source.
When extra information about a user's speech is known (for example, the vibration signal described above), the extra information can be incorporated into the objective function. For example, rather than maximizing signal-to-noise ratio or minimizing noise variance as the objective function, the numerical optimization could adapt the weights to maximize the correlation of the beamformer output with the vibration microphone signal. This objective function would have the advantage of being able to steer the main lobe as well as the nulls, because the beamformer has information about where the desired speech signal is, and it does not have to assume a fixed beam direction. Such a beamformer could improve signal-to-noise ratio by both increasing the desired signal and decreasing competing noise. In some embodiments, this may be combined with a constraint on the main beamforming lobe to keep it within a limited range.
In some embodiments, the beamforming algorithms may be modified based on where the audio and vibration signals most strongly correlate. For example, in a time domain beamformer, the beamformer may band limit the signals before calculating the correlation for the objective function. In some embodiments, for example in a multiband or frequency domain beamformer, the correlation-based objective may be used for the frequency bands in which the correlation holds, while the other bands may use the more standard objective functions. In some embodiments, frequency bands outside the correlation range, but close to it, could be constrained to be in the same or similar shape to the bands within the correlation range.
In the foregoing specification, specific embodiments have been described. However, one of ordinary skill in the art appreciates that various modifications and changes can be made without departing from the scope of the invention as set forth in the claims below. Accordingly, the specification and figures are to be regarded in an illustrative rather than a restrictive sense, and all such modifications are intended to be included within the scope of present teachings.
The benefits, advantages, solutions to problems, and any element(s) that may cause any benefit, advantage, or solution to occur or become more pronounced are not to be construed as a critical, required, or essential features or elements of any or all the claims. The invention is defined solely by the appended claims including any amendments made during the pendency of this application and all equivalents of those claims as issued.
Moreover in this document, relational terms such as first and second, top and bottom, and the like may be used solely to distinguish one entity or action from another entity or action without necessarily requiring or implying any actual such relationship or order between such entities or actions. The terms “comprises,” “comprising,” “has,” “having,” “includes,” “including,” “contains,” “containing” or any other variation thereof, are intended to cover a non-exclusive inclusion, such that a process, method, article, or apparatus that comprises, has, includes, contains a list of elements does not include only those elements but may include other elements not expressly listed or inherent to such process, method, article, or apparatus. An element proceeded by “comprises . . . a,” “has . . . a,” “includes . . . a,” or “contains . . . a” does not, without more constraints, preclude the existence of additional identical elements in the process, method, article, or apparatus that comprises, has, includes, contains the element. The terms “a” and “an” are defined as one or more unless explicitly stated otherwise herein. The terms “substantially,” “essentially,” “approximately,” “about” or any other version thereof, are defined as being close to as understood by one of ordinary skill in the art, and in one non-limiting embodiment the term is defined to be within 10%, in another embodiment within 5%, in another embodiment within 1% and in another embodiment within 0.5%. The term “coupled” as used herein is defined as connected, although not necessarily directly and not necessarily mechanically. A device or structure that is “configured” in a certain way is configured in at least that way, but may also be configured in ways that are not listed.
It will be appreciated that some embodiments may be comprised of one or more generic or specialized processors (or “processing devices”) such as microprocessors, digital signal processors, customized processors and field programmable gate arrays (FPGAs) and unique stored program instructions (including both software and firmware) that control the one or more processors to implement, in conjunction with certain non-processor circuits, some, most, or all of the functions of the method and/or apparatus described herein. Alternatively, some or all functions could be implemented by a state machine that has no stored program instructions, or in one or more application specific integrated circuits (ASICs), in which each function or some combinations of certain of the functions are implemented as custom logic. Of course, a combination of the two approaches could be used.
Moreover, an embodiment can be implemented as a computer-readable storage medium having computer readable code stored thereon for programming a computer (e.g., comprising a processor) to perform a method as described and claimed herein. Examples of such computer-readable storage mediums include, but are not limited to, a hard disk, a CD-ROM, an optical storage device, a magnetic storage device, a ROM (Read Only Memory), a PROM (Programmable Read Only Memory), an EPROM (Erasable Programmable Read Only Memory), an EEPROM (Electrically Erasable Programmable Read Only Memory) and a Flash memory. Further, it is expected that one of ordinary skill, notwithstanding possibly significant effort and many design choices motivated by, for example, available time, current technology, and economic considerations, when guided by the concepts and principles disclosed herein will be readily capable of generating such software instructions and programs and ICs with minimal experimentation.
The Abstract of the Disclosure is provided to allow the reader to quickly ascertain the nature of the technical disclosure. It is submitted with the understanding that it will not be used to interpret or limit the scope or meaning of the claims. In addition, in the foregoing Detailed Description, it can be seen that various features are grouped together in various embodiments for the purpose of streamlining the disclosure. This method of disclosure is not to be interpreted as reflecting an intention that the claimed embodiments require more features than are expressly recited in each claim. Rather, as the following claims reflect, inventive subject matter lies in less than all features of a single disclosed embodiment. Thus the following claims are hereby incorporated into the Detailed Description, with each claim standing on its own as a separately claimed subject matter.
Number | Name | Date | Kind |
---|---|---|---|
7565288 | Acero | Jul 2009 | B2 |
9313572 | Dusan et al. | Apr 2016 | B2 |
9313599 | Tammi et al. | Apr 2016 | B2 |
20130287224 | Nystrom | Oct 2013 | A1 |
20140093091 | Dusan et al. | Apr 2014 | A1 |
20140126744 | Petit | May 2014 | A1 |
20140270231 | Dusan et al. | Sep 2014 | A1 |