The present disclosure relates to audio processing in arrays of bidirectional microphones.
In a compact teleconference device, the speaker and microphone are typically placed close to each other. When the distance between the speaker and the microphone is short, omnidirectional microphones pick up considerable echo. Unidirectional microphones also pick up substantial echo, especially in low frequencies, due to the proximity effect. Bidirectional microphones with their axis oriented perpendicular to the speaker, reject echo signals at a significantly better than omnidirectional or unidirectional microphones.
In teleconference devices, small circular arrays of bidirectional microphones may use gain sharing/mixing to cover a room with multiple talkers. However, conventional gain sharing/mixing may result in poor performance due when using bidirectional microphones. Bidirectional microphones pick up sound from either end of the microphone with opposite polarity. When mixing the output of two bidirectional microphones, the polarity of each signal may cause the total output to cancel out a meaningful signal.
The techniques presented herein provide a method for a device including a plurality of bidirectional microphones to generate an output audio signal that optimizes the echo rejection of the bidirectional microphones. The method includes receiving audio from an audio source and generating an audio signal from each of the bidirectional microphones. The method further includes forming a plurality of audio beams from combinations of the audio signals generated from the plurality of bidirectional microphones. Each audio beam captures audio from either a respective positive polarity zone or a respective negative polarity zone. The method also includes determining a direction of the audio source and selecting a perpendicular audio beam pair based on the direction of the audio source. The selected perpendicular audio beam pair includes a primary audio beam aimed toward the direction of the audio source and a secondary beam perpendicular to the primary audio beam. The method further includes generating an output signal by combining the primary audio beam with the secondary audio beam based on a comparison of which respective polarity zone the audio is captured for the primary audio beam and the secondary audio beam.
Bidirectional microphones have better echo rejection than omnidirectional or unidirectional microphones when the speaker is disposed near the microphones. However, bidirectional microphones picks up audio signals from both a front end (e.g., with positive polarity) and a back end (e.g., with negative polarity). When an audio source is in the positive polarity zone of one microphone and in the negative polarity zone of another, signals from the two microphones may cancel each other out when being mixed together. The techniques described herein use beamforming and gainsharing mixing techniques with small circular array of bidirectional microphones to resolve the polarity conflict of bidirectional microphones when doing gainsharing mixing. In a circular array with three bidirectional microphones, the techniques described herein also provide for a method to estimate sound direction without any ambiguity in determining from which direction (e.g., front or back) a sound originates.
As used herein, bidirectional microphones refer to a sound input device that records audio signals with a positive polarity in one direction and a negative polarity in the opposite direction. A bidirectional microphone may be constructed with a single transducer (e.g., a ribbon) or from an array of multiple transducers (e.g., Micro-Electro-Mechanical System (MEMS)) in an array. A typical pickup pattern of a bidirectional microphone is cos(θ), with two lobes in opposite directions along the axis of the microphone and a deep null perpendicular to the axis. From the outputs of two bidirectional microphones spaced Φ degrees apart, a virtual bidirectional microphone pointing to any angle Ψ can be formed by combining the two outputs with gains of c1 and c2, respectively:
cos(θ+Φ)=c1*cos(θ)+c2*cos(θ+Ψ)
where c2=sin(Φ)/sin(Ψ); and
If sound directed from the angle Ψ arrives at two microphones at different times (e.g., the microphones are spaced apart), then a proper delay may be introduced to compensate for the difference.
To cover 360° of space, at least two bidirectional microphones may be used to form a small circular array. The axes of the two microphones are configured to be perpendicular to each other. Each microphone covers 180° of space, with 90° in front of the microphone and 90° behind the microphone. The minimal sensitivity of the two-microphone array is at 45° off the axis of either microphone. A sound at 45° off the axis is picked up 3 dB lower (cos(45°)) than a sound that is on axis (cos) (0°).
Referring to
In one example, the microphones 110 and 120 produce audio signals S1 and S2 (e.g., audio beams 130 and 135), respectively. Combining S1 and S2 with beamforming may be used to create two more beams S3 and S4 (e.g., audio beams 140 and 145) along the directions of 45°/225° and 135°/315°, respectively. When sound comes from 45°/225°, it reaches two microphones at same time, and no compensation for any difference in the time of arrival is necessary. The two audio beams S3 and S4 may be generated from the microphone outputs S1 and S2 according to:
S3=(S1+S2)/√{square root over (3)}; and
S4=(S1−S2)/√{square root over (3)}.
With a total of four audio beams (e.g., S1, S2, S3 and S4) covering a room, when there is only one audio source in the room, the audio source is within 22.5° of the central axis one of the beams, leading to a worst case of 0.7 dB down from an audio source that is in line with the axis of one of the audio beams.
When there are multiple audio sources in a room, gainsharing techniques (e.g., implemented by gainsharing logic 170) may smooth the transition between audio sources by mixing more than one beams without attenuating any one source over another source. In this way, each source may be received by the microphone array according to the output:
output=Σai*Si, where ai is beam gain and Si is beam signal.
Different bidirectional microphones may receive audio from the same source in the room with different polarities. For example, referring to
In another example, when an audio source is at 315°, the polarities of both the signals S1 and S2 are positive, and adding S1 to S2 enhances the signal strength while subtracting S1 from S2 would reduce the signal strength. Consequently, when there are two audio sources, one at 45°, and the other at 315°, simply combining the beams S1 and S2 together attenuates the audio signal from one source while enhancing the audio signal from the other source, regardless of whether the signals are simply mixed by adding or subtracting.
Referring now to
The beam group selection logic 160 receives the location information 210 and the audio signals S1, S2, S3, and S4. The four audio beam signals S1, S2, S3, and S4 mat be divided into two beam groups, such as S1 and S2 in a first audio beam group and S3 and S4 in a second audio beam group. The two beams that form the same beam group (e.g., S3 and S4 in the second audio beam group) point to two directions that are perpendicular to each other. Before mixing the two beams in each group together, the beams should be de-correlated, since the two beams may be formed by the same microphone inputs. A Hilbert filter may be used for purpose of decorrelation, but other schemes such as all pass filters may be used. De-correlated beams in the same group can be mixed together by gainsharing techniques. Each group of beams may be used to cover a whole room with two perpendicular beams. The beam group selection logic 160 selects a beam group with a primary audio beam signal 220 and a secondary audio beam signal 225. The primary audio beam signal 220 and the secondary audio beam signal 225 are sent to the gainsharing logic 170 to be mixed into an output signal 230 that covers the entire room, but is primarily aimed at the audio source.
In one example, bidirectional microphones typically have a deep null at +−90° to the beam axis, and the signal strength does not change significantly about 0°. Using the weakest beam to detect the audio source direction is more reliable and accurate than using strongest beam due to the significant change in sensitivity caused by the deep null. In the audio direction logic 150, the maximum Signal-to-Noise Ratio (SNR) of each of the beams are first measured. If the maximum SNR is above a predefined threshold (THR1), then the current maximum SNR is compared to the previous maximum SNR. If the current SNR is higher than previous maximum SNR, then that is an indication of the rising side of a speech signal. Detecting the audio source direction based on the rising side of speech signal is typically more reliable at detecting a new talker than detecting based on a preset SNR threshold. When the current maximum SNR is above THR1 and higher than previous maximum SNR, the audio direction logic 150 determines the audio beam with the minimum SNR and compares the maximum SNR and the minimum SNR to ensure that the difference is within another predefined threshold (THR2). The audio source direction 210 is initially determined to be perpendicular to the direction of the weakest beam. The other beam in the beam group with the weakest beam should point to talker direction and have the strongest SNR. The audio direction logic 150 may confirm the audio source direction 210 by verifying that the other beam in the group has the strongest SNR, or at least very close to maximum SNR (e.g., within a predefined threshold THR3).
Referring now to
The audio direction logic 150 determines whether the difference between the maximum SNR and the minimum SNR exceeds a second predetermined threshold at 350. In one example, this calculation determines whether the difference between the maximum SNR and the minimum SNR does exceed the second predetermined threshold, then the audio direction logic confirms that the audio beam with the minimum SNR is paired with an audio beam that has an SNR within a third predetermined threshold of the maximum SNR at 360. If both the difference between the maximum SNR and the minimum SNR exceeds the second predetermined threshold and the SNR of the beam paired with the weakest SNR beam is within the third predetermined threshold of the maximum SNR, then the audio direction logic 150 determines the audio source direction at 370.
Referring now to
The device also includes audio direction logic 470, beam group selection logic 480, and gainsharing logic 490. The audio direction logic 470 is configured to determine from what direction audio is being received. The beam group selection logic 480 is configured to select the appropriate pair of perpendicular audio beams such that one of the beam is directed as close as possible to the direction of the audio source. The gainsharing logic 490 is configured to combine the signals from the selected audio beam pair in order to generate an output audio signal that optimizes the sensitivity of the microphone array without introducing harsh switching artifacts as audio is received from different directions during a conversation.
Referring now to
In other words, the six audio beams B1, B2, B3, B4, B5, and B6 (e.g., audio beams 440, 445, 450, 455, 460, and 465) may be formed from the audio signals m1, m2, and m3 (e.g., from microphones 410, 420, and 430) according to:
B1=m1;
B2=(m2−m3)/√{square root over (3)};
B3=m2;
B4=(m1−m3)/√{square root over (3)};
B5=m3;
B6=(m2−m1)/√{square root over (3)}.
The six beams are divided into three beam groups: beams B1/B2 are in a first group, beams B3/B4 in a second group, and beams B5/B6 in a third group. The two beams in each group point are perpendicular to each other. Each beam group includes all three microphone inputs with different polarity and gain. The audio direction logic 470 and beam group selection logic 480 may function similarly to the audio direction logic 150 and beam group selection logic 160, described with respect to
The final output of the microphone array device may be determined by the gainsharing logic to be:
output=gm*Bm+gs*p*Bs,
where gm, gs are gains of main beam (i.e., the primary audio beam) and secondary audio beam in the selected perpendicular audio beam group Bm, Bs respectively, and p is the polarity of the secondary beam, either +1.0 or −1.0.
To ensure that the gainsharing logic does not attenuate the overall sound signal due to correlation between the main beam and the second beam, the final determination of the polarity of the second beam may be based on a comparison of the power of the overall signal by mixing the main beam and the secondary beam with two different polarities.
Bp=Bm+Bs
Bn=Bm−Bs
where Bp is the overall beam output calculated with positive polarity and Bn is the overall beam output calculated with negative polarity.
Referring now to
In other words, if the SNR for Bn is higher by more than the predefined threshold, then the polarity switches to negative if the polarity was previously positive, and remains negative if the polarity was previously negative. Similarly, if the SNR for Bp is higher by more than the predefined threshold, then the polarity switches to positive if the polarity was previously negative, and remains positive if the polarity was previously positive. If the SNR for both Bn and Bp are within the predefined threshold, then the polarity remains the same to provide some hysteresis in switching polarity.
Bidirectional microphones do not distinguish whether a sound originates from the front or back. A small array of two bidirectional microphones may retain this ambiguity in sound direction. However, with three bidirectional microphones, assistant audio beams may be formed to differentiate the direction from which a sound originates.
Referring now to
When audio comes from 0°, it arrives at microphone 730 before arriving at microphone 720. When audio comes from 180°, it arrives at microphone 720 before arriving at microphone 730. The time difference between the audio arriving at microphone 730 and at microphone 720 is defined by T=d/s. When the audio direction is 0°±30°, the microphone 720 and the microphone 730 receive the audio with the same polarity, with a difference in signal that is at most 1.24 dB)) (cos(30°. Two assistant beams 740P and 740N may be formed as:
740P=720(t)−730(t+T);
740N=720(t+T)−730(t).
When the audio direction is 0°+−30°, the SNR of the assistant audio beam 740P would be much lower than that of the assistant audio beam 740N. When audio direction is 180° +−30°, the SNR of the assistant audio beam 740N would be much lower than that of the assistant audio beam 740P. Essentially, the assistant audio beams 740P and 740N behave like a pair of unidirectional, endfire arrays pointing in opposite directions.
Similarly, when audio comes from 60°+−30° or 240°+−30°; or 120°+−30° or 300°+−30°, four more assistant beams 750N, 750P, 760N, and 760P may be formed to detect sound direction:
750P=730(t)−710(t+T);
750N=730(t+T)−710(t);
760P=710(t)−720(t+T);
760N=710(t+T)−720(t);
Assistant beams may also be used to confirm the audio direction estimation and the selection of the primary beam. When there are multiple audio sources at same time from different direction in a room, the difference between the SNR of positive assistant beam (e.g., assistant audio beam 740P) and the negative assistant beam (e.g., assistant audio beam 740N) corresponding to the direction of the strongest beam pointing to the primary audio source would be smaller than when there is only one audio source at the same direction.
Referring now to
In other words, when the difference between the SNR of assistant audio beams corresponding to direction of strongest beam is less than a predefined threshold Thr_p_n, and the difference between the SNR of strongest beam and the SNR of the weakest beam is less than a predefined threshold Thr_m, then multiple audio sources are detected in the room. In this case, the main bean may be selected simply by using strongest beam, rather than the beam that is perpendicular to the weakest beam.
Referring now to
At 1040, the device determines the direction of the audio source. In one example, the device estimates the direction of the audio source through the SNR of the audio beams. At 1050, the device selects a perpendicular audio beam pair based on the direction of the audio source. The perpendicular audio beam pair includes a primary audio beam aimed closest to the direction of the audio source and a secondary audio beam perpendicular to the primary audio beam. In one example, the device may select the secondary audio beam as having the lowest SNR of the audio beams and the primary audio beam as the audio beam perpendicular to the secondary audio beam. Alternatively, the device may select the primary audio beam as having the highest SNR and the secondary beam as the audio beam perpendicular to the primary beam.
At 1060, the device generates an output signal by combining the primary audio beam with the secondary audio beam based on a comparison of which respective polarity zone the audio is captured for the primary audio beam and the secondary audio beam. In one example, the output signal is generated through gainsharing techniques to minimize artifacts due to switching to a different perpendicular audio beam pair.
Referring now to
The computer system 1101 further includes a read only memory (ROM) 1105 or other static storage device (e.g., programmable ROM (PROM), erasable PROM (EPROM), and electrically erasable PROM (EEPROM)) coupled to the bus 1102 for storing static information and instructions for the processor 1103.
The computer system 1101 also includes a disk controller 1106 coupled to the bus 1102 to control one or more storage devices for storing information and instructions, such as a magnetic hard disk 1107, and a removable media drive 1108 (e.g., floppy disk drive, read-only compact disc drive, read/write compact disc drive, compact disc jukebox, tape drive, and removable magneto-optical drive, solid state drive, etc.). The storage devices may be added to the computer system 1101 using an appropriate device interface (e.g., small computer system interface (SCSI), integrated device electronics (IDE), enhanced-IDE (E-IDE), direct memory access (DMA), ultra-DMA, or universal serial bus (USB)).
The computer system 1101 may also include special purpose logic devices (e.g., application specific integrated circuits (ASICs)) or configurable logic devices (e.g., simple programmable logic devices (SPLDs), complex programmable logic devices (CPLDs), and field programmable gate arrays (FPGAs)), that, in addition to microprocessors and digital signal processors may individually, or collectively, include types of processing circuitry. The processing circuitry may be located in one device or distributed across multiple devices.
The computer system 1101 may also include a display controller 1109 coupled to the bus 1102 to control a display 1110, such as a cathode ray tube (CRT), liquid crystal display (LCD) or light emitting diode (LED) display, for displaying information to a computer user. The computer system 1101 includes input devices, such as a keyboard 1111 and a pointing device 1112, for interacting with a computer user and providing information to the processor 1103. The pointing device 1112, for example, may be a mouse, a trackball, track pad, touch screen, or a pointing stick for communicating direction information and command selections to the processor 1103 and for controlling cursor movement on the display 1110. In addition, a printer may provide printed listings of data stored and/or generated by the computer system 1101.
The computer system 1101 performs a portion or all of the processing steps of the operations presented herein in response to the processor 1103 executing one or more sequences of one or more instructions contained in a memory, such as the main memory 1104. Such instructions may be read into the main memory 1104 from another computer readable storage medium, such as a hard disk 1107 or a removable media drive 1108. One or more processors in a multi-processing arrangement may also be employed to execute the sequences of instructions contained in main memory 1104. In alternative embodiments, hard-wired circuitry may be used in place of or in combination with software instructions. Thus, embodiments are not limited to any specific combination of hardware circuitry and software.
As stated above, the computer system 1101 includes at least one computer readable storage medium or memory for holding instructions programmed according to the embodiments presented, for containing data structures, tables, records, or other data described herein. Examples of computer readable storage media are compact discs, hard disks, floppy disks, tape, magneto-optical disks, PROMs (EPROM, EEPROM, flash EPROM), DRAM, SRAM, SD RAM, or any other magnetic medium, compact discs (e.g., CD-ROM, DVD), or any other optical medium, punch cards, paper tape, or other physical medium with patterns of holes, or any other medium from which a computer can read.
Stored on any one or on a combination of non-transitory computer readable storage media, embodiments presented herein include software for controlling the computer system 1101, for driving a device or devices for implementing the operations presented herein, and for enabling the computer system 1101 to interact with a human user (e.g., a network administrator). Such software may include, but is not limited to, device drivers, operating systems, development tools, and applications software. Such computer readable storage media further includes a computer program product for performing all or a portion (if processing is distributed) of the processing presented herein.
The computer code devices may be any interpretable or executable code mechanism, including but not limited to scripts, interpretable programs, dynamic link libraries (DLLs), Java classes, and complete executable programs. Moreover, parts of the processing may be distributed for better performance, reliability, and/or cost.
The computer system 1101 also includes a communication interface 1113 coupled to the bus 1102. The communication interface 1113 provides a two-way data communication coupling to a network link 1114 that is connected to, for example, a local area network (LAN) 1115, or to another communications network 1116 such as the Internet. For example, the communication interface 1113 may be a wired or wireless network interface card to attach to any packet switched (wired or wireless) LAN. As another example, the communication interface 1113 may be an asymmetrical digital subscriber line (ADSL) card, an integrated services digital network (ISDN) card or a modem to provide a data communication connection to a corresponding type of communications line. Wireless links may also be implemented. In any such implementation, the communication interface 1113 sends and receives electrical, electromagnetic or optical signals that carry digital data streams representing various types of information.
The network link 1114 typically provides data communication through one or more networks to other data devices. For example, the network link 1114 may provide a connection to another computer through a local area network 1115 (e.g., a LAN) or through equipment operated by a service provider, which provides communication services through a communications network 1116. The local network 1114 and the communications network 1116 use, for example, electrical, electromagnetic, or optical signals that carry digital data streams, and the associated physical layer (e.g., CAT 5 cable, coaxial cable, optical fiber, etc.). The signals through the various networks and the signals on the network link 1114 and through the communication interface 1113, which carry the digital data to and from the computer system 1101 may be implemented in baseband signals, or carrier wave based signals. The baseband signals convey the digital data as unmodulated electrical pulses that are descriptive of a stream of digital data bits, where the term “bits” is to be construed broadly to mean symbol, where each symbol conveys at least one or more information bits. The digital data may also be used to modulate a carrier wave, such as with amplitude, phase and/or frequency shift keyed signals that are propagated over a conductive media, or transmitted as electromagnetic waves through a propagation medium. Thus, the digital data may be sent as unmodulated baseband data through a “wired” communication channel and/or sent within a predetermined frequency band, different than baseband, by modulating a carrier wave. The computer system 1101 can transmit and receive data, including program code, through the network(s) 1115 and 1116, the network link 1114 and the communication interface 1113. Moreover, the network link 1114 may provide a connection through a LAN 1115 to a mobile device 1117 such as a personal digital assistant (PDA), tablet computer, laptop computer, or cellular telephone.
In summary, the techniques described herein leverage the improved echo rejection of bidirectional microphones over omnidirectional or unidirectional microphones when a speaker is close to an array of microphones. The output signal from the microphone array is generated by combining beamforming and gainshare mixing while resolving the polarity conflict mixing signals from different bidirectional microphones. Additionally, for arrays of three or more bidirectional arrays, techniques are presented for estimating the direction of the audio source without ambiguity.
In one form, a method is provided for a device including a plurality of bidirectional microphones to generate an output audio signal that optimizes the echo rejection of the bidirectional microphones. The method includes receiving audio from an audio source and generating an audio signal from each of the bidirectional microphones. The method further includes forming a plurality of audio beams from combinations of the audio signals generated from the plurality of bidirectional microphones. Each audio beam captures audio from either a respective positive polarity zone or a respective negative polarity zone. The method also includes determining a direction of the audio source and selecting a perpendicular audio beam pair based on the direction of the audio source. The selected perpendicular audio beam pair includes a primary audio beam aimed toward the direction of the audio source and a secondary beam perpendicular to the primary audio beam. The method further includes generating an output signal by combining the primary audio beam with the secondary audio beam based on a comparison of which respective polarity zone the audio is captured for the primary audio beam and the secondary audio beam.
In another form, an apparatus is provided comprising plurality of bidirectional microphones and a processor. Each bidirectional microphone is configured to receive audio from an audio source and generate an audio signal. The processor is configured to for a plurality of audio beams from combinations of the audio signals generated from the plurality of bidirectional microphones. Each audio beam captures audio from either a respective positive polarity zone or a respective negative polarity zone. The processor is also configured to determine a direction of the audio source and select a perpendicular audio beam pair based on the direction of the audio source. The selected audio beam pair includes a primary audio beam aimed toward the direction of the audio source and a secondary audio beam perpendicular to the primary audio beam. The processor is further configured to generate an output signal by combining the primary audio beam with the secondary audio beam based on a comparison of which respective polarity zones the audio is captured for the primary audio beam and the secondary audio beam.
In yet another form, one or more non-transitory computer readable storage media is encoded with software comprising computer executable instructions and, when the software is executed by a processor, cause the processor to receive audio of an audio source at a plurality of bidirectional microphones and generate an audio signal from each of the bidirectional microphones. The software is operable to cause the processor to form a plurality of audio beams from combinations of the audio signals generated from the plurality of bidirectional microphones. Each audio beam captures audio from either a respective positive polarity zone or a respective negative polarity zone. The software is also operable to cause the processor to determine a direction of the audio source and select a perpendicular audio beam pair. The selected perpendicular audio beam pair includes a primary audio beam aimed toward the direction of the audio source and a secondary audio beam perpendicular to the primary audio beam. The software is further operable to cause the processor to generate an output signal by combining the primary audio beam with the secondary audio beam based on a comparison of which respective polarity zones the audio is captured for the primary audio beam and the secondary audio beam.
The above description is intended by way of example only. The present disclosure has been described in detail with reference to particular arrangements and configurations, these example configurations and arrangements may be changed significantly without departing from the scope of the present disclosure. Moreover, certain components may be combined, separated, eliminated, or added based on particular needs and implementations. Although the techniques are illustrated and described herein as embodied in one or more specific examples, it is nevertheless not intended to be limited to the details shown, since various modifications and structural changes may be made within the scope and range of equivalents of this disclosure. For instance, while microphone arrays with greater than three bidirectional microphones are not explicitly described herein, similar techniques may be adapted to provide larger microphone arrays with the polarity-sensitive techniques described herein.
This application claims priority to U.S. Provisional Application No. 62/645,447, filed Mar. 20, 2018, the entirety of which is incorporated herein by reference.
Number | Name | Date | Kind |
---|---|---|---|
6069961 | Nakazawa | May 2000 | A |
9549245 | Frater | Jan 2017 | B2 |
9894434 | Rollow, IV et al. | Feb 2018 | B2 |
20040175006 | Kim et al. | Sep 2004 | A1 |
20140278394 | Bastyr et al. | Sep 2014 | A1 |
20170026741 | Yoshino | Jan 2017 | A1 |
20170034616 | Yoshino | Feb 2017 | A1 |
20170345439 | Jensen et al. | Nov 2017 | A1 |
20170365255 | Kupryjanow | Dec 2017 | A1 |
20180167706 | Robison et al. | Jun 2018 | A1 |
Entry |
---|
“The Amazon Alexa Premium Far-Field Dev Kit”, NXP, https://www.nxp.com/support/developer-resources/nxp-designs/the-amazon-alexa-premium-far-field-dev-kit:ALEXA-PREMIUM?fsrch=1&sr=3&pageNum=1, retrieved from the Internet on Jun. 18, 2018, 4 pages. |
Number | Date | Country | |
---|---|---|---|
62645447 | Mar 2018 | US |