This disclosure generally relates to audio devices and systems. More particularly, the disclosure relates to beamforming in audio devices.
Various audio applications benefit from effective sound (i.e., audio signal) pickup. For example, effective voice pickup and/or noise suppression can enhance audio communication systems, audio playback, and situational awareness of audio device users. However, conventional audio devices and systems can fail to adequately pick up (or, detect and/or characterize) audio signals, particularly far field audio signals.
All examples and features mentioned below can be combined in any technically possible way.
Various implementations include enhancing far-field sound pickup. Particular implementations utilize an adaptive beamformer to enhance far-field sound pickup, such as far-field voice pickup.
In some particular aspects, a method of sound enhancement for a system having microphones for far-field pick up includes: generating, using at least two microphones, a primary beam focused on a previously unknown desired signal look direction, the primary beam producing a primary signal configured to enhance the desired signal; generating, using at least two microphones, a reference beam focused on the desired signal look direction, the reference beam producing a reference signal configured to reject the desired signal; and removing, using at least one processor, components that correlate to the reference signal from the primary signal.
In some particular aspects, a system includes: a plurality of microphones for far-field pickup; and at least one processor configured to: generate, using at least two of the microphones, a primary beam focused on a previously unknown desired signal look direction, the primary beam producing a primary signal configured to enhance the desired signal, generate, using at least two of the microphones, a reference beam focused on the desired signal look direction, the reference beam producing a reference signal configured to reject the desired signal, and remove components that correlate to the reference signal from the primary signal.
Implementations may include one of the following features, or any combination thereof.
In certain implementations, the method further includes: prior to generating at least one of the primary beam or the reference beam, determining whether the desired signal activity is detected in an environment of the system.
In some cases, the desired signal relates to voice and the determination of whether voice is detected in the environment of the system includes using voice activity detector processing.
In particular aspects, generating the reference beam uses the same at least two microphones used to generate the primary beam.
In some implementations, at least one of the primary beam or the reference beam is generated using in-situ tuned beamformers.
In certain aspects, the desired signal look direction is selected by a user via manual input.
In particular cases, the desired signal look direction is selected automatically using source localization and beam selector technologies.
In some aspects, the method further includes: prior to removing the components that correlate to the reference signal from the primary signal, generating, using at least two microphones, multiple beams focused on different directions to assist with selecting the primary beam for producing the primary signal.
In particular implementations, the method further includes: removing, using the at least one processor, audio rendered by the system from the primary and reference signals via acoustic echo cancellation.
In certain cases, the system includes at least one of a wearable audio device, a hearing aid device, a speaker, a conferencing system, a vehicle communication system, a smartphone, a tablet, or a computer.
In some aspects, removing from the primary signal components that correlate to the reference signal includes filtering the reference signal to generate a noise estimate signal and subtracting the noise estimate signal from the primary signal.
In particular cases, the method further includes enhancing the spectral amplitude of the primary signal based upon the noise estimate signal to provide an output signal.
In some implementations, filtering the reference signal includes adaptively adjusting filter coefficients.
In certain aspects, adaptively adjusting filter coefficients includes at least one of a background process or monitoring when speech is not detected.
In particular cases, generating at least one of the primary beam or the reference beam includes using superdirective array processing.
In some aspects, the method further includes deriving the reference signal using a delay-and-subtract speech cancellation technique from the at least two microphones used to generate the reference beam.
In certain implementations, the desired signal relates to speech.
In particular cases, the desired signal does not relate to speech.
Two or more features described in this disclosure, including those described in this summary section, may be combined to form implementations not specifically described herein.
The details of one or more implementations are set forth in the accompanying drawings and the description below. Other features, objects and advantages will be apparent from the description and drawings, and from the claims.
It is noted that the drawings of the various implementations are not necessarily to scale. The drawings are intended to depict only typical aspects of the disclosure, and therefore should not be considered as limiting the scope of the implementations. In the drawings, like numbering represents like elements between the drawings.
This disclosure is based, at least in part, on the realization that far field sound pickup can be enhanced using an adaptive beamformer. For example, approaches can include generating dual beams, one focused to enhance the desired signal look direction (e.g., primary sound beam, such as primary speech beam), and the second to reject the desired signal only (e.g., null beam for noise reference). The approaches also include performing adaptive signal processing to these beams to enhance pickup from the desired signal look direction.
In particular cases, such as in fixed installation uses and/or scenarios where a signal processing system can be trained, in-situ tuned beamformers are used to enhance sound pickup. In additional cases, a beam selector can be deployed to select a desired signal look direction. In still further cases, approaches include receiving a user interface command to define the desired signal look direction. The approaches disclosed according to various implementations can be employed in systems including wearable audio devices, fixed devices such as fixed installation-type audio devices, transportation-type devices (e.g., audio systems in automobiles, airplanes, trains, etc.), portable audio devices such as portable speakers, multimedia systems such as multimedia bars (e.g., soundbars and/or video bars), audio and/or video conferencing systems, and/or microphone or other sound pickup systems configured to work in conjunction with an audio and/or video system.
As used herein the term “far field” or “far-field” refers to a distance (e.g., between microphone(s) and sound source) of approximately at least one meter (or, three to five wavelengths). In contrast to certain conventional approaches for enhancing near field sound pickup (e.g., user voice pickup in a wearable device that is only centimeters from a user's mouth), various implementations are configured to enhance sound pickup at a distance of three or more wavelengths from the source. In particular cases, the digital signal processor used to process far field signals uses automatic echo cancelation (AEC) and/or beamforming in order to process far field signals detected by system microphones. The terms “look direction” and “signal look direction” can refer to the direction such as an approximately straight-line direction, between a set of microphones and a given sound source or sources. As described herein, aspects can include enhancing (e.g., amplifying and/or improving signal-to-noise ratio) acoustic signals from a desired signal look direction, such as the direction from which a user is speaking in the far field.
Commonly labeled components in the FIGURES are considered to be substantially equivalent components for the purposes of illustration, and redundant discussion of those components is omitted for clarity.
The system 10 is shown including a plurality of microphones (mics) 20 for far-field acoustic signal (e.g., sound) pickup. In certain implementations, the plurality of microphones 20 includes at least two microphones. In particular cases, the microphones 20 include an array of three, four, five or more microphones (e.g., up to eight microphones). In additional cases, the microphones 20 include multiple arrays of microphones. The system 10 further includes at least one processor, or processor unit (PU(s)) 30, which can be coupled with a memory 40 that stores a program (e.g., program code) 50 for performing far field sound enhancement according to various implementations. In some cases, memory 40 is physically co-located with processor(s) 30, however, in other implementations, the memory 40 is physically separated from the processor(s) 30 and is otherwise accessible by the processor(s) 30. In some cases, the memory 40 may include a flash memory and/or non-volatile random access memory (NVRAM). In particular cases, memory 40 stores: a microcode of a program (e.g., far field sound processing program) 50 for processing and controlling the processor(s) 30, and may also store a variety of reference data. In certain cases, the processor(s) 30 include one or more microprocessors and/or microcontrollers for executing functions as dictated by program 50. In certain cases, processor(s) 30 include at least one digital signal processor (DSP) 60 configured to perform signal processing functions described herein. In certain cases, the DSP(s) 60 may be implemented as a chipset of chips that include separate and multiple analog and digital processors. In particular cases, when the instructions 50 are executed by the processor(s), the DSP 60 performs functions described herein. In certain cases, the processor(s) 30 are also coupled to one or more electro-acoustic transducer(s) 70 for providing an audio output. The system 10 can include a communication unit 80 in some cases, which can include a wireless (e.g., Bluetooth module, Wi-Fi module, etc.) and/or hard-wired (e.g., cabled) communication system. The system 10 can also include additional electronics 100, such as a power manager and/or power source (e.g., battery or power connector), memory, sensors (e.g., inertial measurement unit(s) (IMU(s)), accelerometers/gyroscope/magnetometers, optical sensors, voice activity detection systems), etc. Certain of the above-noted components depicted in
In certain cases, the processor(s) 30 execute the program 50 to take actions using, for example, the digital signal processor (DSP) 60.
As illustrated in
P1: generating, using at least two of the microphones 20, a primary beam focused on a previously unknown desired signal look direction. In various implementations, e.g., as illustrated in
In certain cases, the desired signal look direction can be selected automatically using a beam selector. For example, the DSP 60 can include a beam selector (not shown) between the filter bank 110 and the fixed beamformer 120 that is configured to receive manual beam control commands, e.g., from a user interface or a controller. In these cases, a user can select the signal look direction based on a known direction of a far field sound source relative to the system 10. However, in other cases, the beam selector is configured to automatically (e.g., without user interaction) select the desired signal look direction. In these cases, the beam selector can select a desired signal look direction based on one or more selection factors relating to the input signal detected by microphones 20, which can include signal power, sound pressure level (SPL), correlation, delay, frequency response, coherence, acoustic signature (e.g., a combination of SPL and frequency), etc. In additional cases, the beam selector includes a machine learning engine (e.g., a trainable logic engine and/or artificial neural network) that can select the desired signal look direction based on feedback from prior signal look direction selections, e.g., similar known look directions selected in the past, and/or known prior null directions. In still further cases, the beam selector performs a progressive adjustment to the beam width based on one or more selection factors, e.g., initially selecting a wide beam width (and canceling a remaining portion of the environment 5), and narrowing the beam width as successive selection factors are reinforced, e.g., successively receiving high power signals or acoustic signatures matching a desired sound profile such as a user's speech.
P2: generating, using at least two of the microphones 20, a reference beam focused on the desired signal look direction. In various implementations, e.g., as illustrated in
In some implementations, generating the primary beam and/or reference beam includes using super-directive array processing algorithms that enhance (e.g., maximize) the speech to noise signal to noise (SNR) ratio or directivity, such as generalized eigenvalue (GEV) solver or minimum variance distortionless response (MVDR) solver.
In certain cases, in an optional process P2A includes generating, using at least two of the microphones 20 (
In various implementations, process P2A is performed prior to a subsequent process P3, which includes: removing components that correlate to the reference signal 220 from the primary signal 210. In various implementations, removing components that correlate to the reference signal 220 from the primary signal 210 (e.g., to generate the NLMS error signal) includes: a) filtering the reference signal to generate a noise estimate signal and b) subtracting the noise estimate signal from the primary signal. In certain of these cases, the process further includes enhancing the spectral amplitude of the primary signal 210 based on the noise estimate signal to provide an output signal. In certain cases, filtering the reference signal includes adaptively adjusting filter coefficients, which can include, for example, at least one of a background process or monitoring when speech is not detected. Additional aspects of removing components that correlate to the reference signal 220 from the primary signal 210 are described in U.S. Pat. No. 10,311,889 (“Audio Signal Processing for Noise Reduction,” or the '889 Patent), herein incorporated by reference in its entirety.
In certain implementations, e.g., with respect to
In some cases, e.g., where multiple users 15 are present in an environment 5, the system 10 can be configured to generate multiple primary beams associated with each of the users 15, e.g., for voice pickup from two or more users 15 in the room. These implementations can be beneficial, e.g., in conferencing scenarios, meeting scenarios, etc. In additional cases, the system 10 can be configured to adjust the primary and/or reference beam direction based on user movement within the environment 5. For example, the system 10 can adjust the primary and/or reference beam direction by looking at multiple candidate beams to select a beam associated with the user's speech (e.g., a beam with a particular acoustic signature and/or signal strength), mixing multiple candidate beams (e.g., beams determined to be proximate to the user's last-known speaking direct), or performing source (e.g., user 15) tracking with a location tracking system such as an optical system (e.g., camera) and/or a location identifier such as a locating tracking system on an electronic device that is on or otherwise carried by the user (e.g., smartphone, smart watch, wearable audio device, etc.). Examples of location-based tracking systems such as beacons and/or wearable location tracking systems are described in U.S. Pat. No. 10,547,937 and U.S. patent application Ser. No. 16/732,549 (both entitled, “User-Controlled Beam Steering in Microphone Array”), each of which is incorporated by reference in its entirety.
In particular implementations, the primary beam and/or the reference beam is/are generated using in-situ tuned beamformers. For example, in
In certain implementations, the echo canceler 180 removes audio rendered by the system 10 from the primary and reference signals via acoustic echo cancelation. For example, referring to
In various implementations, the desired signal relates to speech. In these cases, the system 10 is configured to enhance far field sound in the environment 5 that includes a speech, or voice, signal, e.g., the voice of one or more users 15 (
In other implementations, the desired signal does not relate to speech. In these cases, the system 10 is configured to enhance far field sound in the environment 5 that does not include a user's voice signal, or excludes the user's voice signal. For example, the system 10 can be configured to enhance a far field sound including a signal other than a speech signal. Examples of far field sounds other than speech that may be desirably enhanced include, but are not limited to: i) pickup of sounds made by an instrument, including for example, pickup of isolated playback of a single instrument within a band or orchestra, and/or enhancement/amplification of sound from an instrument played within a noisy environment; ii) pickup of sounds made during a sporting event, such as the contact of a baseball bat on a baseball, a basketball swishing through a net, or a football player being tackled by another player; iii) pickup of sounds made by animals, such as movement of animals within an environment and/or animal sounds or cries (e.g., the bark of a dog, purr of a cat, howl of a wolf, neigh of a horse, roar of a lion, etc.); and/or iv) pickup of nature sounds, such as the rustling of leaves, crackle of a fire, or the crash of a wave. Pickup of far field sounds other than voice can be deployed in a number of applications, for example, to enhance functionality in one or more systems. For example, a monitoring device such as a child monitor and/or pet monitor can be configured to detect far field sounds such as the rustling of a baby or the bark of a dog and provide an alert (e.g., via a user interface) relating to the sound/activity.
In particular additional implementations, the system 10 can be part of a wearable device such as a wearable audio device and/or a wearable smart device and can aid in enhancing sound pickup, e.g., as part of a distributed audio system. In certain cases, the system 10 can be deployed in a hearing aid, for example, to aid in picking up the sound of others (e.g., a voice of a conversation partner or a desired signal source) in the far field in order to enhance playback to the hearing aid user of those sound(s). The system 10 can also be deployed in a hearing aid to reduce noise in the user's speech, e.g., as is detectable in the far field. Additionally, the system 10 can enable enhanced hearing for a hearing aid user, e.g., of far field sound.
In any case, the system 10 can beneficially enhance far field signal pickup with beamforming. Certain prior approaches, such as described in the '889 Patent, can beneficially enhance voice pickup in near field use scenarios, for example in user-worn audio devices such as headphones, earphones, audio eyeglasses, and other wearable audio devices. The various implementations disclosed herein can beneficially enhance far field signal pickup, for example, with beamformers that are focused on the far field and corresponding null formers in a target direction. At least one distinction between voice pickup in a user-worn audio device and sound (e.g., voice) pickup in the far field is that the far field system 10 disclosed according to various implementations cannot always benefit from a priori information about source locations. In various implementations, the source location(s) is rarely identified a priori, because for example, given user(s) 15 are seldom located in a fixed location within the environment 5 when speaking. Additionally, a given environment 5 (e.g., a conference room, large office space, meeting facility, transportation vehicle, etc.) can include multiple source location(s) such as seats, and the system 10 will not benefit from identifying which seats will be occupied prior to executing sound pickup processes according to implementations.
One or more of the above described systems and methods, in various examples and combinations, may be used to capture far field sound (e.g., voice signals) and isolate or enhance the those far field sounds relative to background noise, echoes, and other talkers. Any of the systems and methods described, and variations thereof, may be implemented with varying levels of reliability based on, e.g., microphone quality, microphone placement, acoustic ports, headphone frame design, threshold values, selection of adaptive, spectral, and other algorithms, weighting factors, window sizes, etc., as well as other criteria that may accommodate varying applications and operational parameters.
It is to be understood that any of the functions of methods and components of systems disclosed herein may be implemented or carried out in a digital signal processor (DSP), a microprocessor, a logic controller, logic circuits, and the like, or any combination of these, and may include analog circuit components and/or other components with respect to any particular implementation. Any suitable hardware and/or software, including firmware and the like, may be configured to carry out or implement components of the aspects and examples disclosed herein.
While the above describes a particular order of operations performed by certain implementations of the invention, it should be understood that such order is illustrative, as alternative embodiments may perform the operations in a different order, combine certain operations, overlap certain operations, or the like. References in the specification to a given embodiment indicate that the embodiment described may include a particular feature, structure, or characteristic, but every embodiment may not necessarily include the particular feature, structure, or characteristic.
The functionality described herein, or portions thereof, and its various modifications (hereinafter “the functions”) can be implemented, at least in part, via a computer program product, e.g., a computer program tangibly embodied in an information carrier, such as one or more non-transitory machine-readable media, for execution by, or to control the operation of, one or more data processing apparatus, e.g., a programmable processor, a computer, multiple computers, and/or programmable logic components.
A computer program can be written in any form of programming language, including compiled or interpreted languages, and it can be deployed in any form, including as a stand-alone program or as a module, component, subroutine, or other unit suitable for use in a computing environment. A computer program can be deployed to be executed on one computer or on multiple computers at one site or distributed across multiple sites and interconnected by a network.
Actions associated with implementing all or part of the functions can be performed by one or more programmable processors executing one or more computer programs to perform the functions of the calibration process. All or part of the functions can be implemented as, special purpose logic circuitry, e.g., an FPGA and/or an ASIC (application-specific integrated circuit). Processors suitable for the execution of a computer program include, by way of example, both general and special purpose microprocessors, and any one or more processors of any kind of digital computer. Generally, a processor will receive instructions and data from a read-only memory or a random access memory or both. Components of a computer include a processor for executing instructions and one or more memory devices for storing instructions and data.
In various implementations, unless otherwise noted, electronic components described as being “coupled” can be linked via conventional hard-wired and/or wireless means such that these electronic components can communicate data with one another. Additionally, sub-components within a given component can be considered to be linked via conventional pathways, which may not necessarily be illustrated.
A number of implementations have been described. Nevertheless, it will be understood that additional modifications may be made without departing from the scope of the inventive concepts described herein, and, accordingly, other embodiments are within the scope of the following claims.